subject:""

From: Pan Li 

To get better vectorized code of .SAT_SUB,  we would like to avoid the
truncated operation for the assignment.  For example, as below.

unsigned int _1;
unsigned int _2;
_9 = (unsigned short int).SAT_SUB (_1, _2);

If we make sure that the _1 is in the range of unsigned short int.  Such
as a def similar to:

_1 = (unsigned short int)_4;

Then we can do the distribute the truncation operation to:

_3 = MIN_EXPR (_2, 65535);
_9 = .SAT_SUB ((unsigned short int)_1, (unsigned short int)_3);

Let's take RISC-V vector as example to tell the changes.  For below
sample code:

__attribute__((noinline))
void test (uint16_t *x, unsigned b, unsigned n)
{
  unsigned a = 0;
  uint16_t *p = x;

  do {
a = *--p;
*p = (uint16_t)(a >= b ? a - b : 0);
  } while (--n);
}

Before this patch:
  ...
  .L3:
  vle16.v   v1,0(a3)
  vrsub.vx  v5,v2,t1
  mvt3,a4
  addw  a4,a4,t5
  vrgather.vv   v3,v1,v5
  vsetvli   zero,zero,e32,m1,ta,ma
  vzext.vf2 v1,v3
  vssubu.vx v1,v1,a1
  vsetvli   zero,zero,e16,mf2,ta,ma
  vncvt.x.x.w   v1,v1
  vrgather.vv   v3,v1,v5
  vse16.v   v3,0(a3)
  sub   a3,a3,t4
  bgtu  t6,a4,.L3
  ...

After this patch:
test:
  ...
  .L3:
  vle16.v   v3,0(a3)
  vrsub.vx  v5,v2,a6
  mva7,a4
  addw  a4,a4,t3
  vrgather.vv   v1,v3,v5
  vssubu.vv v1,v1,v6
  vrgather.vv   v3,v1,v5
  vse16.v   v3,0(a3)
  sub   a3,a3,t1
  bgtu  t4,a4,.L3
  ...

The below test suites are passed for this patch:
1. The rv64gcv fully regression tests.
2. The rv64gcv build with glibc.
3. The x86 bootstrap tests.
4. The x86 fully regression tests.

gcc/ChangeLog:

* tree-vect-patterns.cc (vect_recog_sat_sub_pattern_distribute):
Add new func impl to perform the truncation distribution.
(vect_recog_sat_sub_pattern): Perform above optimize before
generate .SAT_SUB call.

Signed-off-by: Pan Li 
---
 gcc/tree-vect-patterns.cc | 73 +++
 1 file changed, 73 insertions(+)

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 519d15f2a43..7329ecec2c4 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -4565,6 +4565,77 @@ vect_recog_sat_add_pattern (vec_info *vinfo, 
stmt_vec_info stmt_vinfo,
   return NULL;
 }
 
+/*
+ * Try to distribute the truncation for .SAT_SUB pattern,  mostly occurs in
+ * the benchmark zip.  Aka:
+ *
+ *   unsigned int _1;
+ *   unsigned int _2;
+ *   _9 = (unsigned short int).SAT_SUB (_1, _2);
+ *
+ *   if _1 is known to be in the range of unsigned short int.  For example
+ *   there is a def _1 = (unsigned short int)_4.  Then we can distribute the
+ *   truncation to:
+ *
+ *   _3 = MIN (65535, _2);
+ *   _9 = .SAT_SUB ((unsigned short int)_1, (unsigned short int)_3);
+ *
+ *   Then,  we can better vectorized code and avoid the unnecessary narrowing
+ *   stmt during vectorization.
+ */
+static void
+vect_recog_sat_sub_pattern_distribute (vec_info *vinfo,
+  stmt_vec_info stmt_vinfo,
+  gimple *stmt, tree lhs, tree *ops)
+{
+  tree otype = TREE_TYPE (lhs);
+  tree itype = TREE_TYPE (ops[0]);
+
+  if (types_compatible_p (otype, itype))
+return;
+
+  unsigned itype_prec = TYPE_PRECISION (itype);
+  unsigned otype_prec = TYPE_PRECISION (otype);
+
+  if (otype_prec >= itype_prec)
+return;
+
+  int_range_max r;
+  gimple_ranger granger;
+
+  if (granger.range_of_expr (r, ops[0], stmt) && !r.undefined_p ())
+{
+  wide_int bound = r.upper_bound ();
+  wide_int otype_max = wi::mask (otype_prec, /* negate */false, 
itype_prec);
+
+  if (bound != otype_max)
+   return;
+
+  tree v_otype = get_vectype_for_scalar_type (vinfo, otype);
+  tree v_itype = get_vectype_for_scalar_type (vinfo, itype);
+
+  /* 1. Build truncated op_0  */
+  tree op_0_out = vect_recog_temp_ssa_var (otype, NULL);
+  gimple *op_0_cast = gimple_build_assign (op_0_out, NOP_EXPR, ops[0]);
+  append_pattern_def_seq (vinfo, stmt_vinfo, op_0_cast, v_otype);
+
+  /* 2. Build MIN_EXPR (op_1, 65536)  */
+  tree max = wide_int_to_tree (itype, otype_max);
+  tree op_1_in = vect_recog_temp_ssa_var (itype, NULL);
+  gimple *op_1_min = gimple_build_assign (op_1_in, MIN_EXPR, ops[1], max);
+  append_pattern_def_seq (vinfo, stmt_vinfo, op_1_min, v_itype);
+
+  /* 3. Build truncated op_1  */
+  tree op_1_out = vect_recog_temp_ssa_var (otype, NULL);
+  gimple *op_1_cast = gimple_build_assign (op_1_out, NOP_EXPR, op_1_in);
+  append_pattern_def_seq (vinfo, stmt_vinfo, op_1_cast, v_otype);
+
+  /* 4. Update the ops  */
+  ops[0] = op_0_out;
+  ops[1] = op_1_out;
+}
+}
+
 /*
  * Try to detect saturation sub pattern (SAT_ADD), aka below gimple:
  *   _7 = _1 >= _2;
@@ -4590,6 +4661,8 @@ vect_recog_sat_sub_pattern (vec_info *vinfo, 
stmt_vec_info stmt_vinfo,
 
   if (gimple_unsigned_integer_sat_sub (lhs, ops, NULL))
 {
+

[Patch, rtl-optimization]: Loop unroll factor based on register pressure

2024-06-29 Thread Ajit Agarwal

Hello All:

This patch determines Unroll factor based on loop register pressure.

Unroll factor is quotient of max of available registers in loop
by number of liveness.

If available registers increases unroll factor increases.
Wherein unroll factor decreases if number of liveness increases.

Loop unrolling is based on loop variables that determines unroll
factor. Loop variables of the loop are the variables that increases
register pressure and take advantage of existing register pressure
calculation.

Available registers are determined by the number of hard registers
available for each register class minus max reg pressure of loop
for given register class.

Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit


rtl-optimization: Loop unroll factor based on register pressure

Unroll factor is calculated based on loop register pressure.

Unroll factor is quotient of max of available registers in loop
by number of liveness.

If available registers increases unroll factor increases.
Wherein unroll factor decreases if number of liveness increases.

Loop unrolling is based on loop variables that determines unroll
factor. Loop variables of the loop are the variables that increases
register pressure and take advantage of existing register pressure
calculation.

Available registers are determined by the number of hard registers
available for each register class minus max reg pressure of loop
for given register class.

2024-06-29  Ajit Kumar Agarwal  

gcc/ChangeLog:

* loop-unroll.cc: Add calculation of register pressure of
the loop and use of that to calculate unroll factor.
---
 gcc/loop-unroll.cc | 331 -
 1 file changed, 328 insertions(+), 3 deletions(-)

diff --git a/gcc/loop-unroll.cc b/gcc/loop-unroll.cc
index bfdfe6c2bb7..6936ba7afb9 100644
--- a/gcc/loop-unroll.cc
+++ b/gcc/loop-unroll.cc
@@ -35,6 +35,11 @@ along with GCC; see the file COPYING3.  If not see
 #include "dojump.h"
 #include "expr.h"
 #include "dumpfile.h"
+#include "regs.h"
+#include "ira.h"
+#include "rtl-iter.h"
+#include "regset.h"
+#include "df.h"
 
 /* This pass performs loop unrolling.  We only perform this
optimization on innermost loops (with single exception) because
@@ -65,6 +70,38 @@ along with GCC; see the file COPYING3.  If not see
showed that this choice may affect performance in order of several %.
*/
 
+class loop_data
+{
+public:
+  class loop *outermost_exit;  /* The outermost exit of the loop.  */
+  bool has_call;   /* True if the loop contains a call.  */
+  /* Maximal register pressure inside loop for given register class
+ (defined only for the pressure classes).  */
+  int max_reg_pressure[N_REG_CLASSES];
+  /* Loop regs referenced and live pseudo-registers.  */
+  bitmap_head regs_ref;
+  bitmap_head regs_live;
+};
+
+#define LOOP_DATA(LOOP) ((class loop_data *) (LOOP)->aux)
+
+/* Record all regs that are set in any one insn.  Communication from
+   mark_reg_{store,clobber} and global_conflicts.  Asm can refer to
+   all hard-registers.  */
+static rtx regs_set[(FIRST_PSEUDO_REGISTER > MAX_RECOG_OPERANDS
+? FIRST_PSEUDO_REGISTER : MAX_RECOG_OPERANDS) * 2];
+/* Number of regs stored in the previous array.  */
+static int n_regs_set;
+
+/* Currently processed loop.  */
+static class loop *curr_loop;
+
+/* Registers currently living.  */
+static bitmap_head curr_regs_live;
+
+/* Current reg pressure for each pressure class.  */
+static int curr_reg_pressure[N_REG_CLASSES];
+
 /* Information about induction variables to split.  */
 
 struct iv_to_split
@@ -272,11 +309,262 @@ decide_unrolling (int flags)
 }
 }
 
+/* Return pressure class and number of needed hard registers (through
+   *NREGS) of register REGNO.  */
+static enum reg_class
+get_regno_pressure_class (int regno, int *nregs)
+{
+  if (regno >= FIRST_PSEUDO_REGISTER)
+{
+  enum reg_class pressure_class;
+  pressure_class = reg_allocno_class (regno);
+  pressure_class = ira_pressure_class_translate[pressure_class];
+  *nregs
+   = ira_reg_class_max_nregs[pressure_class][PSEUDO_REGNO_MODE (regno)];
+  return pressure_class;
+}
+  else if (! TEST_HARD_REG_BIT (ira_no_alloc_regs, regno)
+  && ! TEST_HARD_REG_BIT (eliminable_regset, regno))
+{
+  *nregs = 1;
+  return ira_pressure_class_translate[REGNO_REG_CLASS (regno)];
+}
+  else
+{
+  *nregs = 0;
+  return NO_REGS;
+}
+}
+
+/* Increase (if INCR_P) or decrease current register pressure for
+   register REGNO.  */
+static void
+change_pressure (int regno, bool incr_p)
+{
+  int nregs;
+  enum reg_class pressure_class;
+
+  pressure_class = get_regno_pressure_class (regno, );
+  if (! incr_p)
+curr_reg_pressure[pressure_class] -= nregs;
+  else
+{
+  curr_reg_pressure[pressure_class] += nregs;
+  if (LOOP_DATA (curr_loop)->max_reg_pressure[pressure_class]
+ <

Re: [PATCH][PR115565] cse: Don't use a valid regno for non-register in comparison_qty

2024-06-29 Thread Maciej W. Rozycki

On Fri, 21 Jun 2024, Richard Sandiford wrote:

> >  This has passed verification in native `powerpc64le-linux-gnu' and 
> > `x86_64-linux-gnu' regstraps, as well as with the `alpha-linux-gnu' 
> > target.  OK to apply and backport to the release branches?
> 
> Huh!  Nice detective work.

 Thank you.  It did help that `call_pal 158' or the lack of is so easy to 
spot in disassembly or compiler output.

 Inspired by Roger Sayle's observation that the use of $0 hard register 
pre-reload is uncommon I tried to come up with an RTL test case that I 
*might* be able to tweak enough, knowing the nature of the bug, to still 
trigger with trunk.  For that I backported `print_rtx_function' to 4.1.2. 

 The output it produced was sufficiently different though from the syntax 
now accepted for input it wasn't suitable at all.  So I thought I could 
perhaps tweak it by hand using output from GCC 15 as a reference.

 But that did not work either, GCC 15 cannot accept its own output it 
would seem, complaining about integer suffixes across a couple of places:

rwlock-test.i:112:100: error: invalid suffix "B" on integer constant
  112 | 32)) [1 MEM[(struct _pthread_descr_struct * *)rwlock_8(D) + 32B]+0 S8 
A64])) "rwlock-test.i":84:5 discrim 1)
  | ^~~

and most importantly, the asm statement:

rwlock-test.i:53:62: error: expected character `)', found `:'
rwlock-test.i:53:65: note: following context is `37)'

where input is:

  (cinsn 18 (set (reg/v:DI $0 [ __self ])
(asm_operands:DI ("call_pal %1") ("=r") 0 [
(const_int 158)
]
 [
(asm_input:SI ("i") rwlock-test.i:37)
]
 [] rwlock-test.i:37)) "rwlock-test.i":37:58)

and the complaint about the `asm_input' expression, and it's not clear to 
me what is expected here.  So I have given up.

 I guess the RTL parser could be improved, or maybe output produced from 
`print_rtx_function' isn't right, I don't know.

> The patch is OK for trunk, thanks.  I agree that it's a regression
> from 08a692679fb8.  Since it's fixing such a hard-to-diagnose wrong
> code bug, and since it seems very safe, I think it's worth backporting
> to all active branches, after a grace period.

 Agreed as to the grace period.  I have since additionally run the patch 
through `riscv64-linux-gnu' verification and John David Anglin was kind 
enough to do so with `hppa-unknown-linux-gnu'.

 I have pushed the change to trunk now, thank you for your review.

  Maciej

test mail

2024-06-29 Thread Ajit Agarwal

[PATCH 5/5] Document return value in write_cv_integer

2024-06-29 Thread Mark Harmstone

gcc/
* dwarf2codeview.cc (write_lf_modifier): Expand upon comment.
---
 gcc/dwarf2codeview.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/dwarf2codeview.cc b/gcc/dwarf2codeview.cc
index 5a33b439b14..df53d8bab9d 100644
--- a/gcc/dwarf2codeview.cc
+++ b/gcc/dwarf2codeview.cc
@@ -1113,7 +1113,7 @@ write_lf_modifier (codeview_custom_type *t)
 /* Write a CodeView extensible integer.  If the value is non-negative and
< 0x8000, the value gets written directly as an uint16_t.  Otherwise, we
output two bytes for the integer type (LF_CHAR, LF_SHORT, ...), and the
-   actual value follows.  */
+   actual value follows.  Returns the total number of bytes written.  */
 
 static size_t
 write_cv_integer (codeview_integer *i)
-- 
2.44.2

[PATCH 4/5] Make sure CodeView symbols are aligned

2024-06-29 Thread Mark Harmstone

CodeView symbols have to be multiples of four bytes; add an alignment
directive to write_data_symbol to ensure this.

Note that these can be zeroes, so we can rely on GAS to do this for us;
it's only types that need f3, f2, f1 values.

gcc/
* dwarf2codeview.cc (write_data_symbol): Add alignment directive.
---
 gcc/dwarf2codeview.cc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/dwarf2codeview.cc b/gcc/dwarf2codeview.cc
index 71049ccf878..5a33b439b14 100644
--- a/gcc/dwarf2codeview.cc
+++ b/gcc/dwarf2codeview.cc
@@ -958,6 +958,8 @@ write_data_symbol (codeview_symbol *s)
   ASM_OUTPUT_ASCII (asm_out_file, s->data_symbol.name,
strlen (s->data_symbol.name) + 1);
 
+  ASM_OUTPUT_ALIGN (asm_out_file, 2);
+
   targetm.asm_out.internal_label (asm_out_file, SYMBOL_END_LABEL, label_num);
 
 end:
-- 
2.44.2

[PATCH 3/5] Avoid magic numbers when writing CodeView padding

2024-06-29 Thread Mark Harmstone

Adds names for the padding magic numbers to enum cv_leaf_type.

gcc/
* dwarf2codeview.cc (enum cv_leaf_type): Add padding constants.
(write_cv_padding): Use names for padding constants.
---
 gcc/dwarf2codeview.cc | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/gcc/dwarf2codeview.cc b/gcc/dwarf2codeview.cc
index 921d5f41e5a..71049ccf878 100644
--- a/gcc/dwarf2codeview.cc
+++ b/gcc/dwarf2codeview.cc
@@ -77,6 +77,9 @@ enum cv_sym_type {
 /* This is enum LEAF_ENUM_e in Microsoft's cvinfo.h.  */
 
 enum cv_leaf_type {
+  LF_PAD1 = 0xf1,
+  LF_PAD2 = 0xf2,
+  LF_PAD3 = 0xf3,
   LF_MODIFIER = 0x1001,
   LF_POINTER = 0x1002,
   LF_PROCEDURE = 0x1008,
@@ -1037,7 +1040,7 @@ write_lf_pointer (codeview_custom_type *t)
 
 /* All CodeView type definitions have to be aligned to a four-byte boundary,
so write some padding bytes if necessary.  These have to be specific values:
-   f3, f2, f1.  */
+   LF_PAD3, LF_PAD2, LF_PAD1.  */
 
 static void
 write_cv_padding (size_t padding)
@@ -1048,19 +1051,19 @@ write_cv_padding (size_t padding)
   if (padding == 3)
 {
   fputs (integer_asm_op (1, false), asm_out_file);
-  fprint_whex (asm_out_file, 0xf3);
+  fprint_whex (asm_out_file, LF_PAD3);
   putc ('\n', asm_out_file);
 }
 
   if (padding >= 2)
 {
   fputs (integer_asm_op (1, false), asm_out_file);
-  fprint_whex (asm_out_file, 0xf2);
+  fprint_whex (asm_out_file, LF_PAD2);
   putc ('\n', asm_out_file);
 }
 
   fputs (integer_asm_op (1, false), asm_out_file);
-  fprint_whex (asm_out_file, 0xf1);
+  fprint_whex (asm_out_file, LF_PAD1);
   putc ('\n', asm_out_file);
 }
 
-- 
2.44.2

[PATCH 2/5] Add CodeView enum cv_sym_type

2024-06-29 Thread Mark Harmstone

Make everything more gdb-friendly by using an enum for symbol constants
rather than #defines.

gcc/
* dwarf2codeview.cc (S_LDATA32, S_GDATA32, S_COMPILE3): Undefine.
(enum cv_sym_type): Define.
(struct codeview_symbol): Use enum cv_sym_type.
(write_codeview_symbols): Add default to switch.
---
 gcc/dwarf2codeview.cc | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/gcc/dwarf2codeview.cc b/gcc/dwarf2codeview.cc
index 5155aa70139..921d5f41e5a 100644
--- a/gcc/dwarf2codeview.cc
+++ b/gcc/dwarf2codeview.cc
@@ -46,10 +46,6 @@ along with GCC; see the file COPYING3.  If not see
 
 #define CHKSUM_TYPE_MD51
 
-#define S_LDATA32  0x110c
-#define S_GDATA32  0x110d
-#define S_COMPILE3 0x113c
-
 #define CV_CFL_80386   0x03
 #define CV_CFL_X64 0xD0
 
@@ -70,6 +66,14 @@ along with GCC; see the file COPYING3.  If not see
 
 #define HASH_SIZE 16
 
+/* This is enum SYM_ENUM_e in Microsoft's cvinfo.h.  */
+
+enum cv_sym_type {
+  S_LDATA32 = 0x110c,
+  S_GDATA32 = 0x110d,
+  S_COMPILE3 = 0x113c
+};
+
 /* This is enum LEAF_ENUM_e in Microsoft's cvinfo.h.  */
 
 enum cv_leaf_type {
@@ -168,7 +172,7 @@ struct codeview_function
 struct codeview_symbol
 {
   codeview_symbol *next;
-  uint16_t kind;
+  enum cv_sym_type kind;
 
   union
   {
@@ -983,6 +987,8 @@ write_codeview_symbols (void)
case S_GDATA32:
  write_data_symbol (sym);
  break;
+   default:
+ break;
}
 
   free (sym);
-- 
2.44.2

[PATCH 1/5] Add CodeView enum cv_leaf_type

2024-06-29 Thread Mark Harmstone

Make everything more gdb-friendly by using an enum for type constants
rather than #defines.

gcc/
* dwarf2codeview.cc (enum cv_leaf_type): Define.
(struct codeview_subtype): Use enum cv_leaf_type.
(struct codeview_custom_type): Use enum cv_leaf_type.
(write_lf_fieldlist): Add default to switch.
(write_custom_types): Add default to switch.
* dwarf2codeview.h (LF_MODIFIER, LF_POINTER): Undefine.
(LF_PROCEDURE, LF_ARGLIST, LF_FIELDLIST, LF_BITFIELD): Likewise.
(LF_INDEX, LF_ENUMERATE, LF_ARRAY, LF_CLASS): Likewise.
(LF_STRUCTURE, LF_UNION, LF_ENUM, LF_MEMBER, LF_CHAR): Likewise.
(LF_SHORT, LF_USHORT, LF_LONG, LF_ULONG, LF_QUADWORD): Likewise.
(LF_UQUADWORD): Likewise.
---
 gcc/dwarf2codeview.cc | 37 +++--
 gcc/dwarf2codeview.h  | 23 ---
 2 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/gcc/dwarf2codeview.cc b/gcc/dwarf2codeview.cc
index e8ed3713480..5155aa70139 100644
--- a/gcc/dwarf2codeview.cc
+++ b/gcc/dwarf2codeview.cc
@@ -70,6 +70,33 @@ along with GCC; see the file COPYING3.  If not see
 
 #define HASH_SIZE 16
 
+/* This is enum LEAF_ENUM_e in Microsoft's cvinfo.h.  */
+
+enum cv_leaf_type {
+  LF_MODIFIER = 0x1001,
+  LF_POINTER = 0x1002,
+  LF_PROCEDURE = 0x1008,
+  LF_ARGLIST = 0x1201,
+  LF_FIELDLIST = 0x1203,
+  LF_BITFIELD = 0x1205,
+  LF_INDEX = 0x1404,
+  LF_ENUMERATE = 0x1502,
+  LF_ARRAY = 0x1503,
+  LF_CLASS = 0x1504,
+  LF_STRUCTURE = 0x1505,
+  LF_UNION = 0x1506,
+  LF_ENUM = 0x1507,
+  LF_MEMBER = 0x150d,
+  LF_FUNC_ID = 0x1601,
+  LF_CHAR = 0x8000,
+  LF_SHORT = 0x8001,
+  LF_USHORT = 0x8002,
+  LF_LONG = 0x8003,
+  LF_ULONG = 0x8004,
+  LF_QUADWORD = 0x8009,
+  LF_UQUADWORD = 0x800a
+};
+
 struct codeview_string
 {
   codeview_string *next;
@@ -185,7 +212,7 @@ struct codeview_integer
 struct codeview_subtype
 {
   struct codeview_subtype *next;
-  uint16_t kind;
+  enum cv_leaf_type kind;
 
   union
   {
@@ -212,7 +239,7 @@ struct codeview_custom_type
 {
   struct codeview_custom_type *next;
   uint32_t num;
-  uint16_t kind;
+  enum cv_leaf_type kind;
 
   union
   {
@@ -1336,6 +1363,9 @@ write_lf_fieldlist (codeview_custom_type *t)
  putc ('\n', asm_out_file);
 
  break;
+
+   default:
+ break;
}
 
   t->lf_fieldlist.subtypes = next;
@@ -1790,6 +1820,9 @@ write_custom_types (void)
case LF_ARGLIST:
  write_lf_arglist (custom_types);
  break;
+
+   default:
+ break;
}
 
   free (custom_types);
diff --git a/gcc/dwarf2codeview.h b/gcc/dwarf2codeview.h
index e6ad517bf28..8fd3632e524 100644
--- a/gcc/dwarf2codeview.h
+++ b/gcc/dwarf2codeview.h
@@ -60,29 +60,6 @@ along with GCC; see the file COPYING3.  If not see
 #define MOD_const  0x1
 #define MOD_volatile   0x2
 
-/* Constants for type definitions.  */
-#define LF_MODIFIER0x1001
-#define LF_POINTER 0x1002
-#define LF_PROCEDURE   0x1008
-#define LF_ARGLIST 0x1201
-#define LF_FIELDLIST   0x1203
-#define LF_BITFIELD0x1205
-#define LF_INDEX   0x1404
-#define LF_ENUMERATE   0x1502
-#define LF_ARRAY   0x1503
-#define LF_CLASS   0x1504
-#define LF_STRUCTURE   0x1505
-#define LF_UNION   0x1506
-#define LF_ENUM0x1507
-#define LF_MEMBER  0x150d
-#define LF_CHAR0x8000
-#define LF_SHORT   0x8001
-#define LF_USHORT  0x8002
-#define LF_LONG0x8003
-#define LF_ULONG   0x8004
-#define LF_QUADWORD0x8009
-#define LF_UQUADWORD   0x800a
-
 #define CV_ACCESS_PRIVATE  1
 #define CV_ACCESS_PROTECTED2
 #define CV_ACCESS_PUBLIC   3
-- 
2.44.2

[to-be-committed] [RISC-V] DCE analysis for extension elimination

2024-06-29 Thread Jeff Law



This was actually ack'd late in the gcc-14 cycle, but I chose not to 
integrate it given how late we were in the cycle.


The basic idea here is to track liveness of subobjects within a word and 
if we find an extension where the bits set aren't actually used, then we 
convert the extension into a subreg.  The subreg typically simplifies away.


I've seen this help a few routines in coremark, fix one bug in the 
testsuite (pr111384) and fix a couple internally reported bugs in Ventana.


The original idea and code were from Joern; Jivan and I hacked it into 
usable shape.  I've had this in my tester for ~8 months, so it's been 
through more build/test cycles than I care to contemplate and nearly 
every architecture we support.


But just in case, I'm going to wait for it to spin through the 
pre-commit CI tester.  I'll find my old ChangeLog before committing.


Jeff

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index deb12e17d25..cdefdd0fa27 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1453,6 +1453,7 @@ OBJS = \
explow.o \
expmed.o \
expr.o \
+   ext-dce.o \
fibonacci_heap.o \
file-prefix-map.o \
final.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 5f0a101bccb..d6b40edb57b 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3846,4 +3846,8 @@ fipa-ra
 Common Var(flag_ipa_ra) Optimization
 Use caller save register across calls if possible.
 
+fext-dce
+Common Var(flag_ext_dce, 1) Optimization Init(0)
+Perform dead code elimination on zero and sign extensions with special 
dataflow analysis.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/df-scan.cc b/gcc/df-scan.cc
index 1bade2cd71e..46806fe6e67 100644
--- a/gcc/df-scan.cc
+++ b/gcc/df-scan.cc
@@ -78,7 +78,6 @@ static void df_get_eh_block_artificial_uses (bitmap);
 
 static void df_record_entry_block_defs (bitmap);
 static void df_record_exit_block_uses (bitmap);
-static void df_get_exit_block_use_set (bitmap);
 static void df_get_entry_block_def_set (bitmap);
 static void df_grow_ref_info (struct df_ref_info *, unsigned int);
 static void df_ref_chain_delete_du_chain (df_ref);
@@ -3642,7 +3641,7 @@ df_epilogue_uses_p (unsigned int regno)
 
 /* Set the bit for regs that are considered being used at the exit. */
 
-static void
+void
 df_get_exit_block_use_set (bitmap exit_block_uses)
 {
   unsigned int i;
diff --git a/gcc/df.h b/gcc/df.h
index 84e5aa8b524..2b9997eb978 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -1091,6 +1091,7 @@ extern bool df_epilogue_uses_p (unsigned int);
 extern void df_set_regs_ever_live (unsigned int, bool);
 extern void df_compute_regs_ever_live (bool);
 extern void df_scan_verify (void);
+extern void df_get_exit_block_use_set (bitmap);
 
 
 /*
diff --git a/gcc/ext-dce.cc b/gcc/ext-dce.cc
new file mode 100644
index 000..ffea057ee11
--- /dev/null
+++ b/gcc/ext-dce.cc
@@ -0,0 +1,943 @@
+/* RTL dead zero/sign extension (code) elimination.
+   Copyright (C) 2000-2022 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "tree.h"
+#include "memmodel.h"
+#include "insn-config.h"
+#include "emit-rtl.h"
+#include "recog.h"
+#include "cfganal.h"
+#include "tree-pass.h"
+#include "cfgrtl.h"
+#include "rtl-iter.h"
+#include "df.h"
+#include "print-rtl.h"
+
+/* These should probably move into a C++ class.  */
+static vec livein;
+static bitmap all_blocks;
+static bitmap livenow;
+static bitmap changed_pseudos;
+static bool modify;
+
+/* We consider four bit groups for liveness:
+   bit 0..7   (least significant byte)
+   bit 8..15  (second least significant byte)
+   bit 16..31
+   bit 32..BITS_PER_WORD-1  */
+
+/* Note this pass could be used to narrow memory loads too.  It's
+   not clear if that's profitable or not in general.  */
+
+#define UNSPEC_P(X) (GET_CODE (X) == UNSPEC || GET_CODE (X) == UNSPEC_VOLATILE)
+
+/* If we know the destination of CODE only uses some low bits
+   (say just the QI bits of an SI operation), then return true
+   if we can propagate the need for just the subset of bits
+   from the destination to the sources.
+
+   FIXME: This is safe for operands 1 and 2 of an IF_THEN_ELSE, but not
+   operand 0.  Thus is

test mail

2024-06-29 Thread Ajit Agarwal

[to-be-committed][v3][RISC-V] Handle bit manipulation of SImode values

2024-06-29 Thread Jeff Law

Third time is a charm perhaps?  I'm not sure how I keep mucking this 
patch up, but clearly I do as I've sent the wrong patch twice!




--

Last patch in this round of bitmanip work...  At least I think I'm going 
to pause here and switch gears to other projects that need attention 



This patch introduces the ability to generate bitmanip instructions for 
rv64 when operating on SI objects when we know something about the range 
of the bit position (due to masking of the position).


I've got note that the (7-pos % 8) bit position form was discovered by 
RAU in 500.perl.  I took that and expanded it to the simple (pos & mask) 
form as well as covering bset, binv and bclr.


As far as the implementation is concerned

This turns the recently added define_splits into define_insn_and_split 
constructs.  This allows combine to "see" enough RTL to realize a sign 
extension is unnecessary.  Otherwise we get undesirable sign extensions 
for the new testcases.


Second it adds new patterns for the logical operations.  Two patterns 
for IOR/XOR and two patterns for AND.


I think a key concept to keep in mind is that once we determine a Zbs 
operation is safe to perform on a SI value, we can rewrite the RTL in 
64bit form.  If we were ever to try and use range information at expand 
time for this stuff (and we probably should investigate that), that's 
the path I'd suggest.


This is notably cleaner than my original implementation which actually 
kept the more complex RTL form through final and emitted 2/3 
instructions (mask the bit position, then the bset/bclr/binv).



Tested in my tester, but waiting for pre-commit CI to report back before 
taking further action.


Jeffgcc/


* config/riscv/bitmap.md (bset splitters): Turn into define_and_splits.
Don't depend on combine splitting the "andn with constant" form.
(bset, binv, bclr with masked bit position): New patterns.

gcc/testsuite
* gcc.target/riscv/binv-for-simode.c: New test.
* gcc.target/riscv/bset-for-simode.c: New test.
* gcc.target/riscv/bclr-for-simode.c: New test.


diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md
index 3eedabffca0..f403ba8dbba 100644
--- a/gcc/config/riscv/bitmanip.md
+++ b/gcc/config/riscv/bitmanip.md
@@ -615,37 +615,140 @@ (define_insn "*bsetdi_2"
 ;; shift constant.  With the limited range we know the SImode sign
 ;; bit is never set, thus we can treat this as zero extending and
 ;; generate the bsetdi_2 pattern.
-(define_split
-  [(set (match_operand:DI 0 "register_operand")
+(define_insn_and_split ""
+  [(set (match_operand:DI 0 "register_operand" "=r")
(any_extend:DI
 (ashift:SI (const_int 1)
(subreg:QI
- (and:DI (not:DI (match_operand:DI 1 "register_operand"))
+ (and:DI (not:DI (match_operand:DI 1 "register_operand" 
"r"))
  (match_operand 2 "const_int_operand")) 0
-   (clobber (match_operand:DI 3 "register_operand"))]
+   (clobber (match_scratch:X 3 "="))]
   "TARGET_64BIT
&& TARGET_ZBS
&& (TARGET_ZBB || TARGET_ZBKB)
&& (INTVAL (operands[2]) & 0x1f) != 0x1f"
-   [(set (match_dup 0) (and:DI (not:DI (match_dup 1)) (match_dup 2)))
-(set (match_dup 0) (zero_extend:DI (ashift:SI
-  (const_int 1)
-  (subreg:QI (match_dup 0) 0])
+  "#"
+  "&& reload_completed"
+   [(set (match_dup 3) (match_dup 2))
+(set (match_dup 3) (and:DI (not:DI (match_dup 1)) (match_dup 3)))
+(set (match_dup 0) (zero_extend:DI
+(ashift:SI (const_int 1) (match_dup 4]
+  { operands[4] = gen_lowpart (QImode, operands[3]); }
+  [(set_attr "type" "bitmanip")])
 
-(define_split
-  [(set (match_operand:DI 0 "register_operand")
-   (any_extend:DI
+(define_insn_and_split ""
+  [(set (match_operand:DI 0 "register_operand" "=r")
+(any_extend:DI
 (ashift:SI (const_int 1)
(subreg:QI
- (and:DI (match_operand:DI 1 "register_operand")
+ (and:DI (match_operand:DI 1 "register_operand" "r")
  (match_operand 2 "const_int_operand")) 0]
   "TARGET_64BIT
&& TARGET_ZBS
&& (INTVAL (operands[2]) & 0x1f) != 0x1f"
-   [(set (match_dup 0) (and:DI (match_dup 1) (match_dup 2)))
-(set (match_dup 0) (zero_extend:DI (ashift:SI
-  (const_int 1)
-  (subreg:QI (match_dup 0) 0])
+  "#"
+  "&& 1"
+  [(set (match_dup 0) (and:DI (match_dup 1) (match_dup 2)))
+   (set (match_dup 0) (zero_extend:DI (ashift:SI
+(const_int 1)
+(subreg:QI (match_dup 0) 0]
+  { }
+  [(set_attr "type" "bitmanip")])
+
+;; Similarly two patterns for IOR/XOR generating bset/binv to
+;; manipulate a bit in a register
+(define_insn_and_split ""
+

Re: [PATCH] RISC-V: use fclass insns to implement isfinite and isnormal builtins

2024-06-29 Thread Vineet Gupta

On 6/29/24 06:44, Jeff Law wrote:
>> +;; fclass instruction output bitmap
>> +;;   0 negative infinity
>> +;;   1 negative normal number.
>> +;;   2 negative subnormal number.
>> +;;   3 -0
>> +;;   4 +0
>> +;;   5 positive subnormal number.
>> +;;   6 positive normal number.
>> +;;   7 positive infinity
>> +;;   8 signaling NaN.
>> +;;   9 quiet NaN
>> +(define_insn "fclass"
>> +  [(set (match_operand:SI   0 "register_operand" "=r")
>> +(unspec:SI [(match_operand:ANYF 1 "register_operand" " f")]
>> +   UNSPEC_FCLASS))]
>> +  "TARGET_HARD_FLOAT"
>> +  "fclass.\t%0,%1"
>> +  [(set_attr "type" "fcmp")
>> +   (set_attr "mode" "")])
> So I realize the result only has 10 bits of output, but I think would it 
> make more sense to use X rather than SI for the result.  When we use 
> SImode on rv64 we have to deal with potential extensions.  In this case 
> we know the values are properly extended, so we could just claim it's 
> DImode and I think everything would "just work" and we wouldn't have to 
> worry about unnecessary sign extensions creeping in.

Indeed the perils of sign extension on RV are not lost on me and this is
exactly how I started.
But my md syntax foo/bar is, lets just say a work in progress :-)

I started with

+ "fclass" so its invocation became + emit_insn
(gen_fclass (tmp, operands[1])); which then led to
expander itself needing X in the definition lest we get duplicate
definitions due to X's variants.

+(define_expand "isnormal2"
+  [(set (match_operand:X   0 "register_operand" "=r")
+   (unspec:X [(match_operand:ANYF 1 "register_operand" " f")]
+  UNSPEC_ISNORMAL))]
+  "TARGET_HARD_FLOAT"

But this was not getting recognized as a well known pattern:
CODE_FOR_isnormalxx was not getting generated.

Keeping it as following did make it work.

+(define_expand "isnormal2"

Any ideas on how I can keep this and then adjust rest of patterns.

Would it help to make it define_insn *name ?

>> +;; TODO: isinf is a bit tricky as it require trimodal return
>> +;;  1 if 0x80, -1 if 0x1, 0 otherwise
> It shouldn't be terrible, but it's not trivial either.
>
> bext t0, a0, 0
> neg t0
> bext t1, a0, 7
> czero.nez res, t0, t1
> snez t1, t1
> add a0, a1, a0
>
> Or something reasonably close to that.

I wrote the "C" code and saw what compiler would do ;-) for the baseline
isa build.

    andi    a5,a0,128
    bne    a5,zero,.L7
    slli    a5,a0,63
    srai    a0,a5,63
    ret
.L7:
    li    a0,1
    ret

But again labels are hard (for me ) in md.


> Of course that depends on zicond and zbs.  So we probably want the 
> expansion to not depend on those extensions, but generate code that is 
> easily recognized and converted into that kind of a sequence.

Is this a common enough paradigm: {bimodal,trimodal} values based on
{2,3} conditions. If so we could do a helper for baseline and then
optimization.
Otherwise I can just hack up isinf conditional to zicond and zbs based
on your code above - after all both of these extensions are likely to be
fairly common going fwd.

Thx for the quick feedback.

-Vineet

[PATCH] c: Diagnose declarations that are used only in their own initializer [PR115027]

2024-06-29 Thread Martin Uecker



Probably not entirely fool-proof when using statement
expressions in initializers, but should be good enough.


Bootstrapped and regression tested on x86_64.



c: Diagnose declarations that are used only in their own initializer 
[PR115027]

Track the declaration that is currently being initialized and do not
mark it as read when it is used in its own initializer.  This then
allows it to be diagnosed as set-but-unused when it is not used
elsewhere.

PR c/115027

gcc/c/
* c-tree.h (in_decl_init): Declare variable.
* c-parser.cc (c_parser_initializer): Record decl being initialized.
* c-typeck.cc (in_decl_init): Defintie variable.
(mark_exp_read): Ignore decl currently being initialized.

gcc/testsuite/
* gcc.dg/pr115027.c: New test.

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 8c4e697a4e1..46060665115 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -6126,11 +6126,14 @@ c_parser_type_name (c_parser *parser, bool alignas_ok)
 static struct c_expr
 c_parser_initializer (c_parser *parser, tree decl)
 {
+  struct c_expr ret;
+  tree save = in_decl_init;
+  in_decl_init = decl;
+
   if (c_parser_next_token_is (parser, CPP_OPEN_BRACE))
-return c_parser_braced_init (parser, NULL_TREE, false, NULL, decl);
+ret = c_parser_braced_init (parser, NULL_TREE, false, NULL, decl);
   else
 {
-  struct c_expr ret;
   location_t loc = c_parser_peek_token (parser)->location;
   ret = c_parser_expr_no_commas (parser, NULL);
   if (decl != error_mark_node && C_DECL_VARIABLE_SIZE (decl))
@@ -6154,8 +6157,9 @@ c_parser_initializer (c_parser *parser, tree decl)
  || C_DECL_DECLARED_CONSTEXPR (COMPOUND_LITERAL_EXPR_DECL
(ret.value
ret = convert_lvalue_to_rvalue (loc, ret, true, true, true);
-  return ret;
 }
+in_decl_init = save;
+return ret;
 }
 
 /* The location of the last comma within the current initializer list,
diff --git a/gcc/c/c-tree.h b/gcc/c/c-tree.h
index 15da875a029..8013963b06d 100644
--- a/gcc/c/c-tree.h
+++ b/gcc/c/c-tree.h
@@ -740,6 +740,8 @@ extern int in_typeof;
 extern bool c_in_omp_for;
 extern bool c_omp_array_section_p;
 
+extern tree in_decl_init;
+
 extern tree c_last_sizeof_arg;
 extern location_t c_last_sizeof_loc;
 
diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index 455dc374b48..34279dc1d1a 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -73,6 +73,9 @@ int in_sizeof;
 /* The level of nesting inside "typeof".  */
 int in_typeof;
 
+/* When inside an initializer, this is set to the decl being initialized.  */
+tree in_decl_init;
+
 /* True when parsing OpenMP loop expressions.  */
 bool c_in_omp_for;
 
@@ -2047,7 +2050,8 @@ mark_exp_read (tree exp)
 {
 case VAR_DECL:
 case PARM_DECL:
-  DECL_READ_P (exp) = 1;
+  if (exp != in_decl_init)
+   DECL_READ_P (exp) = 1;
   break;
 case ARRAY_REF:
 case COMPONENT_REF:
diff --git a/gcc/testsuite/gcc.dg/pr115027.c b/gcc/testsuite/gcc.dg/pr115027.c
new file mode 100644
index 000..ac2699f8392
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr115027.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-Wunused-but-set-variable" } */
+
+void f(void)
+{
+   struct foo { void *p; };
+   struct foo g = {  };  /* { dg-warning "set but not used" } */
+}

[x86 PATCH]: Additional peephole2 to use lea in round-up integer division.

2024-06-29 Thread Roger Sayle


A common idiom for implementing an integer division that rounds upwards is
to write (x + y - 1) / y.  Conveniently on x86, the two additions to form
the numerator can be performed by a single lea instruction, and indeed gcc
currently generates a lea when x and y both registers.

int foo(int x, int y) {
  return (x+y-1)/y;
}

generates with -O2:

foo:leal-1(%rsi,%rdi), %eax // 4 bytes
cltd
idivl   %esi
ret

Oddly, however, if x is a memory, gcc currently uses two instructions:

int m;
int bar(int y) {
  return (m+y-1)/y;
}

generates:

foo:movlm(%rip), %eax
addl%edi, %eax  // 2 bytes
subl$1, %eax// 3 bytes
cltd
idivl   %edi
ret

This discrepancy is caused by the late decision (in peephole2) to split
an addition with a memory operand, into a load followed by a reg-reg
addition.  This patch improves this situation by adding a peephole2
to recognized consecutive additions and transform them into lea if
profitable.

My first attempt at fixing this was to use a define_insn_and_split:

(define_insn_and_split "*lea3_reg_mem_imm"
  [(set (match_operand:SWI48 0 "register_operand")
   (plus:SWI48 (plus:SWI48 (match_operand:SWI48 1 "register_operand")
   (match_operand:SWI48 2 "memory_operand"))
   (match_operand:SWI48 3 "x86_64_immediate_operand")))]
  "ix86_pre_reload_split ()"
  "#"
  "&& 1"
  [(set (match_dup 4) (match_dup 2))
   (set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 1) (match_dup 4))
 (match_dup 3)))]
  "operands[4] = gen_reg_rtx (mode);")

using combine to combine instructions.  Unfortunately, this approach
interferes with (reload's) subtle balance of deciding when to use/avoid lea,
which can be observed as a code size regression in CSiBE.  The peephole2
approach (proposed here) uniformly improves CSiBE results.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-06-29  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.md (peephole2): Transform two consecutive
additions into a 3-component lea if !TARGET_AVOID_LEA_FOR_ADDR.

gcc/testsuite/ChageLog
* gcc.target/i386/lea-3.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index fd48e76..66ef234 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -6332,6 +6332,21 @@
   "TARGET_APX_NF && reload_completed"
   [(set (match_dup 0) (ashift:SWI48 (match_dup 0) (match_dup 1)))]
   "operands[1] = GEN_INT (exact_log2 (INTVAL (operands[1])));")
+
+;; The peephole2 pass may expose consecutive additions suitable for lea.
+(define_peephole2
+  [(parallel [(set (match_operand:SWI48 0 "register_operand")
+  (plus:SWI48 (match_dup 0)
+  (match_operand 1 "register_operand")))
+ (clobber (reg:CC FLAGS_REG))])
+   (parallel [(set (match_dup 0)
+  (plus:SWI48 (match_dup 0)
+  (match_operand 2 "x86_64_immediate_operand")))
+ (clobber (reg:CC FLAGS_REG))])]
+  "!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun)"
+  [(set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 0)
+ (match_dup 1))
+ (match_dup 2)))])
 
 ;; Add instructions

[PATCH] c: Fix ICE for incorrect code in comptypes_verify [PR115696]

2024-06-29 Thread Martin Uecker



This adds missing code for handling error marks.


Bootstrapped and regression tested on x86_64.



c: Fix ICE for incorrect code in comptypes_verify [PR115696]

The new verification code produces an ICE for incorrect code.  Add the
same logic as already used in comptypes to to bail out under certain
conditions.

PR c/115696

gcc/c/
* c-typeck.cc (comptypes_verify): Bail out for
identical, empty, and erroneous input types.

gcc/testsuite/
* gcc.dg/pr115696.c: New test.

diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index ffcab7df4d3..e486ac04f9c 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -1175,6 +1175,10 @@ common_type (tree t1, tree t2)
 static bool
 comptypes_verify (tree type1, tree type2)
 {
+  if (type1 == type2 || !type1 || !type2
+  || TREE_CODE (type1) == ERROR_MARK || TREE_CODE (type2) == ERROR_MARK)
+return true;
+
   if (TYPE_CANONICAL (type1) != TYPE_CANONICAL (type2)
   && !TYPE_STRUCTURAL_EQUALITY_P (type1)
   && !TYPE_STRUCTURAL_EQUALITY_P (type2))
diff --git a/gcc/testsuite/gcc.dg/pr115696.c b/gcc/testsuite/gcc.dg/pr115696.c
new file mode 100644
index 000..a7c8d87cb06
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr115696.c
@@ -0,0 +1,7 @@
+/* { dg-do compile } */
+/* { dg-options "-Wno-implicit-int" } */
+
+a();   /* { dg-warning "no type or storage" } */
+a; /* { dg-error "redeclared" } */
+   /* { dg-warning "no type or storage" "" { target *-*-* } .-1 } */
+a();   /* { dg-warning "no type or storage" } */

[PATCH] c: Fix ICE for redeclaration of structs with different alignment [PR114727]

2024-06-29 Thread Martin Uecker



This fixes an ICE when redeclaring a struct and having
an aligned attribute in one version in C23.


Bootstrapped and regression tested on x86_64.



c: Fix ICE for redeclaration of structs with different alignment [PR114727]

For redeclarations of struct in C23, if one has an alignment attribute
that makes the alignment different, we later get an ICE in verify_types.
This patches disallows such redeclarations by declaring such types to
be different.

PR c/114727

gcc/c/
* c-typeck.cc (tagged_types_tu_compatible): Add test.

gcc/testsuite/
* gcc.dg/pr114727.c: New test.

diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index e486ac04f9c..455dc374b48 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -1603,6 +1603,9 @@ tagged_types_tu_compatible_p (const_tree t1, const_tree 
t2,
  != TYPE_REVERSE_STORAGE_ORDER (t2)))
 return false;
 
+  if (TYPE_USER_ALIGN (t1) != TYPE_USER_ALIGN (t2))
+data->different_types_p = true;
+
   /* For types already being looked at in some active
  invocation of this function, assume compatibility.
  The cache is built as a linked list on the stack
diff --git a/gcc/testsuite/gcc.dg/pr114727.c b/gcc/testsuite/gcc.dg/pr114727.c
new file mode 100644
index 000..12949590ce0
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr114727.c
@@ -0,0 +1,6 @@
+/* { dg-do compile }
+ * { dg-options "-std=c23 -g" } */
+
+#define Y [[gnu::aligned(128)]]
+extern struct Y foo { int x; } x;
+struct foo { int x; }; /* { dg-error "redefinition" } */

Re: [RFC PATCH] cse: Add another CSE pass after split1

2024-06-29 Thread Jeff Law





On 6/27/24 3:56 PM, Palmer Dabbelt wrote:

This is really more of a question than a patch.

Looking at PR/115687 I managed to convince myself there's a general
class of problems here: splitting might produce constant subexpressions,
but as far as I can tell there's nothing to eliminate those constant
subexpressions.  So I very quickly threw together a CSE that doesn't
fold expressions, and it does eliminate the high-part constants in
question.

At that point I realized the implementation here is bogus: it's not the
folding that's the problem, but introducing new expressions post-split
would break things -- or at least I think it would, we'd end up with
insns the backends don't expect to have that late.  I'm not sure if
split2 would end up cleaning all that up at a functional level, but it
certainly seems like optimization would be pretty far off the rails at
that point and thus doesn't seem like a good idea.  I'm also not sure
how effective this would be without doing the folding, as without
folding we can only eliminate the last insn in the constant sequence --
that's fine here, but it wouldn't work for more complicated stuff.

So I think if this was to go anywhere we'd want to have a CSE that
really only eliminates expressions (ie, doesn't do any of the other
juggling to try and produce more constant subexpressions).  There's a
few places where new expressions can be introduced, so it'd probably be
better done as a new cse_insn-type function instead of just a flag.  It
seems somewhat manageable to write, though.

That said, I really don't know what I'm doing here.  So I figured I'd
just send out what I'd put together, mostly as a way to ask if it's
worth putting time into this?
The biggest problem with running CSE again is the cost.  It's been a 
while since I've dug into rtl compile-time issues, but traditionally CSE 
is the biggest time hog in the RTL pipeline.


There's a natural tension between exposing the synthesis early which 
improves CSE, but harms combine vs exposing it late which helps combine 
but hurts CSE.  Some (myself included) tend to lean towards improving 
combine, while others lean towards improving CSE.



As I mentioned in the context of Vineet's recent changes for cactubssn, 
I do think we want to go back and revisit the mvconst_internal pattern. 
The idea was that it would help combine discover cases where it can 
simplify logical/shift ops, without having to write some quite ugly 
combiner patterns to rediscover the constants.  But it has some 
undesirable fallout as well, particularly in forcing us to write 
define_insn_and_split patterns rather than define_split patterns which 
has the undesirable effect of mucking up combine's costing model for 
splitting.


The space you're poking at has been mined quite a bit through the years 
by numerous sharp folks and there are no simple answers for how to 
handle constants, splitting and the like.  I fully expect that anything 
we do is going to have negative fallout.  It's inherent in this problem 
space.


The other thing to remember is that constant synthesis is rarely a major 
performance driver.  They're constants :-)  I spent months on a similar 
project many years ago (customer contract) -- and while the end result 
looked really good if you were staring at assembly code all day, but 
from a real world performance standpoint it was undetectable.  Certainly 
wasn't worth the effort put in (though some of the infrastructure work 
along the way really cleaned up warts in the tree data structures).


jeff

jeff

Re: [PATCH] _Hashtable fancy pointer support

2024-06-29 Thread François Dumont



On 27/06/2024 22:30, Jonathan Wakely wrote:

On Thu, 27 Jun 2024 at 20:25, François Dumont  wrote:

Thanks for the link, based on it I removed some of the nullptr usages
keeping only assignments.

That's not necessary. A nullable pointer type is equality comparable
with nullptr_t, and nullptr can be implicitly converted to the pointer
type.


Do you prefer that I fully rollback this part then ?

In this new version I restored some nullptr usages and kept the removals 
only when avoiding the conversion without impacting code clarity. But 
that's questionable of course, let me know.




Your _S_case function is wrong though: you can't construct the fancy
pointer type from a raw pointer. You need to use
pointer_traits<__node_ptr>::pointer_to(*rawptr).


The _S_cast taking a __node_base* is only used when the allocator do not 
define any fancy pointer.


In this new version I've added a comment on it and simply used 
__node_base* for clarity.





François

On 26/06/2024 23:41, Jonathan Wakely wrote:

On Wed, 26 Jun 2024 at 21:39, François Dumont  wrote:

Hi

Here is my proposal to add support for fancy allocator pointer.

The only place where we still have C pointers is at the
iterator::pointer level but it's consistent with std::list
implementation and also logical considering that we do not get
value_type pointers from the allocator.

I also wondered if it was ok to use nullptr in different places or if I
should rather do __node_ptr{}. But recent modifications are using
nullptr so I think it's fine.

I haven't reviewed the patch yet, but this answers the nullptr question:
https://en.cppreference.com/w/cpp/named_req/NullablePointer
(aka Cpp17NullablePointer in the C++ standard).
diff --git a/libstdc++-v3/include/bits/hashtable.h 
b/libstdc++-v3/include/bits/hashtable.h
index 361da2b3b4d..474b5e7a96c 100644
--- a/libstdc++-v3/include/bits/hashtable.h
+++ b/libstdc++-v3/include/bits/hashtable.h
@@ -200,8 +200,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 _RehashPolicy, _Traits>,
   private __detail::_Hashtable_alloc<
__alloc_rebind<_Alloc,
-  __detail::_Hash_node<_Value,
-   _Traits::__hash_cached::value>>>,
+  __detail::__get_node_type<_Alloc, _Value,
+
_Traits::__hash_cached::value>>>,
   private _Hashtable_enable_default_ctor<_Equal, _Hash, _Alloc>
 {
   static_assert(is_same::type, _Value>::value,
@@ -216,21 +216,23 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   using __traits_type = _Traits;
   using __hash_cached = typename __traits_type::__hash_cached;
   using __constant_iterators = typename 
__traits_type::__constant_iterators;
-  using __node_type = __detail::_Hash_node<_Value, __hash_cached::value>;
+  using __node_type = __detail::__get_node_type<_Alloc, _Value,
+   _Traits::__hash_cached::value>;
   using __node_alloc_type = __alloc_rebind<_Alloc, __node_type>;
-
   using __hashtable_alloc = __detail::_Hashtable_alloc<__node_alloc_type>;
 
   using __node_value_type =
__detail::_Hash_node_value<_Value, __hash_cached::value>;
   using __node_ptr = typename __hashtable_alloc::__node_ptr;
-  using __value_alloc_traits =
-   typename __hashtable_alloc::__value_alloc_traits;
   using __node_alloc_traits =
typename __hashtable_alloc::__node_alloc_traits;
+  using __value_alloc_traits =
+   typename __node_alloc_traits::template rebind_traits<_Value>;
   using __node_base = typename __hashtable_alloc::__node_base;
   using __node_base_ptr = typename __hashtable_alloc::__node_base_ptr;
+  using __node_base_ptr_traits = std::pointer_traits<__node_base_ptr>;
   using __buckets_ptr = typename __hashtable_alloc::__buckets_ptr;
+  using __buckets_ptr_traits = std::pointer_traits<__buckets_ptr>;
 
   using __insert_base = __detail::_Insert<_Key, _Value, _Alloc, 
_ExtractKey,
  _Equal, _Hash,
@@ -258,15 +260,15 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
   using const_iterator = typename __insert_base::const_iterator;
 
-  using local_iterator = __detail::_Local_iterator;
+  using local_iterator = __detail::__local_iterator<
+   __node_ptr, key_type, value_type,
+   _ExtractKey, _Hash, _RangeHash, _Unused,
+   __constant_iterators::value, __hash_cached::value>;
 
-  using const_local_iterator = __detail::_Local_const_iterator<
-   key_type, _Value,
-   _ExtractKey, _Hash, _RangeHash, _Unused,
-   __constant_iterators::value, __hash_cached::value>;
+  using const_local_iterator = __detail::__const_local_iterator<
+   __node_ptr, key_type, value_type,
+   _ExtractKey, _Hash, _RangeHash, _Unused,
+   __constant_iterators::value, __hash_cached::value>;
 
 private:

Re: [PATCH] RISC-V: use fclass insns to implement isfinite and isnormal builtins

2024-06-29 Thread Jeff Law





On 6/28/24 6:53 PM, Vineet Gupta wrote:

Currently isfinite and isnormal use float compare instructions with fp
flags save/restored around them. Our perf team complained this could be
costly in uarch. RV Base ISA already has FCLASS.{d,s,h} instruction to
do FP compares w/o disturbing FP exception flags.

Coincidently, upstream ijust few days back got support for the
corresponding optabs. All that is needed is to wire these up in the
backend.

I was also hoping to get __builtin_inf() done but unforutnately it
requires little more rtl foo/bar to implement a tri-modal return.

Currently going thru CI testing.

gcc/ChangeLog:
* config/riscv/riscv.md: Add UNSPEC_FCLASS, UNSPEC_ISFINITE,
USPEC_ISNORMAL.
define_insn for fclass.
define_expand for isfinite and isnormal.

gcc/testsuite/ChangeLog:
* gcc.target/riscv/fclass.c: New test.







+;; fclass instruction output bitmap
+;;   0 negative infinity
+;;   1 negative normal number.
+;;   2 negative subnormal number.
+;;   3 -0
+;;   4 +0
+;;   5 positive subnormal number.
+;;   6 positive normal number.
+;;   7 positive infinity
+;;   8 signaling NaN.
+;;   9 quiet NaN
+(define_insn "fclass"
+  [(set (match_operand:SI  0 "register_operand" "=r")
+   (unspec:SI [(match_operand:ANYF 1 "register_operand" " f")]
+  UNSPEC_FCLASS))]
+  "TARGET_HARD_FLOAT"
+  "fclass.\t%0,%1"
+  [(set_attr "type" "fcmp")
+   (set_attr "mode" "")])
So I realize the result only has 10 bits of output, but I think would it 
make more sense to use X rather than SI for the result.  When we use 
SImode on rv64 we have to deal with potential extensions.  In this case 
we know the values are properly extended, so we could just claim it's 
DImode and I think everything would "just work" and we wouldn't have to 
worry about unnecessary sign extensions creeping in.





+
+;; TODO: isinf is a bit tricky as it require trimodal return
+;;  1 if 0x80, -1 if 0x1, 0 otherwise

It shouldn't be terrible, but it's not trivial either.

bext t0, a0, 0
neg t0
bext t1, a0, 7
czero.nez res, t0, t1
snez t1, t1
add a0, a1, a0

Or something reasonably close to that.

Of course that depends on zicond and zbs.  So we probably want the 
expansion to not depend on those extensions, but generate code that is 
easily recognized and converted into that kind of a sequence.


Jeff

[PATCH v10] C, ObjC: Add -Wunterminated-string-initialization

2024-06-29 Thread Alejandro Colomar

Warn about the following:

char  s[3] = "foo";

Initializing a char array with a string literal of the same length as
the size of the array is usually a mistake.  Rarely is the case where
one wants to create a non-terminated character sequence from a string
literal.

In some cases, for writing faster code, one may want to use arrays
instead of pointers, since that removes the need for storing an array of
pointers apart from the strings themselves.

char  *log_levels[]   = { "info", "warning", "err" };
vs.
char  log_levels[][7] = { "info", "warning", "err" };

This forces the programmer to specify a size, which might change if a
new entry is later added.  Having no way to enforce null termination is
very dangerous, however, so it is useful to have a warning for this, so
that the compiler can make sure that the programmer didn't make any
mistakes.  This warning catches the bug above, so that the programmer
will be able to fix it and write:

char  log_levels[][8] = { "info", "warning", "err" };

This warning already existed as part of -Wc++-compat, but this patch
allows enabling it separately.  It is also included in -Wextra, since
it may not always be desired (when unterminated character sequences are
wanted), but it's likely to be desired in most cases.

Since Wc++-compat now includes this warning, the test has to be modified
to expect the text of the new warning too, in .

gcc/c-family/ChangeLog:

* c.opt: Add -Wunterminated-string-initialization.

gcc/c/ChangeLog:

* c-typeck.cc (digest_init): Separate warnings about character
  arrays being initialized as unterminated character sequences
  with string literals, from -Wc++-compat, into a new warning,
  -Wunterminated-string-initialization.

gcc/ChangeLog:

* doc/invoke.texi: Document the new
  -Wunterminated-string-initialization.

gcc/testsuite/ChangeLog:

* gcc.dg/Wcxx-compat-14.c: Adapt the test to match the new text
  of the warning, which doesn't say anything about C++ anymore.
* gcc.dg/Wunterminated-string-initialization.c: New test.

Link: 
Link: 
Link: 

Closes: 
Acked-by: Doug McIlroy 
Acked-by: Mike Stump 
[Sandra: The documentation parts of the patch are OK.]
Reviewed-by: Sandra Loosemore 
Reviewed-by: Martin Uecker 
Cc: "G. Branden Robinson" 
Cc: Ralph Corderoy 
Cc: Dave Kemper 
Cc: Larry McVoy 
Cc: Andrew Pinski 
Cc: Jonathan Wakely 
Cc: Andrew Clayton 
Cc: David Malcolm 
Cc: Joseph Myers 
Cc: Konstantin Kharlamov 
Signed-off-by: Alejandro Colomar 
---

Hi!

v10 changes:

-  Fix accident introduced while fixing rebase conflict in v9.
-  Upgrade an informal "As a member of the peanut gallery, I like the
   patch." into an Acked-by: Mike Stump .

See full range-diff below.

Have a lovely day!
Alex


Range-diff against v9:
1:  1010e7d7ec2 ! 1:  5a567664d7c C, ObjC: Add 
-Wunterminated-string-initialization
@@ Commit message
 Link: 

 Closes: 
 Acked-by: Doug McIlroy 
+Acked-by: Mike Stump 
 [Sandra: The documentation parts of the patch are OK.]
 Reviewed-by: Sandra Loosemore 
 Reviewed-by: Martin Uecker 
@@ Commit message
 Cc: Jonathan Wakely 
 Cc: Andrew Clayton 
 Cc: David Malcolm 
-Cc: Mike Stump 
 Cc: Joseph Myers 
 Cc: Konstantin Kharlamov 
 Signed-off-by: Alejandro Colomar 
@@ gcc/doc/invoke.texi: name is still supported, but the newer name is more 
descrip
  -Wstring-compare
  -Wtype-limits
  -Wuninitialized
-+-Wshift-negative-value @r{(in C++11 to C++17 and in C99 and newer)}
 +-Wunterminated-string-initialization
  -Wunused-parameter @r{(only with} @option{-Wunused} @r{or} 
@option{-Wall}@r{)}
  -Wunused-but-set-parameter @r{(only with} @option{-Wunused} @r{or} 
@option{-Wall}@r{)}}

 gcc/c-family/c.opt|  4 
 gcc/c/c-typeck.cc |  6 +++---
 gcc/doc/invoke.texi   | 20 ++-
 gcc/testsuite/gcc.dg/Wcxx-compat-14.c |  2 +-
 .../Wunterminated-string-initialization.c |  6 ++
 5 files changed, 33 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/Wunterminated-string-initialization.c

diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt
index 864ef4e3b3d..c2e43fe13de 100644
--- a/gcc/c-family/c.opt
+++ b/gcc/c-family/c.opt
@@ -1464,6 +1464,10 @@ Wunsuffixed-float-constants
 C ObjC Var(warn_unsuffixed_float_constants) Warning
 Warn about unsuffixed float constants.

Re: [PATCH v9] C, ObjC: Add -Wunterminated-string-initialization

2024-06-29 Thread Alejandro Colomar

On Sat, Jun 29, 2024 at 02:58:48PM GMT, Alejandro Colomar wrote:
> On Sat, Jun 29, 2024 at 02:52:40PM GMT, Alejandro Colomar wrote:
> > @@ -6450,6 +6452,8 @@ name is still supported, but the newer name is more 
> > descriptive.)
> >  -Wstring-compare
> >  -Wtype-limits
> >  -Wuninitialized
> > +-Wshift-negative-value @r{(in C++11 to C++17 and in C99 and newer)}
> > +-Wunterminated-string-initialization
> 
> Whoops; while in the rebase resolution of conflicts this seemed to
> be an addition from elsewhere.  I didn't intend to add it here.  I'll
> fix that, and send a v10.  I'll investigate how this line appeared.

Ahhh; it seems that line had been removed, not added, somewhere between
my v8 and v9.  I misunderstood the git conflict.

> 
> >  -Wunused-parameter @r{(only with} @option{-Wunused} @r{or} 
> > @option{-Wall}@r{)}
> >  -Wunused-but-set-parameter @r{(only with} @option{-Wunused} @r{or} 
> > @option{-Wall}@r{)}}


-- 



signature.asc
Description: PGP signature

Re: [PATCH v9] C, ObjC: Add -Wunterminated-string-initialization

2024-06-29 Thread Alejandro Colomar

On Sat, Jun 29, 2024 at 02:52:40PM GMT, Alejandro Colomar wrote:
> Warn about the following:
> 
> char  s[3] = "foo";
> 
> Initializing a char array with a string literal of the same length as
> the size of the array is usually a mistake.  Rarely is the case where
> one wants to create a non-terminated character sequence from a string
> literal.
> 
> In some cases, for writing faster code, one may want to use arrays
> instead of pointers, since that removes the need for storing an array of
> pointers apart from the strings themselves.
> 
> char  *log_levels[]   = { "info", "warning", "err" };
> vs.
> char  log_levels[][7] = { "info", "warning", "err" };
> 
> This forces the programmer to specify a size, which might change if a
> new entry is later added.  Having no way to enforce null termination is
> very dangerous, however, so it is useful to have a warning for this, so
> that the compiler can make sure that the programmer didn't make any
> mistakes.  This warning catches the bug above, so that the programmer
> will be able to fix it and write:
> 
> char  log_levels[][8] = { "info", "warning", "err" };
> 
> This warning already existed as part of -Wc++-compat, but this patch
> allows enabling it separately.  It is also included in -Wextra, since
> it may not always be desired (when unterminated character sequences are
> wanted), but it's likely to be desired in most cases.
> 
> Since Wc++-compat now includes this warning, the test has to be modified
> to expect the text of the new warning too, in .
> 
> gcc/c-family/ChangeLog:
> 
>   * c.opt: Add -Wunterminated-string-initialization.
> 
> gcc/c/ChangeLog:
> 
>   * c-typeck.cc (digest_init): Separate warnings about character
> arrays being initialized as unterminated character sequences
> with string literals, from -Wc++-compat, into a new warning,
> -Wunterminated-string-initialization.
> 
> gcc/ChangeLog:
> 
>   * doc/invoke.texi: Document the new
> -Wunterminated-string-initialization.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/Wcxx-compat-14.c: Adapt the test to match the new text
> of the warning, which doesn't say anything about C++ anymore.
>   * gcc.dg/Wunterminated-string-initialization.c: New test.
> 
> Link: 
> Link: 
> Link: 
> 
> Closes: 
> Acked-by: Doug McIlroy 
> [Sandra: The documentation parts of the patch are OK.]
> Reviewed-by: Sandra Loosemore 
> Reviewed-by: Martin Uecker 
> Cc: "G. Branden Robinson" 
> Cc: Ralph Corderoy 
> Cc: Dave Kemper 
> Cc: Larry McVoy 
> Cc: Andrew Pinski 
> Cc: Jonathan Wakely 
> Cc: Andrew Clayton 
> Cc: David Malcolm 
> Cc: Mike Stump 
> Cc: Joseph Myers 
> Cc: Konstantin Kharlamov 
> Signed-off-by: Alejandro Colomar 
> ---
> 
> Hi!
> 
> Here's another round of this patch.
> 
> v9 changes:
> 
> -  Reviewed by Martin.
> -  Add link to related bugzilla bug (and CC its reporter).
> -  Rebase on top of git master.
> 
> See full range-diff below.
> 
> Have a lovely day,
> Alex
> 
> P.S.: I'm looking for a job; if anyone is interested, please contact me.
> 
> Range-diff against v8:
> 1:  06236d0aa05 ! 1:  1010e7d7ec2 C, ObjC: Add 
> -Wunterminated-string-initialization
> @@ Commit message
>  Link: 
> 
>  Link: 
> 
>  Link: 
> 
> +Closes: 
>  Acked-by: Doug McIlroy 
>  [Sandra: The documentation parts of the patch are OK.]
>  Reviewed-by: Sandra Loosemore 
> +Reviewed-by: Martin Uecker 
>  Cc: "G. Branden Robinson" 
>  Cc: Ralph Corderoy 
>  Cc: Dave Kemper 
> @@ Commit message
>  Cc: Andrew Pinski 
>  Cc: Jonathan Wakely 
>  Cc: Andrew Clayton 
> -Cc: Martin Uecker 
>  Cc: David Malcolm 
>  Cc: Mike Stump 
>  Cc: Joseph Myers 
> +Cc: Konstantin Kharlamov 
>  Signed-off-by: Alejandro Colomar 
>  
>   ## gcc/c-family/c.opt ##
> @@ gcc/c/c-typeck.cc: digest_init (location_t init_loc, tree type, tree 
> init, tree
>   ## gcc/doc/invoke.texi ##
>  @@ gcc/doc/invoke.texi: Objective-C and Objective-C++ Dialects}.
>   -Wsystem-headers  -Wtautological-compare  -Wtrampolines  -Wtrigraphs
> - -Wtrivial-auto-var-init -Wtsan -Wtype-limits  -Wundef
> + -Wtrivial-auto-var-init  -Wno-tsan  -Wtype-limits  -Wundef
>   -Wuninitialized  -Wunknown-pragmas
>  --Wunsuffixed-float-constants  -Wunused
>

[PATCH v9] C, ObjC: Add -Wunterminated-string-initialization

2024-06-29 Thread Alejandro Colomar

Warn about the following:

char  s[3] = "foo";

Initializing a char array with a string literal of the same length as
the size of the array is usually a mistake.  Rarely is the case where
one wants to create a non-terminated character sequence from a string
literal.

In some cases, for writing faster code, one may want to use arrays
instead of pointers, since that removes the need for storing an array of
pointers apart from the strings themselves.

char  *log_levels[]   = { "info", "warning", "err" };
vs.
char  log_levels[][7] = { "info", "warning", "err" };

This forces the programmer to specify a size, which might change if a
new entry is later added.  Having no way to enforce null termination is
very dangerous, however, so it is useful to have a warning for this, so
that the compiler can make sure that the programmer didn't make any
mistakes.  This warning catches the bug above, so that the programmer
will be able to fix it and write:

char  log_levels[][8] = { "info", "warning", "err" };

This warning already existed as part of -Wc++-compat, but this patch
allows enabling it separately.  It is also included in -Wextra, since
it may not always be desired (when unterminated character sequences are
wanted), but it's likely to be desired in most cases.

Since Wc++-compat now includes this warning, the test has to be modified
to expect the text of the new warning too, in .

gcc/c-family/ChangeLog:

* c.opt: Add -Wunterminated-string-initialization.

gcc/c/ChangeLog:

* c-typeck.cc (digest_init): Separate warnings about character
  arrays being initialized as unterminated character sequences
  with string literals, from -Wc++-compat, into a new warning,
  -Wunterminated-string-initialization.

gcc/ChangeLog:

* doc/invoke.texi: Document the new
  -Wunterminated-string-initialization.

gcc/testsuite/ChangeLog:

* gcc.dg/Wcxx-compat-14.c: Adapt the test to match the new text
  of the warning, which doesn't say anything about C++ anymore.
* gcc.dg/Wunterminated-string-initialization.c: New test.

Link: 
Link: 
Link: 

Closes: 
Acked-by: Doug McIlroy 
[Sandra: The documentation parts of the patch are OK.]
Reviewed-by: Sandra Loosemore 
Reviewed-by: Martin Uecker 
Cc: "G. Branden Robinson" 
Cc: Ralph Corderoy 
Cc: Dave Kemper 
Cc: Larry McVoy 
Cc: Andrew Pinski 
Cc: Jonathan Wakely 
Cc: Andrew Clayton 
Cc: David Malcolm 
Cc: Mike Stump 
Cc: Joseph Myers 
Cc: Konstantin Kharlamov 
Signed-off-by: Alejandro Colomar 
---

Hi!

Here's another round of this patch.

v9 changes:

-  Reviewed by Martin.
-  Add link to related bugzilla bug (and CC its reporter).
-  Rebase on top of git master.

See full range-diff below.

Have a lovely day,
Alex

P.S.: I'm looking for a job; if anyone is interested, please contact me.

Range-diff against v8:
1:  06236d0aa05 ! 1:  1010e7d7ec2 C, ObjC: Add 
-Wunterminated-string-initialization
@@ Commit message
 Link: 
 Link: 
 Link: 

+Closes: 
 Acked-by: Doug McIlroy 
 [Sandra: The documentation parts of the patch are OK.]
 Reviewed-by: Sandra Loosemore 
+Reviewed-by: Martin Uecker 
 Cc: "G. Branden Robinson" 
 Cc: Ralph Corderoy 
 Cc: Dave Kemper 
@@ Commit message
 Cc: Andrew Pinski 
 Cc: Jonathan Wakely 
 Cc: Andrew Clayton 
-Cc: Martin Uecker 
 Cc: David Malcolm 
 Cc: Mike Stump 
 Cc: Joseph Myers 
+Cc: Konstantin Kharlamov 
 Signed-off-by: Alejandro Colomar 
 
  ## gcc/c-family/c.opt ##
@@ gcc/c/c-typeck.cc: digest_init (location_t init_loc, tree type, tree 
init, tree
  ## gcc/doc/invoke.texi ##
 @@ gcc/doc/invoke.texi: Objective-C and Objective-C++ Dialects}.
  -Wsystem-headers  -Wtautological-compare  -Wtrampolines  -Wtrigraphs
- -Wtrivial-auto-var-init -Wtsan -Wtype-limits  -Wundef
+ -Wtrivial-auto-var-init  -Wno-tsan  -Wtype-limits  -Wundef
  -Wuninitialized  -Wunknown-pragmas
 --Wunsuffixed-float-constants  -Wunused
 +-Wunsuffixed-float-constants
@@ gcc/doc/invoke.texi: Objective-C and Objective-C++ Dialects}.
  -Wunused-const-variable  -Wunused-const-variable=@var{n}
  -Wunused-function  -Wunused-label  -Wunused-local-typedefs
 @@ gcc/doc/invoke.texi: name is still supported, but the newer name is 
more descriptive.)
- -Wredundant-move

[to-be-committed][RISC-V][V4] movmem for RISCV with V extension

2024-06-29 Thread Jeff Law



I hadn't updated my repo on the host where I handle email, so it picked 
up the older version of this patch without the testsuite fix.  So, V4 
with the testsuite option for lmul fixed.




--

And Sergei's movmem patch.  Just trivial testsuite adjustment for an
option name change and a whitespace fix from me.

I've spun this in my tester for rv32 and rv64.  I'll wait for pre-commit
CI before taking further action.

Just a reminder, this patch is designed to handle the case where we can
issue a single vector load/store which avoids all the complexities of
determining which direction to copy.

--



gcc/ChangeLog

 * config/riscv/riscv.md (movmem): New expander.

gcc/testsuite/ChangeLog

 PR target/112109
 * gcc.target/riscv/rvv/base/movmem-1.c: New test
gcc/ChangeLog

* config/riscv/riscv.md (movmem): New expander.

gcc/testsuite/ChangeLog

PR target/112109
* gcc.target/riscv/rvv/base/movmem-1.c: New test

---
 gcc/config/riscv/riscv.md | 22 +++
 .../gcc.target/riscv/rvv/base/movmem-1.c  | 60 +++
 2 files changed, 82 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.cdiff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index ff37125e3f2..c0c960353eb 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -2723,6 +2723,28 @@ (define_expand "setmem"
 FAIL;
 })
 
+;; Inlining general memmove is a pessimisation: we can't avoid having to decide
+;; which direction to go at runtime, which is costly in instruction count
+;; however for situations where the entire move fits in one vector operation
+;; we can do all reads before doing any writes so we don't have to worry
+;; so generate the inline vector code in such situations
+;; nb. prefer scalar path for tiny memmoves.
+(define_expand "movmem"
+  [(parallel [(set (match_operand:BLK 0 "general_operand")
+   (match_operand:BLK 1 "general_operand"))
+(use (match_operand:P 2 "const_int_operand"))
+(use (match_operand:SI 3 "const_int_operand"))])]
+  "TARGET_VECTOR"
+{
+  if ((INTVAL (operands[2]) >= TARGET_MIN_VLEN / 8)
+   && (INTVAL (operands[2]) <= TARGET_MIN_VLEN)
+   && riscv_vector::expand_block_move (operands[0], operands[1],
+operands[2]))
+DONE;
+  else
+FAIL;
+})
+
 ;; Expand in-line code to clear the instruction cache between operand[0] and
 ;; operand[1].
 (define_expand "clear_cache"
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c
new file mode 100644
index 000..d9d4a70a392
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/movmem-1.c
@@ -0,0 +1,60 @@
+/* { dg-do compile } */
+/* { dg-add-options riscv_v } */
+/* { dg-additional-options "-O3 -mrvv-max-lmul=dynamic" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#define MIN_VECTOR_BYTES (__riscv_v_min_vlen / 8)
+
+/* Tiny memmoves should not be vectorised.
+** f1:
+**  li\s+a2,\d+
+**  tail\s+memmove
+*/
+char *
+f1 (char *a, char const *b)
+{
+  return __builtin_memmove (a, b, MIN_VECTOR_BYTES - 1);
+}
+
+/* Vectorise+inline minimum vector register width with LMUL=1
+** f2:
+**  (
+**  vsetivli\s+zero,16,e8,m1,ta,ma
+**  |
+**  li\s+[ta][0-7],\d+
+**  vsetvli\s+zero,[ta][0-7],e8,m1,ta,ma
+**  )
+**  vle8\.v\s+v\d+,0\(a1\)
+**  vse8\.v\s+v\d+,0\(a0\)
+**  ret
+*/
+char *
+f2 (char *a, char const *b)
+{
+  return __builtin_memmove (a, b, MIN_VECTOR_BYTES);
+}
+
+/* Vectorise+inline up to LMUL=8
+** f3:
+**  li\s+[ta][0-7],\d+
+**  vsetvli\s+zero,[ta][0-7],e8,m8,ta,ma
+**  vle8\.v\s+v\d+,0\(a1\)
+**  vse8\.v\s+v\d+,0\(a0\)
+**  ret
+*/
+char *
+f3 (char *a, char const *b)
+{
+  return __builtin_memmove (a, b, MIN_VECTOR_BYTES * 8);
+}
+
+/* Don't vectorise if the move is too large for one operation
+** f4:
+**  li\s+a2,\d+
+**  tail\s+memmove
+*/
+char *
+f4 (char *a, char const *b)
+{
+  return __builtin_memmove (a, b, MIN_VECTOR_BYTES * 8 + 1);
+}

[PATCH] c: Add support for byte arrays in C2Y

2024-06-29 Thread Martin Uecker



This marks structures which include a byte array
as typeless storage.


Bootstrapped and regression tested on x86_64.



c: Add support for byte arrays in C2Y

To get correct aliasing behavior requires that structures and unions
that contain a byte array, i.e. an array of non-atomic character
type (N3254), are marked with TYPE_TYPELESS_STORAGE.

gcc/c/
* c-decl.cc (grokdeclarator, finish_struct): Set and
propagate TYPE_TYPELESS_STORAGE.

gcc/testsuite/
* gcc.dg/c2y-byte-alias-1.c: New test.
* gcc.dg/c2y-byte-alias-2.c: New test.
* gcc.dg/c2y-byte-alias-3.c: New test.

diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc
index 0eac266471f..65561f3cbcc 100644
--- a/gcc/c/c-decl.cc
+++ b/gcc/c/c-decl.cc
@@ -7499,12 +7499,17 @@ grokdeclarator (const struct c_declarator *declarator,
   modify the shared type, so we gcc_assert (itype)
   below.  */
  {
+   bool typeless = flag_isoc2y
+   && ((char_type_p (type)
+&& !(type_quals & TYPE_QUAL_ATOMIC))
+   || (AGGREGATE_TYPE_P (type)
+   && TYPE_TYPELESS_STORAGE (type)));
+
addr_space_t as = DECODE_QUAL_ADDR_SPACE (type_quals);
if (!ADDR_SPACE_GENERIC_P (as) && as != TYPE_ADDR_SPACE (type))
  type = build_qualified_type (type,
   ENCODE_QUAL_ADDR_SPACE (as));
-
-   type = build_array_type (type, itype);
+   type = build_array_type (type, itype, typeless);
  }
 
if (type != error_mark_node)
@@ -9656,6 +9661,10 @@ finish_struct (location_t loc, tree t, tree fieldlist, 
tree attributes,
   if (DECL_NAME (x)
  || RECORD_OR_UNION_TYPE_P (TREE_TYPE (x)))
saw_named_field = true;
+
+  if (AGGREGATE_TYPE_P (TREE_TYPE (x))
+ && TYPE_TYPELESS_STORAGE (TREE_TYPE (x)))
+   TYPE_TYPELESS_STORAGE (t) = true;
 }
 
   detect_field_duplicates (fieldlist);
@@ -9856,6 +9865,7 @@ finish_struct (location_t loc, tree t, tree fieldlist, 
tree attributes,
   TYPE_FIELDS (x) = TYPE_FIELDS (t);
   TYPE_LANG_SPECIFIC (x) = TYPE_LANG_SPECIFIC (t);
   TYPE_TRANSPARENT_AGGR (x) = TYPE_TRANSPARENT_AGGR (t);
+  TYPE_TYPELESS_STORAGE (x) = TYPE_TYPELESS_STORAGE (t);
   C_TYPE_FIELDS_READONLY (x) = C_TYPE_FIELDS_READONLY (t);
   C_TYPE_FIELDS_VOLATILE (x) = C_TYPE_FIELDS_VOLATILE (t);
   C_TYPE_FIELDS_NON_CONSTEXPR (x) = C_TYPE_FIELDS_NON_CONSTEXPR (t);
diff --git a/gcc/testsuite/gcc.dg/c2y-byte-alias-1.c 
b/gcc/testsuite/gcc.dg/c2y-byte-alias-1.c
new file mode 100644
index 000..30bc2c09c2f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/c2y-byte-alias-1.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-options "-std=c2y -O2" } */
+
+struct f { _Alignas(int) char buf[sizeof(int)]; };
+struct f2 { struct f x; };
+union g { _Alignas(int) char buf[sizeof(int)]; };
+
+[[gnu::noinline]]
+int foo(struct f *p, int *q)
+{
+   *q = 1;
+   *p = (struct f){ };
+   return *q;
+}
+
+[[gnu::noinline]]
+int foo2(struct f2 *p, int *q)
+{
+   *q = 1;
+   *p = (struct f2){ };
+   return *q;
+}
+
+[[gnu::noinline]]
+int bar(union g *p, int *q)
+{
+   *q = 1;
+   *p = (union g){ };
+   return *q;
+}
+
+
+int main()
+{
+   struct f p;
+   if (0 != foo(, (void*)))
+   __builtin_abort();
+
+   struct f2 p2;
+   if (0 != foo2(, (void*)))
+   __builtin_abort();
+
+   union g q;
+   if (0 != bar(, (void*)))
+   __builtin_abort();
+}
diff --git a/gcc/testsuite/gcc.dg/c2y-byte-alias-2.c 
b/gcc/testsuite/gcc.dg/c2y-byte-alias-2.c
new file mode 100644
index 000..9bd2d18b386
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/c2y-byte-alias-2.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-options "-std=c2y -O2" } */
+
+struct f2 {
+   struct f {
+   _Alignas(int) char buf[sizeof(int)];
+   } x[2];
+   int i;
+};
+
+[[gnu::noinline]]
+int foo2(struct f2 *p, int *q)
+{
+   *q = 1;
+   *p = (struct f2){ };
+   return *q;
+}
+
+struct g2 {
+   union g {
+   _Alignas(int) char buf[sizeof(int)];
+   } x[2];
+   int i;
+};
+
+[[gnu::noinline]]
+int bar2(struct g2 *p, int *q)
+{
+   *q = 1;
+   *p = (struct g2){ };
+   return *q;
+}
+
+int main()
+{
+   struct f2 p2;
+   if (0 != foo2(, (void*)[0].buf))
+   __builtin_abort();
+
+   struct g2 q2;
+   if (0 != bar2(, (void*)[0].buf))
+   __builtin_abort();
+}
diff --git a/gcc/testsuite/gcc.dg/c2y-byte-alias-3.c 
b/gcc/testsuite/gcc.dg/c2y-byte-alias-3.c
new file mode 100644
index 000..f88eab2e92f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/c2y-byte-alias-3.c
@@ -0,0 +1,47 @@
+/* { dg-do run } */
+/* {

Re: [PATCH] libgccjit: Add ability to get the alignment of a type

2024-06-28 Thread Iain Sandoe

Hi Folks,

As noted, it seems to me that the fail here is false positives, but it still 
needs handling.

> On 29 Jun 2024, at 02:28, Iain Sandoe  wrote:
>> On 28 Jun 2024, at 12:50, Rainer Orth  wrote:

> … I am going to fix this with the obvious (provide a default init for the 
> vars) - later today.

Fixed as attached,
Iain



0001-jit-Fix-Darwin-bootstrap-after-r15-1699.patch
Description: Binary data

Re: [PATCH] Hard register asm constraint

2024-06-28 Thread Stefan Schulze Frielinghaus

On Fri, Jun 28, 2024 at 11:46:08AM +0200, Georg-Johann Lay wrote:
> Am 27.06.24 um 10:51 schrieb Stefan Schulze Frielinghaus:
> > On Thu, Jun 27, 2024 at 09:45:32AM +0200, Georg-Johann Lay wrote:
> > > Am 24.05.24 um 11:13 Am 25.06.24 um 16:03 schrieb Paul Koning:
> > > > > On Jun 24, 2024, at 1:50 AM, Stefan Schulze Frielinghaus 
> > > > >  wrote:
> > > > > On Mon, Jun 10, 2024 at 07:19:19AM +0200, Stefan Schulze Frielinghaus 
> > > > > wrote:
> > > > > > On Fri, May 24, 2024 at 11:13:12AM +0200, Stefan Schulze 
> > > > > > Frielinghaus wrote:
> > > > > > > This implements hard register constraints for inline asm.  A hard 
> > > > > > > register
> > > > > > > constraint is of the form {regname} where regname is any valid 
> > > > > > > register.  This
> > > > > > > basically renders register asm superfluous.  For example, the 
> > > > > > > snippet
> > > > > > > 
> > > > > > > int test (int x, int y)
> > > > > > > {
> > > > > > >register int r4 asm ("r4") = x;
> > > > > > >register int r5 asm ("r5") = y;
> > > > > > >unsigned int copy = y;
> > > > > > >asm ("foo %0,%1,%2" : "+d" (r4) : "d" (r5), "d" (copy));
> > > > > > >return r4;
> > > > > > > }
> > > > > > > 
> > > > > > > could be rewritten into
> > > > > > > 
> > > > > > > int test (int x, int y)
> > > > > > > {
> > > > > > >asm ("foo %0,%1,%2" : "+{r4}" (x) : "{r5}" (y), "d" (y));
> > > > > > >return x;
> > > > > > > }
> > > > 
> > > > I like this idea but I'm wondering: regular constraints specify what 
> > > > sort of value is needed, for example an int vs. a short int vs. a 
> > > > float.  The notation you've shown doesn't seem to have that aspect.
> > > > 
> > > > The other comment is that I didn't see documentation updates to reflect 
> > > > this new feature.
> > > > 
> > > > paul
> > > > 
> > >   Stefan Schulze Frielinghaus:
> > > > This implements hard register constraints for inline asm.  A hard 
> > > > register
> > > > constraint is of the form {regname} where regname is any valid 
> > > > register.  This
> > > > basically renders register asm superfluous.  For example, the snippet
> > > > 
> > > > int test (int x, int y)
> > > > {
> > > > register int r4 asm ("r4") = x;
> > > > register int r5 asm ("r5") = y;
> > > > unsigned int copy = y;
> > > > asm ("foo %0,%1,%2" : "+d" (r4) : "d" (r5), "d" (copy));
> > > > return r4;
> > > > }
> > > > 
> > > > could be rewritten into
> > > > 
> > > > int test (int x, int y)
> > > > {
> > > > asm ("foo %0,%1,%2" : "+{r4}" (x) : "{r5}" (y), "d" (y));
> > > > return x;
> > > > }
> > > 
> > > Hi, can this also be used in machine descriptions?
> > > 
> > > It would make some insn handling much simpler, for example in
> > > the avr backend.
> > > 
> > > That backend has insns that represent assembly sequences in libgcc
> > > which have a smaller register footprint than plain calls.  However
> > > this requires that such insns have explicit description of which regs
> > > go in and out.
> > > 
> > > The current solution uses hard regs, which works, but a proper
> > > implementation would use register constraints.  I tries that a while
> > > ago, and register constraints lead to a code bloat even in places that
> > > don't use these constraints due to the zillions of new register classes
> > > like R22_1, R22;2, R22_4, R20_1, R20_2, R20_4 etc. that were required.
> > > 
> > > Your approach would allow to use hard register constraints in insns,
> > > and so far the only problem is to determine how much hard regs are
> > > used by the constraint.  The gen tools that generates cc code from md
> > > would use the operand's machine mode to infer the number of hard regs.
> > 
> > I have this on my todo list but ignored it for the very first draft.  At
> > the moment this already fails because genoutput cannot parse the
> > constraint format.
> > 
> > In my "alpha draft" I implemented this feature by emitting moves to hard
> > registers during expand.  This had the limitation that I couldn't
> 
> One problem is that you cannot just introduce hard registers at that
> time because a hard reg may live across the sequence, see for example
> avr.cc::avr_emit3_fix_outputs() and avr_fix_operands().

Yea I was fearing this.  I did some testing on x86_64 and s390 including
explicit function calls, sanitizers etc. but of course this was not
complete which is why I think that the current draft is more robust.

> 
> > support multiple alternatives in combination with hard-register
> > constraints.  I'm still not sure whether this is a feature we really
> > want or whether it should be rather denied.  Anyhow, with this kind of
> > implementation I doubt that this would be feasible for machine
> > descriptions.  I moved on with my current draft where the constraint
> > manifests during register allocation.  This also allows multiple
> > alternatives.  I think one of the (major?) advantages of doing it this
> > way is that operands are kept in pseudos which means

Re: [PATCH] RISC-V: use fclass insns to implement isfinite and isnormal builtins

2024-06-28 Thread Vineet Gupta




On 6/28/24 17:53, Vineet Gupta wrote:
> Currently isfinite and isnormal use float compare instructions with fp
> flags save/restored around them. Our perf team complained this could be
> costly in uarch. RV Base ISA already has FCLASS.{d,s,h} instruction to
> do FP compares w/o disturbing FP exception flags.
>
> Coincidently, upstream ijust few days back got support for the
> corresponding optabs. All that is needed is to wire these up in the
> backend.
>
> I was also hoping to get __builtin_inf() done but unforutnately it
> requires little more rtl foo/bar to implement a tri-modal return.
>
> Currently going thru CI testing.

My local testing spotted one additional failure.

FAIL: g++.dg/opt/pr107569.C  -std=gnu++20  scan-tree-dump-times vrp1
"return 1;" 2

The reason being

bool
bar (double x)
{
  [[assume (std::isfinite (x))]];
  return std::isfinite (x);
}

generating the new seq

.LFB4:
    fclass.d    a0,fa0
    andi    a0,a0,126
    snez    a0,a0
    ret

vs.

    li    a0,1
    ret

I have a hunch this requires the pending value range patch from Hao Chen
GUI.

Thx,
-Vineet

[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-May/653094.html

Re: [PATCH] libgccjit: Add ability to get the alignment of a type

2024-06-28 Thread Iain Sandoe

Hi Folks,

> On 28 Jun 2024, at 12:50, Rainer Orth  wrote:
> 
> David Malcolm  writes:
> 
>> On Thu, 2024-04-04 at 18:59 -0400, Antoni Boucher wrote:
>>> Hi.
>>> This patch adds a new API to produce an rvalue representing the 
>>> alignment of a type.
>>> Thanks for the review.
>> 
>> Patch looks good to me (but may need the usual ABI version updates when
>> merging).
> 
> This patch broke macOS bootstrap:
> 
> /vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc: In member function 
> 'virtual gcc::jit::recording::string* 
> gcc::jit::recording::memento_of_typeinfo::make_debug_string()': 
> /vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc:5529:30: error: 
> 'ident' may be used uninitialized [-Werror=maybe-uninitialized]
> 5529 |   return string::from_printf (m_ctxt,
>  |  ^~~~
> 5530 |   "%s (%s)",
>  |   ~~
> 5531 |   ident,
>  |   ~~
> 5532 |   m_type->get_debug_string ());
>  |   
> /vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc:5519:15: note: 'ident' 
> was declared here
> 5519 |   const char* ident;
>  |   ^
> 
> /vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc: In member function 
> 'virtual void 
> gcc::jit::recording::memento_of_typeinfo::write_reproducer(gcc::jit::reproducer&)':
>   
> /vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc:5552:11: error: 'type' 
> may be used uninitialized [-Werror=maybe-uninitialized]
> 5552 |   r.write ("  gcc_jit_rvalue *%s =\n"
>  |   ^~~
> 5553 | "gcc_jit_context_new_%sof (%s, /* gcc_jit_context *ctxt */\n"
>  | ~
> 5554 | "(gcc_jit_type *) %s); /* 
> gcc_jit_type *type */\n",
>  | 
> ~~~
>  | id,
>  | ~~~
> 5556 | type,
>  | ~
> 5557 | r.get_identifier (get_context ()),
>  | ~~
> 5558 | r.get_identifier (m_type));
>  | ~~
> /vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc:5541:15: note: 'type' 
> was declared here
> 5541 |   const char* type;
>  |   ^~~~
> 
> I wonder how this can have worked anywhere (apart from jit not being
> enabled by default on non-Darwin targets).

Well, in principle, all values of the m_info_type enum are covered (there are 
only 2) - and therefore the two vars should be seen as initialized on some 
path.   It is quite disappointing that we cannot track this in a 12 line 
function with such a small enumeration… 

… I am going to fix this with the obvious (provide a default init for the vars) 
- later today.

Iain

Re: [PATCH] RISC-V: use fclass insns to implement isfinite and isnormal builtins

2024-06-28 Thread Andrew Waterman

+1 to any change that reduces the number of fflags accesses.


On Fri, Jun 28, 2024 at 5:54 PM Vineet Gupta  wrote:
>
> Currently isfinite and isnormal use float compare instructions with fp
> flags save/restored around them. Our perf team complained this could be
> costly in uarch. RV Base ISA already has FCLASS.{d,s,h} instruction to
> do FP compares w/o disturbing FP exception flags.
>
> Coincidently, upstream ijust few days back got support for the
> corresponding optabs. All that is needed is to wire these up in the
> backend.
>
> I was also hoping to get __builtin_inf() done but unforutnately it
> requires little more rtl foo/bar to implement a tri-modal return.
>
> Currently going thru CI testing.
>
> gcc/ChangeLog:
> * config/riscv/riscv.md: Add UNSPEC_FCLASS, UNSPEC_ISFINITE,
> USPEC_ISNORMAL.
> define_insn for fclass.
> define_expand for isfinite and isnormal.
>
> gcc/testsuite/ChangeLog:
> * gcc.target/riscv/fclass.c: New test.
>
> Signed-off-by: Vineet Gupta 
> ---
>  gcc/config/riscv/riscv.md   | 56 +
>  gcc/testsuite/gcc.target/riscv/fclass.c | 18 
>  2 files changed, 74 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/fclass.c
>
> diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
> index ff37125e3f28..fc4441916137 100644
> --- a/gcc/config/riscv/riscv.md
> +++ b/gcc/config/riscv/riscv.md
> @@ -68,6 +68,9 @@
>UNSPEC_FMAX
>UNSPEC_FMINM
>UNSPEC_FMAXM
> +  UNSPEC_FCLASS
> +  UNSPEC_ISFINITE
> +  UNSPEC_ISNORMAL
>
>;; Stack tie
>UNSPEC_TIE
> @@ -3436,6 +3439,59 @@
> (set_attr "mode" "")
> (set (attr "length") (const_int 16))])
>
> +;; fclass instruction output bitmap
> +;;   0 negative infinity
> +;;   1 negative normal number.
> +;;   2 negative subnormal number.
> +;;   3 -0
> +;;   4 +0
> +;;   5 positive subnormal number.
> +;;   6 positive normal number.
> +;;   7 positive infinity
> +;;   8 signaling NaN.
> +;;   9 quiet NaN
> +(define_insn "fclass"
> +  [(set (match_operand:SI  0 "register_operand" "=r")
> +   (unspec:SI [(match_operand:ANYF 1 "register_operand" " f")]
> +  UNSPEC_FCLASS))]
> +  "TARGET_HARD_FLOAT"
> +  "fclass.\t%0,%1"
> +  [(set_attr "type" "fcmp")
> +   (set_attr "mode" "")])
> +
> +(define_expand "isfinite2"
> +  [(set (match_operand:SI  0 "register_operand" "=r")
> +   (unspec:SI [(match_operand:ANYF 1 "register_operand" " f")]
> +  UNSPEC_ISFINITE))]
> +  "TARGET_HARD_FLOAT"
> +{
> +  rtx tmp = gen_reg_rtx (SImode);
> +  emit_insn (gen_fclass (tmp, operands[1]));
> +  riscv_emit_binary (AND, tmp, tmp, GEN_INT (0x7e));
> +  rtx cmp = gen_rtx_NE (SImode, tmp, const0_rtx);
> +  emit_insn (gen_cstoresi4 (operands[0], cmp, tmp, const0_rtx));
> +
> +  DONE;
> +})
> +
> +;; TODO: isinf is a bit tricky as it require trimodal return
> +;;  1 if 0x80, -1 if 0x1, 0 otherwise
> +
> +(define_expand "isnormal2"
> +  [(set (match_operand:SI  0 "register_operand" "=r")
> +   (unspec:SI [(match_operand:ANYF 1 "register_operand" " f")]
> +  UNSPEC_ISNORMAL))]
> +  "TARGET_HARD_FLOAT"
> +{
> +  rtx tmp = gen_reg_rtx (SImode);
> +  emit_insn (gen_fclass (tmp, operands[1]));
> +  riscv_emit_binary (AND, tmp, tmp, GEN_INT (0x42));
> +  rtx cmp = gen_rtx_NE (SImode, tmp, const0_rtx);
> +  emit_insn (gen_cstoresi4 (operands[0], cmp, tmp, const0_rtx));
> +
> +  DONE;
> +})
> +
>  (define_insn "*seq_zero_"
>[(set (match_operand:GPR   0 "register_operand" "=r")
> (eq:GPR (match_operand:X 1 "register_operand" " r")
> diff --git a/gcc/testsuite/gcc.target/riscv/fclass.c 
> b/gcc/testsuite/gcc.target/riscv/fclass.c
> new file mode 100644
> index ..0dfac982ebeb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/fclass.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target hard_float } */
> +/* { dg-options "-march=rv64gc -mabi=lp64d  -ftrapping-math { target { rv64 
> } } } */
> +/* { dg-options "-march=rv32gc -mabi=ilp32d -ftrapping-math { target { rv32 
> } } } */
> +
> +int t_isfinite(double a)
> +{
> +  return __builtin_isfinite(a);
> +}
> +
> +int t_isnormal(double a)
> +{
> +  return __builtin_isnormal(a);
> +}
> +
> +/* { dg-final { scan-assembler-not   {\mfrflags}  } } */
> +/* { dg-final { scan-assembler-not   {\mfsflags}  } } */
> +/* { dg-final { scan-assembler-times {\tfclass} 2 } } */
> --
> 2.34.1
>

[PATCH] RISC-V: use fclass insns to implement isfinite and isnormal builtins

2024-06-28 Thread Vineet Gupta

Currently isfinite and isnormal use float compare instructions with fp
flags save/restored around them. Our perf team complained this could be
costly in uarch. RV Base ISA already has FCLASS.{d,s,h} instruction to
do FP compares w/o disturbing FP exception flags.

Coincidently, upstream ijust few days back got support for the
corresponding optabs. All that is needed is to wire these up in the
backend.

I was also hoping to get __builtin_inf() done but unforutnately it
requires little more rtl foo/bar to implement a tri-modal return.

Currently going thru CI testing.

gcc/ChangeLog:
* config/riscv/riscv.md: Add UNSPEC_FCLASS, UNSPEC_ISFINITE,
USPEC_ISNORMAL.
define_insn for fclass.
define_expand for isfinite and isnormal.

gcc/testsuite/ChangeLog:
* gcc.target/riscv/fclass.c: New test.

Signed-off-by: Vineet Gupta 
---
 gcc/config/riscv/riscv.md   | 56 +
 gcc/testsuite/gcc.target/riscv/fclass.c | 18 
 2 files changed, 74 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/fclass.c

diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index ff37125e3f28..fc4441916137 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -68,6 +68,9 @@
   UNSPEC_FMAX
   UNSPEC_FMINM
   UNSPEC_FMAXM
+  UNSPEC_FCLASS
+  UNSPEC_ISFINITE
+  UNSPEC_ISNORMAL
 
   ;; Stack tie
   UNSPEC_TIE
@@ -3436,6 +3439,59 @@
(set_attr "mode" "")
(set (attr "length") (const_int 16))])
 
+;; fclass instruction output bitmap
+;;   0 negative infinity
+;;   1 negative normal number.
+;;   2 negative subnormal number.
+;;   3 -0
+;;   4 +0
+;;   5 positive subnormal number.
+;;   6 positive normal number.
+;;   7 positive infinity
+;;   8 signaling NaN.
+;;   9 quiet NaN
+(define_insn "fclass"
+  [(set (match_operand:SI  0 "register_operand" "=r")
+   (unspec:SI [(match_operand:ANYF 1 "register_operand" " f")]
+  UNSPEC_FCLASS))]
+  "TARGET_HARD_FLOAT"
+  "fclass.\t%0,%1"
+  [(set_attr "type" "fcmp")
+   (set_attr "mode" "")])
+
+(define_expand "isfinite2"
+  [(set (match_operand:SI  0 "register_operand" "=r")
+   (unspec:SI [(match_operand:ANYF 1 "register_operand" " f")]
+  UNSPEC_ISFINITE))]
+  "TARGET_HARD_FLOAT"
+{
+  rtx tmp = gen_reg_rtx (SImode);
+  emit_insn (gen_fclass (tmp, operands[1]));
+  riscv_emit_binary (AND, tmp, tmp, GEN_INT (0x7e));
+  rtx cmp = gen_rtx_NE (SImode, tmp, const0_rtx);
+  emit_insn (gen_cstoresi4 (operands[0], cmp, tmp, const0_rtx));
+
+  DONE;
+})
+
+;; TODO: isinf is a bit tricky as it require trimodal return
+;;  1 if 0x80, -1 if 0x1, 0 otherwise
+
+(define_expand "isnormal2"
+  [(set (match_operand:SI  0 "register_operand" "=r")
+   (unspec:SI [(match_operand:ANYF 1 "register_operand" " f")]
+  UNSPEC_ISNORMAL))]
+  "TARGET_HARD_FLOAT"
+{
+  rtx tmp = gen_reg_rtx (SImode);
+  emit_insn (gen_fclass (tmp, operands[1]));
+  riscv_emit_binary (AND, tmp, tmp, GEN_INT (0x42));
+  rtx cmp = gen_rtx_NE (SImode, tmp, const0_rtx);
+  emit_insn (gen_cstoresi4 (operands[0], cmp, tmp, const0_rtx));
+
+  DONE;
+})
+
 (define_insn "*seq_zero_"
   [(set (match_operand:GPR   0 "register_operand" "=r")
(eq:GPR (match_operand:X 1 "register_operand" " r")
diff --git a/gcc/testsuite/gcc.target/riscv/fclass.c 
b/gcc/testsuite/gcc.target/riscv/fclass.c
new file mode 100644
index ..0dfac982ebeb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/fclass.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target hard_float } */
+/* { dg-options "-march=rv64gc -mabi=lp64d  -ftrapping-math { target { rv64 } 
} } */
+/* { dg-options "-march=rv32gc -mabi=ilp32d -ftrapping-math { target { rv32 } 
} } */
+
+int t_isfinite(double a)
+{
+  return __builtin_isfinite(a);
+}
+
+int t_isnormal(double a)
+{
+  return __builtin_isnormal(a);
+}
+
+/* { dg-final { scan-assembler-not   {\mfrflags}  } } */
+/* { dg-final { scan-assembler-not   {\mfsflags}  } } */
+/* { dg-final { scan-assembler-times {\tfclass} 2 } } */
-- 
2.34.1

[committed] Fix mcore-elf regression after recent IRA change

2024-06-28 Thread Jeff Law


So the recent IRA change exposed a bug in the mcore backend.

The mcore has a special instruction (xtrb3) which can zero extend a GPR 
into R1.  It's useful because zextb requires a matching 
source/destination.  Unfortunately xtrb3 modifies CC.


The IRA changes twiddle register allocation such that we want to use 
xtrb3.  Unfortunately CC is live at the point where we want to use xtrb3 
and clobbering CC causes the test to fail.


Exposing the clobber in the expander and insn seems like the best path 
forward.  We could also drop the xtrb3 alternative, but that seems like 
it would hurt codegen more than exposing the clobber.


The bitfield extraction patterns using xtrb look problematic as well, 
but I didn't try to fix those.


This fixes the builtn-arith-overflow regressions and appears to fix 
20010122-1.c as a side effect.




Pushing to the trunk.

Jeff

commit 9fbbad9b6c6e7fa7eaf37552173f5b8b2958976b
Author: Jeff Law 
Date:   Fri Jun 28 18:36:50 2024 -0600

[committed] Fix mcore-elf regression after recent IRA change

So the recent IRA change exposed a bug in the mcore backend.

The mcore has a special instruction (xtrb3) which can zero extend a GPR into
R1.  It's useful because zextb requires a matching source/destination.
Unfortunately xtrb3 modifies CC.

The IRA changes twiddle register allocation such that we want to use xtrb3.
Unfortunately CC is live at the point where we want to use xtrb3 and 
clobbering
CC causes the test to fail.

Exposing the clobber in the expander and insn seems like the best path 
forward.
We could also drop the xtrb3 alternative, but that seems like it would hurt
codegen more than exposing the clobber.

The bitfield extraction patterns using xtrb look problematic as well, but I
didn't try to fix those.

This fixes the builtn-arith-overflow regressions and appears to fix
20010122-1.c as a side effect.

gcc/
* config/mcore/mcore.md  (zero_extendqihi2): Clobber CC in expander
and matching insn.
(zero_extendqisi2): Likewise.

diff --git a/gcc/config/mcore/mcore.md b/gcc/config/mcore/mcore.md
index d416ce24a97..432b89520d7 100644
--- a/gcc/config/mcore/mcore.md
+++ b/gcc/config/mcore/mcore.md
@@ -1057,15 +1057,17 @@ (define_insn ""
   [(set_attr "type" "load")])
 
 (define_expand "zero_extendqisi2"
-  [(set (match_operand:SI 0 "mcore_arith_reg_operand" "")
-   (zero_extend:SI (match_operand:QI 1 "general_operand" "")))]
+  [(parallel [(set (match_operand:SI 0 "mcore_arith_reg_operand" "")
+ (zero_extend:SI (match_operand:QI 1 "general_operand" "")))
+ (clobber (reg:CC 17))])]
   ""
   "") 
 
 ;; RBE: XXX: we don't recognize that the xtrb3 kills the CC register.
 (define_insn ""
   [(set (match_operand:SI 0 "mcore_arith_reg_operand" "=r,b,r")
-   (zero_extend:SI (match_operand:QI 1 "general_operand" "0,r,m")))]
+   (zero_extend:SI (match_operand:QI 1 "general_operand" "0,r,m")))
+   (clobber (reg:CC 17))]
   ""
   "@
zextb   %0
@@ -1091,15 +1093,17 @@ (define_insn ""
   [(set_attr "type" "load")])
 
 (define_expand "zero_extendqihi2"
-  [(set (match_operand:HI 0 "mcore_arith_reg_operand" "")
-   (zero_extend:HI (match_operand:QI 1 "general_operand" "")))]
+  [(parallel [(set (match_operand:HI 0 "mcore_arith_reg_operand" "")
+  (zero_extend:HI (match_operand:QI 1 "general_operand" "")))
+ (clobber (reg:CC 17))])]
   ""
   "") 
 
 ;; RBE: XXX: we don't recognize that the xtrb3 kills the CC register.
 (define_insn ""
   [(set (match_operand:HI 0 "mcore_arith_reg_operand" "=r,b,r")
-   (zero_extend:HI (match_operand:QI 1 "general_operand" "0,r,m")))]
+   (zero_extend:HI (match_operand:QI 1 "general_operand" "0,r,m")))
+   (clobber (reg:CC 17))]
   ""
   "@
zextb   %0

Re: [PATCH] Fortran: fix ALLOCATE with SOURCE of deferred character length [PR114019]

2024-06-28 Thread Steve Kargl

On Fri, Jun 28, 2024 at 10:00:53PM +0200, Harald Anlauf wrote:
> 
> the attached patch fixes an ICE occuring for ALLOCATE with SOURCE
> (or MOLD) of deferred character length in the scalar case, which
> looked obscure because the ICE disappears at -O1 and higher.
> 
> The dump tree suggests that it is a wrong decl for the temporary
> source that was e.g.
> 
> character(kind=1) source.2[1:];
> 
> whereas I had expected
> 
> character(kind=1)[1:] * source.2;
> 
> and which we now get after the patch.  Or am I missing something?
> 
> Regtested on x86_64-pc-linux-gnu.  OK for mainline?

I don't think you're missing anything.  We've a number of bugs
were one needs to distinguish between various declarations:

character(len=2), allocatable :: a
character(len=:), allocatable :: a(2)
character(len=:), allocatable :: a(:)

I can certain imagine you've fixed another (corner) case that
was originally missed.

OK to commit.  Thanks for the patch.

-- 
Steve

[PATCH] c++: DR2627, Bit-fields and narrowing conversions [PR94058]

2024-06-28 Thread Marek Polacek

Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?

-- >8 --
This DR (https://cplusplus.github.io/CWG/issues/2627.html) says that
even if we are converting from an integer type or unscoped enumeration type
to an integer type that cannot represent all the values of the original
type, it's not narrowing if "the source is a bit-field whose width w is
less than that of its type (or, for an enumeration type, its underlying
type) and the target type can represent all the values of a hypothetical
extended integer type with width w and with the same signedness as the
original type".

DR 2627
PR c++/94058
PR c++/104392

gcc/cp/ChangeLog:

* typeck2.cc (check_narrowing): Don't warn if the conversion isn't
narrowing as per DR 2627.

gcc/testsuite/ChangeLog:

* g++.dg/DRs/dr2627.C: New test.
* g++.dg/cpp0x/Wnarrowing22.C: New test.
* g++.dg/cpp2a/spaceship-narrowing1.C: New test.
* g++.dg/cpp2a/spaceship-narrowing2.C: New test.
---
 gcc/cp/typeck2.cc | 12 +
 gcc/testsuite/g++.dg/DRs/dr2627.C | 13 +
 gcc/testsuite/g++.dg/cpp0x/Wnarrowing22.C | 49 +++
 .../g++.dg/cpp2a/spaceship-narrowing1.C   | 34 +
 .../g++.dg/cpp2a/spaceship-narrowing2.C   | 26 ++
 5 files changed, 134 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/DRs/dr2627.C
 create mode 100644 gcc/testsuite/g++.dg/cpp0x/Wnarrowing22.C
 create mode 100644 gcc/testsuite/g++.dg/cpp2a/spaceship-narrowing1.C
 create mode 100644 gcc/testsuite/g++.dg/cpp2a/spaceship-narrowing2.C

diff --git a/gcc/cp/typeck2.cc b/gcc/cp/typeck2.cc
index 7782f38da43..30a6fbe95c9 100644
--- a/gcc/cp/typeck2.cc
+++ b/gcc/cp/typeck2.cc
@@ -1012,6 +1012,18 @@ check_narrowing (tree type, tree init, tsubst_flags_t 
complain,
   if (TREE_CODE (ftype) == ENUMERAL_TYPE)
/* Check for narrowing based on the values of the enumeration. */
ftype = ENUM_UNDERLYING_TYPE (ftype);
+  /* Undo convert_bitfield_to_declared_type (STRIP_NOPS isn't enough).  */
+  tree op = init;
+  while (CONVERT_EXPR_P (op))
+   op = TREE_OPERAND (op, 0);
+  /* Core 2627 says that we shouldn't warn when "the source is a bit-field
+whose width w is less than that of its type (or, for an enumeration
+type, its underlying type) and the target type can represent all the
+values of a hypothetical extended integer type with width w and with
+the same signedness as the original type".  */
+  if (is_bitfield_expr_with_lowered_type (op)
+ && TYPE_PRECISION (TREE_TYPE (op)) < TYPE_PRECISION (ftype))
+   ftype = TREE_TYPE (op);
   if ((tree_int_cst_lt (TYPE_MAX_VALUE (type),
TYPE_MAX_VALUE (ftype))
   || tree_int_cst_lt (TYPE_MIN_VALUE (ftype),
diff --git a/gcc/testsuite/g++.dg/DRs/dr2627.C 
b/gcc/testsuite/g++.dg/DRs/dr2627.C
new file mode 100644
index 000..fe7f28613ca
--- /dev/null
+++ b/gcc/testsuite/g++.dg/DRs/dr2627.C
@@ -0,0 +1,13 @@
+// DR 2627 - Bit-fields and narrowing conversions
+// { dg-do compile { target c++20 } }
+
+#include 
+
+struct C {
+  long long i : 8;
+};
+
+void f() {
+  C x{1}, y{2};
+  x.i <=> y.i;
+}
diff --git a/gcc/testsuite/g++.dg/cpp0x/Wnarrowing22.C 
b/gcc/testsuite/g++.dg/cpp0x/Wnarrowing22.C
new file mode 100644
index 000..dd30451a7cc
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/Wnarrowing22.C
@@ -0,0 +1,49 @@
+// DR 2627 - Bit-fields and narrowing conversions
+// PR c++/94058
+// { dg-do compile { target c++11 } }
+// { dg-options "-Wno-error=narrowing" }
+
+using int64_t = __INT64_TYPE__;
+using int32_t = __INT32_TYPE__;
+
+struct A {
+  int64_t i1 : __CHAR_BIT__;
+  int64_t i2 : sizeof (int32_t) * __CHAR_BIT__ - 1;
+  int64_t i3 : sizeof (int32_t) * __CHAR_BIT__;
+  int64_t i4 : sizeof (int32_t) * __CHAR_BIT__ + 1;
+  int64_t i5 : sizeof (int64_t) * __CHAR_BIT__ - 1;
+  int64_t i6 : sizeof (int64_t) * __CHAR_BIT__;
+} a;
+
+int32_t i1{a.i1};
+int32_t i2{a.i2};
+int32_t i3{a.i3};
+int32_t i4{a.i4}; // { dg-warning "narrowing conversion" }
+int32_t i5{a.i5}; // { dg-warning "narrowing conversion" }
+int32_t i6{a.i6}; // { dg-warning "narrowing conversion" }
+
+struct B {
+  bool b1 : sizeof (bool) * __CHAR_BIT__;
+  bool b2 : sizeof (bool);
+} b;
+
+signed char b1{b.b1};
+signed char b2{b.b2};
+
+enum E : int64_t { E1 };
+
+struct C {
+  E e1 : __CHAR_BIT__;
+  E e2 : sizeof (int32_t) * __CHAR_BIT__ - 1;
+  E e3 : sizeof (int32_t) * __CHAR_BIT__;
+  E e4 : sizeof (int32_t) * __CHAR_BIT__ + 1;
+  E e5 : sizeof (int64_t) * __CHAR_BIT__ - 1;
+  E e6 : sizeof (int64_t) * __CHAR_BIT__;
+} c;
+
+int32_t e1{c.e1};
+int32_t e2{c.e2};
+int32_t e3{c.e3};
+int32_t e4{c.e4}; // { dg-warning "narrowing conversion" }
+int32_t e5{c.e5}; // { dg-warning "narrowing conversion" }
+int32_t e6{c.e6}; // { dg-warning "narrowing conversion" }
diff --git

Re: [PATCH] c++: Relax too strict assert in stabilize_expr [PR111160]

2024-06-28 Thread Patrick Palka

On Wed, 26 Jun 2024, Simon Martin wrote:

> The case in the ticket is an ICE on invalid due to an assert in 
> stabilize_expr,
> but the underlying issue can actually trigger on this *valid* code:
> 
> === cut here ===
> struct TheClass {
>   TheClass() {}
>   TheClass(volatile TheClass& t) {}
>   TheClass operator=(volatile TheClass& t) volatile { return t; }
> };
> void the_func() {
>   volatile TheClass x, y, z;
>   (false ? x : y) = z;
> }
> === cut here ===
> 
> The problem is that stabilize_expr asserts that it returns an expression
> without TREE_SIDE_EFFECTS, which can't be if the involved type is volatile.
> 
> This patch relaxes the assert to accept having TREE_THIS_VOLATILE on the
> returned expression.
> 
> Successfully tested on x86_64-pc-linux-gnu.
> 
>   PR c++/60
> 
> gcc/cp/ChangeLog:
> 
>   * tree.cc (stabilize_expr): Stabilized expressions can have
>   TREE_SIDE_EFFECTS if they're volatile.

LGTM (although I can't formally approve the patch)

> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.dg/overload/error8.C: New test.
>   * g++.dg/overload/volatile2.C: New test.
> 
> ---
>  gcc/cp/tree.cc|  2 +-
>  gcc/testsuite/g++.dg/overload/error8.C|  9 +
>  gcc/testsuite/g++.dg/overload/volatile2.C | 12 
>  3 files changed, 22 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/g++.dg/overload/error8.C
>  create mode 100644 gcc/testsuite/g++.dg/overload/volatile2.C
> 
> diff --git a/gcc/cp/tree.cc b/gcc/cp/tree.cc
> index 28648c14c6d..dfd4a3a948b 100644
> --- a/gcc/cp/tree.cc
> +++ b/gcc/cp/tree.cc
> @@ -5969,7 +5969,7 @@ stabilize_expr (tree exp, tree* initp)
>  }
>*initp = init_expr;
>  
> -  gcc_assert (!TREE_SIDE_EFFECTS (exp));
> +  gcc_assert (!TREE_SIDE_EFFECTS (exp) || TREE_THIS_VOLATILE (exp));
>return exp;
>  }
>  
> diff --git a/gcc/testsuite/g++.dg/overload/error8.C 
> b/gcc/testsuite/g++.dg/overload/error8.C
> new file mode 100644
> index 000..a7e745860e0
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/overload/error8.C
> @@ -0,0 +1,9 @@
> +// PR c++/60
> +// { dg-do compile { target c++11 } }
> +
> +class TheClass {}; // { dg-error "discards|bind|discards|bind" }
> +void the_func() {
> +  TheClass x;
> +  volatile TheClass y;
> +  (false ? x : x) = y; // { dg-error "ambiguous|ambiguous" }
> +}
> diff --git a/gcc/testsuite/g++.dg/overload/volatile2.C 
> b/gcc/testsuite/g++.dg/overload/volatile2.C
> new file mode 100644
> index 000..9f27357aed6
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/overload/volatile2.C
> @@ -0,0 +1,12 @@
> +// PR c++/60
> +// { dg-do compile { target c++11 } }
> +
> +struct TheClass {
> +  TheClass() {}
> +  TheClass(volatile TheClass& t) {}
> +  TheClass operator=(volatile TheClass& t) volatile { return t; }
> +};
> +void the_func() {
> +  volatile TheClass x, y, z;
> +  (false ? x : y) = z;
> +}
> -- 
> 2.44.0
> 
> 
> 
>

Re: [PATCH] c++: Fix ICE locating 'this' for (not matching) template member function [PR115364]

2024-06-28 Thread Patrick Palka

On Fri, 28 Jun 2024, Simon Martin wrote:

> We currently ICE when emitting the error message for this invalid code:
> 
> === cut here ===
> struct foo {
>   template void not_const() {}
> };
> void fn(const foo& obj) {
>   obj.not_const<5>();
> }
> === cut here ===
> 
> The problem is that get_fndecl_argument_location assumes that it has a
> FUNCTION_DECL in its hands to find the location of the bad argument. It might
> however have a TEMPLATE_DECL if there's a single candidate that cannot be
> instantiated, like here.
> 
> This patch simply defaults to using the FNDECL's location in this case, which
> fixes this PR.
> 
> Successfully tested on x86_64-pc-linux-gnu.
> 
>   PR c++/115364
> 
> gcc/cp/ChangeLog:
> 
>   * call.cc (get_fndecl_argument_location): Use FNDECL's location for
>   TEMPLATE_DECLs.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.dg/overload/template7.C: New test.
> 
> ---
>  gcc/cp/call.cc| 4 
>  gcc/testsuite/g++.dg/overload/template7.C | 9 +
>  2 files changed, 13 insertions(+)
>  create mode 100644 gcc/testsuite/g++.dg/overload/template7.C
> 
> diff --git a/gcc/cp/call.cc b/gcc/cp/call.cc
> index 7bbc1fb0c78..d5ff2311e63 100644
> --- a/gcc/cp/call.cc
> +++ b/gcc/cp/call.cc
> @@ -8347,6 +8347,10 @@ get_fndecl_argument_location (tree fndecl, int argnum)
>if (DECL_ARTIFICIAL (fndecl))
>  return DECL_SOURCE_LOCATION (fndecl);
>  
> +  /* Use FNDECL's location for TEMPLATE_DECLs.  */
> +  if (TREE_CODE (fndecl) == TEMPLATE_DECL)
> +return DECL_SOURCE_LOCATION (fndecl);
> +

For TEMPLATE_DECL fndecl, it'd be more natural to return the
corresponding argument location of its DECL_TEMPLATE_RESULT (which
should be a FUNCTION_DECL).  The STRIP_TEMPLATE macro would be
convenient to use here.


It seems this doesn't fix the regression completely however because
in GCC 11 the code was rejected with a "permerror" (which can be
downgraded to a warning with -fpermissive):

  115364.C: In function ‘void fn(const foo&)’:
  115364.C:5:43: error: passing ‘const foo’ as ‘this’ argument discards 
qualifiers [-fpermissive]
  5 | void fn(const foo& obj) { obj.not_const<5>(); }
|   ^~
  115364.C:3:24: note:   in call to ‘void foo::not_const() [with int 
 = 5]’
  3 | template void not_const() {}
|^

and we now reject with an ordinary error:

  115364.C: In function ‘void fn(const foo&)’:
  115364.C:5:27: error: cannot convert ‘const foo*’ to ‘foo*’
  5 | void fn(const foo& obj) { obj.not_const<5>(); }
|   ^~~
|   |
|   const foo*
  115364.C:3:24: note:   initializing argument 'this' of ‘template > void foo::not_const()’
  3 | template void not_const() {}
|^

To restore the error into a permerror, we need to figure out why we're
unexpectedly hitting this code path with a TEMPLATE_DECL, and why it's
necessary that the member function needs to take no arguments.  It turns
out I looked into this and submitted a patch for PR106760 (of which this
PR115364 is a dup) last year:
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620514.html

The patch was approved, but I lost track of it and never pushed it :/
I'm going to go ahead and push that fix shortly, sorry for not doing so
earlier.  Thanks for looking into this issue!

>int i;
>tree param;
>  
> diff --git a/gcc/testsuite/g++.dg/overload/template7.C 
> b/gcc/testsuite/g++.dg/overload/template7.C
> new file mode 100644
> index 000..67191c4ff62
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/overload/template7.C
> @@ -0,0 +1,9 @@
> +// PR c++/115364
> +// { dg-do compile }
> +
> +struct foo {
> +  template void not_const() {} // { dg-note "initializing" }
> +};
> +void fn(const foo& obj) {
> +  obj.not_const<5>(); // { dg-error "cannot convert" }
> +}
> -- 
> 2.44.0
> 
> 
> 
>

[PATCH] Fortran: fix ALLOCATE with SOURCE of deferred character length [PR114019]

2024-06-28 Thread Harald Anlauf

Dear all,

the attached patch fixes an ICE occuring for ALLOCATE with SOURCE
(or MOLD) of deferred character length in the scalar case, which
looked obscure because the ICE disappears at -O1 and higher.

The dump tree suggests that it is a wrong decl for the temporary
source that was e.g.

character(kind=1) source.2[1:];

whereas I had expected

character(kind=1)[1:] * source.2;

and which we now get after the patch.  Or am I missing something?

Regtested on x86_64-pc-linux-gnu.  OK for mainline?

Thanks,
Harald

From 4d12f6d0cf63ea6a2deb5398e6478dde114e76b8 Mon Sep 17 00:00:00 2001
From: Harald Anlauf 
Date: Fri, 28 Jun 2024 21:44:06 +0200
Subject: [PATCH] Fortran: fix ALLOCATE with SOURCE of deferred character
 length [PR114019]

gcc/fortran/ChangeLog:

	PR fortran/114019
	* trans-stmt.cc (gfc_trans_allocate): Fix handling of case of
	scalar character expression being used for SOURCE.

gcc/testsuite/ChangeLog:

	PR fortran/114019
	* gfortran.dg/allocate_with_source_33.f90: New test.
---
 gcc/fortran/trans-stmt.cc |  5 +-
 .../gfortran.dg/allocate_with_source_33.f90   | 53 +++
 2 files changed, 57 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gfortran.dg/allocate_with_source_33.f90

diff --git a/gcc/fortran/trans-stmt.cc b/gcc/fortran/trans-stmt.cc
index 93b633e212e..60275e18867 100644
--- a/gcc/fortran/trans-stmt.cc
+++ b/gcc/fortran/trans-stmt.cc
@@ -6464,7 +6464,10 @@ gfc_trans_allocate (gfc_code * code, gfc_omp_namelist *omp_allocate)
   else if (se.expr != NULL_TREE && temp_var_needed)
 	{
 	  tree var, desc;
-	  tmp = GFC_DESCRIPTOR_TYPE_P (TREE_TYPE (se.expr)) || is_coarray ?
+	  tmp = (GFC_DESCRIPTOR_TYPE_P (TREE_TYPE (se.expr))
+		 || is_coarray
+		 || (code->expr3->ts.type == BT_CHARACTER
+		 && code->expr3->rank == 0)) ?
 		se.expr
 	  : build_fold_indirect_ref_loc (input_location, se.expr);

diff --git a/gcc/testsuite/gfortran.dg/allocate_with_source_33.f90 b/gcc/testsuite/gfortran.dg/allocate_with_source_33.f90
new file mode 100644
index 000..7b1a26c464c
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/allocate_with_source_33.f90
@@ -0,0 +1,53 @@
+! { dg-do compile }
+! { dg-options "-O0" }
+!
+! PR fortran/114019 - allocation with source of deferred character length
+
+subroutine s
+  implicit none
+  character(1)  :: w   = "4"
+  character(*), parameter   :: str = "123"
+  character(5), pointer :: chr_pointer1
+  character(:), pointer :: chr_pointer2
+  character(:), pointer :: chr_ptr_arr(:)
+  character(5), allocatable :: chr_alloc1
+  character(:), allocatable :: chr_alloc2
+  character(:), allocatable :: chr_all_arr(:)
+  allocate (chr_pointer1, source=w// str//w)
+  allocate (chr_pointer2, source=w// str//w)
+  allocate (chr_ptr_arr,  source=w//[str//w])
+  allocate (chr_alloc1,   source=w// str//w)
+  allocate (chr_alloc2,   source=w// str//w)
+  allocate (chr_all_arr,  source=w//[str//w])
+  allocate (chr_pointer1, mold  =w// str//w)
+  allocate (chr_pointer2, mold  =w// str//w)
+  allocate (chr_ptr_arr,  mold  =w//[str//w])
+  allocate (chr_alloc1,   mold  =w// str//w)
+  allocate (chr_alloc2,   mold  =w// str//w)
+  allocate (chr_all_arr,  mold  =w//[str//w])
+end
+
+subroutine s2
+  implicit none
+  integer, parameter :: ck=4
+  character(kind=ck,len=1)  :: w   = ck_"4"
+  character(kind=ck,len=*), parameter   :: str = ck_"123"
+  character(kind=ck,len=5), pointer :: chr_pointer1
+  character(kind=ck,len=:), pointer :: chr_pointer2
+  character(kind=ck,len=:), pointer :: chr_ptr_arr(:)
+  character(kind=ck,len=5), allocatable :: chr_alloc1
+  character(kind=ck,len=:), allocatable :: chr_alloc2
+  character(kind=ck,len=:), allocatable :: chr_all_arr(:)
+  allocate (chr_pointer1, source=w// str//w)
+  allocate (chr_pointer2, source=w// str//w)
+  allocate (chr_ptr_arr,  source=w//[str//w])
+  allocate (chr_alloc1,   source=w// str//w)
+  allocate (chr_alloc2,   source=w// str//w)
+  allocate (chr_all_arr,  source=w//[str//w])
+  allocate (chr_pointer1, mold  =w// str//w)
+  allocate (chr_pointer2, mold  =w// str//w)
+  allocate (chr_ptr_arr,  mold  =w//[str//w])
+  allocate (chr_alloc1,   mold  =w// str//w)
+  allocate (chr_alloc2,   mold  =w// str//w)
+  allocate (chr_all_arr,  mold  =w//[str//w])
+end
--
2.35.3

[committed] libstdc++: Define __glibcxx_assert_fail for non-verbose build [PR115585]

2024-06-28 Thread Jonathan Wakely

Tested x86_64-linux. Pushed to trunk. Backports needed.

-- >8 --

When the library is configured with --disable-libstdcxx-verbose the
assertions just abort instead of calling __glibcxx_assert_fail, and so I
didn't export that function for the non-verbose build. However, that
option is documented to not change the library ABI, so we still need to
export the symbol from the library. It could be needed by programs
compiled against the headers from a verbose build.

The non-verbose definition can just call abort so that it doesn't pull
in I/O symbols, which are unwanted in a non-verbose build.

libstdc++-v3/ChangeLog:

PR libstdc++/115585
* src/c++11/assert_fail.cc (__glibcxx_assert_fail): Add
definition for non-verbose builds.
---
 libstdc++-v3/src/c++11/assert_fail.cc | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/libstdc++-v3/src/c++11/assert_fail.cc 
b/libstdc++-v3/src/c++11/assert_fail.cc
index 6d99c7958f3..76c8a5a5c2f 100644
--- a/libstdc++-v3/src/c++11/assert_fail.cc
+++ b/libstdc++-v3/src/c++11/assert_fail.cc
@@ -22,10 +22,10 @@
 // see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 // .
 
-#include   // for std::fprintf, stderr
 #include  // for std::abort
 
 #ifdef _GLIBCXX_VERBOSE_ASSERT
+#include   // for std::fprintf, stderr
 namespace std
 {
   [[__noreturn__]]
@@ -41,4 +41,12 @@ namespace std
 abort();
   }
 }
+#else
+namespace std
+{
+  [[__noreturn__]]
+  void
+  __glibcxx_assert_fail(const char*, int, const char*, const char*) noexcept
+  { abort(); }
+}
 #endif
-- 
2.45.2

[committed] libstdc++: Extend std::equal memcmp optimization to std::byte [PR101485]

2024-06-28 Thread Jonathan Wakely

Tested x86_64-linux. Pushed to trunk.

-- >8 --

We optimize std::equal to memcmp for integers and pointers, which means
that std::byte comparisons generate bigger code than char comparisons.

We can't use memcmp for arbitrary enum types, because they could have an
overloaded operator== that has custom semantics, but we know that
std::byte doesn't do that.

libstdc++-v3/ChangeLog:

PR libstdc++/101485
* include/bits/stl_algobase.h (__equal_aux1): Check for
std::byte as well.
* testsuite/25_algorithms/equal/101485.cc: New test.
---
 libstdc++-v3/include/bits/stl_algobase.h |  6 +-
 libstdc++-v3/testsuite/25_algorithms/equal/101485.cc | 11 +++
 2 files changed, 16 insertions(+), 1 deletion(-)
 create mode 100644 libstdc++-v3/testsuite/25_algorithms/equal/101485.cc

diff --git a/libstdc++-v3/include/bits/stl_algobase.h 
b/libstdc++-v3/include/bits/stl_algobase.h
index 57ff2f7cb08..dec1e4c79d8 100644
--- a/libstdc++-v3/include/bits/stl_algobase.h
+++ b/libstdc++-v3/include/bits/stl_algobase.h
@@ -1257,7 +1257,11 @@ _GLIBCXX_END_NAMESPACE_CONTAINER
   typedef typename iterator_traits<_II1>::value_type _ValueType1;
   const bool __simple = ((__is_integer<_ValueType1>::__value
 #if _GLIBCXX_USE_BUILTIN_TRAIT(__is_pointer)
- || __is_pointer(_ValueType1)
+   || __is_pointer(_ValueType1)
+#endif
+#if __glibcxx_byte && __glibcxx_type_trait_variable_templates
+   // bits/cpp_type_traits.h declares std::byte
+   || is_same_v<_ValueType1, byte>
 #endif
 ) && __memcmpable<_II1, _II2>::__value);
   return std::__equal<__simple>::equal(__first1, __last1, __first2);
diff --git a/libstdc++-v3/testsuite/25_algorithms/equal/101485.cc 
b/libstdc++-v3/testsuite/25_algorithms/equal/101485.cc
new file mode 100644
index 000..1fbb40acae9
--- /dev/null
+++ b/libstdc++-v3/testsuite/25_algorithms/equal/101485.cc
@@ -0,0 +1,11 @@
+// { dg-options "-O0" }
+// { dg-do compile { target c++17 } }
+// { dg-final { scan-assembler "memcmp" } }
+
+#include 
+#include 
+
+bool eq(std::byte const* p, std::byte const* q, unsigned n)
+{
+  return std::equal(p, p + n, q);
+}
-- 
2.45.2

Re: [PATCH 2/2] libstdc++: Do not use C++11 alignof in C++98 mode [PR104395]

2024-06-28 Thread Jonathan Wakely

Pushed to trunk.

On Thu, 27 Jun 2024 at 10:01, Jonathan Wakely  wrote:
>
> As I commented in the PR, I think it would be nice if the compiler
> accepted C++11 alignof in C++98 mode when -faligned-new is used. But
> even if G++ added that, we'd need Clang to use it, and then wait a few
> releases for that new Clang support to be in widespread use.
>
> So let's just disable the extended alignment support in allocators.
> Using -std=c++98 -faligned-new seems like a silly combination anyway.
>
> Tested x86_64-linux.
>
> -- >8 --
>
> When -faligned-new (or Clang's -faligned-allocation) is used our
> allocators try to support extended alignments, gated on the
> __cpp_aligned_new macro. However, because they use alignof(_Tp) which is
> not a keyword in C++98 mode, using -std=c++98 -faligned-new results in
> errors from  and other headers.
>
> We could change them to use __alignof__ instead of alignof, but that
> would potentially alter the result of the conditions, because e.g.
> alignof(long long) != __alignof__(long long) on some targets. That's
> probably not an issue for any types with extended alignment, so maybe it
> would be a safe change.
>
> For now, it seems acceptable to just disable the extended alignment
> support in C++98 mode, so that -faligned-new enables std::align_val_t
> and the corresponding operator new overloads, but doesn't affect
> std::allocator, __gnu_cxx::__bitmap_allocator etc.
>
> libstdc++-v3/ChangeLog:
>
> PR libstdc++/104395
> * include/bits/new_allocator.h: Disable extended alignment
> support in C++98 mode.
> * include/bits/stl_tempbuf.h: Likewise.
> * include/ext/bitmap_allocator.h: Likewise.
> * include/ext/malloc_allocator.h: Likewise.
> * include/ext/mt_allocator.h: Likewise.
> * include/ext/pool_allocator.h: Likewise.
> * testsuite/ext/104395.cc: New test.
> ---
>  libstdc++-v3/include/bits/new_allocator.h   | 4 ++--
>  libstdc++-v3/include/bits/stl_tempbuf.h | 6 +++---
>  libstdc++-v3/include/ext/bitmap_allocator.h | 4 ++--
>  libstdc++-v3/include/ext/malloc_allocator.h | 2 +-
>  libstdc++-v3/include/ext/mt_allocator.h | 4 ++--
>  libstdc++-v3/include/ext/pool_allocator.h   | 4 ++--
>  libstdc++-v3/testsuite/ext/104395.cc| 8 
>  7 files changed, 20 insertions(+), 12 deletions(-)
>  create mode 100644 libstdc++-v3/testsuite/ext/104395.cc
>
> diff --git a/libstdc++-v3/include/bits/new_allocator.h 
> b/libstdc++-v3/include/bits/new_allocator.h
> index 0e90c8819ac..5dcdee11c4d 100644
> --- a/libstdc++-v3/include/bits/new_allocator.h
> +++ b/libstdc++-v3/include/bits/new_allocator.h
> @@ -140,7 +140,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> std::__throw_bad_alloc();
>   }
>
> -#if __cpp_aligned_new
> +#if __cpp_aligned_new && __cplusplus >= 201103L
> if (alignof(_Tp) > __STDCPP_DEFAULT_NEW_ALIGNMENT__)
>   {
> std::align_val_t __al = std::align_val_t(alignof(_Tp));
> @@ -161,7 +161,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>  # define _GLIBCXX_SIZED_DEALLOC(p, n) (p)
>  #endif
>
> -#if __cpp_aligned_new
> +#if __cpp_aligned_new && __cplusplus >= 201103L
> if (alignof(_Tp) > __STDCPP_DEFAULT_NEW_ALIGNMENT__)
>   {
> _GLIBCXX_OPERATOR_DELETE(_GLIBCXX_SIZED_DEALLOC(__p, __n),
> diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h 
> b/libstdc++-v3/include/bits/stl_tempbuf.h
> index 759c4937744..0f267054613 100644
> --- a/libstdc++-v3/include/bits/stl_tempbuf.h
> +++ b/libstdc++-v3/include/bits/stl_tempbuf.h
> @@ -85,7 +85,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> if (__builtin_expect(size_t(__len) > (size_t(-1) / sizeof(_Tp)), 0))
>   return 0;
>
> -#if __cpp_aligned_new
> +#if __cpp_aligned_new && __cplusplus >= 201103L
> if (alignof(_Tp) > __STDCPP_DEFAULT_NEW_ALIGNMENT__)
>   return (_Tp*) _GLIBCXX_OPERATOR_NEW(__len * sizeof(_Tp),
>   align_val_t(alignof(_Tp)),
> @@ -107,7 +107,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>  # define _GLIBCXX_SIZED_DEALLOC(T, p, n) (p)
>  #endif
>
> -#if __cpp_aligned_new
> +#if __cpp_aligned_new && __cplusplus >= 201103L
> if (alignof(_Tp) > __STDCPP_DEFAULT_NEW_ALIGNMENT__)
>   {
> _GLIBCXX_OPERATOR_DELETE(_GLIBCXX_SIZED_DEALLOC(_Tp, __p, __len),
> @@ -168,7 +168,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>  inline void
>  return_temporary_buffer(_Tp* __p)
>  {
> -#if __cpp_aligned_new
> +#if __cpp_aligned_new && __cplusplus >= 201103L
>if (alignof(_Tp) > __STDCPP_DEFAULT_NEW_ALIGNMENT__)
> _GLIBCXX_OPERATOR_DELETE(__p, align_val_t(alignof(_Tp)));
>else
> diff --git a/libstdc++-v3/include/ext/bitmap_allocator.h 
> b/libstdc++-v3/include/ext/bitmap_allocator.h
> index ef2ee13187b..45b2283ca30 100644
> --- a/libstdc++-v3/include/ext/bitmap_allocator.h
> +++ b/libstdc++-v3/include/ext/bitmap_allocator.h
> @@ -1017,7 +1017,7 @@

Re: [PATCH 1/2] libstdc++: Simplify class templates

2024-06-28 Thread Jonathan Wakely

Pushed to trunk.

On Thu, 27 Jun 2024 at 10:03, Jonathan Wakely  wrote:
>
> I'm planning to push this, although arguably the first change isn't
> worth doing if we can't use it everywhere. If we need to keep the old
> code for EDG, maybe we should just keep using that? The new version
> probably compiles faster though.
>
> Removing the dependency on std::aligned_storage and adding the test is
> surely useful though.
>
> Tested x86_64-linux.
>
> -- >8 --
>
> As noted in a comment, the __gnu_cxx::__aligned_membuf class template
> can be simplified, because alignof(T) and alignas(T) use the correct
> alignment for a data member. That's true since GCC 8 and Clang 8. The
> EDG front end (as used by Intel icc, aka "Intel C++ Compiler Classic")
> does not implement the PR c++/69560 change, so keep using the old
> implementation when __EDG__ is defined, to avoid an ABI change for icc.
>
> For __gnu_cxx::__aligned_buffer all supported compilers agree on the
> value of __alignof__(T), but we can still simplify it by removing the
> dependency on std::aligned_storage.
>
> Add a test that checks that the aligned buffer types have the expected
> alignment, so that we can tell if changes like this affect their ABI
> properties.
>
> libstdc++-v3/ChangeLog:
>
> * include/ext/aligned_buffer.h (__aligned_membuf): Use
> alignas(T) directly instead of defining a struct and using 9its
> alignment.
> (__aligned_buffer): Remove use of std::aligned_storage.
> * testsuite/abi/aligned_buffers.cc: New test.
> ---
>  libstdc++-v3/include/ext/aligned_buffer.h | 20 -
>  libstdc++-v3/testsuite/abi/aligned_buffers.cc | 42 +++
>  2 files changed, 52 insertions(+), 10 deletions(-)
>  create mode 100644 libstdc++-v3/testsuite/abi/aligned_buffers.cc
>
> diff --git a/libstdc++-v3/include/ext/aligned_buffer.h 
> b/libstdc++-v3/include/ext/aligned_buffer.h
> index 26b36609fa5..9c2c628e54a 100644
> --- a/libstdc++-v3/include/ext/aligned_buffer.h
> +++ b/libstdc++-v3/include/ext/aligned_buffer.h
> @@ -49,11 +49,15 @@ namespace __gnu_cxx
>// Target macro ADJUST_FIELD_ALIGN can produce different alignment for
>// types when used as class members. __aligned_membuf is intended
>// for use as a class member, so align the buffer as for a class 
> member.
> -  // Since GCC 8 we could just use alignof(_Tp) instead, but older
> -  // versions of non-GNU compilers might still need this trick.
> +  // Since GCC 8 we can just use alignas(_Tp) to get the right alignment.
> +#ifdef __EDG__
> +  // The EDG front end does not implement the PR c++/69560 alignof 
> change.
>struct _Tp2 { _Tp _M_t; };
> -
> -  alignas(__alignof__(_Tp2::_M_t)) unsigned char _M_storage[sizeof(_Tp)];
> +  alignas(__alignof__(_Tp2::_M_t))
> +#else
> +  alignas(_Tp)
> +#endif
> +   unsigned char _M_storage[sizeof(_Tp)];
>
>__aligned_membuf() = default;
>
> @@ -81,8 +85,6 @@ namespace __gnu_cxx
>template
>  using __aligned_buffer = __aligned_membuf<_Tp>;
>  #else
> -#pragma GCC diagnostic push
> -#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
>// Similar to __aligned_membuf but aligned for complete objects, not 
> members.
>// This type is used in , , 
>// and , but ideally they would use 
> __aligned_membuf
> @@ -90,10 +92,9 @@ namespace __gnu_cxx
>// This type is still used to avoid an ABI change.
>template
>  struct __aligned_buffer
> -: std::aligned_storage
>  {
> -  typename
> -   std::aligned_storage::type _M_storage;
> +  // Using __alignof__ gives the alignment for a complete object.
> +  alignas(__alignof__(_Tp)) unsigned char _M_storage[sizeof(_Tp)];
>
>__aligned_buffer() = default;
>
> @@ -120,7 +121,6 @@ namespace __gnu_cxx
>_M_ptr() const noexcept
>{ return static_cast(_M_addr()); }
>  };
> -#pragma GCC diagnostic pop
>  #endif
>
>  } // namespace
> diff --git a/libstdc++-v3/testsuite/abi/aligned_buffers.cc 
> b/libstdc++-v3/testsuite/abi/aligned_buffers.cc
> new file mode 100644
> index 000..b4b8ea13970
> --- /dev/null
> +++ b/libstdc++-v3/testsuite/abi/aligned_buffers.cc
> @@ -0,0 +1,42 @@
> +// { dg-do compile { target c++11 } }
> +
> +// Check alignment of the buffer types used for uninitialized storage.
> +
> +#include 
> +
> +template using membuf = __gnu_cxx::__aligned_membuf;
> +template using objbuf = __gnu_cxx::__aligned_buffer;
> +
> +template
> +constexpr bool
> +check_alignof_membuf()
> +{
> +  return alignof(membuf) == alignof(T)
> +&& __alignof__(membuf) == alignof(T);
> +}
> +
> +template
> +constexpr bool
> +check_alignof_objbuf()
> +{
> +#if _GLIBCXX_INLINE_VERSION
> +  // For the gnu-versioned-namespace ABI __aligned_buffer == 
> __aligned_membuf.
> +  return check_alignof_membuf();
> +#else
> +  return alignof(objbuf) == __alignof__(T)
> +&& __alignof__(objbuf) == __alignof__(T);
> +#endif
> +}
> +
> +struct

RE: [PATCH v6] aarch64: Add vector popcount besides QImode [PR113859]

2024-06-28 Thread Pengxuan Zheng (QUIC)

> > On 6/28/24 6:18 AM, Pengxuan Zheng wrote:
> > > This patch improves GCC’s vectorization of __builtin_popcount for
> > > aarch64 target by adding popcount patterns for vector modes besides
> > > QImode, i.e., HImode, SImode and DImode.
> > >
> > > With this patch, we now generate the following for V8HI:
> > >cnt v1.16b, v0.16b
> > >uaddlp  v2.8h, v1.16b
> > >
> > > For V4HI, we generate:
> > >cnt v1.8b, v0.8b
> > >uaddlp  v2.4h, v1.8b
> > >
> > > For V4SI, we generate:
> > >cnt v1.16b, v0.16b
> > >uaddlp  v2.8h, v1.16b
> > >uaddlp  v3.4s, v2.8h
> > >
> > > For V4SI with TARGET_DOTPROD, we generate the following instead:
> > >moviv0.4s, #0
> > >moviv1.16b, #1
> > >cnt v3.16b, v2.16b
> > >udotv0.4s, v3.16b, v1.16b
> > >
> > > For V2SI, we generate:
> > >cnt v1.8b, v.8b
> > >uaddlp  v2.4h, v1.8b
> > >uaddlp  v3.2s, v2.4h
> > >
> > > For V2SI with TARGET_DOTPROD, we generate the following instead:
> > >moviv0.8b, #0
> > >moviv1.8b, #1
> > >cnt v3.8b, v2.8b
> > >udotv0.2s, v3.8b, v1.8b
> > >
> > > For V2DI, we generate:
> > >cnt v1.16b, v.16b
> > >uaddlp  v2.8h, v1.16b
> > >uaddlp  v3.4s, v2.8h
> > >uaddlp  v4.2d, v3.4s
> > >
> > > For V4SI with TARGET_DOTPROD, we generate the following instead:
> > >moviv0.4s, #0
> > >moviv1.16b, #1
> > >cnt v3.16b, v2.16b
> > >udotv0.4s, v3.16b, v1.16b
> > >uaddlp  v0.2d, v0.4s
> > >
> > >   PR target/113859
> > >
> > > gcc/ChangeLog:
> > >
> > >   * config/aarch64/aarch64-simd.md (aarch64_addlp):
> > Rename to...
> > >   (@aarch64_addlp): ... This.
> > >   (popcount2): New define_expand.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >   * gcc.target/aarch64/popcnt-udot.c: New test.
> > >   * gcc.target/aarch64/popcnt-vec.c: New test.
> > >
> > > Signed-off-by: Pengxuan Zheng 
> > > ---
> > >   gcc/config/aarch64/aarch64-simd.md| 41 ++-
> > >   .../gcc.target/aarch64/popcnt-udot.c  | 58 
> > >   gcc/testsuite/gcc.target/aarch64/popcnt-vec.c | 69
> +++
> > >   3 files changed, 167 insertions(+), 1 deletion(-)
> > >   create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
> > >   create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-vec.c
> > >
> > > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > > b/gcc/config/aarch64/aarch64-simd.md
> > > index 01b084d8ccb..afdf3ec7873 100644
> > > --- a/gcc/config/aarch64/aarch64-simd.md
> > > +++ b/gcc/config/aarch64/aarch64-simd.md
> > > @@ -3461,7 +3461,7 @@ (define_insn
> > "*aarch64_addlv_ze"
> > > [(set_attr "type" "neon_reduc_add")]
> > >   )
> > >
> > > -(define_expand "aarch64_addlp"
> > > +(define_expand "@aarch64_addlp"
> > > [(set (match_operand: 0 "register_operand")
> > >   (plus:
> > > (vec_select:
> > > @@ -3517,6 +3517,45 @@ (define_insn
> > "popcount2"
> > > [(set_attr "type" "neon_cnt")]
> > >   )
> > >
> > > +(define_expand "popcount2"
> > > +  [(set (match_operand:VDQHSD 0 "register_operand")
> > > +(popcount:VDQHSD (match_operand:VDQHSD 1
> > > +"register_operand")))]
> > > +  "TARGET_SIMD"
> > > +  {
> > > +/* Generate a byte popcount. */
> >
> > A couple of formatting nits. Two spaces before end of comment.
> 
> I noticed this in other places, but didn't realize it's intentional. Glad you
> pointed this out!
> 
> >
> > > +machine_mode mode =  == 64 ? V8QImode : V16QImode;
> > > +rtx tmp = gen_reg_rtx (mode);
> > > +auto icode = optab_handler (popcount_optab, mode);
> > > +emit_insn (GEN_FCN (icode) (tmp, gen_lowpart (mode,
> > > + operands[1])));
> > > +
> > > +if (TARGET_DOTPROD
> > > +&& (mode == SImode || mode == DImode))
> > > +  {
> > > +/* For V4SI and V2SI, we can generate a UDOT with a 0
> > > + accumulator
> > and a
> > > +   1 multiplicand. For V2DI, another UAADDLP is needed. */
> >
> > Likewise.
> >
> > > +rtx ones = force_reg (mode, CONST1_RTX (mode));
> > > +auto icode = optab_handler (udot_prod_optab, mode);
> > > +mode =  == 64 ? V2SImode : V4SImode;
> > > +rtx dest = mode == mode ? operands[0] : gen_reg_rtx
> > (mode);
> > > +rtx zeros = force_reg (mode, CONST0_RTX (mode));
> > > +emit_insn (GEN_FCN (icode) (dest, tmp, ones, zeros));
> > > +tmp = dest;
> > > +  }
> > > +
> > > +/* Use a sequence of UADDLPs to accumulate the counts. Each
> > > + step
> > doubles
> > > +   the element size and halves the number of elements. */
> >
> > Likewise. Also two spaces after the dot before a new sentence.
> >
> > You could run your patch through gcc/contrib/check_GNU_style.sh to
> > check for formatting nits.
> 
> Thanks for the info, Tejas. I just tried running 
> gcc/contrib/check_GNU_style.sh
> on the file I changed, but it didn't seem to warn this. Maybe I am not using 
> it
> correctly?

Just

[PATCH v9] aarch64: Add vector popcount besides QImode [PR113859]

2024-06-28 Thread Pengxuan Zheng

This patch improves GCC’s vectorization of __builtin_popcount for aarch64 target
by adding popcount patterns for vector modes besides QImode, i.e., HImode,
SImode and DImode.

With this patch, we now generate the following for V8HI:
  cnt v1.16b, v0.16b
  uaddlp  v2.8h, v1.16b

For V4HI, we generate:
  cnt v1.8b, v0.8b
  uaddlp  v2.4h, v1.8b

For V4SI, we generate:
  cnt v1.16b, v0.16b
  uaddlp  v2.8h, v1.16b
  uaddlp  v3.4s, v2.8h

For V4SI with TARGET_DOTPROD, we generate the following instead:
  moviv0.4s, #0
  moviv1.16b, #1
  cnt v3.16b, v2.16b
  udotv0.4s, v3.16b, v1.16b

For V2SI, we generate:
  cnt v1.8b, v.8b
  uaddlp  v2.4h, v1.8b
  uaddlp  v3.2s, v2.4h

For V2SI with TARGET_DOTPROD, we generate the following instead:
  moviv0.8b, #0
  moviv1.8b, #1
  cnt v3.8b, v2.8b
  udotv0.2s, v3.8b, v1.8b

For V2DI, we generate:
  cnt v1.16b, v.16b
  uaddlp  v2.8h, v1.16b
  uaddlp  v3.4s, v2.8h
  uaddlp  v4.2d, v3.4s

For V4SI with TARGET_DOTPROD, we generate the following instead:
  moviv0.4s, #0
  moviv1.16b, #1
  cnt v3.16b, v2.16b
  udotv0.4s, v3.16b, v1.16b
  uaddlp  v0.2d, v0.4s

PR target/113859

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_addlp): Rename to...
(@aarch64_addlp): ... This.
(popcount2): New define_expand.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/popcnt-udot.c: New test.
* gcc.target/aarch64/popcnt-vec.c: New test.

Signed-off-by: Pengxuan Zheng 
---
 gcc/config/aarch64/aarch64-simd.md| 41 ++-
 .../gcc.target/aarch64/popcnt-udot.c  | 58 
 gcc/testsuite/gcc.target/aarch64/popcnt-vec.c | 69 +++
 3 files changed, 167 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-vec.c

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 01b084d8ccb..fd0c5e612b5 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3461,7 +3461,7 @@ (define_insn 
"*aarch64_addlv_ze"
   [(set_attr "type" "neon_reduc_add")]
 )
 
-(define_expand "aarch64_addlp"
+(define_expand "@aarch64_addlp"
   [(set (match_operand: 0 "register_operand")
(plus:
  (vec_select:
@@ -3517,6 +3517,45 @@ (define_insn "popcount2"
   [(set_attr "type" "neon_cnt")]
 )
 
+(define_expand "popcount2"
+  [(set (match_operand:VDQHSD 0 "register_operand")
+   (popcount:VDQHSD (match_operand:VDQHSD 1 "register_operand")))]
+  "TARGET_SIMD"
+  {
+/* Generate a byte popcount.  */
+machine_mode mode =  == 64 ? V8QImode : V16QImode;
+rtx tmp = gen_reg_rtx (mode);
+auto icode = optab_handler (popcount_optab, mode);
+emit_insn (GEN_FCN (icode) (tmp, gen_lowpart (mode, operands[1])));
+
+if (TARGET_DOTPROD
+   && (mode == SImode || mode == DImode))
+  {
+   /* For V4SI and V2SI, we can generate a UDOT with a 0 accumulator and a
+  1 multiplicand.  For V2DI, another UAADDLP is needed.  */
+   rtx ones = force_reg (mode, CONST1_RTX (mode));
+   auto icode = optab_handler (udot_prod_optab, mode);
+   mode =  == 64 ? V2SImode : V4SImode;
+   rtx dest = mode == mode ? operands[0] : gen_reg_rtx (mode);
+   rtx zeros = force_reg (mode, CONST0_RTX (mode));
+   emit_insn (GEN_FCN (icode) (dest, tmp, ones, zeros));
+   tmp = dest;
+  }
+
+/* Use a sequence of UADDLPs to accumulate the counts.  Each step doubles
+   the element size and halves the number of elements.  */
+while (mode != mode)
+  {
+   auto icode = code_for_aarch64_addlp (ZERO_EXTEND, GET_MODE (tmp));
+   mode = insn_data[icode].operand[0].mode;
+   rtx dest = mode == mode ? operands[0] : gen_reg_rtx (mode);
+   emit_insn (GEN_FCN (icode) (dest, tmp));
+   tmp = dest;
+  }
+DONE;
+  }
+)
+
 ;; 'across lanes' max and min ops.
 
 ;; Template for outputting a scalar, so we can create __builtins which can be
diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c 
b/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
new file mode 100644
index 000..f6a968dae95
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=armv8.2-a+dotprod -fno-vect-cost-model 
-fno-schedule-insns -fno-schedule-insns2" } */
+
+/*
+** bar:
+** moviv([0-9]+).16b, 0x1
+** moviv([0-9]+).4s, 0
+** ldr q([0-9]+), \[x0\]
+** cnt v([0-9]+).16b, v\3.16b
+** udotv\2.4s, v\4.16b, v\1.16b
+** str q\2, \[x1\]
+** ret
+*/
+void
+bar (unsigned int *__restrict b, unsigned int *__restrict d)
+{
+  d[0] = __builtin_popcount (b[0]);
+  d[1] = __builtin_popcount (b[1]);
+  d[2] = __builtin_popcount (b[2]);
+  d[3] = __builtin_popcount (b[3]);
+}
+
+/*
+** bar1:
+** moviv([0-9]+).8b, 0x1
+**

[COMMITTED] ssa_lazy_cache takes an optional bitmap_obstack pointer.

2024-06-28 Thread Andrew MacLeod

There are times when a  bitmap_obstack could be provided to the lazy 
cache, in which case it does not need to manage an obstack on its own.


fast_vrp can have a few  of these live at once, and I anticipate some 
changes to GORI where we may use them a bit more too, so this just 
provides a little more flexibility.


bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed.

Andrew



From 5612541834c063dd4126fb059e59c5dc8d5f2f8e Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Wed, 26 Jun 2024 14:53:54 -0400
Subject: [PATCH] ssa_lazy_cache takes an optional bitmap_obstack pointer.

Allow ssa_lazy cache to allocate bitmaps from a client provided obstack
if so desired.

	* gimple-range-cache.cc (ssa_lazy_cache::ssa_lazy_cache): Relocate here.
	Check for provided obstack.
	(ssa_lazy_cache::~ssa_lazy_cache): Relocate here.  Free bitmap or obstack.
	* gimple-range-cache.h (ssa_lazy_cache::ssa_lazy_cache): Move.
	(ssa_lazy_cache::~ssa_lazy_cache): Move.
	(ssa_lazy_cache::m_ob): New.
	* gimple-range.cc (dom_ranger::dom_ranger): Iniitialize obstack.
	(dom_ranger::~dom_ranger): Release obstack.
	(dom_ranger::pre_bb): Create ssa_lazy_cache using obstack.
	* gimple-range.h (m_bitmaps): New.
---
 gcc/gimple-range-cache.cc | 26 ++
 gcc/gimple-range-cache.h  |  9 +++--
 gcc/gimple-range.cc   |  4 +++-
 gcc/gimple-range.h|  1 +
 4 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index 6979a14cbaa..0fffd7c16a1 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -683,6 +683,32 @@ ssa_cache::dump (FILE *f)
 
 }
 
+// Construct an ssa_lazy_cache. If OB is specified, us it, otherwise use
+// a local bitmap obstack.
+
+ssa_lazy_cache::ssa_lazy_cache (bitmap_obstack *ob)
+{
+  if (!ob)
+{
+  bitmap_obstack_initialize (_bitmaps);
+  m_ob = _bitmaps;
+}
+  else
+m_ob = ob;
+  active_p = BITMAP_ALLOC (m_ob);
+}
+
+// Destruct an sa_lazy_cache.  Free the bitmap if it came from a different
+// obstack, or release the obstack if it was a local one.
+
+ssa_lazy_cache::~ssa_lazy_cache ()
+{
+  if (m_ob == _bitmaps)
+bitmap_obstack_release (_bitmaps);
+  else
+BITMAP_FREE (active_p);
+}
+
 // Return true if NAME has an active range in the cache.
 
 bool
diff --git a/gcc/gimple-range-cache.h b/gcc/gimple-range-cache.h
index 0ea34d3f686..539c06753dd 100644
--- a/gcc/gimple-range-cache.h
+++ b/gcc/gimple-range-cache.h
@@ -78,12 +78,8 @@ protected:
 class ssa_lazy_cache : public ssa_cache
 {
 public:
-  inline ssa_lazy_cache ()
-  {
-bitmap_obstack_initialize (_bitmaps);
-active_p = BITMAP_ALLOC (_bitmaps);
-  }
-  inline ~ssa_lazy_cache () { bitmap_obstack_release (_bitmaps); }
+  ssa_lazy_cache (bitmap_obstack *ob = NULL);
+  ~ssa_lazy_cache ();
   inline bool empty_p () const { return bitmap_empty_p (active_p); }
   virtual bool has_range (tree name) const;
   virtual bool set_range (tree name, const vrange );
@@ -94,6 +90,7 @@ public:
   void merge (const ssa_lazy_cache &);
 protected:
   bitmap_obstack m_bitmaps;
+  bitmap_obstack *m_ob;
   bitmap active_p;
 };
 
diff --git a/gcc/gimple-range.cc b/gcc/gimple-range.cc
index 5df649e268c..7ba7d464b5e 100644
--- a/gcc/gimple-range.cc
+++ b/gcc/gimple-range.cc
@@ -908,6 +908,7 @@ assume_query::dump (FILE *f)
 
 dom_ranger::dom_ranger () : m_global ()
 {
+  bitmap_obstack_initialize (_bitmaps);
   m_freelist.create (0);
   m_freelist.truncate (0);
   m_bb.create (0);
@@ -928,6 +929,7 @@ dom_ranger::~dom_ranger ()
 }
   m_bb.release ();
   m_freelist.release ();
+  bitmap_obstack_release (_bitmaps);
 }
 
 // Implement range of EXPR on stmt S, and return it in R.
@@ -1071,7 +1073,7 @@ dom_ranger::pre_bb (basic_block bb)
   if (!m_freelist.is_empty ())
 e_cache = m_freelist.pop ();
   else
-e_cache = new ssa_lazy_cache;
+e_cache = new ssa_lazy_cache (_bitmaps);
   gcc_checking_assert (e_cache->empty_p ());
 
   // If there is a single pred, check if there are any ranges on
diff --git a/gcc/gimple-range.h b/gcc/gimple-range.h
index 91177567947..62bd8a87112 100644
--- a/gcc/gimple-range.h
+++ b/gcc/gimple-range.h
@@ -116,6 +116,7 @@ public:
   void pre_bb (basic_block bb);
   void post_bb (basic_block bb);
 protected:
+  bitmap_obstack m_bitmaps;
   void range_in_bb (vrange , basic_block bb, tree name);
   DISABLE_COPY_AND_ASSIGN (dom_ranger);
   ssa_cache m_global;
-- 
2.45.0

RE: [PATCH v6] aarch64: Add vector popcount besides QImode [PR113859]

2024-06-28 Thread Pengxuan Zheng (QUIC)

> On 6/28/24 6:18 AM, Pengxuan Zheng wrote:
> > This patch improves GCC’s vectorization of __builtin_popcount for
> > aarch64 target by adding popcount patterns for vector modes besides
> > QImode, i.e., HImode, SImode and DImode.
> >
> > With this patch, we now generate the following for V8HI:
> >cnt v1.16b, v0.16b
> >uaddlp  v2.8h, v1.16b
> >
> > For V4HI, we generate:
> >cnt v1.8b, v0.8b
> >uaddlp  v2.4h, v1.8b
> >
> > For V4SI, we generate:
> >cnt v1.16b, v0.16b
> >uaddlp  v2.8h, v1.16b
> >uaddlp  v3.4s, v2.8h
> >
> > For V4SI with TARGET_DOTPROD, we generate the following instead:
> >moviv0.4s, #0
> >moviv1.16b, #1
> >cnt v3.16b, v2.16b
> >udotv0.4s, v3.16b, v1.16b
> >
> > For V2SI, we generate:
> >cnt v1.8b, v.8b
> >uaddlp  v2.4h, v1.8b
> >uaddlp  v3.2s, v2.4h
> >
> > For V2SI with TARGET_DOTPROD, we generate the following instead:
> >moviv0.8b, #0
> >moviv1.8b, #1
> >cnt v3.8b, v2.8b
> >udotv0.2s, v3.8b, v1.8b
> >
> > For V2DI, we generate:
> >cnt v1.16b, v.16b
> >uaddlp  v2.8h, v1.16b
> >uaddlp  v3.4s, v2.8h
> >uaddlp  v4.2d, v3.4s
> >
> > For V4SI with TARGET_DOTPROD, we generate the following instead:
> >moviv0.4s, #0
> >moviv1.16b, #1
> >cnt v3.16b, v2.16b
> >udotv0.4s, v3.16b, v1.16b
> >uaddlp  v0.2d, v0.4s
> >
> > PR target/113859
> >
> > gcc/ChangeLog:
> >
> > * config/aarch64/aarch64-simd.md (aarch64_addlp):
> Rename to...
> > (@aarch64_addlp): ... This.
> > (popcount2): New define_expand.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/aarch64/popcnt-udot.c: New test.
> > * gcc.target/aarch64/popcnt-vec.c: New test.
> >
> > Signed-off-by: Pengxuan Zheng 
> > ---
> >   gcc/config/aarch64/aarch64-simd.md| 41 ++-
> >   .../gcc.target/aarch64/popcnt-udot.c  | 58 
> >   gcc/testsuite/gcc.target/aarch64/popcnt-vec.c | 69 +++
> >   3 files changed, 167 insertions(+), 1 deletion(-)
> >   create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
> >   create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-vec.c
> >
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > b/gcc/config/aarch64/aarch64-simd.md
> > index 01b084d8ccb..afdf3ec7873 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -3461,7 +3461,7 @@ (define_insn
> "*aarch64_addlv_ze"
> > [(set_attr "type" "neon_reduc_add")]
> >   )
> >
> > -(define_expand "aarch64_addlp"
> > +(define_expand "@aarch64_addlp"
> > [(set (match_operand: 0 "register_operand")
> > (plus:
> >   (vec_select:
> > @@ -3517,6 +3517,45 @@ (define_insn
> "popcount2"
> > [(set_attr "type" "neon_cnt")]
> >   )
> >
> > +(define_expand "popcount2"
> > +  [(set (match_operand:VDQHSD 0 "register_operand")
> > +(popcount:VDQHSD (match_operand:VDQHSD 1
> > +"register_operand")))]
> > +  "TARGET_SIMD"
> > +  {
> > +/* Generate a byte popcount. */
> 
> A couple of formatting nits. Two spaces before end of comment.

I noticed this in other places, but didn't realize it's intentional. Glad you 
pointed this out!

> 
> > +machine_mode mode =  == 64 ? V8QImode : V16QImode;
> > +rtx tmp = gen_reg_rtx (mode);
> > +auto icode = optab_handler (popcount_optab, mode);
> > +emit_insn (GEN_FCN (icode) (tmp, gen_lowpart (mode,
> > + operands[1])));
> > +
> > +if (TARGET_DOTPROD
> > +&& (mode == SImode || mode == DImode))
> > +  {
> > +/* For V4SI and V2SI, we can generate a UDOT with a 0 accumulator
> and a
> > +   1 multiplicand. For V2DI, another UAADDLP is needed. */
> 
> Likewise.
> 
> > +rtx ones = force_reg (mode, CONST1_RTX (mode));
> > +auto icode = optab_handler (udot_prod_optab, mode);
> > +mode =  == 64 ? V2SImode : V4SImode;
> > +rtx dest = mode == mode ? operands[0] : gen_reg_rtx
> (mode);
> > +rtx zeros = force_reg (mode, CONST0_RTX (mode));
> > +emit_insn (GEN_FCN (icode) (dest, tmp, ones, zeros));
> > +tmp = dest;
> > +  }
> > +
> > +/* Use a sequence of UADDLPs to accumulate the counts. Each step
> doubles
> > +   the element size and halves the number of elements. */
> 
> Likewise. Also two spaces after the dot before a new sentence.
> 
> You could run your patch through gcc/contrib/check_GNU_style.sh to check
> for formatting nits.

Thanks for the info, Tejas. I just tried running gcc/contrib/check_GNU_style.sh 
on the file I changed, but it didn't seem to warn this. Maybe I am not using it 
correctly?

Anyway, here's the updated version. Please let me know if you notice anything 
else.
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655991.html

Thanks,
Pengxuan
> 
> Thanks,
> Tejas.
> 
> > +while (mode != mode)
> > +  {
> > +auto icode = code_for_aarch64_addlp (ZERO_EXTEND,

[PATCH v8] aarch64: Add vector popcount besides QImode [PR113859]

2024-06-28 Thread Pengxuan Zheng

This patch improves GCC’s vectorization of __builtin_popcount for aarch64 target
by adding popcount patterns for vector modes besides QImode, i.e., HImode,
SImode and DImode.

With this patch, we now generate the following for V8HI:
  cnt v1.16b, v0.16b
  uaddlp  v2.8h, v1.16b

For V4HI, we generate:
  cnt v1.8b, v0.8b
  uaddlp  v2.4h, v1.8b

For V4SI, we generate:
  cnt v1.16b, v0.16b
  uaddlp  v2.8h, v1.16b
  uaddlp  v3.4s, v2.8h

For V4SI with TARGET_DOTPROD, we generate the following instead:
  moviv0.4s, #0
  moviv1.16b, #1
  cnt v3.16b, v2.16b
  udotv0.4s, v3.16b, v1.16b

For V2SI, we generate:
  cnt v1.8b, v.8b
  uaddlp  v2.4h, v1.8b
  uaddlp  v3.2s, v2.4h

For V2SI with TARGET_DOTPROD, we generate the following instead:
  moviv0.8b, #0
  moviv1.8b, #1
  cnt v3.8b, v2.8b
  udotv0.2s, v3.8b, v1.8b

For V2DI, we generate:
  cnt v1.16b, v.16b
  uaddlp  v2.8h, v1.16b
  uaddlp  v3.4s, v2.8h
  uaddlp  v4.2d, v3.4s

For V4SI with TARGET_DOTPROD, we generate the following instead:
  moviv0.4s, #0
  moviv1.16b, #1
  cnt v3.16b, v2.16b
  udotv0.4s, v3.16b, v1.16b
  uaddlp  v0.2d, v0.4s

PR target/113859

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_addlp): Rename to...
(@aarch64_addlp): ... This.
(popcount2): New define_expand.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/popcnt-udot.c: New test.
* gcc.target/aarch64/popcnt-vec.c: New test.

Signed-off-by: Pengxuan Zheng 
---
 gcc/config/aarch64/aarch64-simd.md| 41 ++-
 .../gcc.target/aarch64/popcnt-udot.c  | 58 
 gcc/testsuite/gcc.target/aarch64/popcnt-vec.c | 69 +++
 3 files changed, 167 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-vec.c

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 01b084d8ccb..04c97d076a9 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3461,7 +3461,7 @@ (define_insn 
"*aarch64_addlv_ze"
   [(set_attr "type" "neon_reduc_add")]
 )
 
-(define_expand "aarch64_addlp"
+(define_expand "@aarch64_addlp"
   [(set (match_operand: 0 "register_operand")
(plus:
  (vec_select:
@@ -3517,6 +3517,45 @@ (define_insn "popcount2"
   [(set_attr "type" "neon_cnt")]
 )
 
+(define_expand "popcount2"
+  [(set (match_operand:VDQHSD 0 "register_operand")
+(popcount:VDQHSD (match_operand:VDQHSD 1 "register_operand")))]
+  "TARGET_SIMD"
+  {
+/* Generate a byte popcount.  */
+machine_mode mode =  == 64 ? V8QImode : V16QImode;
+rtx tmp = gen_reg_rtx (mode);
+auto icode = optab_handler (popcount_optab, mode);
+emit_insn (GEN_FCN (icode) (tmp, gen_lowpart (mode, operands[1])));
+
+if (TARGET_DOTPROD
+&& (mode == SImode || mode == DImode))
+  {
+/* For V4SI and V2SI, we can generate a UDOT with a 0 accumulator and a
+   1 multiplicand.  For V2DI, another UAADDLP is needed.  */
+rtx ones = force_reg (mode, CONST1_RTX (mode));
+auto icode = optab_handler (udot_prod_optab, mode);
+mode =  == 64 ? V2SImode : V4SImode;
+rtx dest = mode == mode ? operands[0] : gen_reg_rtx (mode);
+rtx zeros = force_reg (mode, CONST0_RTX (mode));
+emit_insn (GEN_FCN (icode) (dest, tmp, ones, zeros));
+tmp = dest;
+  }
+
+/* Use a sequence of UADDLPs to accumulate the counts.  Each step doubles
+   the element size and halves the number of elements.  */
+while (mode != mode)
+  {
+auto icode = code_for_aarch64_addlp (ZERO_EXTEND, GET_MODE (tmp));
+mode = insn_data[icode].operand[0].mode;
+rtx dest = mode == mode ? operands[0] : gen_reg_rtx (mode);
+emit_insn (GEN_FCN (icode) (dest, tmp));
+tmp = dest;
+  }
+DONE;
+  }
+)
+
 ;; 'across lanes' max and min ops.
 
 ;; Template for outputting a scalar, so we can create __builtins which can be
diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c 
b/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
new file mode 100644
index 000..f6a968dae95
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=armv8.2-a+dotprod -fno-vect-cost-model 
-fno-schedule-insns -fno-schedule-insns2" } */
+
+/*
+** bar:
+** moviv([0-9]+).16b, 0x1
+** moviv([0-9]+).4s, 0
+** ldr q([0-9]+), \[x0\]
+** cnt v([0-9]+).16b, v\3.16b
+** udotv\2.4s, v\4.16b, v\1.16b
+** str q\2, \[x1\]
+** ret
+*/
+void
+bar (unsigned int *__restrict b, unsigned int *__restrict d)
+{
+  d[0] = __builtin_popcount (b[0]);
+  d[1] = __builtin_popcount (b[1]);
+  d[2] = __builtin_popcount (b[2]);
+  d[3] = __builtin_popcount (b[3]);
+}
+
+/*
+** bar1:
+** movi

RE: [PATCH v7] aarch64: Add vector popcount besides QImode [PR113859]

2024-06-28 Thread Pengxuan Zheng (QUIC)

Please ignore this patch. I accidently added unrelated changes. I'll push a 
correct version shortly.

Sorry for the noise.

Thanks,
Pengxuan
> This patch improves GCC’s vectorization of __builtin_popcount for aarch64
> target by adding popcount patterns for vector modes besides QImode, i.e.,
> HImode, SImode and DImode.
> 
> With this patch, we now generate the following for V8HI:
>   cnt v1.16b, v0.16b
>   uaddlp  v2.8h, v1.16b
> 
> For V4HI, we generate:
>   cnt v1.8b, v0.8b
>   uaddlp  v2.4h, v1.8b
> 
> For V4SI, we generate:
>   cnt v1.16b, v0.16b
>   uaddlp  v2.8h, v1.16b
>   uaddlp  v3.4s, v2.8h
> 
> For V4SI with TARGET_DOTPROD, we generate the following instead:
>   moviv0.4s, #0
>   moviv1.16b, #1
>   cnt v3.16b, v2.16b
>   udotv0.4s, v3.16b, v1.16b
> 
> For V2SI, we generate:
>   cnt v1.8b, v.8b
>   uaddlp  v2.4h, v1.8b
>   uaddlp  v3.2s, v2.4h
> 
> For V2SI with TARGET_DOTPROD, we generate the following instead:
>   moviv0.8b, #0
>   moviv1.8b, #1
>   cnt v3.8b, v2.8b
>   udotv0.2s, v3.8b, v1.8b
> 
> For V2DI, we generate:
>   cnt v1.16b, v.16b
>   uaddlp  v2.8h, v1.16b
>   uaddlp  v3.4s, v2.8h
>   uaddlp  v4.2d, v3.4s
> 
> For V4SI with TARGET_DOTPROD, we generate the following instead:
>   moviv0.4s, #0
>   moviv1.16b, #1
>   cnt v3.16b, v2.16b
>   udotv0.4s, v3.16b, v1.16b
>   uaddlp  v0.2d, v0.4s
> 
>   PR target/113859
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-simd.md (aarch64_addlp):
> Rename to...
>   (@aarch64_addlp): ... This.
>   (popcount2): New define_expand.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/aarch64/popcnt-udot.c: New test.
>   * gcc.target/aarch64/popcnt-vec.c: New test.
> 
> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64-simd.md| 41 ++-
>  .../gcc.target/aarch64/popcnt-udot.c  | 58 
>  gcc/testsuite/gcc.target/aarch64/popcnt-vec.c | 69 +++
>  3 files changed, 167 insertions(+), 1 deletion(-)  create mode 100644
> gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-vec.c
> 
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index 01b084d8ccb..04c97d076a9 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -3461,7 +3461,7 @@ (define_insn
> "*aarch64_addlv_ze"
>[(set_attr "type" "neon_reduc_add")]
>  )
> 
> -(define_expand "aarch64_addlp"
> +(define_expand "@aarch64_addlp"
>[(set (match_operand: 0 "register_operand")
>   (plus:
> (vec_select:
> @@ -3517,6 +3517,45 @@ (define_insn "popcount2"
>[(set_attr "type" "neon_cnt")]
>  )
> 
> +(define_expand "popcount2"
> +  [(set (match_operand:VDQHSD 0 "register_operand")
> +(popcount:VDQHSD (match_operand:VDQHSD 1 "register_operand")))]
> +  "TARGET_SIMD"
> +  {
> +/* Generate a byte popcount.  */
> +machine_mode mode =  == 64 ? V8QImode : V16QImode;
> +rtx tmp = gen_reg_rtx (mode);
> +auto icode = optab_handler (popcount_optab, mode);
> +emit_insn (GEN_FCN (icode) (tmp, gen_lowpart (mode, operands[1])));
> +
> +if (TARGET_DOTPROD
> +&& (mode == SImode || mode == DImode))
> +  {
> +/* For V4SI and V2SI, we can generate a UDOT with a 0 accumulator and
> a
> +   1 multiplicand.  For V2DI, another UAADDLP is needed.  */
> +rtx ones = force_reg (mode, CONST1_RTX (mode));
> +auto icode = optab_handler (udot_prod_optab, mode);
> +mode =  == 64 ? V2SImode : V4SImode;
> +rtx dest = mode == mode ? operands[0] : gen_reg_rtx (mode);
> +rtx zeros = force_reg (mode, CONST0_RTX (mode));
> +emit_insn (GEN_FCN (icode) (dest, tmp, ones, zeros));
> +tmp = dest;
> +  }
> +
> +/* Use a sequence of UADDLPs to accumulate the counts.  Each step
> doubles
> +   the element size and halves the number of elements.  */
> +while (mode != mode)
> +  {
> +auto icode = code_for_aarch64_addlp (ZERO_EXTEND, GET_MODE
> (tmp));
> +mode = insn_data[icode].operand[0].mode;
> +rtx dest = mode == mode ? operands[0] : gen_reg_rtx (mode);
> +emit_insn (GEN_FCN (icode) (dest, tmp));
> +tmp = dest;
> +  }
> +DONE;
> +  }
> +)
> +
>  ;; 'across lanes' max and min ops.
> 
>  ;; Template for outputting a scalar, so we can create __builtins which can be
> diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
> b/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
> new file mode 100644
> index 000..f6a968dae95
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
> @@ -0,0 +1,58 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=armv8.2-a+dotprod -fno-vect-cost-model
> +-fno-schedule-insns -fno-schedule-insns2" } */
> +
> +/*
> +** bar:
> +**   moviv([0-9]+).16b, 0x1
> +**   moviv([0-9]+).4s, 0
> +**

[PATCH v7] aarch64: Add vector popcount besides QImode [PR113859]

2024-06-28 Thread Pengxuan Zheng

This patch improves GCC’s vectorization of __builtin_popcount for aarch64 target
by adding popcount patterns for vector modes besides QImode, i.e., HImode,
SImode and DImode.

With this patch, we now generate the following for V8HI:
  cnt v1.16b, v0.16b
  uaddlp  v2.8h, v1.16b

For V4HI, we generate:
  cnt v1.8b, v0.8b
  uaddlp  v2.4h, v1.8b

For V4SI, we generate:
  cnt v1.16b, v0.16b
  uaddlp  v2.8h, v1.16b
  uaddlp  v3.4s, v2.8h

For V4SI with TARGET_DOTPROD, we generate the following instead:
  moviv0.4s, #0
  moviv1.16b, #1
  cnt v3.16b, v2.16b
  udotv0.4s, v3.16b, v1.16b

For V2SI, we generate:
  cnt v1.8b, v.8b
  uaddlp  v2.4h, v1.8b
  uaddlp  v3.2s, v2.4h

For V2SI with TARGET_DOTPROD, we generate the following instead:
  moviv0.8b, #0
  moviv1.8b, #1
  cnt v3.8b, v2.8b
  udotv0.2s, v3.8b, v1.8b

For V2DI, we generate:
  cnt v1.16b, v.16b
  uaddlp  v2.8h, v1.16b
  uaddlp  v3.4s, v2.8h
  uaddlp  v4.2d, v3.4s

For V4SI with TARGET_DOTPROD, we generate the following instead:
  moviv0.4s, #0
  moviv1.16b, #1
  cnt v3.16b, v2.16b
  udotv0.4s, v3.16b, v1.16b
  uaddlp  v0.2d, v0.4s

PR target/113859

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_addlp): Rename to...
(@aarch64_addlp): ... This.
(popcount2): New define_expand.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/popcnt-udot.c: New test.
* gcc.target/aarch64/popcnt-vec.c: New test.

Signed-off-by: Pengxuan Zheng 
---
 gcc/config/aarch64/aarch64-simd.md| 41 ++-
 .../gcc.target/aarch64/popcnt-udot.c  | 58 
 gcc/testsuite/gcc.target/aarch64/popcnt-vec.c | 69 +++
 3 files changed, 167 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-vec.c

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 01b084d8ccb..04c97d076a9 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3461,7 +3461,7 @@ (define_insn 
"*aarch64_addlv_ze"
   [(set_attr "type" "neon_reduc_add")]
 )
 
-(define_expand "aarch64_addlp"
+(define_expand "@aarch64_addlp"
   [(set (match_operand: 0 "register_operand")
(plus:
  (vec_select:
@@ -3517,6 +3517,45 @@ (define_insn "popcount2"
   [(set_attr "type" "neon_cnt")]
 )
 
+(define_expand "popcount2"
+  [(set (match_operand:VDQHSD 0 "register_operand")
+(popcount:VDQHSD (match_operand:VDQHSD 1 "register_operand")))]
+  "TARGET_SIMD"
+  {
+/* Generate a byte popcount.  */
+machine_mode mode =  == 64 ? V8QImode : V16QImode;
+rtx tmp = gen_reg_rtx (mode);
+auto icode = optab_handler (popcount_optab, mode);
+emit_insn (GEN_FCN (icode) (tmp, gen_lowpart (mode, operands[1])));
+
+if (TARGET_DOTPROD
+&& (mode == SImode || mode == DImode))
+  {
+/* For V4SI and V2SI, we can generate a UDOT with a 0 accumulator and a
+   1 multiplicand.  For V2DI, another UAADDLP is needed.  */
+rtx ones = force_reg (mode, CONST1_RTX (mode));
+auto icode = optab_handler (udot_prod_optab, mode);
+mode =  == 64 ? V2SImode : V4SImode;
+rtx dest = mode == mode ? operands[0] : gen_reg_rtx (mode);
+rtx zeros = force_reg (mode, CONST0_RTX (mode));
+emit_insn (GEN_FCN (icode) (dest, tmp, ones, zeros));
+tmp = dest;
+  }
+
+/* Use a sequence of UADDLPs to accumulate the counts.  Each step doubles
+   the element size and halves the number of elements.  */
+while (mode != mode)
+  {
+auto icode = code_for_aarch64_addlp (ZERO_EXTEND, GET_MODE (tmp));
+mode = insn_data[icode].operand[0].mode;
+rtx dest = mode == mode ? operands[0] : gen_reg_rtx (mode);
+emit_insn (GEN_FCN (icode) (dest, tmp));
+tmp = dest;
+  }
+DONE;
+  }
+)
+
 ;; 'across lanes' max and min ops.
 
 ;; Template for outputting a scalar, so we can create __builtins which can be
diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c 
b/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
new file mode 100644
index 000..f6a968dae95
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/popcnt-udot.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=armv8.2-a+dotprod -fno-vect-cost-model 
-fno-schedule-insns -fno-schedule-insns2" } */
+
+/*
+** bar:
+** moviv([0-9]+).16b, 0x1
+** moviv([0-9]+).4s, 0
+** ldr q([0-9]+), \[x0\]
+** cnt v([0-9]+).16b, v\3.16b
+** udotv\2.4s, v\4.16b, v\1.16b
+** str q\2, \[x1\]
+** ret
+*/
+void
+bar (unsigned int *__restrict b, unsigned int *__restrict d)
+{
+  d[0] = __builtin_popcount (b[0]);
+  d[1] = __builtin_popcount (b[1]);
+  d[2] = __builtin_popcount (b[2]);
+  d[3] = __builtin_popcount (b[3]);
+}
+
+/*
+** bar1:
+** movi

[wwwdocs, committed] git: Move current devel/omp/gcc branch to 14

2024-06-28 Thread Paul-Antoine Arras


Committed as debf3885965604c81541a549d531ec450f498058
https://gcc.gnu.org/git.html#general
--
PAcommit debf3885965604c81541a549d531ec450f498058
Author: Paul-Antoine Arras 
Date:   Fri Jun 28 12:08:57 2024 +0200

git: Move current devel/omp/gcc branch to 14

diff --git htdocs/git.html htdocs/git.html
index a6e88566..b5c2737a 100644
--- htdocs/git.html
+++ htdocs/git.html
@@ -280,17 +280,17 @@ in Git.
   Makarov vmaka...@redhat.com.
   
 
-  https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;a=shortlog;h=refs/heads/devel/omp/gcc-13;>devel/omp/gcc-13
+  https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;a=shortlog;h=refs/heads/devel/omp/gcc-14;>devel/omp/gcc-14
   This branch is for collaborative development of
   https://gcc.gnu.org/wiki/OpenACC;>OpenACC and
   https://gcc.gnu.org/wiki/openmp;>OpenMP support and related
   functionality, such
   as https://gcc.gnu.org/wiki/Offloading;>offloading support (OMP:
   offloading and multi processing).
-  The branch is based on releases/gcc-13.
-  Please send patch emails with a short-hand [og13] tag in the
+  The branch is based on releases/gcc-14.
+  Please send patch emails with a short-hand [og14] tag in the
   subject line, and use ChangeLog.omp files. (Likewise but now
-  stale branches exists for the prior GCC releases 9 to 12.)
+  stale branches exists for the prior GCC releases 9 to 13.)
 
   unified-autovect
   This branch is for work on improving effectiveness and generality of GCC's
@@ -897,14 +897,15 @@ merged.
   https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;a=shortlog;h=refs/heads/devel/omp/gcc-9;>devel/omp/gcc-9
   https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;a=shortlog;h=refs/heads/devel/omp/gcc-10;>devel/omp/gcc-10
   https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;a=shortlog;h=refs/heads/devel/omp/gcc-11;>devel/omp/gcc-11
+  https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;a=shortlog;h=refs/heads/devel/omp/gcc-12;>devel/omp/gcc-12
+  https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;a=shortlog;h=refs/heads/devel/omp/gcc-13;>devel/omp/gcc-13
   These branches were used for collaborative development of
   https://gcc.gnu.org/wiki/OpenACC;>OpenACC and
   https://gcc.gnu.org/wiki/openmp;>OpenMP support and related
   functionality as the successors to openacc-gcc-9-branch after the move to
   Git.
-  The branches were based on releases/gcc-9, releases/gcc-10 and
-  releases/gcc-11 respectively.
-  Development has now moved to the devel/omp/gcc-12 branch.
+  The branches were based on releases/gcc-9, releases/gcc-10, etc.
+  Development has now moved to the devel/omp/gcc-14 branch.
 
   hammer-3_3-branch
   The goal of this branch was to have a stable compiler based on GCC 3.3

Re: nvptx vs. [PATCH] Add a late-combine pass [PR106594]

2024-06-28 Thread Richard Sandiford

Richard Sandiford  writes:
> Thomas Schwinge  writes:
>> Hi!
>>
>> On 2024-06-27T23:20:18+0200, I wrote:
>>> On 2024-06-27T22:27:21+0200, I wrote:
 On 2024-06-27T18:49:17+0200, I wrote:
> On 2023-10-24T19:49:10+0100, Richard Sandiford 
>  wrote:
>> This patch adds a combine pass that runs late in the pipeline.

 [After sending, I realized I replied to a previous thread of this work.]

> I've beek looking a bit through recent nvptx target code generation
> changes for GCC target libraries, and thought I'd also share here my
> findings for the "late-combine" changes in isolation, for nvptx target.
> 
> First the unexpected thing:

 So much for "unexpected thing" -- next level of unexpected here...
 Appreciated if anyone feels like helping me find my way through this, but
 I totally understand if you've got other things to do.
>>>
>>> OK, I found something already.  (Unexpectedly quickly...)  ;-)
>>>
> there are a few cases where we now see unused
> registers get declared
>>
>>> But in fact, for both cases
>>
>> Now tested: 's%both%all'.  :-)
>>
>>> the unexpected difference goes away if after
>>> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'.  That's normally run
>>> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's
>>> all not active for nvptx target given '!reload_completed', given nvptx is
>>> 'targetm.no_register_allocation'.  Maybe we need to enable a few more
>>> passes, or is there anything in 'pass_late_combine' to change, so that we
>>> don't run into this?  Does it inadvertently mark registers live or
>>> something like that?
>>
>> Basically, is 'pass_late_combine' potentionally doing things that depend
>> on later clean-up?  (..., or shouldn't it be doing these things in the
>> first place?)
>
> It's possible that late-combine could expose dead code, but I imagine
> it's a niche case.
>
> I had a look at the nvptx logs from my comparison, and the cases in
> which I saw this seemed to be those where late-combine doesn't find
> anything to do.  Does that match your examples?  Specifically,
> the effect should be the same with -fdbg-cnt=late_combine:0-0
>
> I think what's happening is that:
>
> - combine exposes dead code
>
> - ce2 previously ran df_analyze with DF_LR_RUN_DCE set, and so cleared
>   up the dead code
>
> - late-combine instead runs df_analyze without that flag (since late-combine
>   itself doesn't really care whether dead code is present)
>
> - if late-combine doesn't do anything, ce2's df_analyze call has nothing
>   to do, and skips even the DCE
>
> The easiest fix would be to add:
>
>   df_set_flags (DF_LR_RUN_DCE);
>
> before df_analyze in late-combine.cc, so that it behaves like ce2.
> But the arrangement feels wrong.  I would have expected DF_LR_RUN_DCE
> to depend on whether df_analyze had been called since the last DCE pass
> (whether DF_LR_RUN_DCE or a full DCE).

I'm testing the attached patch to do that.  I'll submit it properly if
testing passes, but it seems to fix the extra-register problem for me.

Thanks,
Richard

---
Give fast DCE a separate dirty flag

Thomas pointed out that we sometimes failed to eliminate some dead code
(specifically clobbers of otherwise unused registers) on nvptx when
late-combine is enabled.  This happens because:

- combine is able to optimise the function in a way that exposes dead code.
  This leaves the df information in a "dirty" state.

- late_combine calls df_analyze without DF_LR_RUN_DCE run set.
  This updates the df information and clears the "dirty" state.

- late_combine doesn't find any extra optimisations, and so leaves
  the df information up-to-date.

- if_after_combine (ce2) calls df_analyze with DF_LR_RUN_DCE set.
  Because the df information is already up-to-date, fast DCE is
  not run.

The upshot is that running late-combine has the effect of suppressing
a DCE opportunity that would have been noticed without late_combine.

I think this shows that we should track the state of the DCE separately
from the LR problem.  Every pass updates the latter, but not all passes
update the former.

gcc/
* df.h (DF_LR_DCE): New df_problem_id.
(df_lr_dce): New macro.
* df-core.cc (rest_of_handle_df_finish): Check for a null free_fun.
* df-problems.cc (df_lr_finalize): Split out fast DCE handling to...
(df_lr_dce_finalize): ...this new function.
(problem_LR_DCE): New df_problem.
(df_lr_add_problem): Register LR_DCE rather than LR itself.
* dce.cc (fast_dce): Clear df_lr_dce->solutions_dirty.
---
 gcc/dce.cc |  3 ++
 gcc/df-core.cc |  3 +-
 gcc/df-problems.cc | 96 --
 gcc/df.h   |  2 +
 4 files changed, 74 insertions(+), 30 deletions(-)

diff --git a/gcc/dce.cc b/gcc/dce.cc
index be1a2a87732..04e8d98818d 100644
--- a/gcc/dce.cc
+++ b/gcc/dce.cc
@@ -1182,6 +1182,9 @@ fast_dce (bool word_level)
   BITMAP_FREE

[PATCH] i386: Cleanup tmp variable usage in ix86_expand_move

2024-06-28 Thread Uros Bizjak

Remove extra assignment, extra temp variable and variable shadowing.

No functional changes intended.

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_expand_move): Remove extra
assignment to tmp variable, reuse tmp variable instead of
declaring new temporary variable and remove tmp variable shadowing.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Also built crosscompiler to x86_64-pc-cygwin and x86_64-apple-darwin16.

Uros.
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index a4434c19272..a773b45bf03 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -414,9 +414,6 @@ ix86_expand_move (machine_mode mode, rtx operands[])
{
 #if TARGET_PECOFF
  tmp = legitimize_pe_coff_symbol (op1, addend != NULL_RTX);
-#else
- tmp = NULL_RTX;
-#endif
 
  if (tmp)
{
@@ -425,6 +422,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
break;
}
  else
+#endif
{
  op1 = operands[1];
  break;
@@ -482,12 +480,12 @@ ix86_expand_move (machine_mode mode, rtx operands[])
  /* dynamic-no-pic */
  if (MACHOPIC_INDIRECT)
{
- rtx temp = (op0 && REG_P (op0) && mode == Pmode)
-? op0 : gen_reg_rtx (Pmode);
- op1 = machopic_indirect_data_reference (op1, temp);
+ tmp = (op0 && REG_P (op0) && mode == Pmode)
+   ? op0 : gen_reg_rtx (Pmode);
+ op1 = machopic_indirect_data_reference (op1, tmp);
  if (MACHOPIC_PURE)
op1 = machopic_legitimize_pic_address (op1, mode,
-  temp == op1 ? 0 : temp);
+  tmp == op1 ? 0 : tmp);
}
  if (op0 != op1 && GET_CODE (op0) != MEM)
{
@@ -542,9 +540,9 @@ ix86_expand_move (machine_mode mode, rtx operands[])
  op1 = validize_mem (force_const_mem (mode, op1));
  if (!register_operand (op0, mode))
{
- rtx temp = gen_reg_rtx (mode);
- emit_insn (gen_rtx_SET (temp, op1));
- emit_move_insn (op0, temp);
+ tmp = gen_reg_rtx (mode);
+ emit_insn (gen_rtx_SET (tmp, op1));
+ emit_move_insn (op0, tmp);
  return;
}
}
@@ -565,7 +563,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
   if (SUBREG_BYTE (op0) == 0)
{
  wide_int mask = wi::mask (64, true, 128);
- rtx tmp = immed_wide_int_const (mask, TImode);
+ tmp = immed_wide_int_const (mask, TImode);
  op0 = SUBREG_REG (op0);
  tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
  if (mode == DFmode)
@@ -577,7 +575,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
   else if (SUBREG_BYTE (op0) == 8)
{
  wide_int mask = wi::mask (64, false, 128);
- rtx tmp = immed_wide_int_const (mask, TImode);
+ tmp = immed_wide_int_const (mask, TImode);
  op0 = SUBREG_REG (op0);
  tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
  if (mode == DFmode)

Re: [PATCH v3] Arm: Fix disassembly error in Thumb-1 relaxed load/store [PR115188]

2024-06-28 Thread Richard Earnshaw (lists)

On 27/06/2024 17:16, Wilco Dijkstra wrote:
> Hi Richard,
> 
>> Doing just this will mean that the register allocator will have to undo a 
>> pre/post memory operand that was accepted by the predicate (memory_operand). 
>>  I think we really need a tighter predicate (lets call it noautoinc_mem_op) 
>> here to avoid that.  Note that the existing uses of Uw also had another 
>> alternative that did permit 'm', so this wasn't previously practical, but 
>> they had alternative ways of being reloaded.
>>
>> No, sorry that won't work; there's another 'm' alternative here as well.
>> The correct fix is to add alternatives for T1, I think, similar to the one 
>> in thumb1_movsi_insn.
>>
>> Also, by observation I think there's a similar problem in the load 
>> operations.
> 
> Just using 'Uw' works fine, but restricting the memory operand too is better 
> indeed.
> I added 'restricted_memory_operand' that only disallows Thumb-1 postincrement.
> 
> There were also a few more cases in unaligned accesses where 'm' was used 
> incorrectly when
> emitting Thumb-1 LDR/STR alternatives (and where no LDM/STMis allowed), so 
> those also use
> 'Uw' and 'restricted_memory_operand'.
> 
> Long term it seems like a better idea is to remove support this odd 
> post-increment
> in general memory operand and only emit it from a peephole pass.
> 
> Cheers,
> Wilco
> 
> 
> v3: Use 'Uw' in a few more cases. Add 'restricted_memory_operand'.
> 
> A Thumb-1 memory operand allows single-register LDMIA/STMIA. This doesn't get
> printed as LDR/STR with writeback in unified syntax, resulting in strange
> assembler errors if writeback is selected.  To work around this, use the 'Uw'
> constraint that blocks writeback.  Also use a new 'restricted_memory_operand'
> which is a general memory operand that disallows writeback in Thumb-1.
> A few other patterns were using 'm' for Thumb-1 in a similar way, update these
> to also use 'restricted_memory_operand' and 'Uw'.
> 
> Passes bootstrap & regress, OK for commit (and backport to GCC14.2)?

I'm not a major fan of the name restricted_memory_operand as it doesn't 
describe which restriction is being applied and something like 
t1_restricted_memory_operand would not be any clearer.  Perhaps 
mem_and_no_t1_wback_op would be better?

OK with that change.

R.

> 
> gcc:
> PR target/115188
> * config/arm/arm.md (unaligned_loadsi): Use 'Uw' constraint and
> 'restricted_memory_operand'.
> (unaligned_loadhiu): Likewise.
> (unaligned_storesi): Likewise.
> (unaligned_storehi): Likewise.
> * config/arm/predicates.md (restricted_memory_operand): Add new 
> predicate.
> * config/arm/sync.md (arm_atomic_load): Use 'Uw' constraint.
> (arm_atomic_store): Likewise.
> 
> gcc/testsuite:
> PR target/115188
> * gcc.target/arm/pr115188.c: Add new test.
> 
> ---
> 
> diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
> index 
> f47e036a8034ed16c61bbd753c7a7cd3efb1ecbd..c962a9341779e4da38f4e1afb26d4a364fc5aee4
>  100644
> --- a/gcc/config/arm/arm.md
> +++ b/gcc/config/arm/arm.md
> @@ -5011,7 +5011,7 @@
>  
>  (define_insn "unaligned_loadsi"
>[(set (match_operand:SI 0 "s_register_operand" "=l,l,r")
> - (unspec:SI [(match_operand:SI 1 "memory_operand" "m,Uw,m")]
> + (unspec:SI [(match_operand:SI 1 "restricted_memory_operand" "Uw,Uw,m")]
>  UNSPEC_UNALIGNED_LOAD))]
>"unaligned_access"
>"@
> @@ -5041,7 +5041,7 @@
>  (define_insn "unaligned_loadhiu"
>[(set (match_operand:SI 0 "s_register_operand" "=l,l,r")
>   (zero_extend:SI
> -   (unspec:HI [(match_operand:HI 1 "memory_operand" "m,Uw,m")]
> +   (unspec:HI [(match_operand:HI 1 "restricted_memory_operand" 
> "Uw,Uw,m")]
>UNSPEC_UNALIGNED_LOAD)))]
>"unaligned_access"
>"@
> @@ -5066,7 +5066,7 @@
> (set_attr "type" "store_8")])
>  
>  (define_insn "unaligned_storesi"
> -  [(set (match_operand:SI 0 "memory_operand" "=m,Uw,m")
> +  [(set (match_operand:SI 0 "restricted_memory_operand" "=Uw,Uw,m")
>   (unspec:SI [(match_operand:SI 1 "s_register_operand" "l,l,r")]
>  UNSPEC_UNALIGNED_STORE))]
>"unaligned_access"
> @@ -5081,7 +5081,7 @@
> (set_attr "type" "store_4")])
>  
>  (define_insn "unaligned_storehi"
> -  [(set (match_operand:HI 0 "memory_operand" "=m,Uw,m")
> +  [(set (match_operand:HI 0 "restricted_memory_operand" "=Uw,Uw,m")
>   (unspec:HI [(match_operand:HI 1 "s_register_operand" "l,l,r")]
>  UNSPEC_UNALIGNED_STORE))]
>"unaligned_access"
> diff --git a/gcc/config/arm/predicates.md b/gcc/config/arm/predicates.md
> index 
> 4994c0c57d6431117c16f7a05e800821dee93408..3dfe381c098c06517dca6026f8dafe87b46135ae
>  100644
> --- a/gcc/config/arm/predicates.md
> +++ b/gcc/config/arm/predicates.md
> @@ -907,3 +907,8 @@
>  ;; A special predicate that doesn't match a particular mode.
>  (define_special_predicate "arm_any_register_operand"
>(match_code

Document 'pass_postreload' vs. 'pass_late_compilation' (was: The nvptx port [4/11+] Post-RA pipeline)

2024-06-28 Thread Thomas Schwinge

Hi!

Before we start looking into enabling certain 'pass_postreload' passes
for nvptx, as we've been discussing in

"nvptx vs. [PATCH] Add a late-combine pass [PR106594]", let's first
document the (not quite obvious) status quo:

On 2014-10-20T16:24:43+0200, Bernd Schmidt  wrote:
> This stops most of the post-regalloc passes to be run if the target 
> doesn't want register allocation. I'd previously moved them all out of 
> postreload to the toplevel, but Jakub (I think) pointed out that the 
> idea is not to run them to avoid crashes if reload fails e.g. for an 
> invalid asm. So I've made a new container pass.

OK to push "Document 'pass_postreload' vs. 'pass_late_compilation'", see
attached?


Grüße
 Thomas


> A later patch will make thread_prologue_and_epilogue_insns callable from 
> the backend.
>
>
> Bernd
>
>   gcc/
>   * passes.def (pass_compute_alignments, pass_duplicate_computed_gotos,
>   pass_variable_tracking, pass_free_cfg, pass_machine_reorg,
>   pass_cleanup_barriers, pass_delay_slots,
>   pass_split_for_shorten_branches, pass_convert_to_eh_region_ranges,
>   pass_shorten_branches, pass_est_nothrow_function_flags,
>   pass_dwarf2_frame, pass_final): Move outside of pass_postreload and
>   into pass_late_compilation.
>   (pass_late_compilation): Add.
>   * passes.c (pass_data_late_compilation, pass_late_compilation,
>   make_pass_late_compilation): New.
>   * timevar.def (TV_LATE_COMPILATION): New.
>
> 
> Index: gcc/passes.def
> ===
> --- gcc/passes.def.orig
> +++ gcc/passes.def
> @@ -415,6 +415,9 @@ along with GCC; see the file COPYING3.
> NEXT_PASS (pass_split_before_regstack);
> NEXT_PASS (pass_stack_regs_run);
> POP_INSERT_PASSES ()
> +  POP_INSERT_PASSES ()
> +  NEXT_PASS (pass_late_compilation);
> +  PUSH_INSERT_PASSES_WITHIN (pass_late_compilation)
> NEXT_PASS (pass_compute_alignments);
> NEXT_PASS (pass_variable_tracking);
> NEXT_PASS (pass_free_cfg);
> Index: gcc/passes.c
> ===
> --- gcc/passes.c.orig
> +++ gcc/passes.c
> @@ -569,6 +569,44 @@ make_pass_postreload (gcc::context *ctxt
>return new pass_postreload (ctxt);
>  }
>  
> +namespace {
> +
> +const pass_data pass_data_late_compilation =
> +{
> +  RTL_PASS, /* type */
> +  "*all-late_compilation", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_LATE_COMPILATION, /* tv_id */
> +  PROP_rtl, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
> +
> +class pass_late_compilation : public rtl_opt_pass
> +{
> +public:
> +  pass_late_compilation (gcc::context *ctxt)
> +: rtl_opt_pass (pass_data_late_compilation, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +  {
> +return reload_completed || targetm.no_register_allocation;
> +  }
> +
> +}; // class pass_late_compilation
> +
> +} // anon namespace
> +
> +static rtl_opt_pass *
> +make_pass_late_compilation (gcc::context *ctxt)
> +{
> +  return new pass_late_compilation (ctxt);
> +}
> +
>  
>  
>  /* Set the static pass number of pass PASS to ID and record that
> Index: gcc/timevar.def
> ===
> --- gcc/timevar.def.orig
> +++ gcc/timevar.def
> @@ -270,6 +270,7 @@ DEFTIMEVAR (TV_EARLY_LOCAL , "early
>  DEFTIMEVAR (TV_OPTIMIZE   , "unaccounted optimizations")
>  DEFTIMEVAR (TV_REST_OF_COMPILATION   , "rest of compilation")
>  DEFTIMEVAR (TV_POSTRELOAD , "unaccounted post reload")
> +DEFTIMEVAR (TV_LATE_COMPILATION   , "unaccounted late compilation")
>  DEFTIMEVAR (TV_REMOVE_UNUSED  , "remove unused locals")
>  DEFTIMEVAR (TV_ADDRESS_TAKEN  , "address taken")
>  DEFTIMEVAR (TV_TODO   , "unaccounted todo")


>From 7f708dd9774773e704cb06b7a6f296927f9057df Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Fri, 28 Jun 2024 16:04:18 +0200
Subject: [PATCH] Document 'pass_postreload' vs. 'pass_late_compilation'

See Subversion r217124 (Git commit 433e4164339f18d0b8798968444a56b681b5232c)
"Reorganize post-ra pipeline for targets without register allocation".

	gcc/
	* passes.cc: Document 'pass_postreload' vs. 'pass_late_compilation'.
	* passes.def: Likewise.
---
 gcc/passes.cc  | 14 +-
 gcc/passes.def |  3 +++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/gcc/passes.cc b/gcc/passes.cc
index d3648a24b58..e444b462113 100644
--- a/gcc/passes.cc
+++ b/gcc/passes.cc
@@ -660,6 +660,10 @@ make_pass_rest_of_compilation (gcc::context *ctxt)
 
 namespace {
 
+/* A container pass (only)

RE: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip

2024-06-28 Thread Li, Pan2

Thanks Tamar and Richard for enlightening.

> I think you're doing the MIN_EXPR wrong - the above says MIN_EXPR
>  which doesn't make
> sense anyway.  I suspect you fail to put the MIN_EXPR to a separate statement?

Make sense, will have another try for this.

> Aye, you need to emit the additional statements through  
> append_pattern_def_seq,
> This is also because the scalar statement doesn’t require them, so it makes 
> costing easier.
> The vectorizer expects arguments to be simple use, so compound statements 
> aren't
> Supported as they make costing and codegen harder.

Yes, you are right. It is not ssa_name during simple use check and then return 
failures to vectorization_convertion.

Pan

-Original Message-
From: Tamar Christina  
Sent: Friday, June 28, 2024 9:39 PM
To: Richard Biener ; Li, Pan2 
Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
jeffreya...@gmail.com; rdapp@gmail.com
Subject: RE: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip

> -Original Message-
> From: Richard Biener 
> Sent: Friday, June 28, 2024 6:39 AM
> To: Li, Pan2 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com;
> jeffreya...@gmail.com; rdapp@gmail.com; Tamar Christina
> 
> Subject: Re: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip
> 
> On Thu, Jun 27, 2024 at 4:45 PM Li, Pan2  wrote:
> >
> > Hi Richard,
> >
> > As mentioned by tamar in previous, would like to try even more optimization
> based on this patch.
> > Assume we take zip benchmark as example, we may have gimple similar as below
> >
> > unsigned int _1, _2;
> > unsigned short int _9;
> >
> > _9 = (unsigned short int).SAT_SUB (_1, _2);
> >
> > If we can locate the _1 is in the range of unsigned short, we can 
> > distribute the
> convert into
> > the .SAT_SUB, aka:
> >
> > From:
> > _1 = (unsigned int short)_other;
> > _9 = (unsigned short int).SAT_SUB (_1, _2);
> >
> > To:
> > _9 = .SAT_SUB ((unsigned int short)_1, (unsigned int short)MIN_EXPR (_2,
> 65536)));
> >
> > Unfortunately, it failed to vectorize when I try to perform above changes. 
> > The
> vectorizable_conversion
> > considers it is not simple use and then return fail to vect_analyze_loop_2.
> >
> > zip.test.c:15:12: note:   ==> examining pattern def statement: patt_42 = 
> > (short
> unsigned int) MIN_EXPR ;
> > zip.test.c:15:12: note:   ==> examining statement: patt_42 = (short 
> > unsigned int)
> MIN_EXPR ;
> > zip.test.c:15:12: note:   vect_is_simple_use: operand MIN_EXPR  b_12(D)>, type of def: unknown
> > zip.test.c:15:12: missed:   Unsupported pattern.
> > zip.test.c:15:12: missed:   use not simple.
> > zip.test.c:15:12: note:   vect_is_simple_use: operand MIN_EXPR  b_12(D)>, type of def: unknown
> > zip.test.c:15:12: missed:   Unsupported pattern.
> > zip.test.c:15:12: missed:   use not simple.
> > zip.test.c:15:12: note:   vect_is_simple_use: operand MIN_EXPR  b_12(D)>, type of def: unknown
> > zip.test.c:15:12: missed:   Unsupported pattern.
> > zip.test.c:15:12: missed:   use not simple.
> > zip.test.c:7:6: missed:   not vectorized: relevant stmt not supported: 
> > patt_42 =
> (short unsigned int) MIN_EXPR ;
> > zip.test.c:15:12: missed:  bad operation or unsupported loop bound.
> >
> > I tried to take COND_EXPR here instead of MIN_EXPR but almost the same
> behavior. I am not sure if we can unblock this by the
> > vectorizable_conversion or we need some improvements from other pass.
> 
> I think you're doing the MIN_EXPR wrong - the above says MIN_EXPR
>  which doesn't make
> sense anyway.  I suspect you fail to put the MIN_EXPR to a separate statement?
> 

Aye, you need to emit the additional statements through  append_pattern_def_seq,
This is also because the scalar statement doesn’t require them, so it makes 
costing easier.

The vectorizer expects arguments to be simple use, so compound statements aren't
Supported as they make costing and codegen harder.

Cheers,
Tamar

> > Thanks a lot.
> >
> > Pan
> >
> > -Original Message-
> > From: Li, Pan2
> > Sent: Thursday, June 27, 2024 2:14 PM
> > To: Richard Biener 
> > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com;
> jeffreya...@gmail.com; rdapp@gmail.com
> > Subject: RE: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip
> >
> > > OK
> >
> > Committed, thanks Richard.
> >
> > Pan
> >
> > -Original Message-
> > From: Richard Biener 
> > Sent: Thursday, June 27, 2024 2:04 PM
> > To: Li, Pan2 
> > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com;
> jeffreya...@gmail.com; rdapp@gmail.com
> > Subject: Re: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip
> >
> > On Thu, Jun 27, 2024 at 3:31 AM  wrote:
> > >
> > > From: Pan Li 
> >
> > OK
> >
> > > The zip benchmark of coremark-pro have one SAT_SUB like pattern but
> > > truncated as below:
> > >
> > > void test (uint16_t *x, unsigned b, unsigned n)
> > > {
> > >   unsigned a = 0;
> >

Re: [PATCHv2 2/2] libiberty/buildargv: handle input consisting of only white space

2024-06-28 Thread Andrew Burgess



Hi,

Am I OK to push these patches given the testing went OK?  I'm thinking
probably, but I don't want to overstep.

Thanks,
Andrew


Andrew Burgess  writes:

> Jeff Law  writes:
>
>> On 2/10/24 10:26 AM, Andrew Burgess wrote:
>>> GDB makes use of the libiberty function buildargv for splitting the
>>> inferior (program being debugged) argument string in the case where
>>> the inferior is not being started under a shell.
>>> 
>>> I have recently been working to improve this area of GDB, and noticed
>>> some unexpected behaviour to the libiberty function buildargv, when
>>> the input is a string consisting only of white space.
>>> 
>>> What I observe is that if the input to buildargv is a string
>>> containing only white space, then buildargv will return an argv list
>>> containing a single empty argument, e.g.:
>>> 
>>>char **argv = buildargv (" ");
>>>assert (*argv[0] == '\0');
>>>assert (argv[1] == NULL);
>>> 
>>> We get the same output from buildargv if the input is a single space,
>>> or multiple spaces.  Other white space characters give the same
>>> results.
>>> 
>>> This doesn't seem right to me, and in fact, there appears to be a work
>>> around for this issue in expandargv where we have this code:
>>> 
>>>/* If the file is empty or contains only whitespace, buildargv would
>>>   return a single empty argument.  In this context we want no arguments,
>>>   instead.  */
>>>if (only_whitespace (buffer))
>>>  {
>>>file_argv = (char **) xmalloc (sizeof (char *));
>>>file_argv[0] = NULL;
>>>  }
>>>else
>>>  /* Parse the string.  */
>>>  file_argv = buildargv (buffer);
>>> 
>>> I think that the correct behaviour in this situation is to return an
>>> empty argv array, e.g.:
>>> 
>>>char **argv = buildargv (" ");
>>>assert (argv[0] == NULL);
>>> 
>>> And it turns out that this is a trivial change to buildargv.  The diff
>>> does look big, but this is because I've re-indented a block.  Check
>>> with 'git diff -b' to see the minimal changes.  I've also removed the
>>> work around from expandargv.
>>> 
>>> When testing this sort of thing I normally write the tests first, and
>>> then fix the code.  In this case test-expandargv.c has sort-of been
>>> used as a mechanism for testing the buildargv function (expandargv
>>> does call buildargv most of the time), however, for this particular
>>> issue the work around in expandargv (mentioned above) masked the
>>> buildargv bug.
>>> 
>>> I did consider adding a new test-buildargv.c file, however, this would
>>> have basically been a copy & paste of test-expandargv.c (with some
>>> minor changes to call buildargv).  This would be fine now, but feels
>>> like we would eventually end up with one file not being updated as
>>> much as the other, and so test coverage would suffer.
>>> 
>>> Instead, I have added some explicit buildargv testing to the
>>> test-expandargv.c file, this reuses the test input that is already
>>> defined for expandargv.
>>> 
>>> Of course, once I removed the work around from expandargv then we now
>>> do always call buildargv from expandargv, and so the bug I'm fixing
>>> would impact both expandargv and buildargv, so maybe the new testing
>>> is redundant?  I tend to think more testing is always better, so I've
>>> left it in for now.
>> So just an FYI.  Sometimes folks include the -b diffs as well for these 
>> scenarios.  THe problem with doing so (as I recently stumbled over 
>> myself) is the bots which monitor the list and apply patches get quite 
>> confused by that practice.  Anyway, just something to be aware of.
>>
>> As for testing, I tend to agree, more is better unless we're highly 
>> confident its redundant.  So I'll go with your judgment on 
>> redundant-ness of the test.
>>
>> As with the prior patch, you'll need to run it through the usual 
>> bootstrap/regression cycle and cobble together a ChangeLog.
>>
>> OK once those things are taken care of.
>
> Jeff,
>
> Thanks for looking these patches over.
>
> For testing, using current(ish) gcc HEAD, on x86-64 GNU/Linux, I:
>
>   ../src/configure --prefix=$(cd .. && pwd)/install
>   make
>   make check
>
> I did this with / without my patch and then:
>
>   find . -name "*.sum"
>   ... compare all .sum files ...
>
> There was no change in any of the .sum files.
>
>   1. Am I correct that this will have run the bootstrap test by default?
>
>   2. Is there any other testing I should be doing?
>
>   3. If not, am I OK to apply both patches in this series?
>
> Thanks,
> Andrew

RE: [PATCH v1] Match: Support imm form for unsigned scalar .SAT_ADD

2024-06-28 Thread Li, Pan2

> OK with those changes.

Thanks Richard for comments, will make the changes and commit if no surprise 
from test suites.

Pan

-Original Message-
From: Richard Biener  
Sent: Friday, June 28, 2024 9:12 PM
To: Li, Pan2 
Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
jeffreya...@gmail.com; rdapp@gmail.com
Subject: Re: [PATCH v1] Match: Support imm form for unsigned scalar .SAT_ADD

On Fri, Jun 28, 2024 at 5:44 AM  wrote:
>
> From: Pan Li 
>
> This patch would like to support the form of unsigned scalar .SAT_ADD
> when one of the op is IMM.  For example as below:
>
> Form IMM:
>   #define DEF_SAT_U_ADD_IMM_FMT_1(T)   \
>   T __attribute__((noinline))  \
>   sat_u_add_imm_##T##_fmt_1 (T x)  \
>   {\
> return (T)(x + 9) >= x ? (x + 9) : -1; \
>   }
>
> DEF_SAT_U_ADD_IMM_FMT_1(uint64_t)
>
> Before this patch:
> __attribute__((noinline))
> uint64_t sat_u_add_imm_uint64_t_fmt_1 (uint64_t x)
> {
>   long unsigned int _1;
>   uint64_t _3;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _1 = MIN_EXPR ;
>   _3 = _1 + 9;
>   return _3;
> ;;succ:   EXIT
>
> }
>
> After this patch:
> __attribute__((noinline))
> uint64_t sat_u_add_imm_uint64_t_fmt_1 (uint64_t x)
> {
>   uint64_t _3;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _3 = .SAT_ADD (x_2(D), 9); [tail call]
>   return _3;
> ;;succ:   EXIT
>
> }
>
> The below test suites are passed for this patch:
> 1. The rv64gcv fully regression test with newlib.
> 2. The x86 bootstrap test.
> 3. The x86 fully regression test.
>
> gcc/ChangeLog:
>
> * match.pd: Add imm form for .SAT_ADD matching.
> * tree-ssa-math-opts.cc (math_opts_dom_walker::after_dom_children):
> Add .SAT_ADD matching under PLUS_EXPR.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/match.pd  | 22 ++
>  gcc/tree-ssa-math-opts.cc |  2 ++
>  2 files changed, 24 insertions(+)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 3fa3f2e8296..d738c7ee9b4 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3154,6 +3154,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  (match (unsigned_integer_sat_add @0 @1)
>   (cond^ (gt @0 (usadd_left_part_1@2 @0 @1)) integer_minus_onep @2))
>
> +/* Unsigned saturation add, case 9 (one op is imm):
> +   SAT_U_ADD = (X + 3) >= x ? (X + 3) : -1.  */
> +(match (unsigned_integer_sat_add @0 @1)
> + (plus:c (min @0 INTEGER_CST@2) INTEGER_CST@1)

No :c necessary on the plus.

> + (with {
> +   unsigned precision = TYPE_PRECISION (type);
> +   wide_int cst_1 = wi::to_wide (@1, precision);
> +   wide_int cst_2 = wi::to_wide (@2, precision);

Just use wi::to_wide (@1/@2);

> +   wide_int max = wi::mask (precision, false, precision);
> +   wide_int sum = wi::add (cst_1, cst_2);
> +  }
> +  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && types_match (type, @0, @1) && wi::eq_p (max, sum)

Can you refactor to put the non-max/sum tests before the (with {...}?

> +
> +/* Unsigned saturation add, case 10 (one op is imm):
> +   SAT_U_ADD = __builtin_add_overflow (X, 3, ) == 0 ? ret : -1.  */
> +(match (unsigned_integer_sat_add @0 @1)
> + (cond^ (ne (imagpart (IFN_ADD_OVERFLOW:c@2 @0 INTEGER_CST@1)) integer_zerop)

No need for :c on the IFN_ADD_OVERFLOW.

OK with those changes.

Richard.

> +  integer_minus_onep (realpart @2))
> +  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && types_match (type, @0
> +
>  /* Unsigned saturation sub, case 1 (branch with gt):
> SAT_U_SUB = X > Y ? X - Y : 0  */
>  (match (unsigned_integer_sat_sub @0 @1)
> diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
> index 3783a874699..3b5433ec000 100644
> --- a/gcc/tree-ssa-math-opts.cc
> +++ b/gcc/tree-ssa-math-opts.cc
> @@ -6195,6 +6195,8 @@ math_opts_dom_walker::after_dom_children (basic_block 
> bb)
>   break;
>
> case PLUS_EXPR:
> + match_unsigned_saturation_add (, as_a (stmt));
> + /* fall-through  */
> case MINUS_EXPR:
>   if (!convert_plusminus_to_widen (, stmt, code))
> {
> --
> 2.34.1
>

[PATCH] RISC-V: Handle NULL stmt in SLP_TREE_SCALAR_STMTS

2024-06-28 Thread Richard Biener

The following starts to handle NULL elements in SLP_TREE_SCALAR_STMTS
with the first candidate being the two-operator nodes where some
lanes are do-not-care and also do not have a scalar stmt computing
the result.  I've sofar whack-a-moled the vect.exp testsuite.

I do plan to use NULL elements for loads from groups with gaps
where we get around not doing that by having a load permutation.

I want to separate changing places where I have coverage from those
I do not.  So this is for the CI and I'll followup with patching up
all remaining iterations over SLP_TREE_SCALAR_STMTS.

* tree-vect-slp.cc (vect_build_slp_tree_2): Make two-operator
nodes have SLP_TREE_SCALAR_STMTS with do-not-care lanes NULL.
(bst_traits::hash): Handle NULL elements in SLP_TREE_SCALAR_STMTS.
(vect_print_slp_tree): Likewise.
(vect_mark_slp_stmts): Likewise.
(vect_mark_slp_stmts_relevant): Likewise.
(vect_find_last_scalar_stmt_in_slp): Likewise.
(vect_bb_slp_mark_live_stmts): Likewise.
(vect_slp_prune_covered_roots): Likewise.
(vect_bb_partition_graph_r): Likewise.
(vect_remove_slp_scalar_calls): Likewise.
* tree-vect-stmts.cc (can_vectorize_live_stmts): Likewise.
---
 gcc/tree-vect-slp.cc   | 56 --
 gcc/tree-vect-stmts.cc | 11 +
 2 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index dd9017e5b3a..47664158e57 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -1571,7 +1571,7 @@ bst_traits::hash (value_type x)
 {
   inchash::hash h;
   for (unsigned i = 0; i < x.length (); ++i)
-h.add_int (gimple_uid (x[i]->stmt));
+h.add_int (x[i] ? gimple_uid (x[i]->stmt) : -1);
   return h.end ();
 }
 inline bool
@@ -2702,6 +2702,8 @@ fail:
   SLP_TREE_VECTYPE (two) = vectype;
   SLP_TREE_CHILDREN (one).safe_splice (children);
   SLP_TREE_CHILDREN (two).safe_splice (children);
+  SLP_TREE_SCALAR_STMTS (one).create (stmts.length ());
+  SLP_TREE_SCALAR_STMTS (two).create (stmts.length ());
   slp_tree child;
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (two), i, child)
SLP_TREE_REF_COUNT (child)++;
@@ -2726,9 +2728,15 @@ fail:
  SLP_TREE_LANE_PERMUTATION (node).safe_push (std::make_pair (1, 
i));
  ocode = gimple_assign_rhs_code (ostmt);
  j = i;
+ SLP_TREE_SCALAR_STMTS (one).quick_push (NULL);
+ SLP_TREE_SCALAR_STMTS (two).quick_push (stmts[i]);
}
  else
-   SLP_TREE_LANE_PERMUTATION (node).safe_push (std::make_pair (0, i));
+   {
+ SLP_TREE_LANE_PERMUTATION (node).safe_push (std::make_pair (0, 
i));
+ SLP_TREE_SCALAR_STMTS (one).quick_push (stmts[i]);
+ SLP_TREE_SCALAR_STMTS (two).quick_push (NULL);
+   }
}
   SLP_TREE_CODE (one) = code0;
   SLP_TREE_CODE (two) = ocode;
@@ -2781,9 +2789,12 @@ vect_print_slp_tree (dump_flags_t dump_kind, 
dump_location_t loc,
 }
   if (SLP_TREE_SCALAR_STMTS (node).exists ())
 FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info)
-  dump_printf_loc (metadata, user_loc, "\t%sstmt %u %G",
-  STMT_VINFO_LIVE_P (stmt_info) ? "[l] " : "",
-  i, stmt_info->stmt);
+  if (stmt_info)
+   dump_printf_loc (metadata, user_loc, "\t%sstmt %u %G",
+STMT_VINFO_LIVE_P (stmt_info) ? "[l] " : "",
+i, stmt_info->stmt);
+  else
+   dump_printf_loc (metadata, user_loc, "\tstmt %u ---\n");
   else
 {
   dump_printf_loc (metadata, user_loc, "\t{ ");
@@ -2924,7 +2935,8 @@ vect_mark_slp_stmts (slp_tree node, hash_set 
)
 return;
 
   FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info)
-STMT_SLP_TYPE (stmt_info) = pure_slp;
+if (stmt_info)
+  STMT_SLP_TYPE (stmt_info) = pure_slp;
 
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
 if (child)
@@ -2954,11 +2966,12 @@ vect_mark_slp_stmts_relevant (slp_tree node, 
hash_set )
 return;
 
   FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info)
-{
-  gcc_assert (!STMT_VINFO_RELEVANT (stmt_info)
-  || STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope);
-  STMT_VINFO_RELEVANT (stmt_info) = vect_used_in_scope;
-}
+if (stmt_info)
+  {
+   gcc_assert (!STMT_VINFO_RELEVANT (stmt_info)
+   || STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope);
+   STMT_VINFO_RELEVANT (stmt_info) = vect_used_in_scope;
+  }
 
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
 if (child)
@@ -3009,10 +3022,11 @@ vect_find_last_scalar_stmt_in_slp (slp_tree node)
   stmt_vec_info stmt_vinfo;
 
   for (int i = 0; SLP_TREE_SCALAR_STMTS (node).iterate (i, _vinfo); i++)
-{
-  stmt_vinfo = vect_orig_stmt (stmt_vinfo);
-  last = last ? get_later_stmt

[PATCH][v2] RISC-V: Harden SLP reduction support wrt STMT_VINFO_REDUC_IDX

2024-06-28 Thread Richard Biener

The following makes sure that for a SLP reductions all lanes have
the same STMT_VINFO_REDUC_IDX.  Once we move that info and can adjust
it we can implement swapping.  It also makes the existing protection
against operand swapping trigger for all stmts participating in a
reduction, not just the final one marked as reduction-def.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

The first version had a thinko and the two remaining FAILs were
because of SLP reduction chains where a same reduc_idx isn't
really relevant.  I'll see what the CI says to this.

Richard.

* tree-vect-slp.cc (vect_build_slp_tree_1): Compare
STMT_VINFO_REDUC_IDX.
(vect_build_slp_tree_2): Prevent operand swapping for
all stmts participating in a reduction.
---
 gcc/tree-vect-slp.cc | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index cb8604eb611..a02795e9d8f 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -1072,6 +1072,7 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char 
*swap,
   stmt_vec_info first_load = NULL, prev_first_load = NULL;
   bool first_stmt_ldst_p = false, ldst_p = false;
   bool first_stmt_phi_p = false, phi_p = false;
+  int first_reduc_idx = -1;
   bool maybe_soft_fail = false;
   tree soft_fail_nunits_vectype = NULL_TREE;
 
@@ -1204,6 +1205,7 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char 
*swap,
  first_stmt_code = rhs_code;
  first_stmt_ldst_p = ldst_p;
  first_stmt_phi_p = phi_p;
+ first_reduc_idx = STMT_VINFO_REDUC_IDX (stmt_info);
 
  /* Shift arguments should be equal in all the packed stmts for a
 vector shift with scalar shift operand.  */
@@ -1267,6 +1269,24 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char 
*swap,
}
   else
{
+ if (first_reduc_idx != STMT_VINFO_REDUC_IDX (stmt_info)
+ /* For SLP reduction groups the index isn't necessarily
+uniform but only that of the first stmt matters.  */
+ && !(first_reduc_idx != -1
+  && STMT_VINFO_REDUC_IDX (stmt_info) != -1
+  && REDUC_GROUP_FIRST_ELEMENT (stmt_info)))
+   {
+ if (dump_enabled_p ())
+   {
+ dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+  "Build SLP failed: different reduc_idx "
+  "%d instead of %d in %G",
+  STMT_VINFO_REDUC_IDX (stmt_info),
+  first_reduc_idx, stmt);
+   }
+ /* Mismatch.  */
+ continue;
+   }
  if (first_stmt_code != rhs_code
  && alt_stmt_code == ERROR_MARK)
alt_stmt_code = rhs_code;
@@ -2535,8 +2555,7 @@ out:
  && oprnds_info[1]->first_dt == vect_internal_def
  && is_gimple_assign (stmt_info->stmt)
  /* Swapping operands for reductions breaks assumptions later on.  */
- && STMT_VINFO_DEF_TYPE (stmt_info) != vect_reduction_def
- && STMT_VINFO_DEF_TYPE (stmt_info) != vect_double_reduction_def)
+ && STMT_VINFO_REDUC_IDX (stmt_info) == -1)
{
  /* See whether we can swap the matching or the non-matching
 stmt operands.  */
-- 
2.43.0

Re: nvptx vs. [PATCH] Add a late-combine pass [PR106594]

2024-06-28 Thread Richard Sandiford

Thomas Schwinge  writes:
> Hi!
>
> On 2024-06-27T23:20:18+0200, I wrote:
>> On 2024-06-27T22:27:21+0200, I wrote:
>>> On 2024-06-27T18:49:17+0200, I wrote:
 On 2023-10-24T19:49:10+0100, Richard Sandiford  
 wrote:
> This patch adds a combine pass that runs late in the pipeline.
>>>
>>> [After sending, I realized I replied to a previous thread of this work.]
>>>
 I've beek looking a bit through recent nvptx target code generation
 changes for GCC target libraries, and thought I'd also share here my
 findings for the "late-combine" changes in isolation, for nvptx target.
 
 First the unexpected thing:
>>>
>>> So much for "unexpected thing" -- next level of unexpected here...
>>> Appreciated if anyone feels like helping me find my way through this, but
>>> I totally understand if you've got other things to do.
>>
>> OK, I found something already.  (Unexpectedly quickly...)  ;-)
>>
 there are a few cases where we now see unused
 registers get declared
>
>> But in fact, for both cases
>
> Now tested: 's%both%all'.  :-)
>
>> the unexpected difference goes away if after
>> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'.  That's normally run
>> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's
>> all not active for nvptx target given '!reload_completed', given nvptx is
>> 'targetm.no_register_allocation'.  Maybe we need to enable a few more
>> passes, or is there anything in 'pass_late_combine' to change, so that we
>> don't run into this?  Does it inadvertently mark registers live or
>> something like that?
>
> Basically, is 'pass_late_combine' potentionally doing things that depend
> on later clean-up?  (..., or shouldn't it be doing these things in the
> first place?)

It's possible that late-combine could expose dead code, but I imagine
it's a niche case.

I had a look at the nvptx logs from my comparison, and the cases in
which I saw this seemed to be those where late-combine doesn't find
anything to do.  Does that match your examples?  Specifically,
the effect should be the same with -fdbg-cnt=late_combine:0-0

I think what's happening is that:

- combine exposes dead code

- ce2 previously ran df_analyze with DF_LR_RUN_DCE set, and so cleared
  up the dead code

- late-combine instead runs df_analyze without that flag (since late-combine
  itself doesn't really care whether dead code is present)

- if late-combine doesn't do anything, ce2's df_analyze call has nothing
  to do, and skips even the DCE

The easiest fix would be to add:

  df_set_flags (DF_LR_RUN_DCE);

before df_analyze in late-combine.cc, so that it behaves like ce2.
But the arrangement feels wrong.  I would have expected DF_LR_RUN_DCE
to depend on whether df_analyze had been called since the last DCE pass
(whether DF_LR_RUN_DCE or a full DCE).

Thanks,
Richard

RE: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip

2024-06-28 Thread Tamar Christina

> -Original Message-
> From: Richard Biener 
> Sent: Friday, June 28, 2024 6:39 AM
> To: Li, Pan2 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com;
> jeffreya...@gmail.com; rdapp@gmail.com; Tamar Christina
> 
> Subject: Re: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip
> 
> On Thu, Jun 27, 2024 at 4:45 PM Li, Pan2  wrote:
> >
> > Hi Richard,
> >
> > As mentioned by tamar in previous, would like to try even more optimization
> based on this patch.
> > Assume we take zip benchmark as example, we may have gimple similar as below
> >
> > unsigned int _1, _2;
> > unsigned short int _9;
> >
> > _9 = (unsigned short int).SAT_SUB (_1, _2);
> >
> > If we can locate the _1 is in the range of unsigned short, we can 
> > distribute the
> convert into
> > the .SAT_SUB, aka:
> >
> > From:
> > _1 = (unsigned int short)_other;
> > _9 = (unsigned short int).SAT_SUB (_1, _2);
> >
> > To:
> > _9 = .SAT_SUB ((unsigned int short)_1, (unsigned int short)MIN_EXPR (_2,
> 65536)));
> >
> > Unfortunately, it failed to vectorize when I try to perform above changes. 
> > The
> vectorizable_conversion
> > considers it is not simple use and then return fail to vect_analyze_loop_2.
> >
> > zip.test.c:15:12: note:   ==> examining pattern def statement: patt_42 = 
> > (short
> unsigned int) MIN_EXPR ;
> > zip.test.c:15:12: note:   ==> examining statement: patt_42 = (short 
> > unsigned int)
> MIN_EXPR ;
> > zip.test.c:15:12: note:   vect_is_simple_use: operand MIN_EXPR  b_12(D)>, type of def: unknown
> > zip.test.c:15:12: missed:   Unsupported pattern.
> > zip.test.c:15:12: missed:   use not simple.
> > zip.test.c:15:12: note:   vect_is_simple_use: operand MIN_EXPR  b_12(D)>, type of def: unknown
> > zip.test.c:15:12: missed:   Unsupported pattern.
> > zip.test.c:15:12: missed:   use not simple.
> > zip.test.c:15:12: note:   vect_is_simple_use: operand MIN_EXPR  b_12(D)>, type of def: unknown
> > zip.test.c:15:12: missed:   Unsupported pattern.
> > zip.test.c:15:12: missed:   use not simple.
> > zip.test.c:7:6: missed:   not vectorized: relevant stmt not supported: 
> > patt_42 =
> (short unsigned int) MIN_EXPR ;
> > zip.test.c:15:12: missed:  bad operation or unsupported loop bound.
> >
> > I tried to take COND_EXPR here instead of MIN_EXPR but almost the same
> behavior. I am not sure if we can unblock this by the
> > vectorizable_conversion or we need some improvements from other pass.
> 
> I think you're doing the MIN_EXPR wrong - the above says MIN_EXPR
>  which doesn't make
> sense anyway.  I suspect you fail to put the MIN_EXPR to a separate statement?
> 

Aye, you need to emit the additional statements through  append_pattern_def_seq,
This is also because the scalar statement doesn’t require them, so it makes 
costing easier.

The vectorizer expects arguments to be simple use, so compound statements aren't
Supported as they make costing and codegen harder.

Cheers,
Tamar

> > Thanks a lot.
> >
> > Pan
> >
> > -Original Message-
> > From: Li, Pan2
> > Sent: Thursday, June 27, 2024 2:14 PM
> > To: Richard Biener 
> > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com;
> jeffreya...@gmail.com; rdapp@gmail.com
> > Subject: RE: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip
> >
> > > OK
> >
> > Committed, thanks Richard.
> >
> > Pan
> >
> > -Original Message-
> > From: Richard Biener 
> > Sent: Thursday, June 27, 2024 2:04 PM
> > To: Li, Pan2 
> > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com;
> jeffreya...@gmail.com; rdapp@gmail.com
> > Subject: Re: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip
> >
> > On Thu, Jun 27, 2024 at 3:31 AM  wrote:
> > >
> > > From: Pan Li 
> >
> > OK
> >
> > > The zip benchmark of coremark-pro have one SAT_SUB like pattern but
> > > truncated as below:
> > >
> > > void test (uint16_t *x, unsigned b, unsigned n)
> > > {
> > >   unsigned a = 0;
> > >   register uint16_t *p = x;
> > >
> > >   do {
> > > a = *--p;
> > > *p = (uint16_t)(a >= b ? a - b : 0); // Truncate after .SAT_SUB
> > >   } while (--n);
> > > }
> > >
> > > It will have gimple before vect pass,  it cannot hit any pattern of
> > > SAT_SUB and then cannot vectorize to SAT_SUB.
> > >
> > > _2 = a_11 - b_12(D);
> > > iftmp.0_13 = (short unsigned int) _2;
> > > _18 = a_11 >= b_12(D);
> > > iftmp.0_5 = _18 ? iftmp.0_13 : 0;
> > >
> > > This patch would like to improve the pattern match to recog above
> > > as truncate after .SAT_SUB pattern.  Then we will have the pattern
> > > similar to below,  as well as eliminate the first 3 dead stmt.
> > >
> > > _2 = a_11 - b_12(D);
> > > iftmp.0_13 = (short unsigned int) _2;
> > > _18 = a_11 >= b_12(D);
> > > iftmp.0_5 = (short unsigned int).SAT_SUB (a_11, b_12(D));
> > >
> > > The below tests are passed for this patch.
> > > 1. The rv64gcv fully regression tests.
> > > 2. The rv64gcv build with glibc.
> > > 3. The x86 bootstrap tests.
> > >

Re: [PATCH] Use move-aware auto_vec in map

2024-06-28 Thread Jørgen Kvalsvik


On 6/28/24 13:55, Richard Biener wrote:

On Fri, Jun 28, 2024 at 8:43 AM Jørgen Kvalsvik  wrote:


Using auto_vec rather than vec for means the vectors are release
automatically upon return, to stop the leak. The problem seems is that
auto_vec is not really move-aware, only the  specialization
is.


Indeed.


This is actually Jan's original suggestion
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655600.html which I
improvised on by also using embedded storage. I think it should fix this
regression:
https://gcc.gnu.org/pipermail/gcc-regression/2024-June/080152.html

I could not reproduce it on x86-64 linux, so if someone could help me
test it on aarch64 that would be much appreciated.


OK.


Pushed. Sorry for the noise.

Thanks,
Jørgen




--

gcc/ChangeLog:

 * tree-profile.cc (find_conditions): Use auto_vec without
   embedded storage.
---
  gcc/tree-profile.cc | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc
index 8c9945847ca..153c9323040 100644
--- a/gcc/tree-profile.cc
+++ b/gcc/tree-profile.cc
@@ -876,7 +876,7 @@ find_conditions (struct function *fn)
  make_top_index (fnblocks, ctx.B1, ctx.top_index);

  /* Bin the Boolean expressions so that exprs[id] -> [x1, x2, ...].  */
-hash_map, auto_vec> exprs;
+hash_map, auto_vec> exprs;
  for (basic_block b : fnblocks)
  {
 const unsigned uid = condition_uid (fn, b);
--
2.39.2

Re: [PATCH v1] Match: Support imm form for unsigned scalar .SAT_ADD

2024-06-28 Thread Richard Biener

On Fri, Jun 28, 2024 at 5:44 AM  wrote:
>
> From: Pan Li 
>
> This patch would like to support the form of unsigned scalar .SAT_ADD
> when one of the op is IMM.  For example as below:
>
> Form IMM:
>   #define DEF_SAT_U_ADD_IMM_FMT_1(T)   \
>   T __attribute__((noinline))  \
>   sat_u_add_imm_##T##_fmt_1 (T x)  \
>   {\
> return (T)(x + 9) >= x ? (x + 9) : -1; \
>   }
>
> DEF_SAT_U_ADD_IMM_FMT_1(uint64_t)
>
> Before this patch:
> __attribute__((noinline))
> uint64_t sat_u_add_imm_uint64_t_fmt_1 (uint64_t x)
> {
>   long unsigned int _1;
>   uint64_t _3;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _1 = MIN_EXPR ;
>   _3 = _1 + 9;
>   return _3;
> ;;succ:   EXIT
>
> }
>
> After this patch:
> __attribute__((noinline))
> uint64_t sat_u_add_imm_uint64_t_fmt_1 (uint64_t x)
> {
>   uint64_t _3;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _3 = .SAT_ADD (x_2(D), 9); [tail call]
>   return _3;
> ;;succ:   EXIT
>
> }
>
> The below test suites are passed for this patch:
> 1. The rv64gcv fully regression test with newlib.
> 2. The x86 bootstrap test.
> 3. The x86 fully regression test.
>
> gcc/ChangeLog:
>
> * match.pd: Add imm form for .SAT_ADD matching.
> * tree-ssa-math-opts.cc (math_opts_dom_walker::after_dom_children):
> Add .SAT_ADD matching under PLUS_EXPR.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/match.pd  | 22 ++
>  gcc/tree-ssa-math-opts.cc |  2 ++
>  2 files changed, 24 insertions(+)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 3fa3f2e8296..d738c7ee9b4 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3154,6 +3154,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  (match (unsigned_integer_sat_add @0 @1)
>   (cond^ (gt @0 (usadd_left_part_1@2 @0 @1)) integer_minus_onep @2))
>
> +/* Unsigned saturation add, case 9 (one op is imm):
> +   SAT_U_ADD = (X + 3) >= x ? (X + 3) : -1.  */
> +(match (unsigned_integer_sat_add @0 @1)
> + (plus:c (min @0 INTEGER_CST@2) INTEGER_CST@1)

No :c necessary on the plus.

> + (with {
> +   unsigned precision = TYPE_PRECISION (type);
> +   wide_int cst_1 = wi::to_wide (@1, precision);
> +   wide_int cst_2 = wi::to_wide (@2, precision);

Just use wi::to_wide (@1/@2);

> +   wide_int max = wi::mask (precision, false, precision);
> +   wide_int sum = wi::add (cst_1, cst_2);
> +  }
> +  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && types_match (type, @0, @1) && wi::eq_p (max, sum)

Can you refactor to put the non-max/sum tests before the (with {...}?

> +
> +/* Unsigned saturation add, case 10 (one op is imm):
> +   SAT_U_ADD = __builtin_add_overflow (X, 3, ) == 0 ? ret : -1.  */
> +(match (unsigned_integer_sat_add @0 @1)
> + (cond^ (ne (imagpart (IFN_ADD_OVERFLOW:c@2 @0 INTEGER_CST@1)) integer_zerop)

No need for :c on the IFN_ADD_OVERFLOW.

OK with those changes.

Richard.

> +  integer_minus_onep (realpart @2))
> +  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && types_match (type, @0
> +
>  /* Unsigned saturation sub, case 1 (branch with gt):
> SAT_U_SUB = X > Y ? X - Y : 0  */
>  (match (unsigned_integer_sat_sub @0 @1)
> diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
> index 3783a874699..3b5433ec000 100644
> --- a/gcc/tree-ssa-math-opts.cc
> +++ b/gcc/tree-ssa-math-opts.cc
> @@ -6195,6 +6195,8 @@ math_opts_dom_walker::after_dom_children (basic_block 
> bb)
>   break;
>
> case PLUS_EXPR:
> + match_unsigned_saturation_add (, as_a (stmt));
> + /* fall-through  */
> case MINUS_EXPR:
>   if (!convert_plusminus_to_widen (, stmt, code))
> {
> --
> 2.34.1
>

Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-06-28 Thread Richard Biener

On Wed, Jun 26, 2024 at 4:50 PM Feng Xue OS  wrote:
>
> Updated the patch.
>
> For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
> vectorizer could only handle the pattern if the reduction chain does not
> contain other operation, no matter the other is normal or lane-reducing.
>
> Actually, to allow multiple arbitrary lane-reducing operations, we need to
> support vectorization of loop reduction chain with mixed input vectypes. Since
> lanes of vectype may vary with operation, the effective ncopies of vectorized
> statements for operation also may not be same to each other, this causes
> mismatch on vectorized def-use cycles. A simple way is to align all operations
> with the one that has the most ncopies, the gap could be complemented by
> generating extra trivial pass-through copies. For example:
>
>int sum = 0;
>for (i)
>  {
>sum += d0[i] * d1[i];  // dot-prod 
>sum += w[i];   // widen-sum 
>sum += abs(s0[i] - s1[i]); // sad 
>sum += n[i];   // normal 
>  }
>
> The vector size is 128-bit vectorization factor is 16. Reduction statements
> would be transformed as:
>
>vector<4> int sum_v0 = { 0, 0, 0, 0 };
>vector<4> int sum_v1 = { 0, 0, 0, 0 };
>vector<4> int sum_v2 = { 0, 0, 0, 0 };
>vector<4> int sum_v3 = { 0, 0, 0, 0 };
>
>for (i / 16)
>  {
>sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>sum_v1 = sum_v1;  // copy
>sum_v2 = sum_v2;  // copy
>sum_v3 = sum_v3;  // copy
>
>sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
>sum_v1 = sum_v1;  // copy
>sum_v2 = sum_v2;  // copy
>sum_v3 = sum_v3;  // copy
>
>sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
>sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
>sum_v2 = sum_v2;  // copy
>sum_v3 = sum_v3;  // copy
>
>sum_v0 += n_v0[i: 0  ~ 3 ];
>sum_v1 += n_v1[i: 4  ~ 7 ];
>sum_v2 += n_v2[i: 8  ~ 11];
>sum_v3 += n_v3[i: 12 ~ 15];
>  }
>
> 2024-03-22 Feng Xue 
>
> gcc/
> PR tree-optimization/114440
> * tree-vectorizer.h (vectorizable_lane_reducing): New function
> declaration.
> * tree-vect-stmts.cc (vect_analyze_stmt): Call new function
> vectorizable_lane_reducing to analyze lane-reducing operation.
> * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost 
> computation
> code related to emulated_mixed_dot_prod.
> (vect_reduction_update_partial_vector_usage): Compute ncopies as the
> original means for single-lane slp node.
> (vectorizable_lane_reducing): New function.
> (vectorizable_reduction): Allow multiple lane-reducing operations in
> loop reduction. Move some original lane-reducing related code to
> vectorizable_lane_reducing.
> (vect_transform_reduction): Extend transformation to support reduction
> statements with mixed input vectypes.
>
> gcc/testsuite/
> PR tree-optimization/114440
> * gcc.dg/vect/vect-reduc-chain-1.c
> * gcc.dg/vect/vect-reduc-chain-2.c
> * gcc.dg/vect/vect-reduc-chain-3.c
> * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> * gcc.dg/vect/vect-reduc-dot-slp-1.c
> ---
>  .../gcc.dg/vect/vect-reduc-chain-1.c  |  62 
>  .../gcc.dg/vect/vect-reduc-chain-2.c  |  77 
>  .../gcc.dg/vect/vect-reduc-chain-3.c  |  66 
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 
>  .../gcc.dg/vect/vect-reduc-dot-slp-1.c|  60 
>  gcc/tree-vect-loop.cc | 333 ++
>  gcc/tree-vect-stmts.cc|   2 +
>  gcc/tree-vectorizer.h |   2 +
>  11 files changed, 836 insertions(+), 70 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c 
> b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> new file mode 100644
> index 000..04bfc419dbd
>

Handle 'NUM' in 'PUSH_INSERT_PASSES_WITHIN' (was: [PATCH 03/11] Handwritten part of conversion of passes to C++ classes)

2024-06-28 Thread Thomas Schwinge

Hi!

As part of this:

On 2013-07-26T11:04:33-0400, David Malcolm  wrote:
> This patch is the hand-written part of the conversion of passes from
> C structs to C++ classes.

> --- a/gcc/passes.c
> +++ b/gcc/passes.c

..., we did hard-code 'PUSH_INSERT_PASSES_WITHIN(PASS)' to always refer
to the first instance of 'PASS':

>  #define PUSH_INSERT_PASSES_WITHIN(PASS) \
>{ \
> -struct opt_pass **p = &(PASS).pass.sub;
> +struct opt_pass **p = &(PASS ## _1)->sub;

..., however we did change 'NEXT_PASS(PASS, NUM)' to actually use 'NUM':

> -#define NEXT_PASS(PASS, NUM)  (p = next_pass_1 (p, &((PASS).pass)))
> +#define NEXT_PASS(PASS, NUM) \
> +  do { \
> +gcc_assert (NULL == PASS ## _ ## NUM); \
> +if ((NUM) == 1)  \
> +  PASS ## _1 = make_##PASS (ctxt_);  \
> +else \
> +  {  \
> +gcc_assert (PASS ## _1); \
> +PASS ## _ ## NUM = PASS ## _1->clone (); \
> +  }  \
> +p = next_pass_1 (p, PASS ## _ ## NUM);  \
> +  } while (0)

This was never re-synchronized later on, and is problematic if you try to
do something like this; change:

[...]
NEXT_PASS (pass_postreload);
PUSH_INSERT_PASSES_WITHIN (pass_postreload)
NEXT_PASS (pass_postreload_cse);
[...]
NEXT_PASS (pass_cprop_hardreg);
NEXT_PASS (pass_fast_rtl_dce);
NEXT_PASS (pass_reorder_blocks);
[...]
POP_INSERT_PASSES ()
[...]

... into:

[...]
NEXT_PASS (pass_postreload);
PUSH_INSERT_PASSES_WITHIN (pass_postreload)
NEXT_PASS (pass_postreload_cse);
[...]
NEXT_PASS (pass_cprop_hardreg);
POP_INSERT_PASSES ()
NEXT_PASS (pass_fast_rtl_dce);
NEXT_PASS (pass_postreload);
PUSH_INSERT_PASSES_WITHIN (pass_postreload)
NEXT_PASS (pass_reorder_blocks);
[...]
POP_INSERT_PASSES ()
[...]

That is, interrupt the pass pipeline within 'pass_postreload', in order
to unconditionally run 'pass_fast_rtl_dce' even if not running
'pass_postreload'.  What happens is that the second
'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' overwrites the first
'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' instead of applying to the
second (preceding) 'NEXT_PASS (pass_postreload);'.

(I ran into this in context of what I tried in

"nvptx vs. [PATCH] Add a late-combine pass [PR106594]"; discuss that
specific use case over there, not here.)

OK to address this with the attached
"Handle 'NUM' in 'PUSH_INSERT_PASSES_WITHIN'"?

This depends on

"Rewrite usage comment at the top of 'gcc/passes.def'" to avoid running
into the 'ERROR: Can't locate [...]' that I'm adding, while processing
the 'PUSH_INSERT_PASSES_WITHIN (PASS)' in the usage comment at the top of
'gcc/passes.def', where 'NEXT_PASS (PASS)' only appears later.  ;-)

I've verified this does the expected thing for the main 'gcc/passes.def',
and that 'PUSH_INSERT_PASSES_WITHIN' is not used/not applicable for
'PASSES_EXTRA' ('gcc/config/*/*-passes.def').

Grüße
 Thomas

>From e368ccba93f5bbaee882076c80849adb55a68fa0 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Fri, 28 Jun 2024 12:10:12 +0200
Subject: [PATCH] Handle 'NUM' in 'PUSH_INSERT_PASSES_WITHIN'

..., such that also for repeated 'NEXT_PASS', 'PUSH_INSERT_PASSES_WITHIN' for a
given 'PASS', the 'PUSH_INSERT_PASSES_WITHIN' applies to the preceeding
'NEXT_PASS', and not unconditionally applies to the first 'NEXT_PASS'.

	gcc/
	* gen-pass-instances.awk: Handle 'PUSH_INSERT_PASSES_WITHIN'.
	* pass_manager.h (PUSH_INSERT_PASSES_WITHIN): Adjust.
	* passes.cc (PUSH_INSERT_PASSES_WITHIN): Likewise.
---
 gcc/gen-pass-instances.awk | 28 +---
 gcc/pass_manager.h |  2 +-
 gcc/passes.cc  |  6 +++---
 3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/gcc/gen-pass-instances.awk b/gcc/gen-pass-instances.awk
index 449889663f7..871ac0cdb52 100644
--- a/gcc/gen-pass-instances.awk
+++ b/gcc/gen-pass-instances.awk
@@ -16,7 +16,7 @@

 # This Awk script takes passes.def and writes pass-instances.def,
 # counting the instances of each kind of pass, adding an instance number
-# to everywhere that NEXT_PASS is used.
+# to everywhere that NEXT_PASS or PUSH_INSERT_PASSES_WITHIN are used.
 # Also handle INSERT_PASS_AFTER, INSERT_PASS_BEFORE and REPLACE_PASS
 # directives.
 #
@@ -222,9 +222,31 @@ END {
 	  if (with_arg)
 	printf ",%s", with_arg;
 	  printf ")%s\n", postfix;
+
+	  continue;
 	}
-  else
-	print lines[i];
+
+  ret = parse_line(lines[i], "PUSH_INSERT_PASSES_WITHIN");
+  if (ret)
+	{
+	  pass_name = args[1];
+
+	  pass_num = pass_final_counts[pass_name];
+	  if (!pass_num)
+	{
+	  print "ERROR: Can't locate instance of the pass

Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-reducing

2024-06-28 Thread Richard Biener

On Wed, Jun 26, 2024 at 4:48 PM Feng Xue OS  wrote:
>
> Updated the patches based on comments.
>
> The input vectype of reduction PHI statement must be determined before
> vect cost computation for the reduction. Since lance-reducing operation has
> different input vectype from normal one, so we need to traverse all reduction
> statements to find out the input vectype with the least lanes, and set that to
> the PHI statement.

OK

> ---
>  gcc/tree-vect-loop.cc | 79 ++-
>  1 file changed, 56 insertions(+), 23 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 347dac97e49..419f4b08d2b 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -7643,7 +7643,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>  {
>stmt_vec_info def = loop_vinfo->lookup_def (reduc_def);
>stmt_vec_info vdef = vect_stmt_to_vectorize (def);
> -  if (STMT_VINFO_REDUC_IDX (vdef) == -1)
> +  int reduc_idx = STMT_VINFO_REDUC_IDX (vdef);
> +
> +  if (reduc_idx == -1)
> {
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -7686,10 +7688,57 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>   return false;
> }
> }
> -  else if (!stmt_info)
> -   /* First non-conversion stmt.  */
> -   stmt_info = vdef;
> -  reduc_def = op.ops[STMT_VINFO_REDUC_IDX (vdef)];
> +  else
> +   {
> + /* First non-conversion stmt.  */
> + if (!stmt_info)
> +   stmt_info = vdef;
> +
> + if (lane_reducing_op_p (op.code))
> +   {
> + enum vect_def_type dt;
> + tree vectype_op;
> +
> + /* The last operand of lane-reducing operation is for
> +reduction.  */
> + gcc_assert (reduc_idx > 0 && reduc_idx == (int) op.num_ops - 1);
> +
> + if (!vect_is_simple_use (op.ops[0], loop_vinfo, , 
> _op))
> +   return false;
> +
> + tree type_op = TREE_TYPE (op.ops[0]);
> +
> + if (!vectype_op)
> +   {
> + vectype_op = get_vectype_for_scalar_type (loop_vinfo,
> +   type_op);
> + if (!vectype_op)
> +   return false;
> +   }
> +
> + /* For lane-reducing operation vectorizable analysis needs the
> +reduction PHI information */
> + STMT_VINFO_REDUC_DEF (def) = phi_info;
> +
> + /* Each lane-reducing operation has its own input vectype, while
> +reduction PHI will record the input vectype with the least
> +lanes.  */
> + STMT_VINFO_REDUC_VECTYPE_IN (vdef) = vectype_op;
> +
> + /* To accommodate lane-reducing operations of mixed input
> +vectypes, choose input vectype with the least lanes for the
> +reduction PHI statement, which would result in the most
> +ncopies for vectorized reduction results.  */
> + if (!vectype_in
> + || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE 
> (vectype_in)))
> +  < GET_MODE_SIZE (SCALAR_TYPE_MODE (type_op
> +   vectype_in = vectype_op;
> +   }
> + else
> +   vectype_in = STMT_VINFO_VECTYPE (phi_info);
> +   }
> +
> +  reduc_def = op.ops[reduc_idx];
>reduc_chain_length++;
>if (!stmt_info && slp_node)
> slp_for_stmt_info = SLP_TREE_CHILDREN (slp_for_stmt_info)[0];
> @@ -7747,6 +7796,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>
>tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);
>STMT_VINFO_REDUC_VECTYPE (reduc_info) = vectype_out;
> +  STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in;
> +
>gimple_match_op op;
>if (!gimple_extract_op (stmt_info->stmt, ))
>  gcc_unreachable ();
> @@ -7831,16 +7882,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>   = get_vectype_for_scalar_type (loop_vinfo,
>  TREE_TYPE (op.ops[i]), slp_op[i]);
>
> -  /* To properly compute ncopies we are interested in the widest
> -non-reduction input type in case we're looking at a widening
> -accumulation that we later handle in vect_transform_reduction.  */
> -  if (lane_reducing
> - && vectype_op[i]
> - && (!vectype_in
> - || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in)))
> - < GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE 
> (vectype_op[i]))
> -   vectype_in = vectype_op[i];
> -
>/* Record how the non-reduction-def value of COND_EXPR is defined.
>  ???  For a chain of multiple CONDs we'd have to match them up all.  
> */
>if (op.code == COND_EXPR && reduc_chain_length == 1)
> @@ -7859,14 +7900,6 @@

Re: Rewrite usage comment at the top of 'gcc/passes.def' (was: [PATCH 02/11] Generate pass-instances.def)

2024-06-28 Thread Richard Biener

On Fri, Jun 28, 2024 at 2:14 PM Thomas Schwinge  wrote:
>
> Hi!
>
> On 2013-07-26T11:04:32-0400, David Malcolm  wrote:
> > Introduce a new gen-pass-instances.awk script, and use it at build time
> > to make a pass-instances.def from passes.def.
>
> (The script has later been rewritten and extended, but the issue I'm
> discussing is relevant already in its original version.)
>
> > The generated pass-instances.def contains similar content to passes.def,
> > but the pass instances within it are explicitly numbered, so that e.g.
> > the third instance of:
> >
> >   NEXT_PASS (pass_copy_prop)
> >
> > becomes:
> >
> >   NEXT_PASS (pass_copy_prop, 3)
>
> > --- a/gcc/passes.c
> > +++ b/gcc/passes.c
> > @@ -1315,12 +1315,12 @@ pipeline::pipeline (context *ctxt)
> >  #define POP_INSERT_PASSES() \
> >}
> >
> > -#define NEXT_PASS(PASS)  (p = next_pass_1 (p, &((PASS).pass)))
> > +#define NEXT_PASS(PASS, NUM)  (p = next_pass_1 (p, &((PASS).pass)))
> >
> >  #define TERMINATE_PASS_LIST() \
> >*p = NULL;
> >
> > -#include "passes.def"
> > +#include "pass-instances.def"
>
> Given this, the usage comment at the top of 'gcc/passes.def' (see below)
> no longer is accurate (even if that latter file does continue to use the
> 'NEXT_PASS' form without 'NUM') -- and, worse, the 'NEXT_PASS' etc. in
> that usage comment are processed by the 'gcc/gen-pass-instances.awk'
> script:
>
> --- source-gcc/gcc/passes.def   2024-06-24 18:55:15.132561641 +0200
> +++ build-gcc/gcc/pass-instances.def2024-06-24 18:55:27.768562714 
> +0200
> [...]
> @@ -20,546 +22,578 @@
>  /*
>   Macros that should be defined when using this file:
> INSERT_PASSES_AFTER (PASS)
> PUSH_INSERT_PASSES_WITHIN (PASS)
> POP_INSERT_PASSES ()
> -   NEXT_PASS (PASS)
> +   NEXT_PASS (PASS, 1)
> TERMINATE_PASS_LIST (PASS)
>   */
> [...]
>
> (That is, this is 'NEXT_PASS' for the first instance of pass 'PASS'.)
> That's benign so far, but with another thing that I'll be extending, I'd
> then run into an error while the script handles this comment block.  ;-\
>
> OK to push "Rewrite usage comment at the top of 'gcc/passes.def'", see
> attached?

OK

>
> Grüße
>  Thomas
>
>

Re: LoongArch vs. [PATCH 0/6] Add a late-combine pass

2024-06-28 Thread chenglulu




在 2024/6/28 下午8:35, Xi Ruoyao 写道:

On Fri, 2024-06-28 at 20:34 +0800, chenglulu wrote:

在 2024/6/28 下午8:25, Xi Ruoyao 写道:

Hi Richard,

The late combine pass has triggered some FAILs on LoongArch and I'm
investigating.  One of them is movcf2gr-via-fr.c.  In
315r.postreload:

(insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
  (reg:FCC 64 $fcc0 [87]))
"../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
168 {movfcc_internal}
   (nil))
(insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
  (reg:FCC 32 $f0 [87]))
"../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
168 {movfcc_internal}
   (nil))

The late combine pass combines these to:

(insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
  (reg:FCC 64 $fcc0 [87]))
"../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
168 {movfcc_internal}
   (nil))

But we are using a FPR ($f0) here deliberately to work around an
architectural issue in LA464 causing a direct FCC-to-GPR move very
slow.

Could you suggest how to fix this issue?

Hi, Ruoyao:

We need to define TARGET_INSN_COST and set the cost of
movcf2gr/movgr2cf.

I've fixed this and am doing correctness testing now.

Ah thanks!  So it uses insn cost instead of rtx cost and I didn't
realize.



That's right.:-D

Re: LoongArch vs. [PATCH 0/6] Add a late-combine pass

2024-06-28 Thread Xi Ruoyao

On Fri, 2024-06-28 at 20:34 +0800, chenglulu wrote:
> 
> 在 2024/6/28 下午8:25, Xi Ruoyao 写道:
> > Hi Richard,
> > 
> > The late combine pass has triggered some FAILs on LoongArch and I'm
> > investigating.  One of them is movcf2gr-via-fr.c.  In
> > 315r.postreload:
> > 
> > (insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
> >  (reg:FCC 64 $fcc0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >   (nil))
> > (insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
> >  (reg:FCC 32 $f0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >   (nil))
> > 
> > The late combine pass combines these to:
> > 
> > (insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
> >  (reg:FCC 64 $fcc0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >   (nil))
> > 
> > But we are using a FPR ($f0) here deliberately to work around an
> > architectural issue in LA464 causing a direct FCC-to-GPR move very
> > slow.
> > 
> > Could you suggest how to fix this issue?
> 
> Hi, Ruoyao:
> 
> We need to define TARGET_INSN_COST and set the cost of
> movcf2gr/movgr2cf.
> 
> I've fixed this and am doing correctness testing now.

Ah thanks!  So it uses insn cost instead of rtx cost and I didn't
realize.


-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University

Re: LoongArch vs. [PATCH 0/6] Add a late-combine pass

2024-06-28 Thread chenglulu




在 2024/6/28 下午8:25, Xi Ruoyao 写道:

Hi Richard,

The late combine pass has triggered some FAILs on LoongArch and I'm
investigating.  One of them is movcf2gr-via-fr.c.  In 315r.postreload:

(insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
 (reg:FCC 64 $fcc0 [87])) 
"../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 
{movfcc_internal}
  (nil))
(insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
 (reg:FCC 32 $f0 [87])) 
"../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 
{movfcc_internal}
  (nil))

The late combine pass combines these to:

(insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
 (reg:FCC 64 $fcc0 [87])) 
"../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 
{movfcc_internal}
  (nil))

But we are using a FPR ($f0) here deliberately to work around an
architectural issue in LA464 causing a direct FCC-to-GPR move very slow.

Could you suggest how to fix this issue?


Hi, Ruoyao:

We need to define TARGET_INSN_COST and set the cost of movcf2gr/movgr2cf.

I've fixed this and am doing correctness testing now.



On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote:

This series is a resubmission of the late-combine work.  I've fixed
some bugs that Jeff's cross-target CI found last time and some others
that I hit since then.

/* snip */

Re: [PATCH] Fix native_encode_vector_part for itype when TYPE_PRECISION (itype) == BITS_PER_UNIT

2024-06-28 Thread Richard Sandiford

Richard Biener  writes:
> On Fri, Jun 28, 2024 at 2:16 PM Richard Biener
>  wrote:
>>
>> On Fri, Jun 28, 2024 at 11:06 AM Richard Biener
>>  wrote:
>> >
>> >
>> >
>> > > Am 28.06.2024 um 10:27 schrieb Richard Sandiford 
>> > > :
>> > >
>> > > Richard Biener  writes:
>> > >>> On Fri, Jun 28, 2024 at 8:01 AM Richard Biener
>> > >>>  wrote:
>> > >>>
>> > >>> On Fri, Jun 28, 2024 at 3:15 AM liuhongt  wrote:
>> > 
>> >  for the testcase in the PR115406, here is part of the dump.
>> > 
>> >   char D.4882;
>> >   vector(1)  _1;
>> >   vector(1) signed char _2;
>> >   char _5;
>> > 
>> >    :
>> >   _1 = { -1 };
>> > 
>> >  When assign { -1 } to vector(1} {signed-boolean:8},
>> >  Since TYPE_PRECISION (itype) <= BITS_PER_UNIT, so it set each bit of 
>> >  dest
>> >  with each vector elemnet. But i think the bit setting should only 
>> >  apply for
>> >  TYPE_PRECISION (itype) < BITS_PER_UNIT. .i.e for vector(1).
>> >  , it will be assigned as -1, instead of 1.
>> >  Is there any specific reason vector(1)  is handled
>> >  differently from vector<1> ?
>> > 
>> >  Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
>> >  Ok for trunk?
>> > >>>
>> > >>> I agree that <= BITS_PER_UNIT is suspicious, but the bit-precision
>> > >>> code should work for 8 bit
>> > >>> entities as well, it seems we only set the LSB of each element in the
>> > >>> "mask".  ISTR that SVE
>> > >>> masks can have up to 8 bit elements (for 8 byte data elements), so
>> > >>> maybe that's why
>> > >>> <= BITS_PER_UNIT.
>> > >
>> > > Yeah.
>> >
>> > So is it necessary that only one bit is set for SVE?

TBH I can't remember now.  It matches what SVE instructions produce, and
lines up with the associated RTL code (which at the time was SVE-specific).
But when dealing with multibyte elements, upper predicate element bits
are ignored on read, so matching the instructions might not matter.

>> > >>> So maybe instead of just setting one bit in
>> > >>>
>> > >>>  ptr[bit / BITS_PER_UNIT] |= 1 << (bit % BITS_PER_UNIT);
>> > >>>
>> > >>> we should set elt_bits bits, aka (without testing)
>> > >>>
>> > >>>  ptr[bit / BITS_PER_UNIT] |= (1 << elt_bits - 1) << (bit
>> > >>> % BITS_PER_UNIT);
>> > >>>
>> > >>> ?
>> > >>
>> > >> Alternatively
>> > >>
>> > >>  if (VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (expr))
>> > >>  && TYPE_PRECISION (itype) <= BITS_PER_UNIT)
>> > >>
>> > >> should be amended with
>> > >>
>> > >>   && GET_MODE_CLASS (TYPE_MODE (TREE_TYPE (expr))) != MODE_VECTOR_INT
>> > >
>> > > How about:
>> > >
>> > >  if (GET_MODE_CLASS (TYPE_MODE (TREE_TYPE (expr))) == MODE_VECTOR_BOOL)
>> > >{
>> > >  gcc_assert (TYPE_PRECISION (itype) <= BITS_PER_UNIT);
>> > >
>> > > ?
>> >
>> > Note the path is also necessary for avx512 and gcn mask modes which are 
>> > integer modes.
>> >
>> > > Is it OK for TYPE_MODE to affect tree-level semantics though, especially
>> > > since it can change with the target attribute?  (At least TYPE_MODE_RAW
>> > > would be stable.)
>> >
>> > That’s a good question and also related to GCC vector extension which can 
>> > result in both BLKmode and integer modes to be used.  But I’m not sure how 
>> > we expose masks to the middle end here.  A too large vector bool could be 
>> > lowered to AVX512 mode.  Maybe we should simply reject interpret/encode of 
>> > BLKmode vectors and make sure to never assign integer modes to vector 
>> > bools (if the target didn’t specify that mode)?
>> >
>> > I guess some test coverage would be nice here.
>>
>> To continue on that, we do not currently have a way to capture a
>> vector comparison output
>> but the C++ frontend has vector ?:
>>
>> typedef int v8si __attribute__((vector_size(32)));
>>
>> void foo (v8si *a, v8si *b, v8si *c)
>> {
>>   *c = *a < *b ? (v8si){-1,-1,-1,-1,-1,-1,-1,-1 } : (v8si){0,0,0,0,0,0,0,0};
>> }
>>
>> with SSE2 we get a  temporary, with AVX512 enabled
>> that becomes .  When we enlarge the vector to size 128
>> then even with AVX512 enabled I see  here and
>> vector lowering decomposes that to scalar (also with AVX or SSE, so maybe
>> just a missed optimization).  But note that to decompose this into two
>> AVX512 vectors the temporary would have to change from 
>> elements to .
>>
>> The not supported vector bool types have BLKmode sofar.
>>
>> But for example on i?86-linux with -mno-sse (like -march=i586) for
>>
>> typedef short v2hi __attribute__((vector_size(4)));
>>
>> void foo (v2hi *a, v2hi *b, v2hi *c)
>> {
>>   *c = *a < *b ? (v2hi){-1,-1} : (v2hi){0,0};
>> }
>>
>> we get a SImode vector  as I feared.  That means
>>  (the BITS_PER_UNIT case) can be ambiguous
>> between SVE (bool for a 8byte data vector) and emulated vectors
>> ("word-mode" vectors; for 1byte data vectors).
>>
>> And without knowing that SVE would have used VnBImode given that
>> AVX512 uses an integer mode.
>>
>> Aside from the too large vector and AVX512 issue above

LoongArch vs. [PATCH 0/6] Add a late-combine pass

2024-06-28 Thread Xi Ruoyao

Hi Richard,

The late combine pass has triggered some FAILs on LoongArch and I'm
investigating.  One of them is movcf2gr-via-fr.c.  In 315r.postreload:

(insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
(reg:FCC 64 $fcc0 [87])) 
"../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 
{movfcc_internal}
 (nil))
(insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
(reg:FCC 32 $f0 [87])) 
"../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 
{movfcc_internal}
 (nil))

The late combine pass combines these to:

(insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
(reg:FCC 64 $fcc0 [87])) 
"../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 
{movfcc_internal}
 (nil))

But we are using a FPR ($f0) here deliberately to work around an
architectural issue in LA464 causing a direct FCC-to-GPR move very slow.

Could you suggest how to fix this issue?

On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote:
> This series is a resubmission of the late-combine work.  I've fixed
> some bugs that Jeff's cross-target CI found last time and some others
> that I hit since then.

/* snip */

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University

Re: [PATCH] Fix native_encode_vector_part for itype when TYPE_PRECISION (itype) == BITS_PER_UNIT

2024-06-28 Thread Richard Biener

On Fri, Jun 28, 2024 at 2:16 PM Richard Biener
 wrote:
>
> On Fri, Jun 28, 2024 at 11:06 AM Richard Biener
>  wrote:
> >
> >
> >
> > > Am 28.06.2024 um 10:27 schrieb Richard Sandiford 
> > > :
> > >
> > > Richard Biener  writes:
> > >>> On Fri, Jun 28, 2024 at 8:01 AM Richard Biener
> > >>>  wrote:
> > >>>
> > >>> On Fri, Jun 28, 2024 at 3:15 AM liuhongt  wrote:
> > 
> >  for the testcase in the PR115406, here is part of the dump.
> > 
> >   char D.4882;
> >   vector(1)  _1;
> >   vector(1) signed char _2;
> >   char _5;
> > 
> >    :
> >   _1 = { -1 };
> > 
> >  When assign { -1 } to vector(1} {signed-boolean:8},
> >  Since TYPE_PRECISION (itype) <= BITS_PER_UNIT, so it set each bit of 
> >  dest
> >  with each vector elemnet. But i think the bit setting should only 
> >  apply for
> >  TYPE_PRECISION (itype) < BITS_PER_UNIT. .i.e for vector(1).
> >  , it will be assigned as -1, instead of 1.
> >  Is there any specific reason vector(1)  is handled
> >  differently from vector<1> ?
> > 
> >  Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> >  Ok for trunk?
> > >>>
> > >>> I agree that <= BITS_PER_UNIT is suspicious, but the bit-precision
> > >>> code should work for 8 bit
> > >>> entities as well, it seems we only set the LSB of each element in the
> > >>> "mask".  ISTR that SVE
> > >>> masks can have up to 8 bit elements (for 8 byte data elements), so
> > >>> maybe that's why
> > >>> <= BITS_PER_UNIT.
> > >
> > > Yeah.
> >
> > So is it necessary that only one bit is set for SVE?
> >
> > >>> So maybe instead of just setting one bit in
> > >>>
> > >>>  ptr[bit / BITS_PER_UNIT] |= 1 << (bit % BITS_PER_UNIT);
> > >>>
> > >>> we should set elt_bits bits, aka (without testing)
> > >>>
> > >>>  ptr[bit / BITS_PER_UNIT] |= (1 << elt_bits - 1) << (bit
> > >>> % BITS_PER_UNIT);
> > >>>
> > >>> ?
> > >>
> > >> Alternatively
> > >>
> > >>  if (VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (expr))
> > >>  && TYPE_PRECISION (itype) <= BITS_PER_UNIT)
> > >>
> > >> should be amended with
> > >>
> > >>   && GET_MODE_CLASS (TYPE_MODE (TREE_TYPE (expr))) != MODE_VECTOR_INT
> > >
> > > How about:
> > >
> > >  if (GET_MODE_CLASS (TYPE_MODE (TREE_TYPE (expr))) == MODE_VECTOR_BOOL)
> > >{
> > >  gcc_assert (TYPE_PRECISION (itype) <= BITS_PER_UNIT);
> > >
> > > ?
> >
> > Note the path is also necessary for avx512 and gcn mask modes which are 
> > integer modes.
> >
> > > Is it OK for TYPE_MODE to affect tree-level semantics though, especially
> > > since it can change with the target attribute?  (At least TYPE_MODE_RAW
> > > would be stable.)
> >
> > That’s a good question and also related to GCC vector extension which can 
> > result in both BLKmode and integer modes to be used.  But I’m not sure how 
> > we expose masks to the middle end here.  A too large vector bool could be 
> > lowered to AVX512 mode.  Maybe we should simply reject interpret/encode of 
> > BLKmode vectors and make sure to never assign integer modes to vector bools 
> > (if the target didn’t specify that mode)?
> >
> > I guess some test coverage would be nice here.
>
> To continue on that, we do not currently have a way to capture a
> vector comparison output
> but the C++ frontend has vector ?:
>
> typedef int v8si __attribute__((vector_size(32)));
>
> void foo (v8si *a, v8si *b, v8si *c)
> {
>   *c = *a < *b ? (v8si){-1,-1,-1,-1,-1,-1,-1,-1 } : (v8si){0,0,0,0,0,0,0,0};
> }
>
> with SSE2 we get a  temporary, with AVX512 enabled
> that becomes .  When we enlarge the vector to size 128
> then even with AVX512 enabled I see  here and
> vector lowering decomposes that to scalar (also with AVX or SSE, so maybe
> just a missed optimization).  But note that to decompose this into two
> AVX512 vectors the temporary would have to change from 
> elements to .
>
> The not supported vector bool types have BLKmode sofar.
>
> But for example on i?86-linux with -mno-sse (like -march=i586) for
>
> typedef short v2hi __attribute__((vector_size(4)));
>
> void foo (v2hi *a, v2hi *b, v2hi *c)
> {
>   *c = *a < *b ? (v2hi){-1,-1} : (v2hi){0,0};
> }
>
> we get a SImode vector  as I feared.  That means
>  (the BITS_PER_UNIT case) can be ambiguous
> between SVE (bool for a 8byte data vector) and emulated vectors
> ("word-mode" vectors; for 1byte data vectors).
>
> And without knowing that SVE would have used VnBImode given that
> AVX512 uses an integer mode.
>
> Aside from the too large vector and AVX512 issue above I think we can use
> MODE_VECTOR_BOOL || TYPE_PRECISION == 1 and for the latter we
> can assert the mode is a scalar integer mode (AVX512 or GCN)?

So like the attached?

Richard.


p
Description: Binary data

Re: [PATCH] Fix native_encode_vector_part for itype when TYPE_PRECISION (itype) == BITS_PER_UNIT

2024-06-28 Thread Richard Biener

On Fri, Jun 28, 2024 at 11:06 AM Richard Biener
 wrote:
>
>
>
> > Am 28.06.2024 um 10:27 schrieb Richard Sandiford 
> > :
> >
> > Richard Biener  writes:
> >>> On Fri, Jun 28, 2024 at 8:01 AM Richard Biener
> >>>  wrote:
> >>>
> >>> On Fri, Jun 28, 2024 at 3:15 AM liuhongt  wrote:
> 
>  for the testcase in the PR115406, here is part of the dump.
> 
>   char D.4882;
>   vector(1)  _1;
>   vector(1) signed char _2;
>   char _5;
> 
>    :
>   _1 = { -1 };
> 
>  When assign { -1 } to vector(1} {signed-boolean:8},
>  Since TYPE_PRECISION (itype) <= BITS_PER_UNIT, so it set each bit of dest
>  with each vector elemnet. But i think the bit setting should only apply 
>  for
>  TYPE_PRECISION (itype) < BITS_PER_UNIT. .i.e for vector(1).
>  , it will be assigned as -1, instead of 1.
>  Is there any specific reason vector(1)  is handled
>  differently from vector<1> ?
> 
>  Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
>  Ok for trunk?
> >>>
> >>> I agree that <= BITS_PER_UNIT is suspicious, but the bit-precision
> >>> code should work for 8 bit
> >>> entities as well, it seems we only set the LSB of each element in the
> >>> "mask".  ISTR that SVE
> >>> masks can have up to 8 bit elements (for 8 byte data elements), so
> >>> maybe that's why
> >>> <= BITS_PER_UNIT.
> >
> > Yeah.
>
> So is it necessary that only one bit is set for SVE?
>
> >>> So maybe instead of just setting one bit in
> >>>
> >>>  ptr[bit / BITS_PER_UNIT] |= 1 << (bit % BITS_PER_UNIT);
> >>>
> >>> we should set elt_bits bits, aka (without testing)
> >>>
> >>>  ptr[bit / BITS_PER_UNIT] |= (1 << elt_bits - 1) << (bit
> >>> % BITS_PER_UNIT);
> >>>
> >>> ?
> >>
> >> Alternatively
> >>
> >>  if (VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (expr))
> >>  && TYPE_PRECISION (itype) <= BITS_PER_UNIT)
> >>
> >> should be amended with
> >>
> >>   && GET_MODE_CLASS (TYPE_MODE (TREE_TYPE (expr))) != MODE_VECTOR_INT
> >
> > How about:
> >
> >  if (GET_MODE_CLASS (TYPE_MODE (TREE_TYPE (expr))) == MODE_VECTOR_BOOL)
> >{
> >  gcc_assert (TYPE_PRECISION (itype) <= BITS_PER_UNIT);
> >
> > ?
>
> Note the path is also necessary for avx512 and gcn mask modes which are 
> integer modes.
>
> > Is it OK for TYPE_MODE to affect tree-level semantics though, especially
> > since it can change with the target attribute?  (At least TYPE_MODE_RAW
> > would be stable.)
>
> That’s a good question and also related to GCC vector extension which can 
> result in both BLKmode and integer modes to be used.  But I’m not sure how we 
> expose masks to the middle end here.  A too large vector bool could be 
> lowered to AVX512 mode.  Maybe we should simply reject interpret/encode of 
> BLKmode vectors and make sure to never assign integer modes to vector bools 
> (if the target didn’t specify that mode)?
>
> I guess some test coverage would be nice here.

To continue on that, we do not currently have a way to capture a
vector comparison output
but the C++ frontend has vector ?:

typedef int v8si __attribute__((vector_size(32)));

void foo (v8si *a, v8si *b, v8si *c)
{
  *c = *a < *b ? (v8si){-1,-1,-1,-1,-1,-1,-1,-1 } : (v8si){0,0,0,0,0,0,0,0};
}

with SSE2 we get a  temporary, with AVX512 enabled
that becomes .  When we enlarge the vector to size 128
then even with AVX512 enabled I see  here and
vector lowering decomposes that to scalar (also with AVX or SSE, so maybe
just a missed optimization).  But note that to decompose this into two
AVX512 vectors the temporary would have to change from 
elements to .

The not supported vector bool types have BLKmode sofar.

But for example on i?86-linux with -mno-sse (like -march=i586) for

typedef short v2hi __attribute__((vector_size(4)));

void foo (v2hi *a, v2hi *b, v2hi *c)
{
  *c = *a < *b ? (v2hi){-1,-1} : (v2hi){0,0};
}

we get a SImode vector  as I feared.  That means
 (the BITS_PER_UNIT case) can be ambiguous
between SVE (bool for a 8byte data vector) and emulated vectors
("word-mode" vectors; for 1byte data vectors).

And without knowing that SVE would have used VnBImode given that
AVX512 uses an integer mode.

Aside from the too large vector and AVX512 issue above I think we can use
MODE_VECTOR_BOOL || TYPE_PRECISION == 1 and for the latter we
can assert the mode is a scalar integer mode (AVX512 or GCN)?

Richard.


> >> maybe.  Still for the possibility of vector(n) 
> >> mask for a int128 element vector
> >> we'd have 16bit mask elements, encoding that differently would be
> >> inconsistent as well
> >> (but of course 16bit elements are not handled by the code right now).
> >
> > Yeah, 16-bit predicate elements aren't a thing for SVE, so we've not
> > had to add support for them.
> >
> > Richard

Rewrite usage comment at the top of 'gcc/passes.def' (was: [PATCH 02/11] Generate pass-instances.def)

2024-06-28 Thread Thomas Schwinge

Hi!

On 2013-07-26T11:04:32-0400, David Malcolm  wrote:
> Introduce a new gen-pass-instances.awk script, and use it at build time
> to make a pass-instances.def from passes.def.

(The script has later been rewritten and extended, but the issue I'm
discussing is relevant already in its original version.)

> The generated pass-instances.def contains similar content to passes.def,
> but the pass instances within it are explicitly numbered, so that e.g.
> the third instance of:
>
>   NEXT_PASS (pass_copy_prop)
>
> becomes:
>
>   NEXT_PASS (pass_copy_prop, 3)

> --- a/gcc/passes.c
> +++ b/gcc/passes.c
> @@ -1315,12 +1315,12 @@ pipeline::pipeline (context *ctxt)
>  #define POP_INSERT_PASSES() \
>}
>  
> -#define NEXT_PASS(PASS)  (p = next_pass_1 (p, &((PASS).pass)))
> +#define NEXT_PASS(PASS, NUM)  (p = next_pass_1 (p, &((PASS).pass)))
>  
>  #define TERMINATE_PASS_LIST() \
>*p = NULL;
>  
> -#include "passes.def"
> +#include "pass-instances.def"

Given this, the usage comment at the top of 'gcc/passes.def' (see below)
no longer is accurate (even if that latter file does continue to use the
'NEXT_PASS' form without 'NUM') -- and, worse, the 'NEXT_PASS' etc. in
that usage comment are processed by the 'gcc/gen-pass-instances.awk'
script:

--- source-gcc/gcc/passes.def   2024-06-24 18:55:15.132561641 +0200
+++ build-gcc/gcc/pass-instances.def2024-06-24 18:55:27.768562714 +0200
[...]
@@ -20,546 +22,578 @@
 /*
  Macros that should be defined when using this file:
INSERT_PASSES_AFTER (PASS)
PUSH_INSERT_PASSES_WITHIN (PASS)
POP_INSERT_PASSES ()
-   NEXT_PASS (PASS)
+   NEXT_PASS (PASS, 1)
TERMINATE_PASS_LIST (PASS)
  */
[...]

(That is, this is 'NEXT_PASS' for the first instance of pass 'PASS'.)
That's benign so far, but with another thing that I'll be extending, I'd
then run into an error while the script handles this comment block.  ;-\

OK to push "Rewrite usage comment at the top of 'gcc/passes.def'", see
attached?


Grüße
 Thomas


>From 072cdf7d9cf86fb2b0553b93365648e153b4376b Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Fri, 28 Jun 2024 14:05:04 +0200
Subject: [PATCH] Rewrite usage comment at the top of 'gcc/passes.def'

Since Subversion r201359 (Git commit a167b052dfe9a8509bb23c374ffaeee953df0917)
"Introduce gen-pass-instances.awk and pass-instances.def", the usage comment at
the top of 'gcc/passes.def' no longer is accurate (even if that latter file
does continue to use the 'NEXT_PASS' form without 'NUM') -- and, worse, the
'NEXT_PASS' etc. in that usage comment are processed by the
'gcc/gen-pass-instances.awk' script:

--- source-gcc/gcc/passes.def   2024-06-24 18:55:15.132561641 +0200
+++ build-gcc/gcc/pass-instances.def2024-06-24 18:55:27.768562714 +0200
[...]
@@ -20,546 +22,578 @@
 /*
  Macros that should be defined when using this file:
INSERT_PASSES_AFTER (PASS)
PUSH_INSERT_PASSES_WITHIN (PASS)
POP_INSERT_PASSES ()
-   NEXT_PASS (PASS)
+   NEXT_PASS (PASS, 1)
TERMINATE_PASS_LIST (PASS)
  */
[...]

(That is, this is 'NEXT_PASS' for the first instance of pass 'PASS'.)
That's benign so far, but with another thing that I'll be extending, I'd
then run into an error while the script handles this comment block.  ;-\

	gcc/
	* passes.def: Rewrite usage comment at the top.
---
 gcc/passes.def | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/gcc/passes.def b/gcc/passes.def
index 1f222729d39..3f65fcf71d6 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -17,14 +17,11 @@ You should have received a copy of the GNU General Public License
 along with GCC; see the file COPYING3.  If not see
 .  */
 
-/*
- Macros that should be defined when using this file:
-   INSERT_PASSES_AFTER (PASS)
-   PUSH_INSERT_PASSES_WITHIN (PASS)
-   POP_INSERT_PASSES ()
-   NEXT_PASS (PASS)
-   TERMINATE_PASS_LIST (PASS)
- */
+/* Note that this file is processed by a simple parser:
+   'gen-pass-instances.awk', so carefully verify the generated
+   'pass-instances.def' if you deviate from the syntax otherwise used in
+   here.  */
+
 
  /* All passes needed to lower the function into shape optimizers can
 operate on.  These passes are always run first on the function, but
-- 
2.34.1

Re: [PATCH] Use move-aware auto_vec in map

2024-06-28 Thread Richard Biener

On Fri, Jun 28, 2024 at 8:43 AM Jørgen Kvalsvik  wrote:
>
> Using auto_vec rather than vec for means the vectors are release
> automatically upon return, to stop the leak. The problem seems is that
> auto_vec is not really move-aware, only the  specialization
> is.

Indeed.

> This is actually Jan's original suggestion
> https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655600.html which I
> improvised on by also using embedded storage. I think it should fix this
> regression:
> https://gcc.gnu.org/pipermail/gcc-regression/2024-June/080152.html
>
> I could not reproduce it on x86-64 linux, so if someone could help me
> test it on aarch64 that would be much appreciated.

OK.

> --
>
> gcc/ChangeLog:
>
> * tree-profile.cc (find_conditions): Use auto_vec without
>   embedded storage.
> ---
>  gcc/tree-profile.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc
> index 8c9945847ca..153c9323040 100644
> --- a/gcc/tree-profile.cc
> +++ b/gcc/tree-profile.cc
> @@ -876,7 +876,7 @@ find_conditions (struct function *fn)
>  make_top_index (fnblocks, ctx.B1, ctx.top_index);
>
>  /* Bin the Boolean expressions so that exprs[id] -> [x1, x2, ...].  */
> -hash_map, auto_vec> exprs;
> +hash_map, auto_vec> exprs;
>  for (basic_block b : fnblocks)
>  {
> const unsigned uid = condition_uid (fn, b);
> --
> 2.39.2
>

Re: [PATCH] libgccjit: Add ability to get the alignment of a type

2024-06-28 Thread Rainer Orth

David Malcolm  writes:

> On Thu, 2024-04-04 at 18:59 -0400, Antoni Boucher wrote:
>> Hi.
>> This patch adds a new API to produce an rvalue representing the 
>> alignment of a type.
>> Thanks for the review.
>
> Patch looks good to me (but may need the usual ABI version updates when
> merging).

This patch broke macOS bootstrap:

/vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc: In member function 
'virtual gcc::jit::recording::string* 
gcc::jit::recording::memento_of_typeinfo::make_debug_string()': 
/vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc:5529:30: error: 'ident' 
may be used uninitialized [-Werror=maybe-uninitialized]
 5529 |   return string::from_printf (m_ctxt,
  |  ^~~~
 5530 |   "%s (%s)",
  |   ~~
 5531 |   ident,
  |   ~~
 5532 |   m_type->get_debug_string ());
  |   
/vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc:5519:15: note: 'ident' 
was declared here
 5519 |   const char* ident;
  |   ^

/vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc: In member function 
'virtual void 
gcc::jit::recording::memento_of_typeinfo::write_reproducer(gcc::jit::reproducer&)':
  
/vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc:5552:11: error: 'type' 
may be used uninitialized [-Werror=maybe-uninitialized]
 5552 |   r.write ("  gcc_jit_rvalue *%s =\n"
  |   ^~~
 5553 | "gcc_jit_context_new_%sof (%s, /* gcc_jit_context *ctxt */\n"
  | ~
 5554 | "(gcc_jit_type *) %s); /* 
gcc_jit_type *type */\n",
  | 
~~~
  | id,
  | ~~~
 5556 | type,
  | ~
 5557 | r.get_identifier (get_context ()),
  | ~~
 5558 | r.get_identifier (m_type));
  | ~~
/vol/gcc/src/hg/master/darwin/gcc/jit/jit-recording.cc:5541:15: note: 'type' 
was declared here
 5541 |   const char* type;
  |   ^~~~

I wonder how this can have worked anywhere (apart from jit not being
enabled by default on non-Darwin targets).

Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University

Re: [PATCH] i386: Fix regression after refactoring legitimize_pe_coff_symbol, ix86_GOT_alias_set and PE_COFF_LEGITIMIZE_EXTERN_DECL

2024-06-28 Thread Uros Bizjak

On Fri, Jun 28, 2024 at 1:41 PM Evgeny Karpov
 wrote:
>
> Thursday, June 27, 2024 8:13 PM
> Uros Bizjak  wrote:
>
> >
> > So, there is no problem having #endif just after else.
> >
> > Anyway, it's your call, this is not a hill I'm willing to die on. ;)
> >
> > Thanks,
> > Uros.
>
> It looks like the patch resolves 3 reported issues.
> Uros, I suggest merging the patch as it is, without minor refactoring, to 
> avoid triggering another round of testing, if you agree.

Yes, please go ahead.

Thanks,
Uros.

[PATCH] tree-optimization/115652 - more fixing of the fix

2024-06-28 Thread Richard Biener

The following addresses the corner case of an outer loop with an empty
header where we end up asking for the BB of a NULL stmt by
special-casing this case.

Bootstrap and regtest running on x86_64-unknown-linux-gnu, the patch
fixes observed ICEs on GCN.

PR tree-optimization/115652
* tree-vect-slp.cc (vect_schedule_slp_node): Handle the case
where the outer loop header block is empty.
---
 gcc/tree-vect-slp.cc | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 174b4800fa9..dd9017e5b3a 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -9750,8 +9750,15 @@ vect_schedule_slp_node (vec_info *vinfo,
  {
gimple_stmt_iterator si2
  = gsi_after_labels (LOOP_VINFO_LOOP (loop_vinfo)->header);
-   if (last_stmt != *si2
-   && vect_stmt_dominates_stmt_p (last_stmt, *si2))
+   if ((gsi_end_p (si2)
+&& (LOOP_VINFO_LOOP (loop_vinfo)->header
+!= gimple_bb (last_stmt))
+&& dominated_by_p (CDI_DOMINATORS,
+   LOOP_VINFO_LOOP (loop_vinfo)->header,
+   gimple_bb (last_stmt)))
+   || (!gsi_end_p (si2)
+   && last_stmt != *si2
+   && vect_stmt_dominates_stmt_p (last_stmt, *si2)))
  si = si2;
  }
}
-- 
2.35.3

[PATCH] i386: Fix regression after refactoring legitimize_pe_coff_symbol, ix86_GOT_alias_set and PE_COFF_LEGITIMIZE_EXTERN_DECL

2024-06-28 Thread Evgeny Karpov

Thursday, June 27, 2024 8:13 PM
Uros Bizjak  wrote:

> 
> So, there is no problem having #endif just after else.
> 
> Anyway, it's your call, this is not a hill I'm willing to die on. ;)
> 
> Thanks,
> Uros.

It looks like the patch resolves 3 reported issues.
Uros, I suggest merging the patch as it is, without minor refactoring, to avoid 
triggering another round of testing, if you agree.
Thanks.

Regards,
Evgeny

RE: [RFC PATCH] cse: Add another CSE pass after split1

2024-06-28 Thread Tamar Christina

Hi,

> -Original Message-
> From: Palmer Dabbelt 
> Sent: Thursday, June 27, 2024 10:57 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Palmer Dabbelt 
> Subject: [RFC PATCH] cse: Add another CSE pass after split1
> 
> This is really more of a question than a patch.
> 
> Looking at PR/115687 I managed to convince myself there's a general
> class of problems here: splitting might produce constant subexpressions,
> but as far as I can tell there's nothing to eliminate those constant
> subexpressions.  So I very quickly threw together a CSE that doesn't
> fold expressions, and it does eliminate the high-part constants in
> question.
> 
> At that point I realized the implementation here is bogus: it's not the
> folding that's the problem, but introducing new expressions post-split
> would break things -- or at least I think it would, we'd end up with
> insns the backends don't expect to have that late.  I'm not sure if
> split2 would end up cleaning all that up at a functional level, but it
> certainly seems like optimization would be pretty far off the rails at
> that point and thus doesn't seem like a good idea.  I'm also not sure
> how effective this would be without doing the folding, as without
> folding we can only eliminate the last insn in the constant sequence --
> that's fine here, but it wouldn't work for more complicated stuff.
> 
> So I think if this was to go anywhere we'd want to have a CSE that
> really only eliminates expressions (ie, doesn't do any of the other
> juggling to try and produce more constant subexpressions).  There's a
> few places where new expressions can be introduced, so it'd probably be
> better done as a new cse_insn-type function instead of just a flag.  It
> seems somewhat manageable to write, though.
> 
> That said, I really don't know what I'm doing here.  So I figured I'd
> just send out what I'd put together, mostly as a way to ask if it's
> worth putting time into this?

I've tried a similar thing in the past as it's useful for cases where we 
optimize
predicates in RTL. The general problem being that predicates in gimple on
unmasked instructions are missing.

I had, similarly to you good results using another CSE pass after split and
Also ran into the issue where since we're out of CFG mode that you can't
do any jump related optimizations.  We also had issues with cases where
we would have converted FP operations to integer ones.

The new CSE pass would convert them back to FP ops,  but it looks like
your patch prevents any simplification at all?

I think that might be worth relaxing into any simplification where we would
end up requiring a reload, which was the internal suggestion I got from Richard 
S
last time but didn't get the time to work out.

Cheers,
Tamar

> ---
>  gcc/common.opt  |   4 ++
>  gcc/cse.cc  | 112 ++--
>  gcc/opts.cc |   1 +
>  gcc/passes.def  |   1 +
>  gcc/tree-pass.h |   1 +
>  5 files changed, 105 insertions(+), 14 deletions(-)
> 
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 327230967ea..efc4b8ddaf3 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -2695,6 +2695,10 @@ frerun-cse-after-loop
>  Common Var(flag_rerun_cse_after_loop) Optimization
>  Add a common subexpression elimination pass after loop optimizations.
> 
> +frerun-cse-after-split
> +Common Var(flag_rerun_cse_after_split) Optimization
> +Add a common subexpression elimination pass after splitting instructions.
> +
>  frerun-loop-opt
>  Common Ignore
>  Does nothing.  Preserved for backward compatibility.
> diff --git a/gcc/cse.cc b/gcc/cse.cc
> index c53deecbe54..d3955001ce7 100644
> --- a/gcc/cse.cc
> +++ b/gcc/cse.cc
> @@ -543,11 +543,11 @@ static rtx fold_rtx (rtx, rtx_insn *);
>  static rtx equiv_constant (rtx);
>  static void record_jump_equiv (rtx_insn *, bool);
>  static void record_jump_cond (enum rtx_code, machine_mode, rtx, rtx);
> -static void cse_insn (rtx_insn *);
> +static void cse_insn (rtx_insn *, int);
>  static void cse_prescan_path (struct cse_basic_block_data *);
>  static void invalidate_from_clobbers (rtx_insn *);
>  static void invalidate_from_sets_and_clobbers (rtx_insn *);
> -static void cse_extended_basic_block (struct cse_basic_block_data *);
> +static void cse_extended_basic_block (struct cse_basic_block_data *, int);
>  extern void dump_class (struct table_elt*);
>  static void get_cse_reg_info_1 (unsigned int regno);
>  static struct cse_reg_info * get_cse_reg_info (unsigned int regno);
> @@ -4511,12 +4511,13 @@ canonicalize_insn (rtx_insn *insn, vec
> *psets)
> 
> 
> 
>  /* Main function of CSE.
> First simplify sources and addresses of all assignments
> -   in the instruction, using previously-computed equivalents values.
> +   in the instruction, using previously-computed equivalents values when
> +   simplification is allowed.
> Then install the new sources and destinations in the table
> of available values.  */
> 
>  static void
> -cse_insn (rtx_insn *insn)
>

Re: Re: [PATCH 0/2] fix RISC-V zcmp popretz [PR113715]

2024-06-28 Thread Fei Gao

On 2024-06-09 04:36  Jeff Law  wrote:
>
>
>
>On 6/5/24 8:42 PM, Fei Gao wrote:
>
>>> But let's back up and get a good explanation of what the problem is.
>>> Based on patch 2/2 it looks like we have lost an assignment to the
>>> return register.
>>>
>>> To someone not familiar with this code, it sounds to me like we've made
>>> a mistake earlier and we're now defining a hook that lets us go back and
>>> fix that earlier mistake.   I'm probably wrong, but so far that's what
>>> it sounds like.
>> Hi Jeff
>>
>> You're right. Let me rephrase  patch 2/2 with more details. Search /* feigao 
>> to location the point I'm
>> tring to explain.
>>
>> code snippets from gcc/function.cc
>> void
>> thread_prologue_and_epilogue_insns (void)
>> {
>> ...
>>    /*feigao:
>>          targetm.gen_epilogue () is called here to generate epilogue 
>>sequence.
>> https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=b27d323a368033f0b37e93c57a57a35fd9997864
>> Commit above tries in targetm.gen_epilogue () to detect if
>> there's li   a0,0 insn at the end of insn chain, if so, cm.popret 
Hi Jeff
I should have made it clear that there're  li   a0,0 and use a0 insns instead 
of just li a0, 0 here.
>> is replaced by cm.popretz and li a0,0 insn is deleted.
>So that seems like the critical issue.  Generation of the
>prologue/epilogue really shouldn't be changing other instructions in the
>instruction stream.  I'm not immediately aware of another target that
>does that, an it seems like a rather risky thing to do.
>
>
>It looks like the cm.popretz's RTL exposes the assignment to a0 and
>there's a DCE pass that runs after insertion of the prologue/epilogue.
>So I would suggest leaving the assignment to a0 in the RTL chain and see
>if the later DCE pass after prologue generation eliminates the redundant
>assignment.  That seems a lot cleaner. 
The DCE pass  after prologue generation may not help here for the following 
reasons:
1. The use a0 insn is not deletable, and then li a0,0 that defines a0 cannot be 
deleted.
2. We need to detect pattern (clear a0, use a0 and cm.popret) before generating 
cm.popretz. 
    I don't think DCE is a good place to put this piece of codes. 
    And I insist prologue and epilogue pass is a better place to do it with 
simplicity and clear logic
    as I explained earlier to Kito. Hook was added here safely without any 
impact on other targets.

Please let me know your idea. 

Thanks. 
Fei
>
>
>
>Jeff

Re: [PATCH v2] MIPS: Output $0 for conditional trap if !ISA_HAS_COND_TRAPI

2024-06-28 Thread Maciej W. Rozycki

On Fri, 28 Jun 2024, YunQiang Su wrote:

> > > >  Overall ISTM there is no need for distinct insns for ISA_HAS_COND_TRAPI
> > > > and !ISA_HAS_COND_TRAPI cases each and this would better be sorted with
> > > > predicates and constraints, especially as the output pattern is the same
> > > > in both cases anyway.  This would prevent special-casing from being 
> > > > needed
> > > > in `mips_expand_conditional_trap' as well.
> > > >
> > >
> > > I agree. The patch should be quite simple
> > >
> > >[(trap_if (match_operator:GPR 0 "trap_comparison_operator"
> > > [(match_operand:GPR 1 "reg_or_0_operand" 
> > > "dJ")
> > >  (match_operand:GPR 2 "arith_operand" 
> > > "dI")])
> > > (const_int 0))]
> > >"ISA_HAS_COND_TRAPI"
> > > -  "t%C0\t%z1,%2"
> > > +  "t%C0\t%z1,%z2"
> > >[(set_attr "type" "trap")])
> >
> >  Nope, this is wrong.
> >
> 
> > in both cases anyway.  This would prevent special-casing from being needed
> > in `mips_expand_conditional_trap' as well.
> 
> We cannot make  `mips_expand_conditional_trap' simpler at this point.

 This is simply not true.  However as the platform maintainer you are the 
expert in this area, so I am leaving it to up you to figure out.  If you 
want, that is, of course.  All the necessary details are in the paragraph 
I've left quoted at the top.

 NB given that this is a fix for an easily reproducible bug, there should 
have been a test case committed along with it.

  Maciej

[PATCH v2 8/8] libgomp: Map omp_default_mem_space to USM

2024-06-28 Thread Andrew Stubbs

When unified shared memory is required, the default memory space should also be
unified.

libgomp/ChangeLog:

* config/linux/allocator.c (linux_memspace_alloc): Check
omp_requires_mask.
(linux_memspace_calloc): Likewise.
(linux_memspace_free): Likewise.
(linux_memspace_realloc): Likewise.
* libgomp.h (omp_requires_mask): New extern.
* target.c (omp_requires_mask): Remove static.
* testsuite/libgomp.c-c++-common/target-implicit-map-4.c: Add
NO_USM_STACK conditional code.
---
 libgomp/config/linux/allocator.c | 16 
 libgomp/libgomp.h|  1 +
 libgomp/target.c |  2 +-
 .../libgomp.c-c++-common/target-implicit-map-4.c | 16 
 4 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c
index 81d2877b8f1..a026f49be16 100644
--- a/libgomp/config/linux/allocator.c
+++ b/libgomp/config/linux/allocator.c
@@ -101,7 +101,9 @@ linux_memspace_alloc (omp_memspace_handle_t memspace, 
size_t size, int pin,
   /* Explicit pinning may not be required.  */
   pin = pin && !always_pinned_mode;
 
-  if (memspace == ompx_gnu_unified_shared_mem_space)
+  if (memspace == ompx_gnu_unified_shared_mem_space
+  || (memspace == omp_default_mem_space
+ && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)))
 addr = gomp_usm_alloc (size);
   else if (pin)
 {
@@ -194,7 +196,9 @@ linux_memspace_calloc (omp_memspace_handle_t memspace, 
size_t size, int pin)
   /* Explicit pinning may not be required.  */
   pin = pin && !always_pinned_mode;
 
-  if (memspace == ompx_gnu_unified_shared_mem_space)
+  if (memspace == ompx_gnu_unified_shared_mem_space
+  || (memspace == omp_default_mem_space
+ && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)))
 {
   void *ret = gomp_usm_alloc (size);
   memset (ret, 0, size);
@@ -216,7 +220,9 @@ linux_memspace_free (omp_memspace_handle_t memspace, void 
*addr, size_t size,
   /* Explicit pinning may not be required.  */
   pin = pin && !always_pinned_mode;
 
-  if (memspace == ompx_gnu_unified_shared_mem_space)
+  if (memspace == ompx_gnu_unified_shared_mem_space
+  || (memspace == omp_default_mem_space
+ && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)))
 gomp_usm_free (addr);
   else if (pin)
 {
@@ -244,7 +250,9 @@ linux_memspace_realloc (omp_memspace_handle_t memspace, 
void *addr,
   /* Explicit pinning may not be required.  */
   pin = pin && !always_pinned_mode;
 
-  if (memspace == ompx_gnu_unified_shared_mem_space)
+  if (memspace == ompx_gnu_unified_shared_mem_space
+  || (memspace == omp_default_mem_space
+ && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)))
 /* Realloc is not implemented for USM.  */
 ;
   else if (oldpin && pin)
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 707fcdb39d7..4c5c89c8454 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1123,6 +1123,7 @@ extern int gomp_pause_host (void);
 
 /* target.c */
 
+extern int omp_requires_mask;
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
diff --git a/libgomp/target.c b/libgomp/target.c
index f0ee2c84197..455cac917c9 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -107,7 +107,7 @@ static int num_devices;
 static int num_devices_openmp;
 
 /* OpenMP requires mask.  */
-static int omp_requires_mask;
+int omp_requires_mask;
 
 /* Similar to gomp_realloc, but release register_lock before gomp_fatal.  */
 
diff --git a/libgomp/testsuite/libgomp.c-c++-common/target-implicit-map-4.c 
b/libgomp/testsuite/libgomp.c-c++-common/target-implicit-map-4.c
index 2766312292b..de865352e9b 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/target-implicit-map-4.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/target-implicit-map-4.c
@@ -6,6 +6,9 @@
 
 /* { dg-skip-if "Not all devices allow USM" { offload_device_gcn && { ! 
omp_usm } } } */
 
+/* { dg-additional-options "-DNO_USM_STACK" { target offload_target_nvptx } } 
*/
+/* { dg-additional-options "-DNO_USM_STACK" { target offload_target_amdgcn } } 
*/
+
 #pragma omp requires unified_shared_memory
 
 /* Ensure that defaultmap(default : pointer) uses correct OpenMP 5.2
@@ -27,10 +30,23 @@ test_device (int dev)
   intptr_t ip = (intptr_t) p2;
   intptr_t ipa = (intptr_t) p2a;
 
+#if NO_USM_STACK
+  int A_init[3] = {1,2,3};
+  int B_init[5] = {4,5,6,7,8};
+  int *A = (int*) malloc (sizeof (A_init));
+  int *B = (int*) malloc (sizeof (B_init));
+  int *p3 = [0];
+  int *p3a = [0];
+
+  /* Not all USM supports stack variables.  */
+  __builtin_memcpy (A, A_init, sizeof (A_init));
+  __builtin_memcpy (B, B_init, sizeof (B_init));
+#else
   int A[3] = {1,2,3};
   int B[5] = {4,5,6,7,8};
   int *p3 = [0];
   int *p3a = [0];
+#endif

[PATCH v2 6/8] amdgcn: libgomp plugin USM implementation

2024-06-28 Thread Andrew Stubbs

From: Andrew Stubbs 

Implement the Unified Shared Memory API calls in the GCN plugin.

The AMD equivalent of "Managed Memory" means registering previously allocated
host memory as "coarse-grained" (whereas allocating coarse-grained memory via
hsa_allocate_memory allocates device-side memory, initially).  It's possible to
do this to ordinary host heap memory (i.e. from "malloc"), but a) this caused
mysterious crashes inside the HSA runtime (presumably an unfortunate
page-sharing situation), and b) it's unlikely that the malloc/free
implementation is optimized for avoiding page migrations (in general).

This implementation reuses the "usmpin" allocator (introduced in my previous
patch-set to optimize pinned memory allocation) to solve these issues.
Firstly, all USM memory is allocated from specially memmap'd pages to ensure
that as few pages as possible get migrated.  Secondly, the free chain is stored
in a side-table so that we can be sure that walking the chain doesn't migrate
all the pages back to the host, for no reason.

The HSA header files update included here were relicenced by AMD and sent to me
explicitly to enable this project. AMD retain the copyright (Q4 2022), as they
do for the headers already in-tree.  This is *not* just a random copy from the
other project with the incompatible license.  (The small change made recently
by Tobias has not been erased, however.)

include/ChangeLog:

* hsa.h: Import a new version from AMD.
* hsa_ext_amd.h: Likewise.
* hsa_ext_image.h: Likewise.

libgomp/ChangeLog:

* Makefile.in: Regenerate.
* config/gcn/allocator.c (gcn_memspace_alloc): Disallow
ompx_gnu_host_mem_space.
(gcn_memspace_calloc): Likewise.
(gcn_memspace_free): Likewise.
(gcn_memspace_realloc): Likewise.
* plugin/Makefrag.am
(libgomp_plugin_gcn_la_SOURCES): Add usmpin-allocator.c.
* plugin/plugin-gcn.c: Include libgomp.h, sys/mman.h, and unistd.h.
(struct hsa_runtime_fn_info): Add hsa_amd_svm_attributes_set_fn.
(dump_hsa_system_info): Dump HSA_AMD_SYSTEM_INFO_SVM_SUPPORTED and
HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT data.
(init_hsa_runtime_functions): Load hsa_amd_svm_attributes_set.
(usm_ctx): New variable.
(usm_heap_pages): New.
(usm_heap_create): New function.
(GOMP_OFFLOAD_get_num_devices): Update comment only.
(GOMP_OFFLOAD_usm_alloc): New function.
(GOMP_OFFLOAD_usm_free): New function.
(GOMP_OFFLOAD_is_usm_ptr): New function.
* testsuite/lib/libgomp.exp (check_effective_target_omp_usm): Add
amdgcn test.
* testsuite/libgomp.c++/usm-1.C: Switch to omp_usm effective target.
* testsuite/libgomp.c-c++-common/requires-1.c: Require omp_usm.
* testsuite/libgomp.c-c++-common/requires-4.c: Skip AMD devices that
don't support USM.
* testsuite/libgomp.c-c++-common/requires-4a.c: Likewise.
* testsuite/libgomp.c-c++-common/requires-5.c: Likewise.
* testsuite/libgomp.c-c++-common/target-implicit-map-4.c: Likewise.
* testsuite/libgomp.c/usm-1.c: Set amdgcn options.
* testsuite/libgomp.c/usm-2.c: Likewise.
* testsuite/libgomp.c/usm-3.c: Likewise.
* testsuite/libgomp.c/usm-4.c: Likewise.
* testsuite/libgomp.c/usm-5.c: Clarify host-fallback behaviour.
* testsuite/libgomp.c/usm-6.c: Require omp_usm.
* usmpin-allocator.c (gomp_fatal): Define.
* usm-allocator.c: New file.
---
 include/hsa.h |  28 +-
 include/hsa_ext_amd.h | 459 +-
 include/hsa_ext_image.h   |   2 +-
 libgomp/Makefile.in   |  13 +-
 libgomp/config/gcn/allocator.c|  10 +
 libgomp/plugin/Makefrag.am|   2 +-
 libgomp/plugin/plugin-gcn.c   | 169 ++-
 libgomp/testsuite/lib/libgomp.exp |  12 +
 libgomp/testsuite/libgomp.c++/usm-1.C |   2 +-
 .../libgomp.c-c++-common/requires-1.c |   1 +
 .../libgomp.c-c++-common/requires-4.c |   5 +-
 .../libgomp.c-c++-common/requires-4a.c|   2 +
 .../libgomp.c-c++-common/requires-5.c |   2 +
 .../target-implicit-map-4.c   |   2 +
 libgomp/testsuite/libgomp.c/usm-1.c   |   1 +
 libgomp/testsuite/libgomp.c/usm-2.c   |   1 +
 libgomp/testsuite/libgomp.c/usm-3.c   |   1 +
 libgomp/testsuite/libgomp.c/usm-4.c   |   1 +
 libgomp/testsuite/libgomp.c/usm-5.c   |   2 +
 libgomp/testsuite/libgomp.c/usm-6.c   |   2 +-
 libgomp/usm-allocator.c   | 232 +
 libgomp/usmpin-allocator.c|   3 +
 22 files changed, 922 insertions(+), 30 deletions(-)
 mode change 100644 => 100755 include/hsa.h
 mode change 100644 => 100755 include/hsa_ext_amd.h
 mode change 100644 => 100755

[PATCH v2 7/8] openmp, libgomp: Handle unified shared memory in omp_target_is_accessible

2024-06-28 Thread Andrew Stubbs

From: Marcel Vollweiler 

This patch handles Unified Shared Memory (USM) in the OpenMP runtime routine
omp_target_is_accessible.

libgomp/ChangeLog:

* target.c (omp_target_is_accessible): Handle unified shared memory.
* testsuite/libgomp.c-c++-common/target-is-accessible-1.c: Updated.
* testsuite/libgomp.fortran/target-is-accessible-1.f90: Updated.
* testsuite/libgomp.c-c++-common/target-is-accessible-2.c: New test.
* testsuite/libgomp.fortran/target-is-accessible-2.f90: New test.
---
 libgomp/target.c  |  8 +--
 .../target-is-accessible-1.c  | 22 +--
 .../target-is-accessible-2.c  | 21 ++
 .../target-is-accessible-1.f90| 20 +++--
 .../target-is-accessible-2.f90| 22 +++
 5 files changed, 77 insertions(+), 16 deletions(-)
 create mode 100644 
libgomp/testsuite/libgomp.c-c++-common/target-is-accessible-2.c
 create mode 100644 libgomp/testsuite/libgomp.fortran/target-is-accessible-2.f90

diff --git a/libgomp/target.c b/libgomp/target.c
index 754dea4e031..f0ee2c84197 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -5281,9 +5281,13 @@ omp_target_is_accessible (const void *ptr, size_t size, 
int device_num)
   if (devicep == NULL)
 return false;
 
-  /* TODO: Unified shared memory must be handled when available.  */
+  if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+return true;
 
-  return devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM;
+  if (devicep->is_usm_ptr_func && devicep->is_usm_ptr_func ((void *) ptr))
+return true;
+
+  return false;
 }
 
 int
diff --git a/libgomp/testsuite/libgomp.c-c++-common/target-is-accessible-1.c 
b/libgomp/testsuite/libgomp.c-c++-common/target-is-accessible-1.c
index 2e75c6300ae..e7f9cf27a42 100644
--- a/libgomp/testsuite/libgomp.c-c++-common/target-is-accessible-1.c
+++ b/libgomp/testsuite/libgomp.c-c++-common/target-is-accessible-1.c
@@ -1,3 +1,5 @@
+/* { dg-do run } */
+
 #include 
 
 int
@@ -6,7 +8,8 @@ main ()
   int d = omp_get_default_device ();
   int id = omp_get_initial_device ();
   int n = omp_get_num_devices ();
-  void *p;
+  int i = 42;
+  void *p = 
 
   if (d < 0 || d >= n)
 d = id;
@@ -26,23 +29,28 @@ main ()
   if (omp_target_is_accessible (p, sizeof (int), n + 1))
 __builtin_abort ();
 
-  /* Currently, a host pointer is accessible if the device supports shared
- memory or omp_target_is_accessible is executed on the host. This
- test case must be adapted when unified shared memory is avialable.  */
   int a[128];
   for (int d = 0; d <= omp_get_num_devices (); d++)
 {
+  /* SHARED_MEM is 1 if and only if host and device share the same memory.
+OMP_TARGET_IS_ACCESSIBLE should not return 0 for shared memory.  */
   int shared_mem = 0;
   #pragma omp target map (alloc: shared_mem) device (d)
shared_mem = 1;
-  if (omp_target_is_accessible (p, sizeof (int), d) != shared_mem)
+
+  if (shared_mem && !omp_target_is_accessible (p, sizeof (int), d))
+   __builtin_abort ();
+
+  /* USM is disabled by default.  Hence OMP_TARGET_IS_ACCESSIBLE should
+return 0 if shared_mem is false.  */
+  if (!shared_mem && omp_target_is_accessible (p, sizeof (int), d))
__builtin_abort ();
 
-  if (omp_target_is_accessible (a, 128 * sizeof (int), d) != shared_mem)
+  if (shared_mem && !omp_target_is_accessible (a, 128 * sizeof (int), d))
__builtin_abort ();
 
   for (int i = 0; i < 128; i++)
-   if (omp_target_is_accessible ([i], sizeof (int), d) != shared_mem)
+   if (shared_mem && !omp_target_is_accessible ([i], sizeof (int), d))
  __builtin_abort ();
 }
 
diff --git a/libgomp/testsuite/libgomp.c-c++-common/target-is-accessible-2.c 
b/libgomp/testsuite/libgomp.c-c++-common/target-is-accessible-2.c
new file mode 100644
index 000..24c77232f5d
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/target-is-accessible-2.c
@@ -0,0 +1,21 @@
+/* { dg-do run } */
+/* { dg-require-effective-target omp_usm } */
+
+#include 
+
+#pragma omp requires unified_shared_memory
+
+int
+main ()
+{
+  int *a = (int *) omp_alloc (sizeof (int), ompx_gnu_unified_shared_mem_alloc);
+  if (!a)
+__builtin_abort ();
+
+  for (int d = 0; d <= omp_get_num_devices (); d++)
+if (!omp_target_is_accessible (a, sizeof (int), d))
+  __builtin_abort ();
+
+  omp_free(a, ompx_gnu_unified_shared_mem_alloc);
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.fortran/target-is-accessible-1.f90 
b/libgomp/testsuite/libgomp.fortran/target-is-accessible-1.f90
index 150df6f8a4f..0df43aae095 100644
--- a/libgomp/testsuite/libgomp.fortran/target-is-accessible-1.f90
+++ b/libgomp/testsuite/libgomp.fortran/target-is-accessible-1.f90
@@ -1,3 +1,5 @@
+! { dg-do run }
+
 program main
   use omp_lib
   use iso_c_binding
@@ -28,24 +30,28 @@ program main

[PATCH v2 5/8] amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK

2024-06-28 Thread Andrew Stubbs

From: Andrew Stubbs 

The AMD GCN runtime must be set to the correct mode for Unified Shared Memory
to work, but this is not always clear at compile and link time due to the split
nature of the offload compilation pipeline.

This patch sets a new attribute on OpenMP offload functions to ensure that the
information is passed all the way to the backend.  The backend then places a
marker in the assembler code for mkoffload to find. Finally mkoffload places a
constructor function into the final program to ensure that the HSA_XNACK
environment variable passes the correct mode to the GPU.

The HSA_XNACK variable must be set before the HSA runtime is even loaded, so
it makes more sense to have this set within the constructor than at some point
later within libgomp or the GCN plugin.

Other toolchains require the end-user to set HSA_XNACK manually (or else wonder
why it's not working), so the constructor also checks that any existing manual
setting is compatible with the binary's requirements.

gcc/ChangeLog:

* config/gcn/gcn.c (unified_shared_memory_enabled): New variable.
(gcn_init_cumulative_args): Handle attribute "omp unified memory".
(gcn_hsa_declare_function_name): Emit "MKOFFLOAD OPTIONS: USM+".
* config/gcn/mkoffload.c (TEST_XNACK_OFF): New macro.
(process_asm): Detect "MKOFFLOAD OPTIONS: USM+".
Emit configure_xnack constructor, as required.
* omp-low.c (create_omp_child_function): Add attribute "omp unified
memory".
---
 gcc/config/gcn/gcn.cc   | 32 +++-
 gcc/config/gcn/mkoffload.cc | 35 ++-
 gcc/omp-low.cc  |  4 
 3 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index d6531f55190..6a83ff2a1b4 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -70,6 +70,11 @@ static bool ext_gcn_constants_init = 0;
 
 enum gcn_isa gcn_isa = ISA_GCN3;   /* Default to GCN3.  */
 
+/* Record whether the host compiler added "omp unifed memory" attributes to
+   any functions.  We can then pass this on to mkoffload to ensure xnack is
+   compatible there too.  */
+static bool unified_shared_memory_enabled = false;
+
 /* Reserve this much space for LDS (for propagating variables from
worker-single mode to worker-partitioned mode), per workgroup.  Global
analysis could calculate an exact bound, but we don't do that yet.
@@ -2942,6 +2947,29 @@ gcn_init_cumulative_args (CUMULATIVE_ARGS *cum /* 
Argument info to init */ ,
   if (!caller && cfun->machine->normal_function)
 gcn_detect_incoming_pointer_arg (fndecl);
 
+  if (fndecl && lookup_attribute ("omp unified memory",
+ DECL_ATTRIBUTES (fndecl)))
+{
+  unified_shared_memory_enabled = true;
+
+  switch (gcn_arch)
+   {
+   case PROCESSOR_FIJI:
+   case PROCESSOR_VEGA10:
+   case PROCESSOR_VEGA20:
+   case PROCESSOR_GFX908:
+   case PROCESSOR_GFX1030:
+   case PROCESSOR_GFX1036:
+   case PROCESSOR_GFX1100:
+   case PROCESSOR_GFX1103:
+ error ("GPU architecture does not support Unified Shared Memory");
+ break;
+   default:
+ if (flag_xnack == HSACO_ATTR_OFF)
+   error ("Unified Shared Memory is enabled, but XNACK is disabled");
+   }
+}
+
   reinit_regs ();
 }
 
@@ -6820,12 +6848,14 @@ gcn_hsa_declare_function_name (FILE *file, const char 
*name,
   fputs (",@function\n", file);
   ASM_OUTPUT_FUNCTION_LABEL (file, name, decl);
 
-  /* This comment is read by mkoffload.  */
+  /* These comments are read by mkoffload.  */
   if (flag_openacc)
 fprintf (file, "\t;; OPENACC-DIMS: %d, %d, %d : %s\n",
 oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_GANG),
 oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_WORKER),
 oacc_get_fn_dim_size (cfun->decl, GOMP_DIM_VECTOR), name);
+  if (unified_shared_memory_enabled)
+fprintf (asm_out_file, "\t;; MKOFFLOAD OPTIONS: USM+\n");
 }
 
 /* Implement TARGET_ASM_SELECT_SECTION.
diff --git a/gcc/config/gcn/mkoffload.cc b/gcc/config/gcn/mkoffload.cc
index 810298a799b..3dcb6943c45 100644
--- a/gcc/config/gcn/mkoffload.cc
+++ b/gcc/config/gcn/mkoffload.cc
@@ -487,6 +487,7 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
 {
   int fn_count = 0, var_count = 0, ind_fn_count = 0;
   int dims_count = 0, regcount_count = 0;
+  bool unified_shared_memory_enabled = false;
   struct obstack fns_os, dims_os, regcounts_os;
   obstack_init (_os);
   obstack_init (_os);
@@ -511,6 +512,7 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
   fn_count += 2;
 
   char buf[1000];
+  char dummy;
   enum
 { IN_CODE,
   IN_METADATA,
@@ -531,6 +533,9 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
dims_count++;
  }
 
+   if (sscanf (buf, " ;; MKOFFLOAD OPTIONS: USM+%c", ) > 0)
+ unified_shared_memory_enabled = true;
+

[PATCH v2 4/8] openmp: Use libgomp memory allocation functions with unified shared memory.

2024-06-28 Thread Andrew Stubbs

From: Hafiz Abid Qadeer 

This patches changes calls to malloc/free/calloc/realloc and operator new to
memory allocation functions in libgomp with
allocator=ompx_unified_shared_mem_alloc.  This helps existing code to benefit
from the unified shared memory, and is necessary to implement "requires
unified_shared_memory" using managed memory.  The libgomp does the correct
thing with all the mapping constructs and there are no memory copies if the
pointer is pointing to unified shared memory.

We only replace the standard new operator and not the class member or
placement new.

gcc/ChangeLog:

* omp-low.cc (usm_transform): New function.
(pass_data_usm_transform): New.
(class pass_usm_transform): New.
(make_pass_usm_transform): New function.
* passes.def: Add pass_usm_transform pass.
* tree-pass.h (make_pass_usm_transform): New prototype.

libgomp/ChangeLog:

* testsuite/libgomp.c-c++-common/requires-4.c: Add xfail.
* testsuite/libgomp.c++/usm-1.C: New test.
* testsuite/libgomp.c++/usm-2.C: New test.
* testsuite/libgomp.c/usm-6.c: New test.
* testsuite/libgomp.fortran/usm-2.f90: New test.

gcc/testsuite/ChangeLog:

* c-c++-common/gomp/usm-2.c: New test.
* c-c++-common/gomp/usm-3.c: New test.
* g++.dg/gomp/usm-1.C: New test.
* g++.dg/gomp/usm-2.C: New test.
* g++.dg/gomp/usm-3.C: New test.
* g++.dg/gomp/usm-4.C: New test.
* g++.dg/gomp/usm-5.C: New test.
* gfortran.dg/gomp/usm-2.f90: New test.
* gfortran.dg/gomp/usm-3.f90: New test.

co-authored-by: Andrew Stubbs 
---
 gcc/omp-low.cc| 184 ++
 gcc/passes.def|   1 +
 gcc/testsuite/c-c++-common/gomp/usm-2.c   |  46 +
 gcc/testsuite/c-c++-common/gomp/usm-3.c   |  44 +
 gcc/testsuite/g++.dg/gomp/usm-1.C |  32 +++
 gcc/testsuite/g++.dg/gomp/usm-2.C |  30 +++
 gcc/testsuite/g++.dg/gomp/usm-3.C |  38 
 gcc/testsuite/g++.dg/gomp/usm-4.C |  32 +++
 gcc/testsuite/g++.dg/gomp/usm-5.C |  30 +++
 gcc/testsuite/gfortran.dg/gomp/usm-2.f90  |  16 ++
 gcc/testsuite/gfortran.dg/gomp/usm-3.f90  |  13 ++
 gcc/tree-pass.h   |   1 +
 libgomp/testsuite/libgomp.c++/usm-1.C |  54 +
 libgomp/testsuite/libgomp.c++/usm-2.C |  33 
 .../libgomp.c-c++-common/requires-4.c |   2 +
 libgomp/testsuite/libgomp.c/usm-6.c   |  94 +
 libgomp/testsuite/libgomp.fortran/usm-2.f90   |  33 
 17 files changed, 683 insertions(+)
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-2.c
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-3.c
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-1.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-2.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-3.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-4.C
 create mode 100644 gcc/testsuite/g++.dg/gomp/usm-5.C
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-2.f90
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-3.f90
 create mode 100644 libgomp/testsuite/libgomp.c++/usm-1.C
 create mode 100644 libgomp/testsuite/libgomp.c++/usm-2.C
 create mode 100644 libgomp/testsuite/libgomp.c/usm-6.c
 create mode 100644 libgomp/testsuite/libgomp.fortran/usm-2.f90

diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index cf3f57748d8..d3f9ccc4567 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -15075,6 +15075,190 @@ make_pass_diagnose_omp_blocks (gcc::context *ctxt)
 {
   return new pass_diagnose_omp_blocks (ctxt);
 }
+
+/* Provide transformation required for using unified shared memory
+   by replacing calls to standard memory allocation functions with
+   function provided by the libgomp.  */
+
+static tree
+usm_transform (gimple_stmt_iterator *gsi_p, bool *,
+  struct walk_stmt_info *wi)
+{
+  gimple *stmt = gsi_stmt (*gsi_p);
+  /* ompx_gnu_unified_shared_mem_alloc is 201.
+ This must match the definition in libgomp/omp.h.in.  */
+  const unsigned int unified_shared_mem_alloc = 201;
+
+  switch (gimple_code (stmt))
+{
+case GIMPLE_CALL:
+  {
+   gcall *gs = as_a  (stmt);
+   tree fndecl = gimple_call_fndecl (gs);
+   unsigned int args = gimple_call_num_args (gs);
+   if (fndecl)
+ {
+   tree allocator = build_int_cst (pointer_sized_int_node,
+   unified_shared_mem_alloc);
+   const char *name = IDENTIFIER_POINTER (DECL_NAME (fndecl));
+   if ((strcmp (name, "malloc") == 0)
+|| (fndecl_built_in_p (fndecl, BUILT_IN_NORMAL)
+&& DECL_FUNCTION_CODE (fndecl) == BUILT_IN_MALLOC)
+|| (DECL_IS_REPLACEABLE_OPERATOR_NEW_P (fndecl)
+&& args == 1)
+|| strcmp (name, "omp_target_alloc") == 0)
+ {
+

[PATCH v2 2/8] openmp, nvptx: ompx_gnu_unified_shared_mem_alloc

2024-06-28 Thread Andrew Stubbs

From: Andrew Stubbs 

This adds support for using Cuda Managed Memory with omp_alloc.  It will be
used as the underpinnings for "requires unified_shared_memory" in a later
patch.

There are two new predefined allocators, ompx_gnu_unified_shared_mem_alloc and
ompx_gnu_host_mem_alloc, plus corresponding memory spaces, which can be used to
allocate memory in the "managed" space and explicitly on the host (it is
intended that "malloc" will be intercepted by the compiler).

The nvptx plugin is modified to make the necessary Cuda calls, and libgomp
is modified to switch to shared-memory mode for USM allocated mappings.

gcc/fortran/ChangeLog:

* openmp.cc (is_predefined_allocator): Recognise new allocators.

include/ChangeLog:

* cuda/cuda.h (CUdevice_attribute): Add definitions for
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR and
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR.
(CUmemAttach_flags): New.
(CUpointer_attribute): New.
(cuMemAllocManaged): New prototype.
(cuPointerGetAttribute): New prototype.

libgomp/ChangeLog:

* allocator.c (ompx_gnu_max_predefined_alloc): Update.
(predefined_ompx_gnu_alloc_mapping): Add
ompx_gnu_unified_shared_mem_space and ompx_gnu_host_mem_space.
(omp_init_allocator): Recognise ompx_gnu_pinned_mem_alloc and
ompx_gnu_host_mem_space.
* config/linux/allocator.c (linux_memspace_alloc): Support USM.
(linux_memspace_calloc): Likewise.
(linux_memspace_free): Likewise.
(linux_memspace_realloc): Likewise.
* config/nvptx/allocator.c (nvptx_memspace_alloc): Disallow host
memory.
(nvptx_memspace_calloc): Likewise.
(nvptx_memspace_free): Likewise.
(nvptx_memspace_realloc): Likewise.
* libgomp-plugin.h (GOMP_OFFLOAD_usm_alloc): New prototype.
(GOMP_OFFLOAD_usm_free): New prototype.
(GOMP_OFFLOAD_is_usm_ptr): New prototype.
* libgomp.h (gomp_usm_alloc): New prototype.
(gomp_usm_free): New prototype.
(OFFSET_USM): New define.
(struct gomp_device_descr): Add USM functions.
* omp.h.in (omp_memspace_handle_t): Add
ompx_gnu_unified_shared_mem_space and ompx_gnu_host_mem_space.
(omp_allocator_handle_t): Ad ompx_gnu_unified_shared_mem_alloc and
ompx_gnu_host_mem_alloc.
* omp_lib.f90.in: Likewise.
* omp_lib.h.in: Likewise.
* plugin/cuda-lib.def (cuMemAllocManaged): Add new call.
(cuPointerGetAttribute): Likewise.
* plugin/plugin-nvptx.c (nvptx_alloc): Add "usm" parameter.
Call cuMemAllocManaged as appropriate.
(GOMP_OFFLOAD_get_num_devices): Allow
GOMP_REQUIRES_UNIFIED_SHARED_MEMORY if the device supports managed
memory or integrated memory.
(GOMP_OFFLOAD_alloc): Move internals to ...
(GOMP_OFFLOAD_alloc_1): ... this, and add usm parameter.
(GOMP_OFFLOAD_usm_alloc): New function.
(GOMP_OFFLOAD_usm_free): New function.
(GOMP_OFFLOAD_is_usm_ptr): New function.
* target.c (gomp_map_pointer): Add USM support.
(gomp_attach_pointer): Likewise.
(gomp_map_val): Likewise.
(gomp_map_vars_internal): Likewise.
(gomp_usm_alloc): New function.
(gomp_usm_free): New function.
(gomp_load_plugin_for_device): Add usm_alloc, usm_free, and is_usm_ptr.
* testsuite/lib/libgomp.exp (check_effective_target_omp_usm): New.
* testsuite/libgomp.c/alloc-ompx_gnu_host_mem_alloc-1.c: New test.
* testsuite/libgomp.c/usm-1.c: New test.
* testsuite/libgomp.c/usm-2.c: New test.
* testsuite/libgomp.c/usm-3.c: New test.
* testsuite/libgomp.c/usm-4.c: New test.
* testsuite/libgomp.c/usm-5.c: New test.
* testsuite/libgomp.fortran/usm-3.f90: New test.
*/testsuite/libgomp.c-c++-common/requires-5.c: Fix static data failure.

co-authored-by: Kwok Cheung Yeung  
co-authored-by: Thomas Schwinge  
---
 gcc/fortran/openmp.cc |  8 +-
 include/cuda/cuda.h   | 13 
 libgomp/allocator.c   | 17 ++--
 libgomp/config/linux/allocator.c  | 21 -
 libgomp/config/nvptx/allocator.c  | 10 +++
 libgomp/libgomp-plugin.h  |  3 +
 libgomp/libgomp.h |  6 ++
 libgomp/omp.h.in  |  4 +
 libgomp/omp_lib.f90.in|  8 ++
 libgomp/omp_lib.h.in  | 10 +++
 libgomp/plugin/cuda-lib.def   |  2 +
 libgomp/plugin/plugin-nvptx.c | 52 +++--
 libgomp/target.c  | 77 ++-
 libgomp/testsuite/lib/libgomp.exp | 10 +++
 .../libgomp.c-c++-common/requires-5.c |  3 +-
 .../alloc-ompx_gnu_host_mem_alloc-1.c | 77 +++
 libgomp/testsuite/libgomp.c/usm-1.c

[PATCH v2 3/8] openmp: Enable -foffload-memory=unified

2024-06-28 Thread Andrew Stubbs

From: Andrew Stubbs 

Ensure that "requires unified_shared_memory" plays nicely with the
-foffload-memory options, and that enabling the option has the same effect as
enabling USM in the code.

Also adds some testcases.

gcc/c/ChangeLog:

* c-parser.cc (c_parser_omp_target): Add
OMP_REQUIRES_UNIFIED_SHARED_MEMORY to omp_requires_mask, if needed.
(c_parser_omp_requires): Check requires doesn't conflict with
-foffload-memory.

gcc/cp/ChangeLog:

* parser.cc (cp_parser_omp_target): Add
OMP_REQUIRES_UNIFIED_SHARED_MEMORY to omp_requires_mask, if needed.
(cp_parser_omp_requires): Check requires doesn't conflict with
-foffload-memory.

gcc/fortran/ChangeLog:

* openmp.cc (gfc_match_omp_requires): Check requires doesn't conflict
with -foffload-memory.
* parse.cc (gfc_parse_file): Check -foffload-memory option when setting
omp_requires_mask.

libgomp/ChangeLog:

* testsuite/libgomp.fortran/usm-1.f90: New test.

gcc/testsuite/ChangeLog:

* c-c++-common/gomp/usm-1.c: New test.
* gfortran.dg/gomp/usm-1.f90: New test.
---
 gcc/c/c-parser.cc   | 20 ---
 gcc/cp/parser.cc| 20 ---
 gcc/fortran/openmp.cc   |  6 +
 gcc/fortran/parse.cc|  3 ++-
 gcc/testsuite/c-c++-common/gomp/usm-1.c |  4 +++
 gcc/testsuite/gfortran.dg/gomp/usm-1.f90|  6 +
 libgomp/testsuite/libgomp.fortran/usm-1.f90 | 28 +
 7 files changed, 80 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/gomp/usm-1.c
 create mode 100644 gcc/testsuite/gfortran.dg/gomp/usm-1.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/usm-1.f90

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 00f8bf4376e..3d8c40185cd 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -24223,8 +24223,14 @@ c_parser_omp_target (c_parser *parser, enum 
pragma_context context, bool *if_p)
 }
 
   if (flag_openmp)
-omp_requires_mask
-  = (enum omp_requires) (omp_requires_mask | OMP_REQUIRES_TARGET_USED);
+{
+  omp_requires_mask
+   = (enum omp_requires) (omp_requires_mask | OMP_REQUIRES_TARGET_USED);
+  if (flag_offload_memory == OFFLOAD_MEMORY_UNIFIED)
+   omp_requires_mask
+ = (enum omp_requires) (omp_requires_mask
+| OMP_REQUIRES_UNIFIED_SHARED_MEMORY);
+}
 
   if (c_parser_next_token_is (parser, CPP_NAME))
 {
@@ -25871,7 +25877,15 @@ c_parser_omp_requires (c_parser *parser)
  if (!strcmp (p, "unified_address"))
this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
  else if (!strcmp (p, "unified_shared_memory"))
-   this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+   {
+ this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+
+ if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+ && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+   error_at (cloc,
+ "% is incompatible with the "
+ "selected %<-foffload-memory%> option");
+   }
  else if (!strcmp (p, "dynamic_allocators"))
this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS;
  else if (!strcmp (p, "reverse_offload"))
diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index 779625144db..5ad41034496 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -47290,8 +47290,14 @@ cp_parser_omp_target (cp_parser *parser, cp_token 
*pragma_tok,
  enum pragma_context context, bool *if_p)
 {
   if (flag_openmp)
-omp_requires_mask
-  = (enum omp_requires) (omp_requires_mask | OMP_REQUIRES_TARGET_USED);
+{
+  omp_requires_mask
+   = (enum omp_requires) (omp_requires_mask | OMP_REQUIRES_TARGET_USED);
+  if (flag_offload_memory == OFFLOAD_MEMORY_UNIFIED)
+   omp_requires_mask
+ = (enum omp_requires) (omp_requires_mask
+| OMP_REQUIRES_UNIFIED_SHARED_MEMORY);
+}
 
   if (cp_lexer_next_token_is (parser->lexer, CPP_NAME))
 {
@@ -49866,7 +49872,15 @@ cp_parser_omp_requires (cp_parser *parser, cp_token 
*pragma_tok)
  if (!strcmp (p, "unified_address"))
this_req = OMP_REQUIRES_UNIFIED_ADDRESS;
  else if (!strcmp (p, "unified_shared_memory"))
-   this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+   {
+ this_req = OMP_REQUIRES_UNIFIED_SHARED_MEMORY;
+
+ if (flag_offload_memory != OFFLOAD_MEMORY_UNIFIED
+ && flag_offload_memory != OFFLOAD_MEMORY_NONE)
+   error_at (cloc,
+ "% is incompatible with the "
+ "selected %<-foffload-memory%> option");
+   }
  else if (!strcmp (p, "dynamic_allocators"))
this_req = OMP_REQUIRES_DYNAMIC_ALLOCATORS;
  else if (!strcmp (p,

[PATCH v2 0/8] OpenMP: Unified Shared Memory via Managed Memory

2024-06-28 Thread Andrew Stubbs

These patched are an evolution of the USM portion of the patches previously
posted in July 2022 (yes, it's taken a while!)

https://patchwork.sourceware.org/project/gcc/list/?series=10748=%2A=both

The pinned memory portion was already posted (and partially approved
already) and must be applied before this series (v5 version).

https://patchwork.sourceware.org/project/gcc/list/?series=35022=%2A=both

The series implements OpenMP's "Unified Shared Memory" concept, first
for NVidia GPUs, and then for AMD GPUs.  We already have a very simple
implementation of USM that works on integrated APU devices and any other
device that supports shared memory access natively.  This new
implementation replaces that implementation in the case where using
"managed memory" is likely to be a win (the usual non-APU case).

In theory, explicit mapping of exactly the right memory with carefully
hand-optimized "to" and "from" directives is the most optimal implementation
(except possibly in the case where the data is too large for the device).
Experimentally, the "dumb" USM implementation we already have performs
quite well with modern devices and drivers.  This new managed memory
implementation appears to fall between the two, and can outperform
explicit mapping in the non-trivial cases (e.g. many small mappings, sparse
data, rectangular copies, etc.)

The trade-off for the additional performance is added complexity and
malloc/free is no longer compatible with external libraries (e.g. strdup).

To help mitigate these incompatibility issues, two new GNU extensions
are added:

1. ompx_gnu_unified_shared_mem_alloc / ompx_gnu_unified_shared_mem_space

  This new pre-defined allocator, used with omp_alloc, allows a
  programmer to explicitly allocate managed memory without converting
  the whole program to USM.  Creating explicit mappings for this memory is
  now optional, and if they do occur the runtime will detect the USM and apply
  no-op mappings.

2. ompx_gnu_host_mem_alloc / ompx_gnu_host_mem_space

  Conversely, this new pre-defined allocator allows a programmer to
  override "requires unified_shared_memory" and obtain regular host
  memory from the regular system heap.  This might be desirable when a
  large amount of memory is needed in a completely unrelated context, or
  for interacting with external libraries.

Known limitation: We can intercept dynamic heap allocations, but static
data and automatic stack variables are generally not accessible from the
device.  (Migrating stack pages used by an active thread seems like a
bad idea, in any case.)

I can approve the amdgcn patches myself, but comments are welcome.

OK for mainline?  (Once the pinned memory dependencies are committed.)

Thanks

Andrew

P.S. This series includes contributions from (at least) Thomas Schwinge,
Marcel Vollweiler, Kwok Cheung Yeung, and Abid Qadeer.

Andrew Stubbs (6):
  libgomp: Disentangle shared memory from managed
  openmp, nvptx: ompx_gnu_unified_shared_mem_alloc
  openmp: Enable -foffload-memory=unified
  amdgcn, openmp: Auto-detect USM mode and set HSA_XNACK
  amdgcn: libgomp plugin USM implementation
  libgomp: Map omp_default_mem_space to USM

Hafiz Abid Qadeer (1):
  openmp: Use libgomp memory allocation functions with unified shared
memory.

Marcel Vollweiler (1):
  openmp, libgomp: Handle unified shared memory in
omp_target_is_accessible

 gcc/c/c-parser.cc |  20 +-
 gcc/config/gcn/gcn.cc |  32 +-
 gcc/config/gcn/mkoffload.cc   |  35 +-
 gcc/cp/parser.cc  |  20 +-
 gcc/fortran/openmp.cc |  14 +-
 gcc/fortran/parse.cc  |   3 +-
 gcc/omp-low.cc| 188 +++
 gcc/passes.def|   1 +
 gcc/testsuite/c-c++-common/gomp/usm-1.c   |   4 +
 gcc/testsuite/c-c++-common/gomp/usm-2.c   |  46 ++
 gcc/testsuite/c-c++-common/gomp/usm-3.c   |  44 ++
 gcc/testsuite/g++.dg/gomp/usm-1.C |  32 ++
 gcc/testsuite/g++.dg/gomp/usm-2.C |  30 ++
 gcc/testsuite/g++.dg/gomp/usm-3.C |  38 ++
 gcc/testsuite/g++.dg/gomp/usm-4.C |  32 ++
 gcc/testsuite/g++.dg/gomp/usm-5.C |  30 ++
 gcc/testsuite/gfortran.dg/gomp/usm-1.f90  |   6 +
 gcc/testsuite/gfortran.dg/gomp/usm-2.f90  |  16 +
 gcc/testsuite/gfortran.dg/gomp/usm-3.f90  |  13 +
 gcc/tree-pass.h   |   1 +
 include/cuda/cuda.h   |  13 +
 include/hsa.h |  28 +-
 include/hsa_ext_amd.h | 459 +-
 include/hsa_ext_image.h   |   2 +-
 libgomp/Makefile.in   |  13 +-
 libgomp/allocator.c   |  17 +-
 libgomp/config/gcn/allocator.c|  10 +
 libgomp/config/linux/allocator.c  |  29 +-
 libgomp/config/nvptx/allocator.c

[PATCH v2 1/8] libgomp: Disentangle shared memory from managed

2024-06-28 Thread Andrew Stubbs

Some GPU compute systems allow the GPU to access host memory without much
prior setup, but that's not necessarily the fast way to do it.  For shared
memory APUs this is almost certainly the correct choice, but for AMD there
is the difference between "fine-grained" and "coarse-grained" memory, and
for NVidia Cuda generally runs better if it knows the status of the memory
you access.

Therefore, for performance, we want to use "managed memory", in which the OS
drivers handle page migration on the fly, but this will require some
additional configuration steps that I will implement in later patches.  There
may be a temporary regression in USM support.

This patch disables the basic stop-gap shared memory so we can introduce
fast Unified Shared Memory using the managed memory APIs in the next patches.

If a device has integrated memory then the patch attempts to continue using
the current behaviour.  The new plugin API to achieve this is made optional
so as not to break compatibility.  It needs to be a new API because the
existing capability setting runs before the devices have been scanned and does
not allow different capabilities for different devices.

libgomp/ChangeLog:

* libgomp-plugin.h (GOMP_OFFLOAD_get_dev_caps): New prototype.
* libgomp.h (struct gomp_device_descr): Add get_dev_caps_func.
* plugin/plugin-gcn.c (GOMP_OFFLOAD_get_dev_caps): New function.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_dev_caps): New function.
* target.c (gomp_load_plugin_for_device): Load the get_dev_caps API.
(gomp_target_init): Don't assume unified shared memory is the same
as actual shared memory.  Use get_dev_caps to allow plugins to set
different capabilities for different devices.
---
 libgomp/libgomp-plugin.h  |  1 +
 libgomp/libgomp.h |  1 +
 libgomp/plugin/plugin-gcn.c   | 40 ---
 libgomp/plugin/plugin-nvptx.c | 16 ++
 libgomp/target.c  |  9 
 5 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 0d2e3f0a6ec..100dbca1633 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -128,6 +128,7 @@ extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, 
uint64_t, uint64_t,
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
+extern unsigned int GOMP_OFFLOAD_get_dev_caps (int);
 extern int GOMP_OFFLOAD_get_type (void);
 extern int GOMP_OFFLOAD_get_num_devices (unsigned int);
 extern bool GOMP_OFFLOAD_init_device (int);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index c3aabd4b7d3..f48bf7418f0 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1402,6 +1402,7 @@ struct gomp_device_descr
   /* Function handlers.  */
   __typeof (GOMP_OFFLOAD_get_name) *get_name_func;
   __typeof (GOMP_OFFLOAD_get_caps) *get_caps_func;
+  __typeof (GOMP_OFFLOAD_get_dev_caps) *get_dev_caps_func;
   __typeof (GOMP_OFFLOAD_get_type) *get_type_func;
   __typeof (GOMP_OFFLOAD_get_num_devices) *get_num_devices_func;
   __typeof (GOMP_OFFLOAD_init_device) *init_device_func;
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 3d882b5ab63..c8c588e8efa 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -3321,9 +3321,43 @@ GOMP_OFFLOAD_get_name (void)
 unsigned int
 GOMP_OFFLOAD_get_caps (void)
 {
-  /* FIXME: Enable shared memory for APU, but not discrete GPU.  */
-  return /*GOMP_OFFLOAD_CAP_SHARED_MEM |*/ GOMP_OFFLOAD_CAP_OPENMP_400
-   | GOMP_OFFLOAD_CAP_OPENACC_200;
+  return GOMP_OFFLOAD_CAP_OPENMP_400 | GOMP_OFFLOAD_CAP_OPENACC_200;
+}
+
+/* Return any capabilities that are specific to one device only.  */
+
+unsigned int
+GOMP_OFFLOAD_get_dev_caps (int n)
+{
+  /* The device agents have been enumerated, but might not have been
+ initialized, so get_agent_info won't work here.  */
+  struct agent_info *agent = _context.agents[n];
+
+  char name[64];
+  hsa_status_t status = hsa_fns.hsa_agent_get_info_fn (agent->id,
+  HSA_AGENT_INFO_NAME,
+  );
+  if (status != HSA_STATUS_SUCCESS)
+return 0;
+
+  gcn_isa device_isa = isa_code (name);
+  unsigned int caps = 0;
+
+  /* APU devices might have shared memory.
+ Don't add devices to this check if they support shared memory
+ via XNACK and page migration!  */
+  if (device_isa == EF_AMDGPU_MACH_AMDGCN_GFX1036 /* Expect "yes".  */
+  || device_isa == EF_AMDGPU_MACH_AMDGCN_GFX1103 /* Observed "no".  */)
+{
+  bool b;
+  hsa_system_info_t type = HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT;
+  status = hsa_fns.hsa_system_get_info_fn (type, );
+  if (status == HSA_STATUS_SUCCESS
+ && b)
+   caps |= GOMP_OFFLOAD_CAP_SHARED_MEM;
+}
+
+

Re: [PATCH] Hard register asm constraint

2024-06-28 Thread Georg-Johann Lay


Am 27.06.24 um 10:51 schrieb Stefan Schulze Frielinghaus:

On Thu, Jun 27, 2024 at 09:45:32AM +0200, Georg-Johann Lay wrote:

Am 24.05.24 um 11:13 Am 25.06.24 um 16:03 schrieb Paul Koning:

On Jun 24, 2024, at 1:50 AM, Stefan Schulze Frielinghaus 
 wrote:
On Mon, Jun 10, 2024 at 07:19:19AM +0200, Stefan Schulze Frielinghaus wrote:

On Fri, May 24, 2024 at 11:13:12AM +0200, Stefan Schulze Frielinghaus wrote:

This implements hard register constraints for inline asm.  A hard register
constraint is of the form {regname} where regname is any valid register.  This
basically renders register asm superfluous.  For example, the snippet

int test (int x, int y)
{
   register int r4 asm ("r4") = x;
   register int r5 asm ("r5") = y;
   unsigned int copy = y;
   asm ("foo %0,%1,%2" : "+d" (r4) : "d" (r5), "d" (copy));
   return r4;
}

could be rewritten into

int test (int x, int y)
{
   asm ("foo %0,%1,%2" : "+{r4}" (x) : "{r5}" (y), "d" (y));
   return x;
}


I like this idea but I'm wondering: regular constraints specify what sort of 
value is needed, for example an int vs. a short int vs. a float.  The notation 
you've shown doesn't seem to have that aspect.

The other comment is that I didn't see documentation updates to reflect this 
new feature.

paul


  Stefan Schulze Frielinghaus:

This implements hard register constraints for inline asm.  A hard register
constraint is of the form {regname} where regname is any valid register.  This
basically renders register asm superfluous.  For example, the snippet

int test (int x, int y)
{
register int r4 asm ("r4") = x;
register int r5 asm ("r5") = y;
unsigned int copy = y;
asm ("foo %0,%1,%2" : "+d" (r4) : "d" (r5), "d" (copy));
return r4;
}

could be rewritten into

int test (int x, int y)
{
asm ("foo %0,%1,%2" : "+{r4}" (x) : "{r5}" (y), "d" (y));
return x;
}


Hi, can this also be used in machine descriptions?

It would make some insn handling much simpler, for example in
the avr backend.

That backend has insns that represent assembly sequences in libgcc
which have a smaller register footprint than plain calls.  However
this requires that such insns have explicit description of which regs
go in and out.

The current solution uses hard regs, which works, but a proper
implementation would use register constraints.  I tries that a while
ago, and register constraints lead to a code bloat even in places that
don't use these constraints due to the zillions of new register classes
like R22_1, R22;2, R22_4, R20_1, R20_2, R20_4 etc. that were required.

Your approach would allow to use hard register constraints in insns,
and so far the only problem is to determine how much hard regs are
used by the constraint.  The gen tools that generates cc code from md
would use the operand's machine mode to infer the number of hard regs.


I have this on my todo list but ignored it for the very first draft.  At
the moment this already fails because genoutput cannot parse the
constraint format.

In my "alpha draft" I implemented this feature by emitting moves to hard
registers during expand.  This had the limitation that I couldn't


One problem is that you cannot just introduce hard registers at that
time because a hard reg may live across the sequence, see for example
avr.cc::avr_emit3_fix_outputs() and avr_fix_operands().


support multiple alternatives in combination with hard-register
constraints.  I'm still not sure whether this is a feature we really
want or whether it should be rather denied.  Anyhow, with this kind of
implementation I doubt that this would be feasible for machine
descriptions.  I moved on with my current draft where the constraint
manifests during register allocation.  This also allows multiple
alternatives.  I think one of the (major?) advantages of doing it this
way is that operands are kept in pseudos which means they are
automagically saved/restored over function boundaries and what not.  Or
in other words, the register constraint manifests at the asm boundary
which is probably what users expect and should be less error prone


As far as I know, a local register variable is only supposed to be
loaded to the specified register when the variable is used as an
operand to some inline asm.  Only in such asm statements, the
variable will live in the specified register.  So "surviving" a
function call is not even a problem to solve with the current local
regvar semantic?


(again just thinking of implicit code which gets injected as e.g. by
sanitizers introducing calls etc.).

So long story short, I would like to look into this but currently it
doesn't work.  I'm also not sure to which extend this could be used.
However, once I have some more time I will have a look at the avr
backend for examples.

Cheers,
Stefan


Great.  When you have any questions about the avr backend, don't
hesitate to ask me.

Cheers,
Johann

[PATCH] c++: Fix ICE locating 'this' for (not matching) template member function [PR115364]

2024-06-28 Thread Simon Martin

We currently ICE when emitting the error message for this invalid code:

=== cut here ===
struct foo {
  template void not_const() {}
};
void fn(const foo& obj) {
  obj.not_const<5>();
}
=== cut here ===

The problem is that get_fndecl_argument_location assumes that it has a
FUNCTION_DECL in its hands to find the location of the bad argument. It might
however have a TEMPLATE_DECL if there's a single candidate that cannot be
instantiated, like here.

This patch simply defaults to using the FNDECL's location in this case, which
fixes this PR.

Successfully tested on x86_64-pc-linux-gnu.

PR c++/115364

gcc/cp/ChangeLog:

* call.cc (get_fndecl_argument_location): Use FNDECL's location for
TEMPLATE_DECLs.

gcc/testsuite/ChangeLog:

* g++.dg/overload/template7.C: New test.

---
 gcc/cp/call.cc| 4 
 gcc/testsuite/g++.dg/overload/template7.C | 9 +
 2 files changed, 13 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/overload/template7.C

diff --git a/gcc/cp/call.cc b/gcc/cp/call.cc
index 7bbc1fb0c78..d5ff2311e63 100644
--- a/gcc/cp/call.cc
+++ b/gcc/cp/call.cc
@@ -8347,6 +8347,10 @@ get_fndecl_argument_location (tree fndecl, int argnum)
   if (DECL_ARTIFICIAL (fndecl))
 return DECL_SOURCE_LOCATION (fndecl);
 
+  /* Use FNDECL's location for TEMPLATE_DECLs.  */
+  if (TREE_CODE (fndecl) == TEMPLATE_DECL)
+return DECL_SOURCE_LOCATION (fndecl);
+
   int i;
   tree param;
 
diff --git a/gcc/testsuite/g++.dg/overload/template7.C 
b/gcc/testsuite/g++.dg/overload/template7.C
new file mode 100644
index 000..67191c4ff62
--- /dev/null
+++ b/gcc/testsuite/g++.dg/overload/template7.C
@@ -0,0 +1,9 @@
+// PR c++/115364
+// { dg-do compile }
+
+struct foo {
+  template void not_const() {} // { dg-note "initializing" }
+};
+void fn(const foo& obj) {
+  obj.not_const<5>(); // { dg-error "cannot convert" }
+}
-- 
2.44.0

[PATCH] Remove unused hybrid_* operators in range-ops.

2024-06-28 Thread Aldy Hernandez

Now that the dust has settled on the prange work, we can remove the
hybrid operators.  I will push this once tests complete.

gcc/ChangeLog:

* range-op-ptr.cc (class hybrid_and_operator): Remove.
(class hybrid_or_operator): Same.
(class hybrid_min_operator): Same.
(class hybrid_max_operator): Same.
---
 gcc/range-op-ptr.cc | 156 
 1 file changed, 156 deletions(-)

diff --git a/gcc/range-op-ptr.cc b/gcc/range-op-ptr.cc
index 9421d3cd21d..1f41236e710 100644
--- a/gcc/range-op-ptr.cc
+++ b/gcc/range-op-ptr.cc
@@ -612,162 +612,6 @@ operator_pointer_diff::op1_op2_relation_effect (irange 
_range, tree type,
rel);
 }
 
-// --
-// Hybrid operators for the 4 operations which integer and pointers share,
-// but which have different implementations.  Simply check the type in
-// the call and choose the appropriate method.
-// Once there is a PRANGE signature, simply add the appropriate
-// prototypes in the rmixed range class, and remove these hybrid classes.
-
-class hybrid_and_operator : public operator_bitwise_and
-{
-public:
-  using range_operator::update_bitmask;
-  using range_operator::op1_range;
-  using range_operator::op2_range;
-  using range_operator::lhs_op1_relation;
-  bool op1_range (irange , tree type,
- const irange , const irange ,
- relation_trio rel = TRIO_VARYING) const final override
-{
-  if (INTEGRAL_TYPE_P (type))
-   return operator_bitwise_and::op1_range (r, type, lhs, op2, rel);
-  else
-   return false;
-}
-  bool op2_range (irange , tree type,
- const irange , const irange ,
- relation_trio rel = TRIO_VARYING) const final override
-{
-  if (INTEGRAL_TYPE_P (type))
-   return operator_bitwise_and::op2_range (r, type, lhs, op1, rel);
-  else
-   return false;
-}
-  relation_kind lhs_op1_relation (const irange ,
- const irange , const irange ,
- relation_kind rel) const final override
-{
-  if (!lhs.undefined_p () && INTEGRAL_TYPE_P (lhs.type ()))
-   return operator_bitwise_and::lhs_op1_relation (lhs, op1, op2, rel);
-  else
-   return VREL_VARYING;
-}
-  void update_bitmask (irange , const irange ,
-  const irange ) const final override
-{
-  if (!r.undefined_p () && INTEGRAL_TYPE_P (r.type ()))
-   operator_bitwise_and::update_bitmask (r, lh, rh);
-}
-
-  void wi_fold (irange , tree type, const wide_int _lb,
-   const wide_int _ub, const wide_int _lb,
-   const wide_int _ub) const final override
-{
-  if (INTEGRAL_TYPE_P (type))
-   return operator_bitwise_and::wi_fold (r, type, lh_lb, lh_ub,
- rh_lb, rh_ub);
-  else
-   return op_pointer_and.wi_fold (r, type, lh_lb, lh_ub, rh_lb, rh_ub);
-}
-} op_hybrid_and;
-
-// Temporary class which dispatches routines to either the INT version or
-// the pointer version depending on the type.  Once PRANGE is a range
-// class, we can remove the hybrid.
-
-class hybrid_or_operator : public operator_bitwise_or
-{
-public:
-  using range_operator::update_bitmask;
-  using range_operator::op1_range;
-  using range_operator::op2_range;
-  using range_operator::lhs_op1_relation;
-  bool op1_range (irange , tree type,
- const irange , const irange ,
- relation_trio rel = TRIO_VARYING) const final override
-{
-  if (INTEGRAL_TYPE_P (type))
-   return operator_bitwise_or::op1_range (r, type, lhs, op2, rel);
-  else
-   return op_pointer_or.op1_range (r, type, lhs, op2, rel);
-}
-  bool op2_range (irange , tree type,
- const irange , const irange ,
- relation_trio rel = TRIO_VARYING) const final override
-{
-  if (INTEGRAL_TYPE_P (type))
-   return operator_bitwise_or::op2_range (r, type, lhs, op1, rel);
-  else
-   return op_pointer_or.op2_range (r, type, lhs, op1, rel);
-}
-  void update_bitmask (irange , const irange ,
-  const irange ) const final override
-{
-  if (!r.undefined_p () && INTEGRAL_TYPE_P (r.type ()))
-   operator_bitwise_or::update_bitmask (r, lh, rh);
-}
-
-  void wi_fold (irange , tree type, const wide_int _lb,
-   const wide_int _ub, const wide_int _lb,
-   const wide_int _ub) const final override
-{
-  if (INTEGRAL_TYPE_P (type))
-   return operator_bitwise_or::wi_fold (r, type, lh_lb, lh_ub,
- rh_lb, rh_ub);
-  else
-   return op_pointer_or.wi_fold (r, type, lh_lb, lh_ub, rh_lb, rh_ub);
-}
-} op_hybrid_or;
-
-// Temporary class which dispatches routines to either the INT version or
-// the

Re: [Patch, Fortran] 2/3 Refactor locations where _vptr is (re)set.

2024-06-28 Thread Andre Vehreschild

Hi Paul,

thanks for the review. I have removed the commented assert and committed as
gcc-15-1704-gaa3599a10ca

What about your pr59104 patch? It is living happily in my dev-branch and making
no problems.

Thanks again and regards,
Andre


On Thu, 27 Jun 2024 07:29:40 +0100
Paul Richard Thomas  wrote:

> Hi Andre,
>
> Thanks for resending the patches. I fear that daytime work and visitors
> have taken my attention the last days - hence the delay in reviewing, for
> which I apoloise,
>
> The patches do what they are advertised to do, without regressions on my
> side. I like gfc_class_set_vptr. Please remove the commented out assert,
> unless you intend to deploy it.
>
> OK for mainline.
>
> Thanks for the patches.
>
> Regards
>
> Paul
>
>
> On Fri, 21 Jun 2024 at 07:39, Andre Vehreschild  wrote:
>
> > Hi Paul,
> >
> > I am sorry for the delay. I am fighting with PR96992, where Harald finds
> > more
> > and more issues. I think I am approaching that one wrongly. We will see.
> >
> > Anyway, please find attached updated version of the 2/3 and 3/3 patches,
> > which
> > apply cleanly onto master at 1f974c3a24b76e25a2b7f31a6c7f4aee93a9eaab .
> >
> > Hope that helps and thanks in advance for looking at the patches.
> >
> > Regards,
> > Andre
> >
> > PS. I have attached them in plain text and as archive to prevent mailers
> > from
> > corrupting them.
> >
> > On Thu, 20 Jun 2024 07:42:31 +0100
> > Paul Richard Thomas  wrote:
> >
> > > Hi Andre,
> > >
> > > Both this patch and 3/3 are corrupt according to git apply:
> > > [pault@pc30 gcc]$ git apply --ignore-space-change --ignore-whitespace <
> > > ~/prs/andre/u*.patch
> > > error: corrupt patch at line 45
> > > [pault@pc30 gcc]$ git apply --ignore-space-change --ignore-whitespace <
> > > ~/prs/andre/i*.patch
> > > error: corrupt patch at line 36
> > >
> > > I tried messing with the offending lines, to no avail. I can apply them
> > by
> > > hand or, perhaps, you could supply me with clean patches?
> > >
> > > The patches look OK but I want to check the code that they generate.
> > >
> > > Cheers
> > >
> > > Paul
> > >
> > >
> > > On Tue, 11 Jun 2024 at 13:57, Andre Vehreschild  wrote:
> > >
> > > > Hi all,
> > > >
> > > > this patch refactors most of the locations where the _vptr of a class
> > data
> > > > type
> > > > is reset. The code was inconsistent in most of the locations. The goal
> > of
> > > > using
> > > > only one routine for setting the _vptr is to be able to later modify it
> > > > more
> > > > easily.
> > > >
> > > > The ultimate goal being that every time one assigns to a class data
> > type a
> > > > consistent way is used to prevent forgetting the corner cases. So this
> > is
> > > > just a
> > > > small step in this direction. I think it is worth to simplify the code
> > to
> > > > something consistent to reduce maintenance efforts anyhow.
> > > >
> > > > Regtested ok on x86_64 Fedora 39. Ok for mainline?
> > > >
> > > > Regards,
> > > > Andre
> > > > --
> > > > Andre Vehreschild * Email: vehre ad gmx dot de
> > > >
> >
> >
> > --
> > Andre Vehreschild * Kreuzherrenstr. 8 * 52062 Aachen
> > Tel.: +49 241 9291018 * Email: ve...@gmx.de
> >


--
Andre Vehreschild * Email: vehre ad gmx dot de

Re: [PATCH 2/3] libstdc++: Optimize __uninitialized_default using memset

2024-06-28 Thread Jonathan Wakely

On Fri, 28 Jun 2024 at 07:53, Maciej Cencora  wrote:
>
> But constexpr-ness of bit_cast has additional limitations and e.g. providing 
> an union as _Tp would be a hard-error. So we have two options:
>  - before bitcasting check if type can be bitcast-ed at compile-time,
>  - change the 'if constexpr' to regular 'if'.
>
> If we go with the second solution then we will include classes with pointers, 
> and unions.

I don't think we want to add runtime comparisons, the point is to
optimize the code not do more work :-)

> Additionally we could also include types with padding by passing 
> zero-initialized object (like a class-scope static constexpr or global) into 
> bit_cast... but then such a variable would be ODR-used and most-likely won't 
> be optimized out.
>
> I guess the best option would be to introduce in C++ language a new 
> compiler-backed type trait like: 
> std::zero_initialized_object_has_all_zeros_object_representation.

Yes, I think a new built-in is the only approach that will work for
class types. I'll just limit the optimization to scalars (excluding
member pointers).



>
> Regards,
> Maciej
>
> pt., 28 cze 2024 o 00:25 Jonathan Wakely  napisał(a):
>>
>> On Thu, 27 Jun 2024 at 14:27, Maciej Cencora  wrote:
>> >
>> > I think going the bit_cast way would be the best because it enables the 
>> > optimization for many more classes including common wrappers like 
>> > optional, variant, pair, tuple and std::array.
>>
>> This isn't tested but seems to work on simple cases. But for large
>> objects the loop hits the constexpr iteration limit and compilation
>> fails, so it needs a sizeof(_Tp) < 64 or something.
>>
>>   using _ValueType
>> = typename iterator_traits<_ForwardIterator>::value_type;
>>   using _Tp = remove_all_extents_t<_ValueType>;
>>   // Need value-init to be equivalent to zero-init.
>>   if constexpr (is_member_pointer<_Tp>::value)
>> return nullptr;
>>   else if constexpr (!is_scalar<_Tp>::value)
>> {
>>   using __trivial
>> = __and_,
>>  is_trivially_constructible<_ValueType>>;
>>   if constexpr (__trivial::value)
>> {
>>   struct _Bytes
>>   {
>> unsigned char __b[sizeof(_Tp)];
>>
>> #if __cpp_constexpr >= 201304
>> constexpr bool _M_nonzero() const
>> {
>>   for (auto __c : __b)
>> if (__c)
>>   return true;
>>   return false;
>> }
>> #else
>> constexpr bool _M_nonzero(size_t __n = 0) const
>> {
>>   return __n < sizeof(_Tp)
>>  && (__b[__n] || _M_nonzero(__n + 1));
>> }
>> #endif
>>   };
>>   if constexpr (__builtin_bit_cast(_Bytes, _Tp())._M_nonzero())
>> return nullptr;
>> }
>> }
>>   using _Ptr = decltype(std::__to_address(__first));
>>   // Cannot use memset if _Ptr is cv-qualified.
>>   if constexpr (is_convertible<_Ptr, void*>::value)
>> return std::__to_address(__first);
>>

Re: [PATCH] Fix native_encode_vector_part for itype when TYPE_PRECISION (itype) == BITS_PER_UNIT

2024-06-28 Thread Richard Biener

> Am 28.06.2024 um 10:27 schrieb Richard Sandiford :
> 
> Richard Biener  writes:
>>> On Fri, Jun 28, 2024 at 8:01 AM Richard Biener
>>>  wrote:
>>> 
>>> On Fri, Jun 28, 2024 at 3:15 AM liuhongt  wrote:

 for the testcase in the PR115406, here is part of the dump.

  char D.4882;
  vector(1)  _1;
  vector(1) signed char _2;
  char _5;

   :
  _1 = { -1 };

 When assign { -1 } to vector(1} {signed-boolean:8},
 Since TYPE_PRECISION (itype) <= BITS_PER_UNIT, so it set each bit of dest
 with each vector elemnet. But i think the bit setting should only apply for
 TYPE_PRECISION (itype) < BITS_PER_UNIT. .i.e for vector(1).
 , it will be assigned as -1, instead of 1.
 Is there any specific reason vector(1)  is handled
 differently from vector<1> ?

 Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
 Ok for trunk?
>>> 
>>> I agree that <= BITS_PER_UNIT is suspicious, but the bit-precision
>>> code should work for 8 bit
>>> entities as well, it seems we only set the LSB of each element in the
>>> "mask".  ISTR that SVE
>>> masks can have up to 8 bit elements (for 8 byte data elements), so
>>> maybe that's why
>>> <= BITS_PER_UNIT.
> 
> Yeah.

So is it necessary that only one bit is set for SVE?

>>> So maybe instead of just setting one bit in
>>> 
>>>  ptr[bit / BITS_PER_UNIT] |= 1 << (bit % BITS_PER_UNIT);
>>> 
>>> we should set elt_bits bits, aka (without testing)
>>> 
>>>  ptr[bit / BITS_PER_UNIT] |= (1 << elt_bits - 1) << (bit
>>> % BITS_PER_UNIT);
>>> 
>>> ?
>> 
>> Alternatively
>> 
>>  if (VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (expr))
>>  && TYPE_PRECISION (itype) <= BITS_PER_UNIT)
>> 
>> should be amended with
>> 
>>   && GET_MODE_CLASS (TYPE_MODE (TREE_TYPE (expr))) != MODE_VECTOR_INT
> 
> How about:
> 
>  if (GET_MODE_CLASS (TYPE_MODE (TREE_TYPE (expr))) == MODE_VECTOR_BOOL)
>{
>  gcc_assert (TYPE_PRECISION (itype) <= BITS_PER_UNIT);
> 
> ?

Note the path is also necessary for avx512 and gcn mask modes which are integer 
modes.

> Is it OK for TYPE_MODE to affect tree-level semantics though, especially
> since it can change with the target attribute?  (At least TYPE_MODE_RAW
> would be stable.)

That’s a good question and also related to GCC vector extension which can 
result in both BLKmode and integer modes to be used.  But I’m not sure how we 
expose masks to the middle end here.  A too large vector bool could be lowered 
to AVX512 mode.  Maybe we should simply reject interpret/encode of BLKmode 
vectors and make sure to never assign integer modes to vector bools (if the 
target didn’t specify that mode)?

I guess some test coverage would be nice here.

>> maybe.  Still for the possibility of vector(n) 
>> mask for a int128 element vector
>> we'd have 16bit mask elements, encoding that differently would be
>> inconsistent as well
>> (but of course 16bit elements are not handled by the code right now).
> 
> Yeah, 16-bit predicate elements aren't a thing for SVE, so we've not
> had to add support for them.
> 
> Richard

Re: [PATCH] Fix native_encode_vector_part for itype when TYPE_PRECISION (itype) == BITS_PER_UNIT

2024-06-28 Thread Richard Sandiford

Richard Biener  writes:
> On Fri, Jun 28, 2024 at 8:01 AM Richard Biener
>  wrote:
>>
>> On Fri, Jun 28, 2024 at 3:15 AM liuhongt  wrote:
>> >
>> > for the testcase in the PR115406, here is part of the dump.
>> >
>> >   char D.4882;
>> >   vector(1)  _1;
>> >   vector(1) signed char _2;
>> >   char _5;
>> >
>> >:
>> >   _1 = { -1 };
>> >
>> > When assign { -1 } to vector(1} {signed-boolean:8},
>> > Since TYPE_PRECISION (itype) <= BITS_PER_UNIT, so it set each bit of dest
>> > with each vector elemnet. But i think the bit setting should only apply for
>> > TYPE_PRECISION (itype) < BITS_PER_UNIT. .i.e for vector(1).
>> > , it will be assigned as -1, instead of 1.
>> > Is there any specific reason vector(1)  is handled
>> > differently from vector<1> ?
>> >
>> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
>> > Ok for trunk?
>>
>> I agree that <= BITS_PER_UNIT is suspicious, but the bit-precision
>> code should work for 8 bit
>> entities as well, it seems we only set the LSB of each element in the
>> "mask".  ISTR that SVE
>> masks can have up to 8 bit elements (for 8 byte data elements), so
>> maybe that's why
>> <= BITS_PER_UNIT.

Yeah.

>>  So maybe instead of just setting one bit in
>>
>>   ptr[bit / BITS_PER_UNIT] |= 1 << (bit % BITS_PER_UNIT);
>>
>> we should set elt_bits bits, aka (without testing)
>>
>>   ptr[bit / BITS_PER_UNIT] |= (1 << elt_bits - 1) << (bit
>> % BITS_PER_UNIT);
>>
>> ?
>
> Alternatively
>
>   if (VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (expr))
>   && TYPE_PRECISION (itype) <= BITS_PER_UNIT)
>
> should be amended with
>
>&& GET_MODE_CLASS (TYPE_MODE (TREE_TYPE (expr))) != MODE_VECTOR_INT

How about:

  if (GET_MODE_CLASS (TYPE_MODE (TREE_TYPE (expr))) == MODE_VECTOR_BOOL)
{
  gcc_assert (TYPE_PRECISION (itype) <= BITS_PER_UNIT);

?

Is it OK for TYPE_MODE to affect tree-level semantics though, especially
since it can change with the target attribute?  (At least TYPE_MODE_RAW
would be stable.)

> maybe.  Still for the possibility of vector(n) 
> mask for a int128 element vector
> we'd have 16bit mask elements, encoding that differently would be
> inconsistent as well
> (but of course 16bit elements are not handled by the code right now).

Yeah, 16-bit predicate elements aren't a thing for SVE, so we've not
had to add support for them.

Richard

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 338024 matches

Mail list logo