[PATCH] [APX] Support Intel APX PUSH2POP2

2023-10-09 Thread Hongyu Wang
From: "Mo, Zewei" 

Hi, 

Intel APX PUSH2POP2 feature has been released in [1].

This feature requires stack to be aligned at 16byte, therefore in
prologue/epilogue, a standalone push/pop will be emitted before any
push2/pop2 if the stack was not aligned to 16byte.
Also for current implementation we only support push2/pop2 usage in
function prologue/epilogue for those callee-saved registers.

Bootstrapped/regtested on x86-64-pc-linux-gnu{-m32,} and sde.

OK for master?

[1].https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html.

gcc/ChangeLog:

* config/i386/i386.cc (gen_push2): New function to emit push2
and adjust cfa offset.
(ix86_use_push2_pop2): New function to determine whether
push2/pop2 can be used.
(ix86_compute_frame_layout): Adjust preferred stack boundary
and stack alignment needed for push2/pop2.
(ix86_emit_save_regs): Emit push2 when available.
(ix86_emit_restore_reg_using_pop2): New function to emit pop2
and adjust cfa info.
(ix86_emit_restore_regs_using_pop2): New function to loop
through the saved regs and call above.
(ix86_expand_epilogue): Call ix86_emit_restore_regs_using_pop2
when push2pop2 available.
* config/i386/i386.md (push2_di): New pattern for push2.
(pop2_di): Likewise for pop2.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-push2pop2-1.c: New test.
* gcc.target/i386/apx-push2pop2_force_drap-1.c: Likewise.
* gcc.target/i386/apx-push2pop2_interrupt-1.c: Likewise.

Co-authored-by: Hu Lin1 
Co-authored-by: Hongyu Wang 
---
 gcc/config/i386/i386.cc   | 252 --
 gcc/config/i386/i386.md   |  26 ++
 .../gcc.target/i386/apx-push2pop2-1.c |  45 
 .../i386/apx-push2pop2_force_drap-1.c |  29 ++
 .../i386/apx-push2pop2_interrupt-1.c  |  28 ++
 5 files changed, 365 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/apx-push2pop2-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/apx-push2pop2_force_drap-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/apx-push2pop2_interrupt-1.c

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 6244f64a619..8251b67e2d6 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -6473,6 +6473,26 @@ gen_pop (rtx arg)
 stack_pointer_rtx)));
 }
 
+/* Generate a "push2" pattern for input ARG.  */
+rtx
+gen_push2 (rtx mem, rtx reg1, rtx reg2)
+{
+  struct machine_function *m = cfun->machine;
+  const int offset = UNITS_PER_WORD * 2;
+
+  if (m->fs.cfa_reg == stack_pointer_rtx)
+m->fs.cfa_offset += offset;
+  m->fs.sp_offset += offset;
+
+  if (REG_P (reg1) && GET_MODE (reg1) != word_mode)
+reg1 = gen_rtx_REG (word_mode, REGNO (reg1));
+
+  if (REG_P (reg2) && GET_MODE (reg2) != word_mode)
+reg2 = gen_rtx_REG (word_mode, REGNO (reg2));
+
+  return gen_push2_di (mem, reg1, reg2);
+}
+
 /* Return >= 0 if there is an unused call-clobbered register available
for the entire function.  */
 
@@ -6714,6 +6734,18 @@ get_probe_interval (void)
 
 #define SPLIT_STACK_AVAILABLE 256
 
+/* Helper function to determine whether push2/pop2 can be used in prologue or
+   epilogue for register save/restore.  */
+static bool
+ix86_pro_and_epilogue_can_use_push2pop2 (int nregs)
+{
+  int aligned = cfun->machine->fs.sp_offset % 16 == 0;
+  return TARGET_APX_PUSH2POP2
+&& !cfun->machine->frame.save_regs_using_mov
+&& cfun->machine->func_type == TYPE_NORMAL
+&& (nregs + aligned) >= 3;
+}
+
 /* Fill structure ix86_frame about frame of currently computed function.  */
 
 static void
@@ -6771,16 +6803,20 @@ ix86_compute_frame_layout (void)
 
  Darwin's ABI specifies 128b alignment for both 32 and  64 bit variants
  at call sites, including profile function calls.
- */
-  if (((TARGET_64BIT_MS_ABI || TARGET_MACHO)
-&& crtl->preferred_stack_boundary < 128)
-  && (!crtl->is_leaf || cfun->calls_alloca != 0
- || ix86_current_function_calls_tls_descriptor
- || (TARGET_MACHO && crtl->profile)
- || ix86_incoming_stack_boundary < 128))
+
+ For APX push2/pop2, the stack also requires 128b alignment.  */
+  if ((ix86_pro_and_epilogue_can_use_push2pop2 (frame->nregs)
+   && crtl->preferred_stack_boundary < 128)
+  || (((TARGET_64BIT_MS_ABI || TARGET_MACHO)
+  && crtl->preferred_stack_boundary < 128)
+ && (!crtl->is_leaf || cfun->calls_alloca != 0
+ || ix86_current_function_calls_tls_descriptor
+ || (TARGET_MACHO && crtl->profile)
+ || ix86_incoming_stack_boundary < 128)))
 {
   crtl->preferred_stack_boundary = 128;
-  crtl->stack_alignment_needed = 128;
+  if (crtl->stack_alignment_needed < 128)
+   crtl->stack_alignment_needed = 128;
 }

[PATCH] x86: set spincount 1 for x86 hybrid platform [PR109812]

2023-10-09 Thread Jun Zhang
From: "Zhang, Jun" 

By test, we find in hybrid platform spincount 1 is better.

Use '-march=native -Ofast -funroll-loops -flto',
results as follows:

spec2017 speed   RPL ADL
657.xz_s 0.00%   0.50%
603.bwaves_s 10.90%  26.20%
607.cactuBSSN_s  5.50%   72.50%
619.lbm_s2.40%   2.50%
621.wrf_s-7.70%  2.40%
627.cam4_s   0.50%   0.70%
628.pop2_s   48.20%  153.00%
638.imagick_s-0.10%  0.20%
644.nab_s2.30%   1.40%
649.fotonik3d_s  8.00%   13.80%
654.roms_s   1.20%   1.10%
Geomean-int  0.00%   0.50%
Geomean-fp   6.30%   21.10%
Geomean-all  5.70%   19.10%

omp2012  RPL ADL
350.md   -1.81%  -1.75%
351.bwaves   7.72%   12.50%
352.nab  14.63%  19.71%
357.bt331-0.20%  1.77%
358.botsalgn 0.00%   0.00%
359.botsspar 0.00%   0.65%
360.ilbdc0.00%   0.25%
362.fma3d2.66%   -0.51%
363.swim 10.44%  0.00%
367.imagick  0.00%   0.12%
370.mgrid331 2.49%   25.56%
371.applu331 1.06%   4.22%
372.smithwa  0.74%   3.34%
376.kdtree   10.67%  16.03%
GEOMEAN  3.34%   5.53%

include/ChangeLog:

* omphook.h: define RUNOMPHOOK macro.

libgomp/ChangeLog:

* env.c (initialize_env): add RUNOMPHOOK macro.
* config/linux/x86/omphook.h: define RUNOMPHOOK macro.
---
 include/omphook.h  |  1 +
 libgomp/config/linux/x86/omphook.h | 19 +++
 libgomp/env.c  |  3 +++
 3 files changed, 23 insertions(+)
 create mode 100644 include/omphook.h
 create mode 100644 libgomp/config/linux/x86/omphook.h

diff --git a/include/omphook.h b/include/omphook.h
new file mode 100644
index 000..2ebe3ad57e6
--- /dev/null
+++ b/include/omphook.h
@@ -0,0 +1 @@
+#define RUNOMPHOOK()
diff --git a/libgomp/config/linux/x86/omphook.h 
b/libgomp/config/linux/x86/omphook.h
new file mode 100644
index 000..aefb311cc07
--- /dev/null
+++ b/libgomp/config/linux/x86/omphook.h
@@ -0,0 +1,19 @@
+#ifdef __x86_64__
+#include "cpuid.h"
+
+/* only for x86 hybrid platform */
+#define RUNOMPHOOK()  \
+  do \
+{ \
+  unsigned int eax, ebx, ecx, edx; \
+  if ((getenv ("GOMP_SPINCOUNT") == NULL) && (wait_policy < 0) \
+ && __get_cpuid_count (7, 0, &eax, &ebx, &ecx, &edx) \
+ && ((edx >> 15) & 1)) \
+   gomp_spin_count_var = 1LL; \
+  if (gomp_throttled_spin_count_var > gomp_spin_count_var) \
+   gomp_throttled_spin_count_var = gomp_spin_count_var; \
+} \
+  while (0)
+#else
+# include "../../../../include/omphook.h"
+#endif
diff --git a/libgomp/env.c b/libgomp/env.c
index a21adb3fd4b..1f13a148694 100644
--- a/libgomp/env.c
+++ b/libgomp/env.c
@@ -61,6 +61,7 @@
 
 #include "secure_getenv.h"
 #include "environ.h"
+#include "omphook.h"
 
 /* Default values of ICVs according to the OpenMP standard,
except for default-device-var.  */
@@ -2496,5 +2497,7 @@ initialize_env (void)
   goacc_runtime_initialize ();
 
   goacc_profiling_initialize ();
+
+  RUNOMPHOOK ();
 }
 #endif /* LIBGOMP_OFFLOADED_ONLY */
-- 
2.31.1



[PATCH v2 3/4] RISC-V: Extend riscv_subset_list, preparatory for target attribute support

2023-10-09 Thread Kito Cheng
riscv_subset_list only accept a full arch string before, but we need to
parse single extension when supporting target attribute, also we may set
a riscv_subset_list directly rather than re-parsing the ISA string
again.

gcc/ChangeLog:

* config/riscv/riscv-subset.h (riscv_subset_list::parse_single_std_ext):
New.
(riscv_subset_list::parse_single_multiletter_ext): Ditto.
(riscv_subset_list::clone): Ditto.
(riscv_subset_list::parse_single_ext): Ditto.
(riscv_subset_list::set_loc): Ditto.
(riscv_set_arch_by_subset_list): Ditto.
* common/config/riscv/riscv-common.cc
(riscv_subset_list::parse_single_std_ext): New.
(riscv_subset_list::parse_single_multiletter_ext): Ditto.
(riscv_subset_list::clone): Ditto.
(riscv_subset_list::parse_single_ext): Ditto.
(riscv_subset_list::set_loc): Ditto.
(riscv_set_arch_by_subset_list): Ditto.
---
 gcc/common/config/riscv/riscv-common.cc | 203 
 gcc/config/riscv/riscv-subset.h |  11 ++
 2 files changed, 214 insertions(+)

diff --git a/gcc/common/config/riscv/riscv-common.cc 
b/gcc/common/config/riscv/riscv-common.cc
index 9a0a68fe5db..25630d5923e 100644
--- a/gcc/common/config/riscv/riscv-common.cc
+++ b/gcc/common/config/riscv/riscv-common.cc
@@ -1036,6 +1036,41 @@ riscv_subset_list::parse_std_ext (const char *p)
   return p;
 }
 
+/* Parsing function for one standard extensions.
+
+   Return Value:
+ Points to the end of extensions.
+
+   Arguments:
+ `p`: Current parsing position.  */
+
+const char *
+riscv_subset_list::parse_single_std_ext (const char *p)
+{
+  if (*p == 'x' || *p == 's' || *p == 'z')
+{
+  error_at (m_loc,
+   "%<-march=%s%>: Not single-letter extension. "
+   "%<%c%>",
+   m_arch, *p);
+  return nullptr;
+}
+
+  unsigned major_version = 0;
+  unsigned minor_version = 0;
+  bool explicit_version_p = false;
+  char subset[2] = {0, 0};
+
+  subset[0] = *p;
+
+  p++;
+
+  p = parsing_subset_version (subset, p, &major_version, &minor_version,
+ /* std_ext_p= */ true, &explicit_version_p);
+
+  add (subset, major_version, minor_version, explicit_version_p, false);
+  return p;
+}
 
 /* Check any implied extensions for EXT.  */
 void
@@ -1138,6 +1173,102 @@ riscv_subset_list::handle_combine_ext ()
 }
 }
 
+/* Parsing function for multi-letter extensions.
+
+   Return Value:
+ Points to the end of extensions.
+
+   Arguments:
+ `p`: Current parsing position.
+ `ext_type`: What kind of extensions, 's', 'z' or 'x'.
+ `ext_type_str`: Full name for kind of extension.  */
+
+
+const char *
+riscv_subset_list::parse_single_multiletter_ext (const char *p,
+const char *ext_type,
+const char *ext_type_str)
+{
+  unsigned major_version = 0;
+  unsigned minor_version = 0;
+  size_t ext_type_len = strlen (ext_type);
+
+  if (strncmp (p, ext_type, ext_type_len) != 0)
+return NULL;
+
+  char *subset = xstrdup (p);
+  const char *end_of_version;
+  bool explicit_version_p = false;
+  char *ext;
+  char backup;
+  size_t len = strlen (p);
+  size_t end_of_version_pos, i;
+  bool found_any_number = false;
+  bool found_minor_version = false;
+
+  end_of_version_pos = len;
+  /* Find the begin of version string.  */
+  for (i = len -1; i > 0; --i)
+{
+  if (ISDIGIT (subset[i]))
+   {
+ found_any_number = true;
+ continue;
+   }
+  /* Might be version seperator, but need to check one more char,
+we only allow p, so we could stop parsing if found
+any more `p`.  */
+  if (subset[i] == 'p' &&
+ !found_minor_version &&
+ found_any_number && ISDIGIT (subset[i-1]))
+   {
+ found_minor_version = true;
+ continue;
+   }
+
+  end_of_version_pos = i + 1;
+  break;
+}
+
+  backup = subset[end_of_version_pos];
+  subset[end_of_version_pos] = '\0';
+  ext = xstrdup (subset);
+  subset[end_of_version_pos] = backup;
+
+  end_of_version
+= parsing_subset_version (ext, subset + end_of_version_pos, &major_version,
+ &minor_version, /* std_ext_p= */ false,
+ &explicit_version_p);
+  free (ext);
+
+  if (end_of_version == NULL)
+return NULL;
+
+  subset[end_of_version_pos] = '\0';
+
+  if (strlen (subset) == 1)
+{
+  error_at (m_loc, "%<-march=%s%>: name of %s must be more than 1 letter",
+   m_arch, ext_type_str);
+  free (subset);
+  return NULL;
+}
+
+  add (subset, major_version, minor_version, explicit_version_p, false);
+  p += end_of_version - subset;
+  free (subset);
+
+  if (*p != '\0' && *p != '_')
+{
+  error_at (m_loc, "%<-march=%s%>: %s must separate with %<_%>",
+   m_arch, ext_type_str);
+  return NULL;
+}
+
+  re

[PATCH v2 4/4] RISC-V: Implement target attribute

2023-10-09 Thread Kito Cheng
The target attribute which proposed in [1], target attribute allow user
to specify a local setting per-function basis.

The syntax of target attribute is `__attribute__((target("")))`.

and the syntax of `` describes below:
```
ATTR-STRING := ATTR-STRING ';' ATTR
 | ATTR

ATTR:= ARCH-ATTR
 | CPU-ATTR
 | TUNE-ATTR

ARCH-ATTR   := 'arch=' EXTENSIONS-OR-FULLARCH

EXTENSIONS-OR-FULLARCH := 
| 

EXTENSIONS :=  ',' 
| 

FULLARCHSTR:= 

EXTENSION  :=   

OP := '+'

VERSION:= [0-9]+ 'p' [0-9]+
| [1-9][0-9]*
|

EXTENSION-NAME := Naming rule is defined in RISC-V ISA manual

CPU-ATTR:= 'cpu=' 
TUNE-ATTR   := 'tune=' 
```

[1] https://github.com/riscv-non-isa/riscv-c-api-doc/pull/35

gcc/ChangeLog:

* config.gcc (riscv): Add riscv-target-attr.o.
* config/riscv/riscv-opts.h (TARGET_MIN_VLEN_OPTS): New.
* config/riscv/riscv-protos.h (riscv_declare_function_size) New.
(riscv_option_valid_attribute_p): New.
(riscv_override_options_internal): New.
(struct riscv_tune_info): New.
(riscv_parse_tune): New.
* config/riscv/riscv-target-attr.cc
(class riscv_target_attr_parser): New.
(struct riscv_attribute_info): New.
(riscv_attributes): New.
(riscv_target_attr_parser::parse_arch):
(riscv_target_attr_parser::handle_arch):
(riscv_target_attr_parser::handle_cpu):
(riscv_target_attr_parser::handle_tune):
(riscv_target_attr_parser::update_settings):
(riscv_process_one_target_attr):
(num_occurences_in_str):
(riscv_process_target_attr):
(riscv_option_valid_attribute_p):
* config/riscv/riscv.cc: Include target-globals.h and
riscv-subset.h.
(struct riscv_tune_info): Move to riscv-protos.h.
(get_tune_str):
(riscv_parse_tune):
(riscv_declare_function_size):
(riscv_option_override): Build target_option_default_node and
target_option_current_node.
(riscv_save_restore_target_globals):
(riscv_option_restore):
(riscv_previous_fndecl):
(riscv_set_current_function): Apply the target attribute.
(TARGET_OPTION_RESTORE): Define.
(TARGET_OPTION_VALID_ATTRIBUTE_P): Ditto.
* config/riscv/riscv.h (SWITCHABLE_TARGET): Define to 1.
(ASM_DECLARE_FUNCTION_SIZE) Define.
* config/riscv/riscv.opt (mtune=): Add Save attribute.
(mcpu=): Ditto.
(mcmodel=): Ditto.
* config/riscv/t-riscv: Add build rule for riscv-target-attr.o
* doc/extend.texi: Add doc for target attribute.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/target-attr-01.c: New.
* gcc.target/riscv/target-attr-02.c: Ditto.
* gcc.target/riscv/target-attr-03.c: Ditto.
* gcc.target/riscv/target-attr-04.c: Ditto.
* gcc.target/riscv/target-attr-05.c: Ditto.
* gcc.target/riscv/target-attr-06.c: Ditto.
* gcc.target/riscv/target-attr-07.c: Ditto.
* gcc.target/riscv/target-attr-bad-01.c: Ditto.
* gcc.target/riscv/target-attr-bad-02.c: Ditto.
* gcc.target/riscv/target-attr-bad-03.c: Ditto.
* gcc.target/riscv/target-attr-bad-04.c: Ditto.
* gcc.target/riscv/target-attr-bad-05.c: Ditto.
* gcc.target/riscv/target-attr-bad-06.c: Ditto.
* gcc.target/riscv/target-attr-bad-07.c: Ditto.
* gcc.target/riscv/target-attr-warning-01.c: Ditto.
* gcc.target/riscv/target-attr-warning-02.c: Ditto.
* gcc.target/riscv/target-attr-warning-03.c: Ditto.
---
 gcc/config.gcc|   2 +-
 gcc/config/riscv/riscv-opts.h |   6 +
 gcc/config/riscv/riscv-protos.h   |  21 +
 gcc/config/riscv/riscv-target-attr.cc | 395 ++
 gcc/config/riscv/riscv.cc | 192 +++--
 gcc/config/riscv/riscv.h  |   6 +
 gcc/config/riscv/riscv.opt|   6 +-
 gcc/config/riscv/t-riscv  |   5 +
 gcc/doc/extend.texi   |  58 +++
 .../gcc.target/riscv/target-attr-01.c |  31 ++
 .../gcc.target/riscv/target-attr-02.c |  31 ++
 .../gcc.target/riscv/target-attr-03.c |  26 ++
 .../gcc.target/riscv/target-attr-04.c |  28 ++
 .../gcc.target/riscv/target-attr-05.c |  27 ++
 .../gcc.target/riscv/target-attr-06.c |  27 ++
 .../gcc.target/riscv/target-attr-07.c |  25 ++
 .../gcc.target/riscv/target-attr-bad-01.c |  13 +
 .../gcc.target/riscv/target-attr-bad-02.c |  13 +
 .../gcc.target/riscv/target-attr-bad-03.c |  13 +
 .../gcc.target/riscv/target-attr-bad-04.c |  13 +
 .../gcc.target/riscv/target-attr-bad-05.c |  13 +
 .../gcc.target/riscv/target-attr-b

[PATCH v2 2/4] RISC-V: Refactor riscv_option_override and riscv_convert_vector_bits. [NFC]

2023-10-09 Thread Kito Cheng
Allow those funciton apply from a local gcc_options rather than the
global options.

Preparatory for target attribute, sperate this change for eaiser reivew
since it's a NFC.

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_convert_vector_bits): Get setting
from argument rather than get setting from global setting.
(riscv_override_options_internal): New, splited from
riscv_override_options, also take a gcc_options argument.
(riscv_option_override): Splited most part to
riscv_override_options_internal.
---
 gcc/config/riscv/riscv.cc | 93 ++-
 1 file changed, 52 insertions(+), 41 deletions(-)

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index b7acf836d02..c7d0d300345 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -8066,10 +8066,11 @@ riscv_init_machine_status (void)
 /* Return the VLEN value associated with -march.
TODO: So far we only support length-agnostic value. */
 static poly_uint16
-riscv_convert_vector_bits (void)
+riscv_convert_vector_bits (struct gcc_options *opts)
 {
   int chunk_num;
-  if (TARGET_MIN_VLEN > 32)
+  int min_vlen = TARGET_MIN_VLEN_OPTS (opts);
+  if (min_vlen > 32)
 {
   /* When targetting minimum VLEN > 32, we should use 64-bit chunk size.
 Otherwise we can not include SEW = 64bits.
@@ -8087,7 +8088,7 @@ riscv_convert_vector_bits (void)
   - TARGET_MIN_VLEN = 2048bit: [256,256]
   - TARGET_MIN_VLEN = 4096bit: [512,512]
   FIXME: We currently DON'T support TARGET_MIN_VLEN > 4096bit.  */
-  chunk_num = TARGET_MIN_VLEN / 64;
+  chunk_num = min_vlen / 64;
 }
   else
 {
@@ -8106,10 +8107,10 @@ riscv_convert_vector_bits (void)
  to set RVV mode size. The RVV machine modes size are run-time constant if
  TARGET_VECTOR is enabled. The RVV machine modes size remains default
  compile-time constant if TARGET_VECTOR is disabled.  */
-  if (TARGET_VECTOR)
+  if (TARGET_VECTOR_OPTS_P (opts))
 {
-  if (riscv_autovec_preference == RVV_FIXED_VLMAX)
-   return (int) TARGET_MIN_VLEN / (riscv_bytes_per_vector_chunk * 8);
+  if (opts->x_riscv_autovec_preference == RVV_FIXED_VLMAX)
+   return (int) min_vlen / (riscv_bytes_per_vector_chunk * 8);
   else
return poly_uint16 (chunk_num, chunk_num);
 }
@@ -8117,40 +8118,33 @@ riscv_convert_vector_bits (void)
 return 1;
 }
 
-/* Implement TARGET_OPTION_OVERRIDE.  */
-
-static void
-riscv_option_override (void)
+/* 'Unpack' up the internal tuning structs and update the options
+in OPTS.  The caller must have set up selected_tune and selected_arch
+as all the other target-specific codegen decisions are
+derived from them.  */
+void
+riscv_override_options_internal (struct gcc_options *opts)
 {
   const struct riscv_tune_info *cpu;
 
-#ifdef SUBTARGET_OVERRIDE_OPTIONS
-  SUBTARGET_OVERRIDE_OPTIONS;
-#endif
-
-  flag_pcc_struct_return = 0;
-
-  if (flag_pic)
-g_switch_value = 0;
-
   /* The presence of the M extension implies that division instructions
  are present, so include them unless explicitly disabled.  */
-  if (TARGET_MUL && (target_flags_explicit & MASK_DIV) == 0)
-target_flags |= MASK_DIV;
-  else if (!TARGET_MUL && TARGET_DIV)
+  if (TARGET_MUL_OPTS_P (opts) && (target_flags_explicit & MASK_DIV) == 0)
+opts->x_target_flags |= MASK_DIV;
+  else if (!TARGET_MUL_OPTS_P (opts) && TARGET_DIV_OPTS_P (opts))
 error ("%<-mdiv%> requires %<-march%> to subsume the % extension");
 
   /* Likewise floating-point division and square root.  */
   if ((TARGET_HARD_FLOAT || TARGET_ZFINX) && (target_flags_explicit & 
MASK_FDIV) == 0)
-target_flags |= MASK_FDIV;
+opts->x_target_flags |= MASK_FDIV;
 
   /* Handle -mtune, use -mcpu if -mtune is not given, and use default -mtune
  if both -mtune and -mcpu are not given.  */
-  cpu = riscv_parse_tune (riscv_tune_string ? riscv_tune_string :
- (riscv_cpu_string ? riscv_cpu_string :
+  cpu = riscv_parse_tune (opts->x_riscv_tune_string ? 
opts->x_riscv_tune_string :
+ (opts->x_riscv_cpu_string ? opts->x_riscv_cpu_string :
   RISCV_TUNE_STRING_DEFAULT));
   riscv_microarchitecture = cpu->microarchitecture;
-  tune_param = optimize_size ? &optimize_size_tune_info : cpu->tune_param;
+  tune_param = opts->x_optimize_size ? &optimize_size_tune_info : 
cpu->tune_param;
 
   /* Use -mtune's setting for slow_unaligned_access, even when optimizing
  for size.  For architectures that trap and emulate unaligned accesses,
@@ -8166,15 +8160,38 @@ riscv_option_override (void)
 
   if ((target_flags_explicit & MASK_STRICT_ALIGN) == 0
   && cpu->tune_param->slow_unaligned_access)
-target_flags |= MASK_STRICT_ALIGN;
+opts->x_target_flags |= MASK_STRICT_ALIGN;
 
   /* If the user hasn't specified a branch cost, use the processor's
  default.  */
-  if (riscv_branch_c

[PATCH v2 1/4] options: Define TARGET__P and TARGET__OPTS_P macro for Mask and InverseMask

2023-10-09 Thread Kito Cheng
We TARGET__P marcro to test a Mask and InverseMask with user
specified target_variable, however we may want to test with specific
gcc_options variable rather than target_variable.

Like RISC-V has defined lots of Mask with TargetVariable, which is not
easy to use, because that means we need to known which Mask are associate with
which TargetVariable, so take a gcc_options variable is a better interface
for such use case.

gcc/ChangeLog:

* doc/options.texi (Mask): Document TARGET__P and
TARGET__OPTS_P.
(InverseMask): Ditto.
* opth-gen.awk (Mask): Generate TARGET__P and
TARGET__OPTS_P macro.
(InverseMask): Ditto.
---
 gcc/doc/options.texi | 23 ---
 gcc/opth-gen.awk | 13 -
 2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/gcc/doc/options.texi b/gcc/doc/options.texi
index 1f7c15b8eb4..715f0a1479c 100644
--- a/gcc/doc/options.texi
+++ b/gcc/doc/options.texi
@@ -404,18 +404,27 @@ You may also specify @code{Var} to select a variable 
other than
 The options-processing script will automatically allocate a unique bit
 for the option.  If the option is attached to @samp{target_flags} or @code{Var}
 which is defined by @code{TargetVariable},  the script will set the macro
-@code{MASK_@var{name}} to the appropriate bitmask.  It will also declare a 
-@code{TARGET_@var{name}} macro that has the value 1 when the option is active
-and 0 otherwise.  If you use @code{Var} to attach the option to a different 
variable
-which is not defined by @code{TargetVariable}, the bitmask macro with be
-called @code{OPTION_MASK_@var{name}}.
+@code{MASK_@var{name}} to the appropriate bitmask.  It will also declare a
+@code{TARGET_@var{name}}, @code{TARGET_@var{name}_P} and
+@code{TARGET_@var{name}_OPTS_P}: @code{TARGET_@var{name}} macros that has the
+value 1 when the option is active and 0 otherwise, @code{TARGET_@var{name}_P} 
is
+similar to @code{TARGET_@var{name}} but take an argument as @samp{target_flags}
+or @code{TargetVariable}, and @code{TARGET_@var{name}_OPTS_P} also similar to
+@code{TARGET_@var{name}} but take an argument as @code{gcc_options}.
+If you use @code{Var} to attach the option to a different variable which is not
+defined by @code{TargetVariable}, the bitmask macro with be called
+@code{OPTION_MASK_@var{name}}.
 
 @item InverseMask(@var{othername})
 @itemx InverseMask(@var{othername}, @var{thisname})
 The option is the inverse of another option that has the
 @code{Mask(@var{othername})} property.  If @var{thisname} is given,
-the options-processing script will declare a @code{TARGET_@var{thisname}}
-macro that is 1 when the option is active and 0 otherwise.
+the options-processing script will declare @code{TARGET_@var{thisname}},
+@code{TARGET_@var{name}_P} and @code{TARGET_@var{name}_OPTS_P} macros:
+@code{TARGET_@var{thisname}} is 1 when the option is active and 0 otherwise,
+@code{TARGET_@var{name}_P} is similar to @code{TARGET_@var{name}} but take an
+argument as @samp{target_flags}, and and @code{TARGET_@var{name}_OPTS_P} also
+similar to @code{TARGET_@var{name}} but take an argument as @code{gcc_options}.
 
 @item Enum(@var{name})
 The option's argument is a string from the set of strings associated
diff --git a/gcc/opth-gen.awk b/gcc/opth-gen.awk
index c4398be2f3a..26551575d55 100644
--- a/gcc/opth-gen.awk
+++ b/gcc/opth-gen.awk
@@ -439,6 +439,10 @@ for (i = 0; i < n_target_vars; i++)
{
print "#define TARGET_" other_masks[i "," j] \
  " ((" target_vars[i] " & MASK_" other_masks[i "," j] ") 
!= 0)"
+   print "#define TARGET_" other_masks[i "," j] "_P(" 
target_vars[i] ")" \
+ " (((" target_vars[i] ") & MASK_" other_masks[i "," j] ") 
!= 0)"
+   print "#define TARGET_" other_masks[i "," j] "_OPTS_P(opts)" \
+ " (((opts->x_" target_vars[i] ") & MASK_" other_masks[i 
"," j] ") != 0)"
}
 }
 print ""
@@ -469,15 +473,22 @@ for (i = 0; i < n_opts; i++) {
  " ((" vname " & " mask original_name ") != 0)"
print "#define TARGET_" name "_P(" vname ")" \
  " (((" vname ") & " mask original_name ") != 0)"
+   print "#define TARGET_" name "_OPTS_P(opts)" \
+ " (((opts->x_" vname ") & " mask original_name ") != 0)"
print "#define TARGET_EXPLICIT_" name "_P(opts)" \
  " ((opts->x_" vname "_explicit & " mask original_name ") 
!= 0)"
print "#define SET_TARGET_" name "(opts) opts->x_" vname " |= " 
mask original_name
}
 }
 for (i = 0; i < n_extra_masks; i++) {
-   if (extra_mask_macros[extra_masks[i]] == 0)
+   if (extra_mask_macros[extra_masks[i]] == 0) {
print "#define TARGET_" extra_masks[i] \
  " ((target_flags & MASK_" extra_masks[i] ") != 0)"
+   print "#define TARGET_" extra_masks[i] "_P(target_flags)" \
+   

[PATCH v2 0/4] RISC-V target attribute

2023-10-09 Thread Kito Cheng
This patch set implement target attribute for RISC-V target, which is similar 
to other target like x86 or ARM, let user able to set some local setting per 
function without changing global settings.

We support arch, tune and cpu first, and we will support other target attribute 
later, this version DOES NOT include multi-version function support yet, that 
is future work, probably work for GCC 15.

The full proposal is put in RISC-V C-API document[1], which has discussed with 
RISC-V LLVM community, so we have consistent syntax and semantics. 

[1] https://github.com/riscv-non-isa/riscv-c-api-doc/pull/35

v2 changelog:
- Resolve awk multi-dimensional issue.
- Tweak code format
- Tweak testcases




RE: [PATCH] RISC-V: Add available vector size for RVV

2023-10-09 Thread Li, Pan2
Committed, thanks Kito.

Pan

-Original Message-
From: Kito Cheng  
Sent: Tuesday, October 10, 2023 11:20 AM
To: Juzhe-Zhong 
Cc: gcc-patches@gcc.gnu.org; kito.ch...@gmail.com; jeffreya...@gmail.com; 
rdapp@gmail.com
Subject: Re: [PATCH] RISC-V: Add available vector size for RVV

LGTM

On Mon, Oct 9, 2023 at 4:23 PM Juzhe-Zhong  wrote:
>
> For RVV, we have VLS modes enable according to TARGET_MIN_VLEN
> from M1 to M8.
>
> For example, when TARGET_MIN_VLEN = 128 bits, we enable
> 128/256/512/1024 bits VLS modes.
>
> This patch fixes following FAIL:
> FAIL: gcc.dg/vect/bb-slp-subgroups-2.c -flto -ffat-lto-objects  
> scan-tree-dump-times slp2 "optimized: basic block" 2
> FAIL: gcc.dg/vect/bb-slp-subgroups-2.c scan-tree-dump-times slp2 "optimized: 
> basic block" 2
>
> gcc/testsuite/ChangeLog:
>
> * lib/target-supports.exp: Add 256/512/1024
>
> ---
>  gcc/testsuite/lib/target-supports.exp | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/testsuite/lib/target-supports.exp 
> b/gcc/testsuite/lib/target-supports.exp
> index af52c38433d..dc366d35a0a 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -8881,7 +8881,7 @@ proc available_vector_sizes { } {
> lappend result 4096 2048 1024 512 256 128 64 32 16 8 4 2
>  } elseif { [istarget riscv*-*-*] } {
> if { [check_effective_target_riscv_v] } {
> -   lappend result 0 32 64 128
> +   lappend result 0 32 64 128 256 512 1024
> }
> lappend result 128
>  } else {
> --
> 2.36.3
>


[PATCH 2/2] c++: note other candidates when diagnosing deletedness

2023-10-09 Thread Patrick Palka
With the previous improvements in place, we can easily extend our
deletedness diagnostic to note the other candidates:

  deleted16.C: In function ‘int main()’:
  deleted16.C:10:4: error: use of deleted function ‘void f(int)’
 10 |   f(0);
|   ~^~~
  deleted16.C:5:6: note: declared here
  5 | void f(int) = delete;
|  ^
  deleted16.C:5:6: note: candidate: ‘void f(int)’ (deleted)
  deleted16.C:6:6: note: candidate: ‘void f(...)’
  6 | void f(...);
|  ^
  deleted16.C:7:6: note: candidate: ‘void f(int, int)’
  7 | void f(int, int);
|  ^
  deleted16.C:7:6: note:   candidate expects 2 arguments, 1 provided

These notes are disabled when a deleted special member function is
selected primarily because it introduces a lot of new "cannot bind
reference" errors in the testsuite when noting non-viable candidates,
e.g. in cpp0x/initlist-opt1.C we would need to expect an error at
A(A&&).

gcc/cp/ChangeLog:

* call.cc (build_over_call): Call print_z_candidates when
diagnosing deletedness.

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/deleted16.C: New test.
---
 gcc/cp/call.cc | 10 +-
 gcc/testsuite/g++.dg/cpp0x/deleted16.C | 11 +++
 2 files changed, 20 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/g++.dg/cpp0x/deleted16.C

diff --git a/gcc/cp/call.cc b/gcc/cp/call.cc
index 648d383ca4e..55fd71636b1 100644
--- a/gcc/cp/call.cc
+++ b/gcc/cp/call.cc
@@ -9873,7 +9873,15 @@ build_over_call (struct z_candidate *cand, int flags, 
tsubst_flags_t complain)
   if (DECL_DELETED_FN (fn))
 {
   if (complain & tf_error)
-   mark_used (fn);
+   {
+ mark_used (fn);
+ /* Note the other candidates we considered unless we selected a
+special member function since the mismatch reasons for other
+candidates are usually uninteresting, e.g. rvalue vs lvalue
+reference binding .  */
+ if (cand->next && !special_memfn_p (fn))
+   print_z_candidates (input_location, cand, /*only_viable_p=*/false);
+   }
   return error_mark_node;
 }
 
diff --git a/gcc/testsuite/g++.dg/cpp0x/deleted16.C 
b/gcc/testsuite/g++.dg/cpp0x/deleted16.C
new file mode 100644
index 000..9fd2fbb1465
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/deleted16.C
@@ -0,0 +1,11 @@
+// Verify we note other candidates when a deleted function is
+// selected by overload resolution.
+// { dg-do compile { target c++11 } }
+
+void f(int) = delete; // { dg-message "declared here|candidate" }
+void f(...); // { dg-message "candidate" }
+void f(int, int); // { dg-message "candidate" }
+
+int main() {
+  f(0); // { dg-error "deleted" }
+}
-- 
2.42.0.325.g3a06386e31



[PATCH 1/2] c++: sort candidates according to viability

2023-10-09 Thread Patrick Palka
This patch:

  * changes splice_viable to move the non-viable candidates to the end
of the list instead of removing them outright
  * makes tourney move the best candidate to the front of the candidate
list
  * adjusts print_z_candidates to preserve our behavior of printing only
viable candidates when diagnosing ambiguity
  * adds a parameter to print_z_candidates to control this default behavior
(the follow-up patch will want to print all candidates when diagnosing
deletedness)

Thus after this patch we have access to the entire candidate list through
the best viable candidate.

This change also happens to fix diagnostics for the below testcase where
we currently neglect to note the third candidate, since the presence of
the two unordered non-strictly viable candidates causes splice_viable to
prematurely get rid of the non-viable third candidate.

gcc/cp/ChangeLog:

* call.cc: Include "tristate.h".
(splice_viable): Sort the candidate list according to viability.
Don't remove non-viable candidates from the list.
(print_z_candidates): Add defaulted only_viable_p parameter.
By default only print non-viable candidates if there is no
viable candidate.
(tourney): Make 'candidates' parameter a reference.  Ignore
non-viable candidates.  Move the true champ to the front
of the candidates list, and update 'candidates' to point to
the front.

gcc/testsuite/ChangeLog:

* g++.dg/overload/error5.C: New test.
---
 gcc/cp/call.cc | 161 +++--
 gcc/testsuite/g++.dg/overload/error5.C |  11 ++
 2 files changed, 111 insertions(+), 61 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/overload/error5.C

diff --git a/gcc/cp/call.cc b/gcc/cp/call.cc
index 15079ddf6dc..648d383ca4e 100644
--- a/gcc/cp/call.cc
+++ b/gcc/cp/call.cc
@@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "attribs.h"
 #include "decl.h"
 #include "gcc-rich-location.h"
+#include "tristate.h"
 
 /* The various kinds of conversion.  */
 
@@ -160,7 +161,7 @@ static struct obstack conversion_obstack;
 static bool conversion_obstack_initialized;
 struct rejection_reason;
 
-static struct z_candidate * tourney (struct z_candidate *, tsubst_flags_t);
+static struct z_candidate * tourney (struct z_candidate *&, tsubst_flags_t);
 static int equal_functions (tree, tree);
 static int joust (struct z_candidate *, struct z_candidate *, bool,
  tsubst_flags_t);
@@ -176,7 +177,8 @@ static void op_error (const op_location_t &, enum 
tree_code, enum tree_code,
 static struct z_candidate *build_user_type_conversion_1 (tree, tree, int,
 tsubst_flags_t);
 static void print_z_candidate (location_t, const char *, struct z_candidate *);
-static void print_z_candidates (location_t, struct z_candidate *);
+static void print_z_candidates (location_t, struct z_candidate *,
+   tristate = tristate::unknown ());
 static tree build_this (tree);
 static struct z_candidate *splice_viable (struct z_candidate *, bool, bool *);
 static bool any_strictly_viable (struct z_candidate *);
@@ -3718,68 +3720,60 @@ add_template_conv_candidate (struct z_candidate 
**candidates, tree tmpl,
 }
 
 /* The CANDS are the set of candidates that were considered for
-   overload resolution.  Return the set of viable candidates, or CANDS
-   if none are viable.  If any of the candidates were viable, set
+   overload resolution.  Sort CANDS so that the strictly viable
+   candidates appear first, followed by non-strictly viable candidates,
+   followed by unviable candidates.  Returns the first candidate
+   in this sorted list.  If any of the candidates were viable, set
*ANY_VIABLE_P to true.  STRICT_P is true if a candidate should be
-   considered viable only if it is strictly viable.  */
+   considered viable only if it is strictly viable when setting
+   *ANY_VIABLE_P.  */
 
 static struct z_candidate*
 splice_viable (struct z_candidate *cands,
   bool strict_p,
   bool *any_viable_p)
 {
-  struct z_candidate *viable;
-  struct z_candidate **last_viable;
-  struct z_candidate **cand;
-  bool found_strictly_viable = false;
+  z_candidate *strictly_viable = nullptr;
+  z_candidate **strictly_viable_tail = &strictly_viable;
+
+  z_candidate *non_strictly_viable = nullptr;
+  z_candidate **non_strictly_viable_tail = &non_strictly_viable;
+
+  z_candidate *unviable = nullptr;
+  z_candidate **unviable_tail = &unviable;
 
   /* Be strict inside templates, since build_over_call won't actually
  do the conversions to get pedwarns.  */
   if (processing_template_decl)
 strict_p = true;
 
-  viable = NULL;
-  last_viable = &viable;
-  *any_viable_p = false;
-
-  cand = &cands;
-  while (*cand)
+  for (z_candidate *cand = cands; cand; cand = cand->next)
 {
-  struct z_candidate *c = *cand;
   

Re: [PATCH] RISC-V: Add available vector size for RVV

2023-10-09 Thread Kito Cheng
LGTM

On Mon, Oct 9, 2023 at 4:23 PM Juzhe-Zhong  wrote:
>
> For RVV, we have VLS modes enable according to TARGET_MIN_VLEN
> from M1 to M8.
>
> For example, when TARGET_MIN_VLEN = 128 bits, we enable
> 128/256/512/1024 bits VLS modes.
>
> This patch fixes following FAIL:
> FAIL: gcc.dg/vect/bb-slp-subgroups-2.c -flto -ffat-lto-objects  
> scan-tree-dump-times slp2 "optimized: basic block" 2
> FAIL: gcc.dg/vect/bb-slp-subgroups-2.c scan-tree-dump-times slp2 "optimized: 
> basic block" 2
>
> gcc/testsuite/ChangeLog:
>
> * lib/target-supports.exp: Add 256/512/1024
>
> ---
>  gcc/testsuite/lib/target-supports.exp | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/testsuite/lib/target-supports.exp 
> b/gcc/testsuite/lib/target-supports.exp
> index af52c38433d..dc366d35a0a 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -8881,7 +8881,7 @@ proc available_vector_sizes { } {
> lappend result 4096 2048 1024 512 256 128 64 32 16 8 4 2
>  } elseif { [istarget riscv*-*-*] } {
> if { [check_effective_target_riscv_v] } {
> -   lappend result 0 32 64 128
> +   lappend result 0 32 64 128 256 512 1024
> }
> lappend result 128
>  } else {
> --
> 2.36.3
>


Re: [PATCH] RISC-V: Make xtheadcondmov-indirect tests robust against instruction reordering

2023-10-09 Thread Kito Cheng
I guess you may also want to clean up those bodies for "check-function-bodies"?

On Mon, Oct 9, 2023 at 3:47 PM Christoph Muellner
 wrote:
>
> From: Christoph Müllner 
>
> Fixes: c1bc7513b1d7 ("RISC-V: const: hide mvconst splitter from IRA")
>
> A recent change broke the xtheadcondmov-indirect tests, because the order of
> emitted instructions changed. Since the test is too strict when testing for
> a fixed instruction order, let's change the tests to simply count instruction,
> like it is done for similar tests.
>
> Reported-by: Patrick O'Neill 
> Signed-off-by: Christoph Müllner 
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/xtheadcondmov-indirect.c: Make robust against
> instruction reordering.
>
> Signed-off-by: Christoph Müllner 
> ---
>  .../gcc.target/riscv/xtheadcondmov-indirect.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect.c 
> b/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect.c
> index c3253ba5239..eba1b86137b 100644
> --- a/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect.c
> +++ b/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect.c
> @@ -1,8 +1,7 @@
>  /* { dg-do compile } */
> -/* { dg-options "-march=rv32gc_xtheadcondmov -fno-sched-pressure" { target { 
> rv32 } } } */
> -/* { dg-options "-march=rv64gc_xtheadcondmov -fno-sched-pressure" { target { 
> rv64 } } } */
> +/* { dg-options "-march=rv32gc_xtheadcondmov" { target { rv32 } } } */
> +/* { dg-options "-march=rv64gc_xtheadcondmov" { target { rv64 } } } */
>  /* { dg-skip-if "" { *-*-* } {"-O0" "-Os" "-Og" "-Oz" "-flto" } } */
> -/* { dg-final { check-function-bodies "**" "" } } */
>
>  /*
>  ** ConEmv_imm_imm_reg:
> @@ -116,3 +115,9 @@ int ConNmv_reg_reg_reg(int x, int y, int z, int n)
>  return z;
>return n;
>  }
> +
> +/* { dg-final { scan-assembler-times "addi\t" 5 } } */
> +/* { dg-final { scan-assembler-times "li\t" 4 } } */
> +/* { dg-final { scan-assembler-times "sub\t" 4 } } */
> +/* { dg-final { scan-assembler-times "th.mveqz\t" 4 } } */
> +/* { dg-final { scan-assembler-times "th.mvnez\t" 4 } } */
> --
> 2.41.0
>


[PATCH] RISC-V Regression: Fix FAIL of predcom-2.c

2023-10-09 Thread Juzhe-Zhong
Like GCN, add -fno-tree-vectorize.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/predcom-2.c: Add riscv.

---
 gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c
index f19edd4cd74..681ff7c696b 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-2.c
@@ -1,6 +1,6 @@
 /* { dg-do run } */
 /* { dg-options "-O2 -funroll-loops --param max-unroll-times=8 
-fpredictive-commoning -fdump-tree-pcom-details-blocks -fno-tree-pre" } */
-/* { dg-additional-options "-fno-tree-vectorize" { target amdgcn-*-* } } */
+/* { dg-additional-options "-fno-tree-vectorize" { target amdgcn-*-* 
riscv*-*-* } } */
 
 void abort (void);
 
-- 
2.36.3



[PATCH] use get_range_query to replace get_global_range_query

2023-10-09 Thread Jiufu Guo
Hi,

For "get_global_range_query" SSA_NAME_RANGE_INFO can be queried.
For "get_range_query", it could get more context-aware range info.
And look at the implementation of "get_range_query",  it returns
global range if no local fun info.

So, if not quering for SSA_NAME, it would be ok to use get_range_query
to replace get_global_range_query.

Patch https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630389.html,
Uses get_range_query could handle more cases.

This patch replaces get_global_range_query by get_range_query for
most possible code pieces (but deoes not draft new test cases).

Pass bootstrap & regtest on ppc64{,le} and x86_64.
Is this ok for trunk.


BR,
Jeff (Jiufu Guo)

gcc/ChangeLog:

* builtins.cc (expand_builtin_strnlen): Replace get_global_range_query
by get_range_query.
* fold-const.cc (expr_not_equal_to): Likewise.
* gimple-fold.cc (size_must_be_zero_p): Likewise.
* gimple-range-fold.cc (fur_source::fur_source): Likewise.
* gimple-ssa-warn-access.cc (check_nul_terminated_array): Likewise.
* tree-dfa.cc (get_ref_base_and_extent): Likewise.
* tree-ssa-loop-split.cc (split_at_bb_p): Likewise.
* tree-ssa-loop-unswitch.cc (evaluate_control_stmt_using_entry_checks):
Likewise.

---
 gcc/builtins.cc   | 2 +-
 gcc/fold-const.cc | 6 +-
 gcc/gimple-fold.cc| 6 ++
 gcc/gimple-range-fold.cc  | 4 +---
 gcc/gimple-ssa-warn-access.cc | 2 +-
 gcc/tree-dfa.cc   | 5 +
 gcc/tree-ssa-loop-split.cc| 2 +-
 gcc/tree-ssa-loop-unswitch.cc | 2 +-
 8 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index cb90bd03b3e..4e0a77ff8e0 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -3477,7 +3477,7 @@ expand_builtin_strnlen (tree exp, rtx target, 
machine_mode target_mode)
 
   wide_int min, max;
   value_range r;
-  get_global_range_query ()->range_of_expr (r, bound);
+  get_range_query (cfun)->range_of_expr (r, bound);
   if (r.varying_p () || r.undefined_p ())
 return NULL_RTX;
   min = r.lower_bound ();
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 4f8561509ff..15134b21b9f 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -11056,11 +11056,7 @@ expr_not_equal_to (tree t, const wide_int &w)
   if (!INTEGRAL_TYPE_P (TREE_TYPE (t)))
return false;
 
-  if (cfun)
-   get_range_query (cfun)->range_of_expr (vr, t);
-  else
-   get_global_range_query ()->range_of_expr (vr, t);
-
+  get_range_query (cfun)->range_of_expr (vr, t);
   if (!vr.undefined_p () && !vr.contains_p (w))
return true;
   /* If T has some known zero bits and W has any of those bits set,
diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
index dc89975270c..853edd9e5d4 100644
--- a/gcc/gimple-fold.cc
+++ b/gcc/gimple-fold.cc
@@ -876,10 +876,8 @@ size_must_be_zero_p (tree size)
   wide_int zero = wi::zero (TYPE_PRECISION (type));
   value_range valid_range (type, zero, ssize_max);
   value_range vr;
-  if (cfun)
-get_range_query (cfun)->range_of_expr (vr, size);
-  else
-get_global_range_query ()->range_of_expr (vr, size);
+  get_range_query (cfun)->range_of_expr (vr, size);
+
   if (vr.undefined_p ())
 vr.set_varying (TREE_TYPE (size));
   vr.intersect (valid_range);
diff --git a/gcc/gimple-range-fold.cc b/gcc/gimple-range-fold.cc
index d1945ccb554..6e9530c3d7f 100644
--- a/gcc/gimple-range-fold.cc
+++ b/gcc/gimple-range-fold.cc
@@ -50,10 +50,8 @@ fur_source::fur_source (range_query *q)
 {
   if (q)
 m_query = q;
-  else if (cfun)
-m_query = get_range_query (cfun);
   else
-m_query = get_global_range_query ();
+m_query = get_range_query (cfun);
   m_gori = NULL;
 }
 
diff --git a/gcc/gimple-ssa-warn-access.cc b/gcc/gimple-ssa-warn-access.cc
index fcaff128d60..e439d1b9b68 100644
--- a/gcc/gimple-ssa-warn-access.cc
+++ b/gcc/gimple-ssa-warn-access.cc
@@ -332,7 +332,7 @@ check_nul_terminated_array (GimpleOrTree expr, tree src, 
tree bound)
 {
   Value_Range r (TREE_TYPE (bound));
 
-  get_global_range_query ()->range_of_expr (r, bound);
+  get_range_query (cfun)->range_of_expr (r, bound);
 
   if (r.undefined_p () || r.varying_p ())
return true;
diff --git a/gcc/tree-dfa.cc b/gcc/tree-dfa.cc
index af8e9243947..5355af2c869 100644
--- a/gcc/tree-dfa.cc
+++ b/gcc/tree-dfa.cc
@@ -531,10 +531,7 @@ get_ref_base_and_extent (tree exp, poly_int64 *poffset,
 
value_range vr;
range_query *query;
-   if (cfun)
- query = get_range_query (cfun);
-   else
- query = get_global_range_query ();
+   query = get_range_query (cfun);
 
if (TREE_CODE (index) == SSA_NAME
&& (low_bound = array_ref_low_bound (exp),
diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc
index 64464802c1e..e85a1881526 100644
--- a/g

[PATCH] RISC-V Regression: Make match patterns more accurate

2023-10-09 Thread Juzhe-Zhong
This patch fixes following 2 FAILs in RVV regression since the check is not 
accurate.

It's inspired by Robin's previous patch:
https://patchwork.sourceware.org/project/gcc/patch/dde89b9e-49a0-d70b-0906-fb3022cac...@gmail.com/

gcc/testsuite/ChangeLog:

* gcc.dg/vect/no-scevccp-outer-7.c: Adjust regex pattern.
* gcc.dg/vect/no-scevccp-vect-iv-3.c: Ditto.

---
 gcc/testsuite/gcc.dg/vect/no-scevccp-outer-7.c   | 2 +-
 gcc/testsuite/gcc.dg/vect/no-scevccp-vect-iv-3.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-7.c 
b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-7.c
index 543ee98b5a4..058d1d2db2d 100644
--- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-7.c
+++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-7.c
@@ -77,4 +77,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
target vect_widen_mult_hi_to_si } } } */
-/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: detected" 
1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: 
detected(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-vect-iv-3.c 
b/gcc/testsuite/gcc.dg/vect/no-scevccp-vect-iv-3.c
index 7049e4936b9..6f2b2210b11 100644
--- a/gcc/testsuite/gcc.dg/vect/no-scevccp-vect-iv-3.c
+++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-vect-iv-3.c
@@ -30,4 +30,4 @@ unsigned int main1 ()
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_widen_sum_hi_to_si } } } */
-/* { dg-final { scan-tree-dump-times "vect_recog_widen_sum_pattern: detected" 
1 "vect" { target vect_widen_sum_hi_to_si } } } */
+/* { dg-final { scan-tree-dump-times "vect_recog_widen_sum_pattern: 
detected(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" { target 
vect_widen_sum_hi_to_si } } } */
-- 
2.36.3



[PATCH V1] introduce light expander sra

2023-10-09 Thread Jiufu Guo
Hi,

There are a few PRs (meta-bug PR101926) on various targets.
The root causes of them are similar: the aggeragte param/
returns are passed by multi-registers, but they are stored
to stack from registers first; and then, access the 
parameter through stack slot.

A general idea to enhance this: accessing the aggregate
parameters/returns directly through registers.  This idea
would be a kind of SRA (using the scalar registers to
access the aggregate parameter/returns).

This experimental patch for light-expander-sra contains
below parts:

a. Check if the parameters/returns are ok/profitable to
   scalarize, and set the scalar pseudos for the
   parameter/return.
  - This is done in "expand_function_start", after the
incoming/outgoing hard registers are determined for the
paramter(s)/return.
The scalarized registers are recorded in DECL_RTL for
the parameter/return in parallel form.
  - At the time when setting DECL_RTL, "scalarizable_aggregate"
is called to check the accesses are ok/profitable to
scalarize.
We can continue to enhance this function, to support
more cases.  For example:
- 'reverse storage order'.
- 'TImode/vector-mode from multi-regs'.
- some cases on 'writing to parameter'/'overlap accesses'.

b. When expanding the accesses of the parameters/returns,
   according to the info of the access(e.g. bitpos,bitsize,
   mode), the scalar(pseudos) can be figured out to expand
   the access.  This may happen when expand below accesses:
  - The component access of a parameter: "_1 = arg.f1".
Or whole parameter access: rhs of "_2 = arg"
  - The assignment to a return val:
"D.xx = yy; or D.xx.f = zz" where D.xx occurs on return
stmt.
  - This is mainly done in expr.cc(expand_expr_real_1, and
expand_assignment).  Function "extract_sub_member" is
used to figure out the scalar rtxs(pseudos).

Besides the above two parts, some work are done in the GIMPLE
tree:  collect sra candidates for parameters/returns, and
collect the SRA access info.
This is mainly done at the beginning of the expander pass by
the class (named expand_sra) and its member functions.
Below are two major items of this part.
 - Collect light-expand-sra candidates.
  Each parameter is checked if it has the proper aggregate type.
  Collect return val (VAR_P) on each return stmts if the
  function is returning via registers.  
  This is implemented in expand_sra::collect_sra_candidates. 

 - Build/collect/manage all the access on the candidates.
  The function "scan_function" is used to do this work, it
  goes through all basicblocks, and all interesting stmts (
  phi, return, assign, call, asm) are checked.
  If there is an interesting expression (e.g. COMPONENT_REF
  or PARM_DECL), then record the required info for the access
  (e.g. pos, size, type, base).
  And if it is risky to do SRA, the candidates may be removed.
  e.g. address-taken and accessed via memory.
  "foo(struct S arg) {bar (&arg);}"

This patch also try to common code for light-expand-sra,
tree-sra, and ipa-sra.
We can continue refactoring to share similar functionalities.

Compare with previous version, this version avoid to store
the parameter to stack if it is scalarized.
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631177.html

This patch is tested on ppc64{,le} and x86_64.
Is this ok for trunk?

BR,
Jeff (Jiufu Guo)

PR target/65421

gcc/ChangeLog:

* cfgexpand.cc (struct access): New class.
(struct expand_sra): New class.
(expand_sra::collect_sra_candidates): New member function.
(expand_sra::add_sra_candidate): Likewise.
(expand_sra::build_access): Likewise.
(expand_sra::analyze_phi): Likewise.
(expand_sra::analyze_assign): Likewise.
(expand_sra::visit_base): Likewise.
(expand_sra::protect_mem_access_in_stmt): Likewise.
(expand_sra::expand_sra):  Class constructor.
(expand_sra::~expand_sra): Class destructor.
(expand_sra::scalarizable_access):  New member function.
(expand_sra::scalarizable_accesses):  Likewise.
(scalarizable_aggregate):  New function.
(set_scalar_rtx_for_returns):  New function.
(expand_value_return): Updated.
(expand_debug_expr): Updated.
(pass_expand::execute): Updated to use expand_sra.
* cfgexpand.h (scalarizable_aggregate): New declare.
(set_scalar_rtx_for_returns): New declare.
* expr.cc (expand_assignment): Updated.
(expand_constructor): Updated.
(query_position_in_parallel): New function.
(extract_sub_member): New function.
(expand_expr_real_1): Updated.
* expr.h (query_position_in_parallel): New declare.
* function.cc (assign_parm_setup_block): Updated.
(assign_parms): Updated.
(expand_function_start): Updated.
* tree-sra.h (struct sra_base_access): New class.
(struct sra_default_analyzer): New class.
(sca

[PATCH] RISC-V Regression: Fix FAIL of bb-slp-pr65935.c for RVV

2023-10-09 Thread Juzhe-Zhong
Here is the reference comparing dump IR between ARM SVE and RVV.

https://godbolt.org/z/zqess8Gss

We can see RVV has one more dump IR:
optimized: basic block part vectorized using 128 byte vectors
since RVV has 1024 bit vectors.

The codegen is reasonable good.

However, I saw GCN also has 1024 bit vector.
This patch may cause this case FAIL in GCN port ?

Hi, GCN folk, could you check this patch in GCN port for me ?

gcc/testsuite/ChangeLog:

* gcc.dg/vect/bb-slp-pr65935.c: Add vect1024 variant.
* lib/target-supports.exp: Ditto.

---
 gcc/testsuite/gcc.dg/vect/bb-slp-pr65935.c | 3 ++-
 gcc/testsuite/lib/target-supports.exp  | 6 ++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr65935.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-pr65935.c
index 8df35327e7a..9ef1330b47c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-pr65935.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr65935.c
@@ -67,7 +67,8 @@ int main()
 
 /* We should also be able to use 2-lane SLP to initialize the real and
imaginary components in the first loop of main.  */
-/* { dg-final { scan-tree-dump-times "optimized: basic block" 10 "slp1" } } */
+/* { dg-final { scan-tree-dump-times "optimized: basic block" 10 "slp1" { 
target {! { vect1024 } } } } } */
+/* { dg-final { scan-tree-dump-times "optimized: basic block" 11 "slp1" { 
target { { vect1024 } } } } } */
 /* We should see the s->phase[dir] operand splatted and no other operand built
from scalars.  See PR97334.  */
 /* { dg-final { scan-tree-dump "Using a splat" "slp1" } } */
diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index dc366d35a0a..95c489d7f76 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -8903,6 +8903,12 @@ proc check_effective_target_vect_variable_length { } {
 return [expr { [lindex [available_vector_sizes] 0] == 0 }]
 }
 
+# Return 1 if the target supports vectors of 1024 bits.
+
+proc check_effective_target_vect1024 { } {
+return [expr { [lsearch -exact [available_vector_sizes] 1024] >= 0 }]
+}
+
 # Return 1 if the target supports vectors of 512 bits.
 
 proc check_effective_target_vect512 { } {
-- 
2.36.3



[PATCH] RISC-V Regression: Fix dump check of bb-slp-68.c

2023-10-09 Thread Juzhe-Zhong
Like GCN, RVV also has 64 bytes vectors (512 bits) which cause FAIL in this 
test.

It's more reasonable to use "vect512" instead of AMDGCN.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/bb-slp-68.c: Use vect512.

---
 gcc/testsuite/gcc.dg/vect/bb-slp-68.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-68.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-68.c
index e7573a14933..2dd3d8ee90c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-68.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-68.c
@@ -20,4 +20,4 @@ void foo ()
 
 /* We want to have the store group split into 4, 2, 4 when using 32byte 
vectors.
Unfortunately it does not work when 64-byte vectors are available.  */
-/* { dg-final { scan-tree-dump-not "from scalars" "slp2" { xfail amdgcn-*-* } 
} } */
+/* { dg-final { scan-tree-dump-not "from scalars" "slp2" { xfail vect512 } } } 
*/
-- 
2.36.3



Re: Re: [PATCH] RISC-V/testsuite: Enable `vect_pack_trunc'

2023-10-09 Thread juzhe.zh...@rivai.ai
Oh. I realize this patch increase FAIL that I recently fixed:
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632247.html 

This fail because RVV doesn't have vec_pack_trunc_optab (Loop vectorizer will 
failed at first time but succeed at 2nd time), 
then RVV will dump 4 times FOLD_EXTRACT_LAST instead of 2  (ARM SVE 2 times 
because they have vec_pack_trunc_optab).

I think the root cause of RVV failing at multiple tests of "vect" is that we 
don't enable vec_pack/vec_unpack/... stuff, 
we still succeed at vectorizations and we want to enable tests of them 
(Mostly just using different approach to vectorize it (cause dump FAIL) because 
of some changing I have done previously in the middle-end).

So enabling "vec_pack" for RVV will fix some FAILs but increase some other 
FAILs.

CC to Richi to see more reasonable suggestions.



juzhe.zh...@rivai.ai
 
发件人: Maciej W. Rozycki
发送时间: 2023-10-10 06:38
收件人: 钟居哲
抄送: gcc-patches; Jeff Law; rdapp.gcc; kito.cheng
主题: Re: 回复: [PATCH] RISC-V/testsuite: Enable `vect_pack_trunc'
On Tue, 10 Oct 2023, 钟居哲 wrote:
 
> Btw, could you rebase to the trunk and run regression again?
 
Full regression-testing takes roughly 40 hours here and I do not normally
update the tree midway through my work so as not to add variables and end 
up chasing a moving target, especially with such an unstable state that we 
have ended up with recently with the RISC-V port.  Since I'm done with 
this part I can refresh and schedule another run if you are curious as to 
how it looks like from my side.  For the C subset alone it'll take less.
 
  Maciej
 


[PATCH] RISC-V: Add available vector size for RVV

2023-10-09 Thread Juzhe-Zhong
For RVV, we have VLS modes enable according to TARGET_MIN_VLEN
from M1 to M8.

For example, when TARGET_MIN_VLEN = 128 bits, we enable
128/256/512/1024 bits VLS modes.

This patch fixes following FAIL:
FAIL: gcc.dg/vect/bb-slp-subgroups-2.c -flto -ffat-lto-objects  
scan-tree-dump-times slp2 "optimized: basic block" 2
FAIL: gcc.dg/vect/bb-slp-subgroups-2.c scan-tree-dump-times slp2 "optimized: 
basic block" 2

gcc/testsuite/ChangeLog:

* lib/target-supports.exp: Add 256/512/1024

---
 gcc/testsuite/lib/target-supports.exp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index af52c38433d..dc366d35a0a 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -8881,7 +8881,7 @@ proc available_vector_sizes { } {
lappend result 4096 2048 1024 512 256 128 64 32 16 8 4 2
 } elseif { [istarget riscv*-*-*] } {
if { [check_effective_target_riscv_v] } {
-   lappend result 0 32 64 128
+   lappend result 0 32 64 128 256 512 1024
}
lappend result 128
 } else {
-- 
2.36.3



Re: xthead regression with [COMMITTED] RISC-V: const: hide mvconst splitter from IRA

2023-10-09 Thread Christoph Müllner
On Mon, Oct 9, 2023 at 10:48 PM Vineet Gupta  wrote:
>
> On 10/9/23 13:46, Christoph Müllner wrote:
> > Given that this causes repeated issues, I think that a fall-back to
> > counting occurrences is the right thing to do. I can do that if that's ok.
>
> Thanks Christoph.

Tested patch on list:
  https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632393.html

>
> -Vineet


[PATCH] RISC-V: Make xtheadcondmov-indirect tests robust against instruction reordering

2023-10-09 Thread Christoph Muellner
From: Christoph Müllner 

Fixes: c1bc7513b1d7 ("RISC-V: const: hide mvconst splitter from IRA")

A recent change broke the xtheadcondmov-indirect tests, because the order of
emitted instructions changed. Since the test is too strict when testing for
a fixed instruction order, let's change the tests to simply count instruction,
like it is done for similar tests.

Reported-by: Patrick O'Neill 
Signed-off-by: Christoph Müllner 

gcc/testsuite/ChangeLog:

* gcc.target/riscv/xtheadcondmov-indirect.c: Make robust against
instruction reordering.

Signed-off-by: Christoph Müllner 
---
 .../gcc.target/riscv/xtheadcondmov-indirect.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect.c 
b/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect.c
index c3253ba5239..eba1b86137b 100644
--- a/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect.c
+++ b/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect.c
@@ -1,8 +1,7 @@
 /* { dg-do compile } */
-/* { dg-options "-march=rv32gc_xtheadcondmov -fno-sched-pressure" { target { 
rv32 } } } */
-/* { dg-options "-march=rv64gc_xtheadcondmov -fno-sched-pressure" { target { 
rv64 } } } */
+/* { dg-options "-march=rv32gc_xtheadcondmov" { target { rv32 } } } */
+/* { dg-options "-march=rv64gc_xtheadcondmov" { target { rv64 } } } */
 /* { dg-skip-if "" { *-*-* } {"-O0" "-Os" "-Og" "-Oz" "-flto" } } */
-/* { dg-final { check-function-bodies "**" "" } } */
 
 /*
 ** ConEmv_imm_imm_reg:
@@ -116,3 +115,9 @@ int ConNmv_reg_reg_reg(int x, int y, int z, int n)
 return z;
   return n;
 }
+
+/* { dg-final { scan-assembler-times "addi\t" 5 } } */
+/* { dg-final { scan-assembler-times "li\t" 4 } } */
+/* { dg-final { scan-assembler-times "sub\t" 4 } } */
+/* { dg-final { scan-assembler-times "th.mveqz\t" 4 } } */
+/* { dg-final { scan-assembler-times "th.mvnez\t" 4 } } */
-- 
2.41.0



Re: 回复: [PATCH] RISC-V/testsuite: Enable `vect_pack_trunc'

2023-10-09 Thread Maciej W. Rozycki
On Tue, 10 Oct 2023, 钟居哲 wrote:

> Btw, could you rebase to the trunk and run regression again?

 Full regression-testing takes roughly 40 hours here and I do not normally
update the tree midway through my work so as not to add variables and end 
up chasing a moving target, especially with such an unstable state that we 
have ended up with recently with the RISC-V port.  Since I'm done with 
this part I can refresh and schedule another run if you are curious as to 
how it looks like from my side.  For the C subset alone it'll take less.

  Maciej


Re: Re: [PATCH] RISC-V/testsuite: Enable `vect_pack_trunc'

2023-10-09 Thread 钟居哲
I know you want vect_int to block the test for rv64gc. 
But unfortunately it failed.

And I have changed everything to run vect testsuite with "riscv_v".
[PATCH] RISC-V: Enable more tests of "vect" for RVV (gnu.org)

So to be consistent, plz add "riscv_v".



juzhe.zh...@rivai.ai
 
From: Maciej W. Rozycki
Date: 2023-10-10 06:29
To: 钟居哲
CC: gcc-patches; Jeff Law; rdapp.gcc; kito.cheng
Subject: Re: [PATCH] RISC-V/testsuite: Enable `vect_pack_trunc'
On Tue, 10 Oct 2023, 钟居哲 wrote:
 
>  && [check_effective_target_arm_little_endian])
>   || ([istarget mips*-*-*]
>  && [et-is-effective-target mips_msa])
> +  || [istarget riscv*-*-*]
>   || ([istarget s390*-*-*]
>  && [check_effective_target_s390_vx])
>   || [istarget amdgcn*-*-*] }}]
> 
> You should change it into:
> 
> || ([istarget riscv*-*-*]
>  && [check_effective_target_riscv_v])
> 
> Then, these additional FAILs will be removed:
> 
> with no changes (except for intermittent Python failures for C++) with the 
> remaining testsuites.  There are a few of regressions in `-march=rv64gc' 
> testing:
> +FAIL: gcc.dg/vect/pr97678.c scan-tree-dump vect "vectorizing stmts using SLP"
> +FAIL: gcc.dg/vect/slp-13-big-array.c scan-tree-dump-times vect "vectorizing 
> stmts using SLP" 3
> +FAIL: gcc.dg/vect/slp-13.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 3
> +FAIL: gcc.dg/vect/pr97678.c -flto -ffat-lto-objects  scan-tree-dump vect 
> "vectorizing stmts using SLP"
> +FAIL: gcc.dg/vect/slp-13-big-array.c -flto -ffat-lto-objects  
> scan-tree-dump-times vect "vectorizing stmts using SLP" 3
> +FAIL: gcc.dg/vect/slp-13.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 3
 
I explained in the change description why the check for `riscv_v' isn't 
needed here: the tests mustn't run in the first place, so naturally they 
cannot fail either.  If I missed anything, then please elaborate.
 
  Maciej
 


Re: [PATCH] RISC-V/testsuite: Enable `vect_pack_trunc'

2023-10-09 Thread Maciej W. Rozycki
On Tue, 10 Oct 2023, 钟居哲 wrote:

>&& [check_effective_target_arm_little_endian])
>|| ([istarget mips*-*-*]
>&& [et-is-effective-target mips_msa])
> +  || [istarget riscv*-*-*]
>|| ([istarget s390*-*-*]
>&& [check_effective_target_s390_vx])
>   || [istarget amdgcn*-*-*] }}]
> 
> You should change it into:
> 
> || ([istarget riscv*-*-*]
>  && [check_effective_target_riscv_v])
> 
> Then, these additional FAILs will be removed:
> 
> with no changes (except for intermittent Python failures for C++) with the 
> remaining testsuites.  There are a few of regressions in `-march=rv64gc' 
> testing:
> +FAIL: gcc.dg/vect/pr97678.c scan-tree-dump vect "vectorizing stmts using SLP"
> +FAIL: gcc.dg/vect/slp-13-big-array.c scan-tree-dump-times vect "vectorizing 
> stmts using SLP" 3
> +FAIL: gcc.dg/vect/slp-13.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 3
> +FAIL: gcc.dg/vect/pr97678.c -flto -ffat-lto-objects  scan-tree-dump vect 
> "vectorizing stmts using SLP"
> +FAIL: gcc.dg/vect/slp-13-big-array.c -flto -ffat-lto-objects  
> scan-tree-dump-times vect "vectorizing stmts using SLP" 3
> +FAIL: gcc.dg/vect/slp-13.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 3

 I explained in the change description why the check for `riscv_v' isn't 
needed here: the tests mustn't run in the first place, so naturally they 
cannot fail either.  If I missed anything, then please elaborate.

  Maciej


回复: [PATCH] RISC-V/testsuite: Enable `vect_pack_trunc'

2023-10-09 Thread 钟居哲
Btw, could you rebase to the trunk and run regression again?

I saw your report 670 FAILs:
# of expected passes   187616
# of unexpected failures   672
# of unexpected successes  14
# of expected failures 1436
# of unresolved testcases  615
# of unsupported tests 4731

I am recently working on fixing FAILs of risc-v regression. Your report looks 
odd.
This is my report:

# of expected passes183613
# of unexpected failures92
# of unexpected successes   12
# of expected failures  1383
# of unresolved testcases   4
# of unsupported tests  4223

This is my report. It should be less than 100 FAILs.


juzhe.zh...@rivai.ai
 
发件人: 钟居哲
发送时间: 2023-10-10 06:17
收件人: gcc-patches
抄送: macro; Jeff Law; rdapp.gcc; kito.cheng
主题: [PATCH] RISC-V/testsuite: Enable `vect_pack_trunc'
 && [check_effective_target_arm_little_endian])
 || ([istarget mips*-*-*]
 && [et-is-effective-target mips_msa])
+|| [istarget riscv*-*-*]
 || ([istarget s390*-*-*]
 && [check_effective_target_s390_vx])
  || [istarget amdgcn*-*-*] }}]

You should change it into:

|| ([istarget riscv*-*-*]
 && [check_effective_target_riscv_v])

Then, these additional FAILs will be removed:

with no changes (except for intermittent Python failures for C++) with the 
remaining testsuites.  There are a few of regressions in `-march=rv64gc' 
testing:
+FAIL: gcc.dg/vect/pr97678.c scan-tree-dump vect "vectorizing stmts using SLP"
+FAIL: gcc.dg/vect/slp-13-big-array.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 3
+FAIL: gcc.dg/vect/slp-13.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 3
+FAIL: gcc.dg/vect/pr97678.c -flto -ffat-lto-objects  scan-tree-dump vect 
"vectorizing stmts using SLP"
+FAIL: gcc.dg/vect/slp-13-big-array.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorizing stmts using SLP" 3
+FAIL: gcc.dg/vect/slp-13.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorizing stmts using SLP" 3


juzhe.zh...@rivai.ai


[PATCH] RISC-V/testsuite: Enable `vect_pack_trunc'

2023-10-09 Thread 钟居哲
 && [check_effective_target_arm_little_endian])
 || ([istarget mips*-*-*]
 && [et-is-effective-target mips_msa])
+|| [istarget riscv*-*-*]
 || ([istarget s390*-*-*]
 && [check_effective_target_s390_vx])
  || [istarget amdgcn*-*-*] }}]

You should change it into:

|| ([istarget riscv*-*-*]
 && [check_effective_target_riscv_v])

Then, these additional FAILs will be removed:

with no changes (except for intermittent Python failures for C++) with the 
remaining testsuites.  There are a few of regressions in `-march=rv64gc' 
testing:
+FAIL: gcc.dg/vect/pr97678.c scan-tree-dump vect "vectorizing stmts using SLP"
+FAIL: gcc.dg/vect/slp-13-big-array.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 3
+FAIL: gcc.dg/vect/slp-13.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 3
+FAIL: gcc.dg/vect/pr97678.c -flto -ffat-lto-objects  scan-tree-dump vect 
"vectorizing stmts using SLP"
+FAIL: gcc.dg/vect/slp-13-big-array.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorizing stmts using SLP" 3
+FAIL: gcc.dg/vect/slp-13.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorizing stmts using SLP" 3


juzhe.zh...@rivai.ai


[PATCH] RISC-V/testsuite: Enable `vect_pack_trunc'

2023-10-09 Thread Maciej W. Rozycki
Despite not defining `vec_pack_trunc_' standard named patterns the 
backend provides vector pack operations via its own `@pred_trunc' 
set of patterns and they do trigger in vectorization producing narrowing 
VNCVT.X.X.W assembly instructions as expected.

Enable the `vect_pack_trunc' setting for RISC-V targets then, improving
GCC C test results in `-march=rv64gcv' testing as follows:

-FAIL: gcc.dg/vect/pr57705.c scan-tree-dump-times vect "vectorized 1 loop" 2
+PASS: gcc.dg/vect/pr57705.c scan-tree-dump-times vect "vectorized 1 loop" 3
+PASS: gcc.dg/vect/pr59354.c scan-tree-dump vect "vectorized 1 loop"
-UNSUPPORTED: gcc.dg/vect/pr97678.c
+PASS: gcc.dg/vect/pr97678.c (test for excess errors)
+PASS: gcc.dg/vect/pr97678.c execution test
+XFAIL: gcc.dg/vect/pr97678.c scan-tree-dump vect "vectorizing stmts using SLP"
-UNSUPPORTED: gcc.dg/vect/vect-bool-cmp.c
+PASS: gcc.dg/vect/vect-bool-cmp.c (test for excess errors)
+PASS: gcc.dg/vect/vect-bool-cmp.c execution test
+PASS: gcc.dg/vect/vect-iv-4.c scan-tree-dump-times vect "vectorized 1 loops" 1
+PASS: gcc.dg/vect/vect-multitypes-14.c scan-tree-dump-times vect "vectorized 1 
loops" 1
+PASS: gcc.dg/vect/vect-multitypes-8.c scan-tree-dump-times vect "vectorized 1 
loops" 1
+PASS: gcc.dg/vect/vect-reduc-dot-u16b.c scan-tree-dump-times vect "vectorized 
1 loops" 1
+PASS: gcc.dg/vect/vect-strided-store-u16-i4.c scan-tree-dump-times vect 
"vectorized 1 loops" 2
-PASS: gcc.dg/vect/slp-13-big-array.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 2
+XFAIL: gcc.dg/vect/slp-13-big-array.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 3
-PASS: gcc.dg/vect/slp-13.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 2
+XFAIL: gcc.dg/vect/slp-13.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 3
+PASS: gcc.dg/vect/slp-multitypes-10.c scan-tree-dump-times vect "vectorized 1 
loops" 1
+PASS: gcc.dg/vect/slp-multitypes-10.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 1
+PASS: gcc.dg/vect/slp-multitypes-5.c scan-tree-dump-times vect "vectorized 1 
loops" 1
+PASS: gcc.dg/vect/slp-multitypes-5.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 1
+PASS: gcc.dg/vect/slp-multitypes-6.c scan-tree-dump-times vect "vectorized 1 
loops" 1
+PASS: gcc.dg/vect/slp-multitypes-6.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 1
+PASS: gcc.dg/vect/slp-multitypes-9.c scan-tree-dump-times vect "vectorized 1 
loops" 1
+PASS: gcc.dg/vect/slp-multitypes-9.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 1
-UNSUPPORTED: gcc.dg/vect/slp-perm-12.c
+PASS: gcc.dg/vect/slp-perm-12.c (test for excess errors)
+PASS: gcc.dg/vect/slp-perm-12.c execution test
-UNSUPPORTED: gcc.dg/vect/bb-slp-11.c
+PASS: gcc.dg/vect/bb-slp-11.c (test for excess errors)
+PASS: gcc.dg/vect/bb-slp-11.c execution test
-FAIL: gcc.dg/vect/pr57705.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorized 1 loop" 2
+PASS: gcc.dg/vect/pr57705.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorized 1 loop" 3
+PASS: gcc.dg/vect/pr59354.c -flto -ffat-lto-objects  scan-tree-dump vect 
"vectorized 1 loop"
-UNSUPPORTED: gcc.dg/vect/pr97678.c -flto -ffat-lto-objects
+PASS: gcc.dg/vect/pr97678.c -flto -ffat-lto-objects (test for excess errors)
+PASS: gcc.dg/vect/pr97678.c -flto -ffat-lto-objects execution test
+XFAIL: gcc.dg/vect/pr97678.c -flto -ffat-lto-objects  scan-tree-dump vect 
"vectorizing stmts using SLP"
-UNSUPPORTED: gcc.dg/vect/vect-bool-cmp.c -flto -ffat-lto-objects
+PASS: gcc.dg/vect/vect-bool-cmp.c -flto -ffat-lto-objects (test for excess 
errors)
+PASS: gcc.dg/vect/vect-bool-cmp.c -flto -ffat-lto-objects execution test
+PASS: gcc.dg/vect/vect-iv-4.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "vectorized 1 loops" 1
+PASS: gcc.dg/vect/vect-multitypes-14.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorized 1 loops" 1
+PASS: gcc.dg/vect/vect-multitypes-8.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorized 1 loops" 1
+PASS: gcc.dg/vect/vect-reduc-dot-u16b.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorized 1 loops" 1
+PASS: gcc.dg/vect/vect-strided-store-u16-i4.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorized 1 loops" 2
-PASS: gcc.dg/vect/slp-13-big-array.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorizing stmts using SLP" 2
+XFAIL: gcc.dg/vect/slp-13-big-array.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorizing stmts using SLP" 3
-PASS: gcc.dg/vect/slp-13.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorizing stmts using SLP" 2
+XFAIL: gcc.dg/vect/slp-13.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorizing stmts using SLP" 3
+PASS: gcc.dg/vect/slp-multitypes-10.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorized 1 loops" 1
+PASS: gcc.dg/vect/slp-multitypes-10.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorizing stmts using SLP" 1
+PASS: gcc.dg/vect/slp-multitypes-5.c -flto -ffat-lto-obje

[PATCH] MATCH: [PR111679] Add alternative simplification of `a | ((~a) ^ b)`

2023-10-09 Thread Andrew Pinski
So currently we have a simplification for `a | ~(a ^ b)` but
that does not match the case where we had originally `(~a) | (a ^ b)`
so we need to add a new pattern that matches that and uses 
bitwise_inverted_equal_p
that also catches comparisons too.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

PR tree-optimization/111679

gcc/ChangeLog:

* match.pd (`a | ((~a) ^ b)`): New pattern.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/bitops-5.c: New test.
---
 gcc/match.pd |  8 +++
 gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c | 27 
 2 files changed, 35 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 31bfd8b6b68..49740d189a7 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1350,6 +1350,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   && TYPE_PRECISION (TREE_TYPE (@0)) == 1)
   (bit_ior @0 (bit_xor @1 { build_one_cst (type); }
 
+/* a | ((~a) ^ b)  -->  a | (~b) (alt version of the above 2) */
+(simplify
+ (bit_ior:c @0 (bit_xor:cs @1 @2))
+ (with { bool wascmp; }
+ (if (bitwise_inverted_equal_p (@0, @1, wascmp)
+  && (!wascmp || element_precision (type) == 1))
+  (bit_ior @0 (bit_not @2)
+
 /* (a | b) | (a &^ b)  -->  a | b  */
 (for op (bit_and bit_xor)
  (simplify
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c 
b/gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c
new file mode 100644
index 000..990610e3002
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/bitops-5.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized-raw" } */
+/* PR tree-optimization/111679 */
+
+int f1(int a, int b)
+{
+return (~a) | (a ^ b); // ~(a & b) or (~a) | (~b)
+}
+
+_Bool fb(_Bool c, _Bool d)
+{
+return (!c) | (c ^ d); // ~(c & d) or (~c) | (~d)
+}
+
+_Bool fb1(int x, int y)
+{
+_Bool a = x == 10,  b = y > 100;
+return (!a) | (a ^ b); // ~(a & b) or (~a) | (~b)
+// or (x != 10) | (y <= 100)
+}
+
+/* { dg-final { scan-tree-dump-not   "bit_xor_expr, "   "optimized" } } */
+/* { dg-final { scan-tree-dump-times "bit_not_expr, " 2 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "bit_and_expr, " 2 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "bit_ior_expr, " 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "ne_expr, _\[0-9\]+, x_\[0-9\]+"  1 
"optimized" } } */
+/* { dg-final { scan-tree-dump-times "le_expr, _\[0-9\]+, y_\[0-9\]+"  1 
"optimized" } } */
-- 
2.39.3



[RFC] RISC-V: Handle new types in scheduling descriptions

2023-10-09 Thread Edwin Lu
Now that every insn is guaranteed a type, we want to ensure the types are 
handled by the existing scheduling descriptions. 

There are 2 approaches I see:
1. Create a new pipeline intended to eventually abort (sifive-7.md) 
2. Add the types to an existing pipeline (generic.md)

Which approach do we want to go with? If there is a different approach we
want to take instead, please let me know as well.

Additionally, should types associated with specific extensions 
(vector, crypto, etc) have specific pipelines dedicated to them? 

* config/riscv/generic.md: update pipeline
* config/riscv/sifive-7.md (sifive_7): update pipeline
(sifive_7_other):

Signed-off-by: Edwin Lu 
---
 gcc/config/riscv/generic.md  | 3 ++-
 gcc/config/riscv/sifive-7.md | 7 +++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/gcc/config/riscv/generic.md b/gcc/config/riscv/generic.md
index 57d3c3b4adc..338d2e85b77 100644
--- a/gcc/config/riscv/generic.md
+++ b/gcc/config/riscv/generic.md
@@ -27,7 +27,8 @@ (define_cpu_unit "fdivsqrt" "pipe0")
 
 (define_insn_reservation "generic_alu" 1
   (and (eq_attr "tune" "generic")
-   (eq_attr "type" 
"unknown,const,arith,shift,slt,multi,auipc,nop,logical,move,bitmanip,min,max,minu,maxu,clz,ctz,cpop"))
+   (eq_attr "type" "unknown,const,arith,shift,slt,multi,auipc,nop,
+ logical,move,bitmanip,min,max,minu,maxu,clz,ctz,cpop,trap,cbo"))
   "alu")
 
 (define_insn_reservation "generic_load" 3
diff --git a/gcc/config/riscv/sifive-7.md b/gcc/config/riscv/sifive-7.md
index 526278e46d4..e76d82614d6 100644
--- a/gcc/config/riscv/sifive-7.md
+++ b/gcc/config/riscv/sifive-7.md
@@ -12,6 +12,8 @@ (define_cpu_unit "sifive_7_B" "sifive_7")
 (define_cpu_unit "sifive_7_idiv" "sifive_7")
 (define_cpu_unit "sifive_7_fpu" "sifive_7")
 
+(define_cpu_unit "sifive_7_abort" "sifive_7")
+
 (define_insn_reservation "sifive_7_load" 3
   (and (eq_attr "tune" "sifive_7")
(eq_attr "type" "load"))
@@ -106,6 +108,11 @@ (define_insn_reservation "sifive_7_f2i" 3
(eq_attr "type" "mfc"))
   "sifive_7_A")
 
+(define_insn_reservation "sifive_7_other" 3
+  (and (eq_attr "tune" "sifive_7")
+   (eq_attr "type" "trap,cbo"))
+  "sifive_7_abort")
+
 (define_bypass 1 
"sifive_7_load,sifive_7_alu,sifive_7_mul,sifive_7_f2i,sifive_7_sfb_alu"
   "sifive_7_alu,sifive_7_branch")
 
-- 
2.34.1



Re: xthead regression with [COMMITTED] RISC-V: const: hide mvconst splitter from IRA

2023-10-09 Thread Vineet Gupta

On 10/9/23 13:46, Christoph Müllner wrote:
Given that this causes repeated issues, I think that a fall-back to 
counting occurrences is the right thing to do. I can do that if that's ok.


Thanks Christoph.

-Vineet


Re: xthead regression with [COMMITTED] RISC-V: const: hide mvconst splitter from IRA

2023-10-09 Thread Christoph Müllner
On Mon, Oct 9, 2023 at 10:36 PM Vineet Gupta  wrote:
>
> Hi Christoph,
>
> On 10/9/23 12:06, Patrick O'Neill wrote:
> >
> > Hi Vineet,
> >
> > We're seeing a regression on all riscv targets after this patch:|
> >
> > FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O2
> > check-function-bodies ConNmv_imm_imm_reg||
> > FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O3 -g
> > check-function-bodies ConNmv_imm_imm_reg
> >
> > Debug log output:
> > body: \taddia[0-9]+,a[0-9]+,-1000+
> > \tlia[0-9]+,9998336+
> > \taddia[0-9]+,a[0-9]+,1664+
> > \tth.mveqza[0-9]+,a[0-9]+,a[0-9]+
> > \tret
> >
> > against: lia5,9998336
> > addia4,a0,-1000
> > addia0,a5,1664
> > th.mveqza0,a1,a4
> > ret|
> >
> > https://github.com/patrick-rivos/gcc-postcommit-ci/issues/8
> > https://github.com/ewlu/riscv-gnu-toolchain/issues/286
> >
>
> It seems with my patch, exactly same instructions get out of order (for
> -O2/-O3) tripping up the test results and differ from say O1 for exact
> same build.
>
> -O2 w/ patch
> ConNmv_imm_imm_reg:
>  lia5,9998336
>  addia4,a0,-1000
>  addia0,a5,1664
>  th.mveqza0,a1,a4
>  ret
>
> -O1 w/ patch
> ConNmv_imm_imm_reg:
>  addia4,a0,-1000
>  lia5,9998336
>  addia0,a5,1664
>  th.mveqza0,a1,a4
>  ret
>
> I'm not sure if there is an easy way to handle that.
> Is there a real reason for testing the full sequences verbatim, or is
> testing number of occurrences of th.mv{eqz,nez} enough.

I did not write the test cases, I just merged two non-functional test files
into one that works without changing the actual test approach.

Given that this causes repeated issues, I think that a fall-back to counting
occurrences is the right thing to do.

I can do that if that's ok.

BR
Christoph



> It seems Jeff recently added -fno-sched-pressure to avoid similar issues
> but that apparently is no longer sufficient.
>
> Thx,
> -Vineet
>
> > Thanks,
> > Patrick
> >
> > On 10/6/23 11:22, Vineet Gupta wrote:
> >> Vlad recently introduced a new gate @ira_in_progress, similar to
> >> counterparts @{reload,lra}_in_progress.
> >>
> >> Use this to hide the constant synthesis splitter from being recog* ()
> >> by IRA register equivalence logic which is eager to undo the splits,
> >> generating worse code for constants (and sometimes no code at all).
> >>
> >> See PR/109279 (large constant), PR/110748 (const -0.0) ...
> >>
> >> Granted the IRA logic is subsided with -fsched-pressure which is now
> >> enabled for RISC-V backend, the gate makes this future-proof in
> >> addition to helping with -O1 etc.
> >>
> >> This fixes 1 addition test
> >>
> >> = Summary of gcc testsuite =
> >>  | # of unexpected case / # of unique 
> >> unexpected case
> >>  |  gcc |  g++ | gfortran |
> >>
> >> rv32imac/  ilp32/ medlow |  416 /   103 |   13 / 6 |   67 /12 |
> >>   rv32imafdc/ ilp32d/ medlow |  416 /   103 |   13 / 6 |   24 / 4 |
> >> rv64imac/   lp64/ medlow |  417 /   104 |9 / 3 |   67 /12 |
> >>   rv64imafdc/  lp64d/ medlow |  416 /   103 |5 / 2 |6 / 1 |
> >>
> >> Also similar to v1, this doesn't move RISC-V SPEC scores at all.
> >>
> >> gcc/ChangeLog:
> >>  * config/riscv/riscv.md (mvconst_internal): Add !ira_in_progress.
> >>
> >> Suggested-by: Jeff Law
> >> Signed-off-by: Vineet Gupta
> >> ---
> >>   gcc/config/riscv/riscv.md | 9 ++---
> >>   1 file changed, 6 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
> >> index 1ebe8f92284d..da84b9357bd3 100644
> >> --- a/gcc/config/riscv/riscv.md
> >> +++ b/gcc/config/riscv/riscv.md
> >> @@ -1997,13 +1997,16 @@
> >>
> >>   ;; Pretend to have the ability to load complex const_int in order to get
> >>   ;; better code generation around them.
> >> -;;
> >>   ;; But avoid constants that are special cased elsewhere.
> >> +;;
> >> +;; Hide it from IRA register equiv recog* () to elide potential undoing 
> >> of split
> >> +;;
> >>   (define_insn_and_split "*mvconst_internal"
> >> [(set (match_operand:GPR 0 "register_operand" "=r")
> >>   (match_operand:GPR 1 "splittable_const_int_operand" "i"))]
> >> -  "!(p2m1_shift_operand (operands[1], mode)
> >> - || high_mask_shift_operand (operands[1], mode))"
> >> +  "!ira_in_progress
> >> +   && !(p2m1_shift_operand (operands[1], mode)
> >> +|| high_mask_shift_operand (operands[1], mode))"
> >> "#"
> >> "&& 1"
> >> [(const_int 0)]
>


Re: [PATCH v4] c++: Check for indirect change of active union member in constexpr [PR101631,PR102286]

2023-10-09 Thread Jason Merrill

On 10/8/23 21:03, Nathaniel Shead wrote:

Ping for https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631203.html

+ && (TREE_CODE (t) == MODIFY_EXPR
+ /* Also check if initializations have implicit change of active
+member earlier up the access chain.  */
+ || !refs->is_empty())


I'm not sure what the cumulative point of these two tests is.  TREE_CODE 
(t) will be either MODIFY_EXPR or INIT_EXPR, and either should be OK.


As I understand it, the problematic case is something like 
constexpr-union2.C, where we're also looking at a MODIFY_EXPR.  So what 
is this check doing?


Incidentally, I think constexpr-union6.C could use a test where we pass 
&u.s to a function other than construct_at, and then try (and fail) to 
assign to the b member from that function.


Jason



Re: xthead regression with [COMMITTED] RISC-V: const: hide mvconst splitter from IRA

2023-10-09 Thread Jeff Law




On 10/9/23 14:36, Vineet Gupta wrote:

Hi Christoph,

On 10/9/23 12:06, Patrick O'Neill wrote:


Hi Vineet,

We're seeing a regression on all riscv targets after this patch:|

FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O2 
check-function-bodies ConNmv_imm_imm_reg||
FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O3 -g 
check-function-bodies ConNmv_imm_imm_reg


Debug log output:
body: \taddi    a[0-9]+,a[0-9]+,-1000+
\tli    a[0-9]+,9998336+
\taddi    a[0-9]+,a[0-9]+,1664+
\tth.mveqz    a[0-9]+,a[0-9]+,a[0-9]+
\tret

against:     li    a5,9998336
    addi    a4,a0,-1000
    addi    a0,a5,1664
    th.mveqz    a0,a1,a4
    ret|

https://github.com/patrick-rivos/gcc-postcommit-ci/issues/8
https://github.com/ewlu/riscv-gnu-toolchain/issues/286



It seems with my patch, exactly same instructions get out of order (for 
-O2/-O3) tripping up the test results and differ from say O1 for exact 
same build.


-O2 w/ patch
ConNmv_imm_imm_reg:
     li    a5,9998336
     addi    a4,a0,-1000
     addi    a0,a5,1664
     th.mveqz    a0,a1,a4
     ret

-O1 w/ patch
ConNmv_imm_imm_reg:
     addi    a4,a0,-1000
     li    a5,9998336
     addi    a0,a5,1664
     th.mveqz    a0,a1,a4
     ret

I'm not sure if there is an easy way to handle that.
Is there a real reason for testing the full sequences verbatim, or is 
testing number of occurrences of th.mv{eqz,nez} enough.
It seems Jeff recently added -fno-sched-pressure to avoid similar issues 
but that apparently is no longer sufficient.

I'd suggest doing a count test rather than an exact match.

Verify you get a single li, two addis and one th.mveqz

Jeff


xthead regression with [COMMITTED] RISC-V: const: hide mvconst splitter from IRA

2023-10-09 Thread Vineet Gupta

Hi Christoph,

On 10/9/23 12:06, Patrick O'Neill wrote:


Hi Vineet,

We're seeing a regression on all riscv targets after this patch:|

FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O2 
check-function-bodies ConNmv_imm_imm_reg||
FAIL: gcc.target/riscv/xtheadcondmov-indirect.c -O3 -g 
check-function-bodies ConNmv_imm_imm_reg


Debug log output:
body: \taddi    a[0-9]+,a[0-9]+,-1000+
\tli    a[0-9]+,9998336+
\taddi    a[0-9]+,a[0-9]+,1664+
\tth.mveqz    a[0-9]+,a[0-9]+,a[0-9]+
\tret

against:     li    a5,9998336
    addi    a4,a0,-1000
    addi    a0,a5,1664
    th.mveqz    a0,a1,a4
    ret|

https://github.com/patrick-rivos/gcc-postcommit-ci/issues/8
https://github.com/ewlu/riscv-gnu-toolchain/issues/286



It seems with my patch, exactly same instructions get out of order (for 
-O2/-O3) tripping up the test results and differ from say O1 for exact 
same build.


-O2 w/ patch
ConNmv_imm_imm_reg:
    li    a5,9998336
    addi    a4,a0,-1000
    addi    a0,a5,1664
    th.mveqz    a0,a1,a4
    ret

-O1 w/ patch
ConNmv_imm_imm_reg:
    addi    a4,a0,-1000
    li    a5,9998336
    addi    a0,a5,1664
    th.mveqz    a0,a1,a4
    ret

I'm not sure if there is an easy way to handle that.
Is there a real reason for testing the full sequences verbatim, or is 
testing number of occurrences of th.mv{eqz,nez} enough.
It seems Jeff recently added -fno-sched-pressure to avoid similar issues 
but that apparently is no longer sufficient.


Thx,
-Vineet


Thanks,
Patrick

On 10/6/23 11:22, Vineet Gupta wrote:

Vlad recently introduced a new gate @ira_in_progress, similar to
counterparts @{reload,lra}_in_progress.

Use this to hide the constant synthesis splitter from being recog* ()
by IRA register equivalence logic which is eager to undo the splits,
generating worse code for constants (and sometimes no code at all).

See PR/109279 (large constant), PR/110748 (const -0.0) ...

Granted the IRA logic is subsided with -fsched-pressure which is now
enabled for RISC-V backend, the gate makes this future-proof in
addition to helping with -O1 etc.

This fixes 1 addition test

= Summary of gcc testsuite =
 | # of unexpected case / # of unique unexpected 
case
 |  gcc |  g++ | gfortran |

rv32imac/  ilp32/ medlow |  416 /   103 |   13 / 6 |   67 /12 |
  rv32imafdc/ ilp32d/ medlow |  416 /   103 |   13 / 6 |   24 / 4 |
rv64imac/   lp64/ medlow |  417 /   104 |9 / 3 |   67 /12 |
  rv64imafdc/  lp64d/ medlow |  416 /   103 |5 / 2 |6 / 1 |

Also similar to v1, this doesn't move RISC-V SPEC scores at all.

gcc/ChangeLog:
* config/riscv/riscv.md (mvconst_internal): Add !ira_in_progress.

Suggested-by: Jeff Law
Signed-off-by: Vineet Gupta
---
  gcc/config/riscv/riscv.md | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index 1ebe8f92284d..da84b9357bd3 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -1997,13 +1997,16 @@
  
  ;; Pretend to have the ability to load complex const_int in order to get

  ;; better code generation around them.
-;;
  ;; But avoid constants that are special cased elsewhere.
+;;
+;; Hide it from IRA register equiv recog* () to elide potential undoing of 
split
+;;
  (define_insn_and_split "*mvconst_internal"
[(set (match_operand:GPR 0 "register_operand" "=r")
  (match_operand:GPR 1 "splittable_const_int_operand" "i"))]
-  "!(p2m1_shift_operand (operands[1], mode)
- || high_mask_shift_operand (operands[1], mode))"
+  "!ira_in_progress
+   && !(p2m1_shift_operand (operands[1], mode)
+|| high_mask_shift_operand (operands[1], mode))"
"#"
"&& 1"
[(const_int 0)]




Re: [PATCH] c++: Improve diagnostics for constexpr cast from void*

2023-10-09 Thread Jason Merrill

On 10/9/23 06:03, Nathaniel Shead wrote:

Bootstrapped and regtested on x86_64-pc-linux-gnu with
GXX_TESTSUITE_STDS=98,11,14,17,20,23,26,impcx.

-- >8 --

This patch improves the errors given when casting from void* in C++26 to
include the expected type if the type of the pointed-to object was
not similar to the casted-to type.

It also ensures (for all standard modes) that void* casts are checked
even for DECL_ARTIFICIAL declarations, such as lifetime-extended
temporaries, and is only ignored for cases where we know it's OK (heap
identifiers and source_location::current). This provides more accurate
diagnostics when using the pointer and ensures that some other casts
from void* are now correctly rejected.

gcc/cp/ChangeLog:

* constexpr.cc (is_std_source_location_current): New.
(cxx_eval_constant_expression): Only ignore cast from void* for
specific cases and improve other diagnostics.

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/constexpr-cast4.C: New test.

Signed-off-by: Nathaniel Shead 
---
  gcc/cp/constexpr.cc  | 83 +---
  gcc/testsuite/g++.dg/cpp0x/constexpr-cast4.C |  7 ++
  2 files changed, 78 insertions(+), 12 deletions(-)
  create mode 100644 gcc/testsuite/g++.dg/cpp0x/constexpr-cast4.C

diff --git a/gcc/cp/constexpr.cc b/gcc/cp/constexpr.cc
index 0f948db7c2d..f38d541a662 100644
--- a/gcc/cp/constexpr.cc
+++ b/gcc/cp/constexpr.cc
@@ -2301,6 +2301,36 @@ is_std_allocator_allocate (const constexpr_call *call)
  && is_std_allocator_allocate (call->fundef->decl));
  }
  
+/* Return true if FNDECL is std::source_location::current.  */

+
+static inline bool
+is_std_source_location_current (tree fndecl)
+{
+  if (!decl_in_std_namespace_p (fndecl))
+return false;
+
+  tree name = DECL_NAME (fndecl);
+  if (name == NULL_TREE || !id_equal (name, "current"))
+return false;
+
+  tree ctx = DECL_CONTEXT (fndecl);
+  if (ctx == NULL_TREE || !CLASS_TYPE_P (ctx) || !TYPE_MAIN_DECL (ctx))
+return false;
+
+  name = DECL_NAME (TYPE_MAIN_DECL (ctx));
+  return name && id_equal (name, "source_location");
+}
+
+/* Overload for the above taking constexpr_call*.  */
+
+static inline bool
+is_std_source_location_current (const constexpr_call *call)
+{
+  return (call
+ && call->fundef
+ && is_std_source_location_current (call->fundef->decl));
+}
+
  /* Return true if FNDECL is __dynamic_cast.  */
  
  static inline bool

@@ -7850,33 +7880,62 @@ cxx_eval_constant_expression (const constexpr_ctx *ctx, 
tree t,
if (TYPE_PTROB_P (type)
&& TYPE_PTR_P (TREE_TYPE (op))
&& VOID_TYPE_P (TREE_TYPE (TREE_TYPE (op)))
-   /* Inside a call to std::construct_at or to
-  std::allocator::{,de}allocate, we permit casting from void*
+   /* Inside a call to std::construct_at,
+  std::allocator::{,de}allocate, or
+  std::source_location::current, we permit casting from void*
   because that is compiler-generated code.  */
&& !is_std_construct_at (ctx->call)
-   && !is_std_allocator_allocate (ctx->call))
+   && !is_std_allocator_allocate (ctx->call)
+   && !is_std_source_location_current (ctx->call))
  {
/* Likewise, don't error when casting from void* when OP is
   &heap uninit and similar.  */
tree sop = tree_strip_nop_conversions (op);
-   if (TREE_CODE (sop) == ADDR_EXPR
-   && VAR_P (TREE_OPERAND (sop, 0))
-   && DECL_ARTIFICIAL (TREE_OPERAND (sop, 0)))
+   tree decl = NULL_TREE;
+   if (TREE_CODE (sop) == ADDR_EXPR)
+ decl = TREE_OPERAND (sop, 0);
+   if (decl
+   && VAR_P (decl)
+   && DECL_ARTIFICIAL (decl)
+   && (DECL_NAME (decl) == heap_identifier
+   || DECL_NAME (decl) == heap_uninit_identifier
+   || DECL_NAME (decl) == heap_vec_identifier
+   || DECL_NAME (decl) == heap_vec_uninit_identifier))
  /* OK */;
/* P2738 (C++26): a conversion from a prvalue P of type "pointer to
   cv void" to a pointer-to-object type T unless P points to an
   object whose type is similar to T.  */
-   else if (cxx_dialect > cxx23
-&& (sop = cxx_fold_indirect_ref (ctx, loc,
- TREE_TYPE (type), sop)))
+   else if (cxx_dialect > cxx23)
  {
-   r = build1 (ADDR_EXPR, type, sop);
-   break;
+   r = cxx_fold_indirect_ref (ctx, loc, TREE_TYPE (type), sop);
+   if (r)
+ {
+   r = build1 (ADDR_EXPR, type, r);
+   break;
+ }
+   if (!ctx->quiet)
+ {
+   if (TREE_CODE (sop) == ADDR_EXPR)
+ {
+ 

Re: [PATCH v1 1/4] options: Define TARGET__P and TARGET__OPTS_P macro for Mask and InverseMask

2023-10-09 Thread Kito Cheng
> Doesn't this need to be updated to avoid multi-dimensional arrays in awk
> and rebased?

Oh, yeah, I should update that, it's post before that issue reported,
let me send v2 sn :P


Re: [RFC 1/2] RISC-V: Add support for _Bfloat16.

2023-10-09 Thread Jeff Law




On 10/9/23 00:18, Jin Ma wrote:


+;; The conversion of DF to BF needs to be done with SF if there is a
+;; chance to generate at least one instruction, otherwise just using
+;; libfunc __truncdfbf2.
+(define_expand "truncdfbf2"
+  [(set (match_operand:BF 0 "register_operand" "=f")
+   (float_truncate:BF
+   (match_operand:DF 1 "register_operand" " f")))]
+  "TARGET_DOUBLE_FLOAT || TARGET_ZDINX"
+  {
+convert_move (operands[0],
+ convert_modes (SFmode, DFmode, operands[1], 0), 0);
+DONE;
+  })

So for conversions to/from BFmode, doesn't generic code take care of
this for us?  Search for convert_mode_scalar in expr.cc. That code will
utilize SFmode as an intermediate step just like your expander.   Is
there some reason that generic code is insufficient?

Similarly for the the other conversions.


As far as I can see, the function 'convert_mode_scalar' doesn't seem to be 
perfect for
dealing with the conversions to/from BFmode. It can only handle BF to HF, SF, 
DF and
SF to BF well, but the rest of the conversion without any processing, directly 
using
the libcall.

Maybe I should choose to enhance its functionality? This seems to be a
good choice, I'm not sure.My recollection was that BF could be converted to/from SF trivially and 

if we wanted BF->DF we'd first convert to SF, then to DF.

Direct BF<->DF conversions aren't actually important from a performance 
standpoint.  So it's OK if they have an extra step IMHO.


jeff


Re: [pushed] analyzer: improvements to out-of-bounds diagrams [PR111155]

2023-10-09 Thread David Malcolm
On Mon, 2023-10-09 at 17:01 +0200, Tobias Burnus wrote:
> Hi David,
> 
> On 09.10.23 16:08, David Malcolm wrote:
> > On Mon, 2023-10-09 at 12:09 +0200, Tobias Burnus wrote:
> > > The following works:
> > > (A) Using "kind == boundaries::kind::HARD" - i.e. adding
> > > "boundaries::"
> > > (B) Renaming the parameter name "kind" to something else - like
> > > "k"
> > > as used
> > >   in the other functions.
> > > 
> > > Can you fix it?
> > Sorry about the breakage, and thanks for the investigation.
> Well, without an older compiler, one does not see it. It also worked
> flawlessly on my laptop today.
> > Does the following patch fix the build for you?
> 
> Yes – as mentioned either of the variants above should work and (A)
> is
> what you have in your patch.
> 
> And it is what I actually tried for the full build. Hence, yes, it
> works :-)

Thanks!

I've pushed this to trunk as r14-4521-g08d0f840dc7ad2.



Re: [PATCH] wide-int: Allow up to 16320 bits wide_int and change widest_int precision to 32640 bits [PR102989]

2023-10-09 Thread Jakub Jelinek
On Mon, Oct 09, 2023 at 03:44:10PM +0200, Jakub Jelinek wrote:
> Thanks, just quick answers, will work on patch adjustments after trying to
> get rid of rwide_int (seems dwarf2out has very limited needs from it, just
> some routine to construct it in GCed memory (and never change afterwards)
> from const wide_int_ref & or so, and then working operator ==,
> get_precision, elt, get_len and get_val methods, so I think we could just
> have a struct dw_wide_int { unsigned int prec, len; HOST_WIDE_INT val[1]; };
> and perform the methods on it after converting to a storage ref.

Now in patch form (again, incremental).

> > Does the variable-length memcpy pay for itself?  If so, perhaps that's a
> > sign that we should have a smaller inline buffer for this class (say 2 
> > HWIs).
> 
> Guess I'll try to see what results in smaller .text size.

I've left the memcpy changes into a separate patch (incremental, attached).
Seems that second patch results in .text growth by 16256 bytes (0.04%),
though I'd bet it probably makes compile time tiny bit faster because it
replaces an out of line memcpy (caused by variable length) with inlined one.

With even the third one it shrinks by 84544 bytes (0.21% down), but the
extra statistics patch then shows massive number of allocations after
running make check-gcc check-g++ check-gfortran for just a minute or two.
On the widest_int side, I see (first number from sort | uniq -c | sort -nr,
second the estimated or final len)
7289034 4
 173586 5
  21819 6
i.e. there are tons of widest_ints which need len 4 (or perhaps just
have it as upper estimation), maybe even 5 would be nice.
On the wide_int side, I see
 155291 576
(supposedly because of bound_wide_int, where we create wide_int_ref from
the 576-bit precision bound_wide_int and then create 576-bit wide_int when
using unary or binary operation on that).

So, perhaps we could get away with say WIDEST_INT_MAX_INL_ELTS of 5 or 6
instead of 9 but keep WIDE_INT_MAX_INL_ELTS at 9 (or whatever is computed
from MAX_BITSIZE_MODE_ANY_INT?).  Or keep it at 9 for both (i.e. without
the third patch).

--- gcc/poly-int.h.jj   2023-10-09 14:37:45.883940062 +0200
+++ gcc/poly-int.h  2023-10-09 17:05:26.629828329 +0200
@@ -96,7 +96,7 @@ struct poly_coeff_traits
-struct poly_coeff_traits
+struct poly_coeff_traits
 {
   typedef WI_UNARY_RESULT (T) result;
   typedef int int_type;
@@ -110,14 +110,13 @@ struct poly_coeff_traits
-struct poly_coeff_traits
+struct poly_coeff_traits
 {
   typedef WI_UNARY_RESULT (T) result;
   typedef int int_type;
   /* These types are always signed.  */
   static const int signedness = 1;
   static const int precision = wi::int_traits::precision;
-  static const int inl_precision = wi::int_traits::inl_precision;
   static const int rank = precision * 2 / CHAR_BIT;
 
   template
--- gcc/double-int.h.jj 2023-01-02 09:32:22.747280053 +0100
+++ gcc/double-int.h2023-10-09 17:06:03.446317336 +0200
@@ -440,7 +440,7 @@ namespace wi
   template <>
   struct int_traits 
   {
-static const enum precision_type precision_type = CONST_PRECISION;
+static const enum precision_type precision_type = INL_CONST_PRECISION;
 static const bool host_dependent_precision = true;
 static const unsigned int precision = HOST_BITS_PER_DOUBLE_INT;
 static unsigned int get_precision (const double_int &);
--- gcc/wide-int.h.jj   2023-10-09 16:06:39.326805176 +0200
+++ gcc/wide-int.h  2023-10-09 17:29:20.016951691 +0200
@@ -343,8 +343,8 @@ template  class widest_int_storag
 
 typedef generic_wide_int  wide_int;
 typedef FIXED_WIDE_INT (ADDR_MAX_PRECISION) offset_int;
-typedef generic_wide_int  > 
widest_int;
-typedef generic_wide_int  
> widest2_int;
+typedef generic_wide_int  > 
widest_int;
+typedef generic_wide_int  > 
widest2_int;
 
 /* wi::storage_ref can be a reference to a primitive type,
so this is the conservatively-correct setting.  */
@@ -394,13 +394,13 @@ namespace wi
 /* The integer has a variable precision but no defined signedness.  */
 VAR_PRECISION,
 
-/* The integer has a constant precision (known at GCC compile time)
-   and is signed.  */
-CONST_PRECISION,
-
-/* Like CONST_PRECISION, but with WIDEST_INT_MAX_PRECISION or larger
-   precision where not all elements of arrays are always present.  */
-WIDEST_CONST_PRECISION
+/* The integer has a constant precision (known at GCC compile time),
+   is signed and all elements are in inline buffer.  */
+INL_CONST_PRECISION,
+
+/* Like INL_CONST_PRECISION, but elements can be heap allocated for
+   larger lengths.  */
+CONST_PRECISION
   };
 
   /* This class, which has no default implementation, is expected to
@@ -410,15 +410,10 @@ namespace wi
Classifies the type of T.
 
  static const unsigned int precision;
-   Only defined if precision_type == CONST_PRECISION or
-   precision_type == WIDEST_CONST_PRECISION.  Specifies the
+   Only defined if precision_type == INL_CONST_PRE

Re: [PATCH] sso-string@gnu-versioned-namespace [PR83077]

2023-10-09 Thread François Dumont



On 09/10/2023 16:42, Iain Sandoe wrote:

Hi François,


On 7 Oct 2023, at 20:32, François Dumont  wrote:

I've been told that previous patch generated with 'git diff -b' was not 
applying properly so here is the same patch again with a simple 'git diff'.

Thanks, that did fix it - There are some training whitespaces in the config 
files, but I suspect that they need to be there since those have values 
appended during the configuration.


You're talking about the ones coming from regenerated Makefile.in and 
configure I guess. I prefer not to edit those, those trailing 
whitespaces are already in.





Anyway, with this + the coroutines and contract v2 (weak def) fix, plus a local 
patch to enable versioned namespace on Darwin, I get results comparable with 
the non-versioned case - but one more patchlet is needed on  yours (to allow 
for targets using emultated TLS):

diff --git a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver 
b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
index 9fab8bead15..b7167fc0c2f 100644
--- a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
+++ b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
@@ -78,6 +78,7 @@ GLIBCXX_8.0 {
  
  # thread/mutex/condition_variable/future

  __once_proxy;
+__emutls_v._ZNSt3__81?__once_call*;


I can add this one, sure, even if it could be part of a dedicated patch. 
I'm surprised that we do not need the __once_callable emul symbol too, 
it would be more consistent with the non-versioned mode.


I'm pretty sure there are a bunch of other symbols missing, but this 
mode is seldomly tested...


  
  # std::__convert_to_v

  _ZNSt3__814__convert_to_v*;


thanks
Iain



On 07/10/2023 14:25, François Dumont wrote:

Hi

Here is a rebased version of this patch.

There are few test failures when running 'make check-c++' but nothing new.

Still, there are 2 patches awaiting validation to fix some of them, PR 
c++/111524 to fix another bunch and I fear that we will have to live with the 
others.

 libstdc++: [_GLIBCXX_INLINE_VERSION] Use cxx11 abi [PR83077]

 Use cxx11 abi when activating versioned namespace mode. To do support
 a new configuration mode where !_GLIBCXX_USE_DUAL_ABI and 
_GLIBCXX_USE_CXX11_ABI.

 The main change is that std::__cow_string is now defined whenever 
_GLIBCXX_USE_DUAL_ABI
 or _GLIBCXX_USE_CXX11_ABI is true. Implementation is using available 
std::string in
 case of dual abi and a subset of it when it's not.

 On the other side std::__sso_string is defined only when 
_GLIBCXX_USE_DUAL_ABI is true
 and _GLIBCXX_USE_CXX11_ABI is false. Meaning that std::__sso_string is a 
typedef for the
 cow std::string implementation when dual abi is disabled and cow string is 
being used.

 libstdcxx-v3/ChangeLog:

 PR libstdc++/83077
 * acinclude.m4 [GLIBCXX_ENABLE_LIBSTDCXX_DUAL_ABI]: Default to 
"new" libstdcxx abi
 when enable_symvers is gnu-versioned-namespace.
 * config/locale/dragonfly/monetary_members.cc 
[!_GLIBCXX_USE_DUAL_ABI]: Define money_base
 members.
 * config/locale/generic/monetary_members.cc 
[!_GLIBCXX_USE_DUAL_ABI]: Likewise.
 * config/locale/gnu/monetary_members.cc [!_GLIBCXX_USE_DUAL_ABI]: 
Likewise.
 * config/locale/gnu/numeric_members.cc
 [!_GLIBCXX_USE_DUAL_ABI](__narrow_multibyte_chars): Define.
 * configure: Regenerate.
 * include/bits/c++config
 [_GLIBCXX_INLINE_VERSION](_GLIBCXX_NAMESPACE_CXX11, 
_GLIBCXX_BEGIN_NAMESPACE_CXX11):
 Define empty.
[_GLIBCXX_INLINE_VERSION](_GLIBCXX_END_NAMESPACE_CXX11, 
_GLIBCXX_DEFAULT_ABI_TAG):
 Likewise.
 * include/bits/cow_string.h [!_GLIBCXX_USE_CXX11_ABI]: Define a 
light version of COW
 basic_string as __std_cow_string for use in stdexcept.
 * include/std/stdexcept [_GLIBCXX_USE_CXX11_ABI]: Define 
__cow_string.
 (__cow_string(const char*)): New.
 (__cow_string::c_str()): New.
 * python/libstdcxx/v6/printers.py (StdStringPrinter::__init__): 
Set self.new_string to True
 when std::__8::basic_string type is found.
 * src/Makefile.am 
[ENABLE_SYMVERS_GNU_NAMESPACE](ldbl_alt128_compat_sources): Define empty.
 * src/Makefile.in: Regenerate.
 * src/c++11/Makefile.am (cxx11_abi_sources): Rename into...
 (dual_abi_sources): ...this. Also move cow-local_init.cc, 
cxx11-hash_tr1.cc,
 cxx11-ios_failure.cc entries to...
 (sources): ...this.
 (extra_string_inst_sources): Move cow-fstream-inst.cc, 
cow-sstream-inst.cc, cow-string-inst.cc,
 cow-string-io-inst.cc, cow-wtring-inst.cc, cow-wstring-io-inst.cc, 
cxx11-locale-inst.cc,
 cxx11-wlocale-inst.cc entries to...
 (inst_sources): ...this.
 * src/c++11/Makefile.in: Regenerat

[COMMITTED] PR tree-optimization/111694 - Ensure float equivalences include + and - zero.

2023-10-09 Thread Andrew MacLeod
When ranger propagates ranges in the on-entry cache, it also check for 
equivalences and incorporates the equivalence into the range for a name 
if it is known.


With floating point values, the equivalence that is generated by 
comparison must also take into account that if the equivalence contains 
zero, both positive and negative zeros could be in the range.


This PR demonstrates that once we establish an equivalence, even though 
we know one value may only have a positive zero, the equivalence may 
have been formed earlier and included a negative zero  This patch 
pessimistically assumes that if the equivalence contains zero, we should 
include both + and - 0 in the equivalence that we utilize.


I audited the other places, and found no other place where this issue 
might arise.  Cache propagation is the only place where we augment the 
range with random equivalences.


Bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed.

Andrew
From b0892b1fc637fadf14d7016858983bc5776a1e69 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Mon, 9 Oct 2023 10:15:07 -0400
Subject: [PATCH 2/2] Ensure float equivalences include + and - zero.

A floating point equivalence may not properly reflect both signs of
zero, so be pessimsitic and ensure both signs are included.

	PR tree-optimization/111694
	gcc/
	* gimple-range-cache.cc (ranger_cache::fill_block_cache): Adjust
	equivalence range.
	* value-relation.cc (adjust_equivalence_range): New.
	* value-relation.h (adjust_equivalence_range): New prototype.

	gcc/testsuite/
	* gcc.dg/pr111694.c: New.
---
 gcc/gimple-range-cache.cc   |  3 +++
 gcc/testsuite/gcc.dg/pr111694.c | 19 +++
 gcc/value-relation.cc   | 19 +++
 gcc/value-relation.h|  3 +++
 4 files changed, 44 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr111694.c

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index 3c819933c4e..89c0845457d 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -1470,6 +1470,9 @@ ranger_cache::fill_block_cache (tree name, basic_block bb, basic_block def_bb)
 		{
 		  if (rel != VREL_EQ)
 		range_cast (equiv_range, type);
+		  else
+		adjust_equivalence_range (equiv_range);
+
 		  if (block_result.intersect (equiv_range))
 		{
 		  if (DEBUG_RANGE_CACHE)
diff --git a/gcc/testsuite/gcc.dg/pr111694.c b/gcc/testsuite/gcc.dg/pr111694.c
new file mode 100644
index 000..a70b03069dc
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr111694.c
@@ -0,0 +1,19 @@
+/* PR tree-optimization/111009 */
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#define signbit(x) __builtin_signbit(x)
+
+static void test(double l, double r)
+{
+  if (l == r && (signbit(l) || signbit(r)))
+;
+  else
+__builtin_abort();
+}
+
+int main()
+{
+  test(0.0, -0.0);
+}
+
diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc
index a2ae39692a6..0326fe7cde6 100644
--- a/gcc/value-relation.cc
+++ b/gcc/value-relation.cc
@@ -183,6 +183,25 @@ relation_transitive (relation_kind r1, relation_kind r2)
   return relation_kind (rr_transitive_table[r1][r2]);
 }
 
+// When one name is an equivalence of another, ensure the equivalence
+// range is correct.  Specifically for floating point, a +0 is also
+// equivalent to a -0 which may not be reflected.  See PR 111694.
+
+void
+adjust_equivalence_range (vrange &range)
+{
+  if (range.undefined_p () || !is_a (range))
+return;
+
+  frange fr = as_a (range);
+  // If range includes 0 make sure both signs of zero are included.
+  if (fr.contains_p (dconst0) || fr.contains_p (dconstm0))
+{
+  frange zeros (range.type (), dconstm0, dconst0);
+  range.union_ (zeros);
+}
+ }
+
 // This vector maps a relation to the equivalent tree code.
 
 static const tree_code relation_to_code [VREL_LAST] = {
diff --git a/gcc/value-relation.h b/gcc/value-relation.h
index be6e277421b..31d48908678 100644
--- a/gcc/value-relation.h
+++ b/gcc/value-relation.h
@@ -91,6 +91,9 @@ inline bool relation_equiv_p (relation_kind r)
 
 void print_relation (FILE *f, relation_kind rel);
 
+// Adjust range as an equivalence.
+void adjust_equivalence_range (vrange &range);
+
 class relation_oracle
 {
 public:
-- 
2.41.0



[COMMITTED] Remove unused get_identity_relation.

2023-10-09 Thread Andrew MacLeod
I added this routine for Aldy when he thought we were going to have to 
add explicit versions for unordered relations.


It seems that with accurate tracking of NANs, we do not need the 
explicit versions in the oracle, so we will not need this identity 
routine to pick the appropriate version of VREL_EQ... as there is only 
one.  As it stands, always returns VREL_EQ, so simply use VREL_EQ in the 
2 calling locations.


Bootstrapped on x86_64-pc-linux-gnu with no regressions. Pushed.

Andrew
From 5ee51119d1345f3f13af784455a4ae466766912b Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Mon, 9 Oct 2023 10:01:11 -0400
Subject: [PATCH 1/2] Remove unused get_identity_relation.

Turns out we didnt need this as there is no unordered relations
managed by the oracle.

	* gimple-range-gori.cc (gori_compute::compute_operand1_range): Do
	not call get_identity_relation.
	(gori_compute::compute_operand2_range): Ditto.
	* value-relation.cc (get_identity_relation): Remove.
	* value-relation.h (get_identity_relation): Remove protyotype.
---
 gcc/gimple-range-gori.cc | 10 ++
 gcc/value-relation.cc| 14 --
 gcc/value-relation.h |  3 ---
 3 files changed, 2 insertions(+), 25 deletions(-)

diff --git a/gcc/gimple-range-gori.cc b/gcc/gimple-range-gori.cc
index 1b5eda43390..887da0ff094 100644
--- a/gcc/gimple-range-gori.cc
+++ b/gcc/gimple-range-gori.cc
@@ -1146,10 +1146,7 @@ gori_compute::compute_operand1_range (vrange &r,
 
   // If op1 == op2, create a new trio for just this call.
   if (op1 == op2 && gimple_range_ssa_p (op1))
-	{
-	  relation_kind k = get_identity_relation (op1, op1_range);
-	  trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), k);
-	}
+	trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), VREL_EQ);
   if (!handler.calc_op1 (r, lhs, op2_range, trio))
 	return false;
 }
@@ -1225,10 +1222,7 @@ gori_compute::compute_operand2_range (vrange &r,
 
   // If op1 == op2, create a new trio for this stmt.
   if (op1 == op2 && gimple_range_ssa_p (op1))
-{
-  relation_kind k = get_identity_relation (op1, op1_range);
-  trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), k);
-}
+trio = relation_trio (trio.lhs_op1 (), trio.lhs_op2 (), VREL_EQ);
   // Intersect with range for op2 based on lhs and op1.
   if (!handler.calc_op2 (r, lhs, op1_range, trio))
 return false;
diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc
index 8fea4aad345..a2ae39692a6 100644
--- a/gcc/value-relation.cc
+++ b/gcc/value-relation.cc
@@ -183,20 +183,6 @@ relation_transitive (relation_kind r1, relation_kind r2)
   return relation_kind (rr_transitive_table[r1][r2]);
 }
 
-// When operands of a statement are identical ssa_names, return the
-// approriate relation between operands for NAME == NAME, given RANGE.
-//
-relation_kind
-get_identity_relation (tree name, vrange &range ATTRIBUTE_UNUSED)
-{
-  // Return VREL_UNEQ when it is supported for floats as appropriate.
-  if (frange::supports_p (TREE_TYPE (name)))
-return VREL_EQ;
-
-  // Otherwise return VREL_EQ.
-  return VREL_EQ;
-}
-
 // This vector maps a relation to the equivalent tree code.
 
 static const tree_code relation_to_code [VREL_LAST] = {
diff --git a/gcc/value-relation.h b/gcc/value-relation.h
index f00f84f93b6..be6e277421b 100644
--- a/gcc/value-relation.h
+++ b/gcc/value-relation.h
@@ -91,9 +91,6 @@ inline bool relation_equiv_p (relation_kind r)
 
 void print_relation (FILE *f, relation_kind rel);
 
-// Return relation for NAME == NAME with RANGE.
-relation_kind get_identity_relation (tree name, vrange &range);
-
 class relation_oracle
 {
 public:
-- 
2.41.0



Re: [PATCH] sso-string@gnu-versioned-namespace [PR83077]

2023-10-09 Thread Iain Sandoe



> On 9 Oct 2023, at 15:42, Iain Sandoe  wrote:

>> On 7 Oct 2023, at 20:32, François Dumont  wrote:
>> 
>> I've been told that previous patch generated with 'git diff -b' was not 
>> applying properly so here is the same patch again with a simple 'git diff'.
> 
> Thanks, that did fix it - There are some training whitespaces in the config 
> files, but I suspect that they need to be there since those have values 
> appended during the configuration.
> 
> Anyway, with this + the coroutines and contract v2 (weak def) fix, plus a 
> local patch to enable versioned namespace on Darwin, I get results comparable 
> with the non-versioned case - but one more patchlet is needed on  yours (to 
> allow for targets using emultated TLS):
> 
> diff --git a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver 
> b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
> index 9fab8bead15..b7167fc0c2f 100644
> --- a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
> +++ b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
> @@ -78,6 +78,7 @@ GLIBCXX_8.0 {
> 
> # thread/mutex/condition_variable/future
> __once_proxy;
> +__emutls_v._ZNSt3__81?__once_call*;
> 
> # std::__convert_to_v
> _ZNSt3__814__convert_to_v*;

Having said this, since the versioned lib is an ABI-break, perhaps we should 
also take the opportunity
to fix the once_call impl. here too?

(at least the fix I made locally does not need the TLS var, so ths would then 
be moot)

Iain

> 
> thanks
> Iain
> 
>> 
>> 
>> On 07/10/2023 14:25, François Dumont wrote:
>>> Hi
>>> 
>>> Here is a rebased version of this patch.
>>> 
>>> There are few test failures when running 'make check-c++' but nothing new.
>>> 
>>> Still, there are 2 patches awaiting validation to fix some of them, PR 
>>> c++/111524 to fix another bunch and I fear that we will have to live with 
>>> the others.
>>> 
>>>libstdc++: [_GLIBCXX_INLINE_VERSION] Use cxx11 abi [PR83077]
>>> 
>>>Use cxx11 abi when activating versioned namespace mode. To do support
>>>a new configuration mode where !_GLIBCXX_USE_DUAL_ABI and 
>>> _GLIBCXX_USE_CXX11_ABI.
>>> 
>>>The main change is that std::__cow_string is now defined whenever 
>>> _GLIBCXX_USE_DUAL_ABI
>>>or _GLIBCXX_USE_CXX11_ABI is true. Implementation is using available 
>>> std::string in
>>>case of dual abi and a subset of it when it's not.
>>> 
>>>On the other side std::__sso_string is defined only when 
>>> _GLIBCXX_USE_DUAL_ABI is true
>>>and _GLIBCXX_USE_CXX11_ABI is false. Meaning that std::__sso_string is a 
>>> typedef for the
>>>cow std::string implementation when dual abi is disabled and cow string 
>>> is being used.
>>> 
>>>libstdcxx-v3/ChangeLog:
>>> 
>>>PR libstdc++/83077
>>>* acinclude.m4 [GLIBCXX_ENABLE_LIBSTDCXX_DUAL_ABI]: Default to 
>>> "new" libstdcxx abi
>>>when enable_symvers is gnu-versioned-namespace.
>>>* config/locale/dragonfly/monetary_members.cc 
>>> [!_GLIBCXX_USE_DUAL_ABI]: Define money_base
>>>members.
>>>* config/locale/generic/monetary_members.cc 
>>> [!_GLIBCXX_USE_DUAL_ABI]: Likewise.
>>>* config/locale/gnu/monetary_members.cc 
>>> [!_GLIBCXX_USE_DUAL_ABI]: Likewise.
>>>* config/locale/gnu/numeric_members.cc
>>>[!_GLIBCXX_USE_DUAL_ABI](__narrow_multibyte_chars): Define.
>>>* configure: Regenerate.
>>>* include/bits/c++config
>>>[_GLIBCXX_INLINE_VERSION](_GLIBCXX_NAMESPACE_CXX11, 
>>> _GLIBCXX_BEGIN_NAMESPACE_CXX11):
>>>Define empty.
>>> [_GLIBCXX_INLINE_VERSION](_GLIBCXX_END_NAMESPACE_CXX11, 
>>> _GLIBCXX_DEFAULT_ABI_TAG):
>>>Likewise.
>>>* include/bits/cow_string.h [!_GLIBCXX_USE_CXX11_ABI]: Define a 
>>> light version of COW
>>>basic_string as __std_cow_string for use in stdexcept.
>>>* include/std/stdexcept [_GLIBCXX_USE_CXX11_ABI]: Define 
>>> __cow_string.
>>>(__cow_string(const char*)): New.
>>>(__cow_string::c_str()): New.
>>>* python/libstdcxx/v6/printers.py (StdStringPrinter::__init__): 
>>> Set self.new_string to True
>>>when std::__8::basic_string type is found.
>>>* src/Makefile.am 
>>> [ENABLE_SYMVERS_GNU_NAMESPACE](ldbl_alt128_compat_sources): Define empty.
>>>* src/Makefile.in: Regenerate.
>>>* src/c++11/Makefile.am (cxx11_abi_sources): Rename into...
>>>(dual_abi_sources): ...this. Also move cow-local_init.cc, 
>>> cxx11-hash_tr1.cc,
>>>cxx11-ios_failure.cc entries to...
>>>(sources): ...this.
>>>(extra_string_inst_sources): Move cow-fstream-inst.cc, 
>>> cow-sstream-inst.cc, cow-string-inst.cc,
>>>cow-string-io-inst.cc, cow-wtring-inst.cc, 
>>> cow-wstring-io-inst.cc, cxx11-locale-inst.cc,
>>>cxx11-wlocale-inst.cc entries to...
>>>(inst_sources): ...this.
>>>   

Re: [PATCH-2, rs6000] Enable vector mode for memory equality compare [PR111449]

2023-10-09 Thread David Edelsohn
On Sun, Oct 8, 2023 at 10:30 PM HAO CHEN GUI  wrote:

> Hi,
>   This patch enables vector mode for memory equality compare by adding
> a new expand cbranchv16qi4 and implementing it. Also the corresponding
> CC reg and compare code is set in rs6000_generate_compare. With the
> patch, 16-byte equality compare can be implemented by one vector compare
> instructions other than 2 8-byte compares with branches.
>
>   The test case is in the second patch which is rs6000 specific.
>
>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> regressions.
>

Thanks for working on this.



>
> Thanks
> Gui Haochen
>
> ChangeLog
> rs6000: Enable vector compare for memory equality compare
>
> gcc/
> PR target/111449
> * config/rs6000/altivec.md (cbranchv16qi4): New expand pattern.
> * config/rs6000/rs6000.cc (rs6000_generate_compare): Generate insn
> sequence for V16QImode equality compare.
> * config/rs6000/rs6000.h (MOVE_MAX_PIECES): Define.
> (COMPARE_MAX_PIECES): Define.
>
> gcc/testsuite/
> PR target/111449
> * gcc.target/powerpc/pr111449.c: New.
>
> patch.diff
> diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
> index e8a596fb7e9..c69bf266402 100644
> --- a/gcc/config/rs6000/altivec.md
> +++ b/gcc/config/rs6000/altivec.md
> @@ -2605,6 +2605,39 @@ (define_insn "altivec_vupklpx"
>  }
>[(set_attr "type" "vecperm")])
>
> +(define_expand "cbranchv16qi4"
> +  [(use (match_operator 0 "equality_operator"
> +   [(match_operand:V16QI 1 "gpc_reg_operand")
> +(match_operand:V16QI 2 "gpc_reg_operand")]))
> +   (use (match_operand 3))]
> +  "VECTOR_UNIT_ALTIVEC_P (V16QImode)"
> +{
> +  if (!TARGET_P9_VECTOR
> +  && MEM_P (operands[1])
> +  && !altivec_indexed_or_indirect_operand (operands[1], V16QImode)
> +  && MEM_P (operands[2])
> +  && !altivec_indexed_or_indirect_operand (operands[2], V16QImode))
> +{
> +  /* Use direct move as the byte order doesn't matter for equality
> +compare.  */
> +  rtx reg_op1 = gen_reg_rtx (V16QImode);
> +  rtx reg_op2 = gen_reg_rtx (V16QImode);
> +  rs6000_emit_le_vsx_permute (reg_op1, operands[1], V16QImode);
> +  rs6000_emit_le_vsx_permute (reg_op2, operands[2], V16QImode);
> +  operands[1] = reg_op1;
> +  operands[2] = reg_op2;
> +}
> +  else
> +{
> +  operands[1] = force_reg (V16QImode, operands[1]);
> +  operands[2] = force_reg (V16QImode, operands[2]);
> +}
> +  rtx_code code = GET_CODE (operands[0]);
> +  operands[0] = gen_rtx_fmt_ee (code, V16QImode, operands[1],
> operands[2]);
> +  rs6000_emit_cbranch (V16QImode, operands);
> +  DONE;
> +})
> +
>  ;; Compare vectors producing a vector result and a predicate, setting CR6
> to
>  ;; indicate a combined status
>  (define_insn "altivec_vcmpequ_p"
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index efe9adce1f8..0087d786840 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -15264,6 +15264,15 @@ rs6000_generate_compare (rtx cmp, machine_mode
> mode)
>   else
> emit_insn (gen_stack_protect_testsi (compare_result, op0,
> op1b));
> }
> +  else if (mode == V16QImode)
> +   {
> + gcc_assert (code == EQ || code == NE);
> +
> + rtx result_vector = gen_reg_rtx (V16QImode);
> + compare_result = gen_rtx_REG (CCmode, CR6_REGNO);
> + emit_insn (gen_altivec_vcmpequb_p (result_vector, op0, op1));
> + code = (code == NE) ? GE : LT;
> +   }
>else
> emit_insn (gen_rtx_SET (compare_result,
> gen_rtx_COMPARE (comp_mode, op0, op1)));
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 3503614efbd..dc33bca0802 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -1730,6 +1730,8 @@ typedef struct rs6000_args
> in one reasonably fast instruction.  */
>  #define MOVE_MAX (! TARGET_POWERPC64 ? 4 : 8)
>  #define MAX_MOVE_MAX 8
> +#define MOVE_MAX_PIECES (!TARGET_POWERPC64 ? 4 : 16)
> +#define COMPARE_MAX_PIECES (!TARGET_POWERPC64 ? 4 : 16)
>

How are the definitions of MOVE_MAX_PIECES and COMPARE_MAX_PIECES
determined?  The email does not provide any explanation for the
implementation.  The rest of the patch is related to vector support, but
vector support is not dependent on TARGET_POWERPC64.

Thanks, David


>
>  /* Nonzero if access to memory by bytes is no faster than for words.
> Also nonzero if doing byte operations (specifically shifts) in
> registers
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr111449.c
> b/gcc/testsuite/gcc.target/powerpc/pr111449.c
> new file mode 100644
> index 000..a8c30b92a41
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr111449.c
> @@ -0,0 +1,19 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p8vector_ok } */
> +/* { dg-options "-maltivec -O2" } */
> +

Re: [pushed] analyzer: improvements to out-of-bounds diagrams [PR111155]

2023-10-09 Thread Tobias Burnus

Hi David,

On 09.10.23 16:08, David Malcolm wrote:

On Mon, 2023-10-09 at 12:09 +0200, Tobias Burnus wrote:

The following works:
(A) Using "kind == boundaries::kind::HARD" - i.e. adding
"boundaries::"
(B) Renaming the parameter name "kind" to something else - like "k"
as used
  in the other functions.

Can you fix it?

Sorry about the breakage, and thanks for the investigation.

Well, without an older compiler, one does not see it. It also worked
flawlessly on my laptop today.

Does the following patch fix the build for you?


Yes – as mentioned either of the variants above should work and (A) is
what you have in your patch.

And it is what I actually tried for the full build. Hence, yes, it works :-)

Thanks for the quick action!

Tobias


gcc/analyzer/ChangeLog:
  * access-diagram.cc (boundaries::add): Explicitly state
  "boundaries::" scope for "kind" enum.
---
  gcc/analyzer/access-diagram.cc | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/analyzer/access-diagram.cc b/gcc/analyzer/access-diagram.cc
index 2197ec63f53..c7d190e3188 100644
--- a/gcc/analyzer/access-diagram.cc
+++ b/gcc/analyzer/access-diagram.cc
@@ -652,7 +652,8 @@ public:
  m_logger->log_partial ("added access_range: ");
  range.dump_to_pp (m_logger->get_printer (), true);
  m_logger->log_partial (" (%s)",
-(kind == kind::HARD) ? "HARD" : "soft");
+(kind == boundaries::kind::HARD)
+? "HARD" : "soft");
  m_logger->end_log_line ();
}
}

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


[PATCH] TEST: Add vectorization check

2023-10-09 Thread Juzhe-Zhong
These cases won't check SLP for load_lanes support target.

Add vectorization check for situations.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr97832-2.c: Add vectorization check.
* gcc.dg/vect/pr97832-3.c: Ditto.
* gcc.dg/vect/pr97832-4.c: Ditto.

---
 gcc/testsuite/gcc.dg/vect/pr97832-2.c | 1 +
 gcc/testsuite/gcc.dg/vect/pr97832-3.c | 1 +
 gcc/testsuite/gcc.dg/vect/pr97832-4.c | 1 +
 3 files changed, 3 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-2.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-2.c
index 7d8d2691432..60e8e8516fc 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-2.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-2.c
@@ -27,3 +27,4 @@ void foo1x1(double* restrict y, const double* restrict x, int 
clen)
 
 /* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
 /* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-3.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-3.c
index c0603e1432e..2dc76e5b565 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-3.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-3.c
@@ -48,3 +48,4 @@ void foo(double* restrict y, const double* restrict x0, const 
double* restrict x
 
 /* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
 /* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-4.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-4.c
index c03442816a4..7e74c9313d5 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-4.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-4.c
@@ -26,3 +26,4 @@ void foo1x1(double* restrict y, const double* restrict x, int 
clen)
 
 /* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
 /* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
-- 
2.36.3



[PATCH] wide-int: Remove rwide_int, introduce dw_wide_int

2023-10-09 Thread Jakub Jelinek
On Mon, Oct 09, 2023 at 12:55:02PM +0200, Jakub Jelinek wrote:
> This makes wide_int unusable in GC structures, so for dwarf2out
> which was the only place which needed it there is a new rwide_int type
> (restricted wide_int) which supports only up to RWIDE_INT_MAX_ELTS limbs
> inline and is trivially copyable (dwarf2out should never deal with large
> _BitInt constants, those should have been lowered earlier).

As discussed on IRC, the dwarf2out.{h,cc} needs are actually quite limited,
it just needs to allocate new GC structures val_wide points to (constructed
from some const wide_int_ref &) and needs to call operator==,
get_precision, elt, get_len and get_val methods on it.
Even trailing_wide_int would be overkill for that, the following just adds
a new struct with precision/len and trailing val array members and
implements the needed methods (only 2 of them using wide_int_ref constructed
from those).

Incremental patch, so far compile time tested only:

--- gcc/wide-int.h.jj   2023-10-09 14:37:45.878940132 +0200
+++ gcc/wide-int.h  2023-10-09 16:06:39.326805176 +0200
@@ -27,7 +27,7 @@ along with GCC; see the file COPYING3.
other longer storage GCC representations (rtl and tree).
 
The actual precision of a wide_int depends on the flavor.  There
-   are four predefined flavors:
+   are three predefined flavors:
 
  1) wide_int (the default).  This flavor does the math in the
  precision of its input arguments.  It is assumed (and checked)
@@ -80,12 +80,7 @@ along with GCC; see the file COPYING3.
wi::leu_p (a, b) as a more efficient short-hand for
"a >= 0 && a <= b". ]
 
- 3) rwide_int.  Restricted wide_int.  This is similar to
- wide_int, but maximum possible precision is RWIDE_INT_MAX_PRECISION
- and it always uses an inline buffer.  offset_int and rwide_int are
- GC-friendly, wide_int and widest_int are not.
-
- 4) widest_int.  This representation is an approximation of
+ 3) widest_int.  This representation is an approximation of
  infinite precision math.  However, it is not really infinite
  precision math as in the GMP library.  It is really finite
  precision math where the precision is WIDEST_INT_MAX_PRECISION.
@@ -257,9 +252,6 @@ along with GCC; see the file COPYING3.
 #define WIDE_INT_MAX_ELTS 255
 #define WIDE_INT_MAX_PRECISION (WIDE_INT_MAX_ELTS * HOST_BITS_PER_WIDE_INT)
 
-#define RWIDE_INT_MAX_ELTS WIDE_INT_MAX_INL_ELTS
-#define RWIDE_INT_MAX_PRECISION WIDE_INT_MAX_INL_PRECISION
-
 /* Precision of widest_int and largest _BitInt precision + 1 we can
support.  */
 #define WIDEST_INT_MAX_ELTS 510
@@ -343,7 +335,6 @@ STATIC_ASSERT (WIDE_INT_MAX_INL_ELTS < W
 template  class generic_wide_int;
 template  class fixed_wide_int_storage;
 class wide_int_storage;
-class rwide_int_storage;
 template  class widest_int_storage;
 
 /* An N-bit integer.  Until we can use typedef templates, use this instead.  */
@@ -352,7 +343,6 @@ template  class widest_int_storag
 
 typedef generic_wide_int  wide_int;
 typedef FIXED_WIDE_INT (ADDR_MAX_PRECISION) offset_int;
-typedef generic_wide_int  rwide_int;
 typedef generic_wide_int  > 
widest_int;
 typedef generic_wide_int  
> widest2_int;
 
@@ -1371,180 +1361,6 @@ wi::int_traits ::get_b
 return wi::get_precision (x);
 }
 
-/* The storage used by rwide_int.  */
-class GTY(()) rwide_int_storage
-{
-private:
-  HOST_WIDE_INT val[RWIDE_INT_MAX_ELTS];
-  unsigned int len;
-  unsigned int precision;
-
-public:
-  rwide_int_storage () = default;
-  template 
-  rwide_int_storage (const T &);
-
-  /* The standard generic_rwide_int storage methods.  */
-  unsigned int get_precision () const;
-  const HOST_WIDE_INT *get_val () const;
-  unsigned int get_len () const;
-  HOST_WIDE_INT *write_val (unsigned int);
-  void set_len (unsigned int, bool = false);
-
-  template 
-  rwide_int_storage &operator = (const T &);
-
-  static rwide_int from (const wide_int_ref &, unsigned int, signop);
-  static rwide_int from_array (const HOST_WIDE_INT *, unsigned int,
-  unsigned int, bool = true);
-  static rwide_int create (unsigned int);
-};
-
-namespace wi
-{
-  template <>
-  struct int_traits 
-  {
-static const enum precision_type precision_type = VAR_PRECISION;
-/* Guaranteed by a static assert in the rwide_int_storage constructor.  */
-static const bool host_dependent_precision = false;
-static const bool is_sign_extended = true;
-static const bool needs_write_val_arg = false;
-template 
-static rwide_int get_binary_result (const T1 &, const T2 &);
-template 
-static unsigned int get_binary_precision (const T1 &, const T2 &);
-  };
-}
-
-/* Initialize the storage from integer X, in its natural precision.
-   Note that we do not allow integers with host-dependent precision
-   to become rwide_ints; rwide_ints must always be logically independent
-   of the host.  */
-template 
-inline rwide_int_storage::rwide_int_storage (const T &x)
-{
-  ST

Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-09 Thread Richard Sandiford
Robin Dapp  writes:
>> It'd be good to expand on this comment a bit.  What kind of COND are you
>> anticipating?  A COND with the neutral op as the else value, so that the
>> PLUS_EXPR (or whatever) can remain unconditional?  If so, it would be
>> good to sketch briefly how that happens, and why it's better than using
>> the conditional PLUS_EXPR.
>> 
>> If that's the reason, perhaps we want a single-use check as well.
>> It's possible that OP1 is used elsewhere in the loop body, in a
>> context that would prefer a different else value.
>
> Would something like the following on top work?
>
> -  /* If possible try to create an IFN_COND_ADD instead of a COND_EXPR and
> - a PLUS_EXPR.  Don't do this if the reduction def operand itself is
> +  /* If possible create a COND_OP instead of a COND_EXPR and an OP_EXPR.
> + The COND_OP will have a neutral_op else value.
> +
> + This allows re-using the mask directly in a masked reduction instead
> + of creating a vector merge (or similar) and then an unmasked reduction.
> +
> + Don't do this if the reduction def operand itself is
>   a vectorizable call as we can create a COND version of it directly.  */

It wasn't very clear, sorry, but it was the last sentence I was asking
for clarification on, not the other bits.  Why do we want to avoid
generating a COND_ADD when the operand is a vectorisable call?

Thanks,
Richard

>
>if (ifn != IFN_LAST
>&& vectorized_internal_fn_supported_p (ifn, TREE_TYPE (lhs))
> -  && try_cond_op && !swap)
> +  && use_cond_op && !swap && has_single_use (op1))
>
> Regards
>  Robin


Re: [PATCH] sso-string@gnu-versioned-namespace [PR83077]

2023-10-09 Thread Iain Sandoe
Hi François,

> On 7 Oct 2023, at 20:32, François Dumont  wrote:
> 
> I've been told that previous patch generated with 'git diff -b' was not 
> applying properly so here is the same patch again with a simple 'git diff'.

Thanks, that did fix it - There are some training whitespaces in the config 
files, but I suspect that they need to be there since those have values 
appended during the configuration.

Anyway, with this + the coroutines and contract v2 (weak def) fix, plus a local 
patch to enable versioned namespace on Darwin, I get results comparable with 
the non-versioned case - but one more patchlet is needed on  yours (to allow 
for targets using emultated TLS):

diff --git a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver 
b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
index 9fab8bead15..b7167fc0c2f 100644
--- a/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
+++ b/libstdc++-v3/config/abi/pre/gnu-versioned-namespace.ver
@@ -78,6 +78,7 @@ GLIBCXX_8.0 {
 
 # thread/mutex/condition_variable/future
 __once_proxy;
+__emutls_v._ZNSt3__81?__once_call*;
 
 # std::__convert_to_v
 _ZNSt3__814__convert_to_v*;


thanks
Iain

> 
> 
> On 07/10/2023 14:25, François Dumont wrote:
>> Hi
>> 
>> Here is a rebased version of this patch.
>> 
>> There are few test failures when running 'make check-c++' but nothing new.
>> 
>> Still, there are 2 patches awaiting validation to fix some of them, PR 
>> c++/111524 to fix another bunch and I fear that we will have to live with 
>> the others.
>> 
>> libstdc++: [_GLIBCXX_INLINE_VERSION] Use cxx11 abi [PR83077]
>> 
>> Use cxx11 abi when activating versioned namespace mode. To do support
>> a new configuration mode where !_GLIBCXX_USE_DUAL_ABI and 
>> _GLIBCXX_USE_CXX11_ABI.
>> 
>> The main change is that std::__cow_string is now defined whenever 
>> _GLIBCXX_USE_DUAL_ABI
>> or _GLIBCXX_USE_CXX11_ABI is true. Implementation is using available 
>> std::string in
>> case of dual abi and a subset of it when it's not.
>> 
>> On the other side std::__sso_string is defined only when 
>> _GLIBCXX_USE_DUAL_ABI is true
>> and _GLIBCXX_USE_CXX11_ABI is false. Meaning that std::__sso_string is a 
>> typedef for the
>> cow std::string implementation when dual abi is disabled and cow string 
>> is being used.
>> 
>> libstdcxx-v3/ChangeLog:
>> 
>> PR libstdc++/83077
>> * acinclude.m4 [GLIBCXX_ENABLE_LIBSTDCXX_DUAL_ABI]: Default to 
>> "new" libstdcxx abi
>> when enable_symvers is gnu-versioned-namespace.
>> * config/locale/dragonfly/monetary_members.cc 
>> [!_GLIBCXX_USE_DUAL_ABI]: Define money_base
>> members.
>> * config/locale/generic/monetary_members.cc 
>> [!_GLIBCXX_USE_DUAL_ABI]: Likewise.
>> * config/locale/gnu/monetary_members.cc 
>> [!_GLIBCXX_USE_DUAL_ABI]: Likewise.
>> * config/locale/gnu/numeric_members.cc
>> [!_GLIBCXX_USE_DUAL_ABI](__narrow_multibyte_chars): Define.
>> * configure: Regenerate.
>> * include/bits/c++config
>> [_GLIBCXX_INLINE_VERSION](_GLIBCXX_NAMESPACE_CXX11, 
>> _GLIBCXX_BEGIN_NAMESPACE_CXX11):
>> Define empty.
>> [_GLIBCXX_INLINE_VERSION](_GLIBCXX_END_NAMESPACE_CXX11, 
>> _GLIBCXX_DEFAULT_ABI_TAG):
>> Likewise.
>> * include/bits/cow_string.h [!_GLIBCXX_USE_CXX11_ABI]: Define a 
>> light version of COW
>> basic_string as __std_cow_string for use in stdexcept.
>> * include/std/stdexcept [_GLIBCXX_USE_CXX11_ABI]: Define 
>> __cow_string.
>> (__cow_string(const char*)): New.
>> (__cow_string::c_str()): New.
>> * python/libstdcxx/v6/printers.py (StdStringPrinter::__init__): 
>> Set self.new_string to True
>> when std::__8::basic_string type is found.
>> * src/Makefile.am 
>> [ENABLE_SYMVERS_GNU_NAMESPACE](ldbl_alt128_compat_sources): Define empty.
>> * src/Makefile.in: Regenerate.
>> * src/c++11/Makefile.am (cxx11_abi_sources): Rename into...
>> (dual_abi_sources): ...this. Also move cow-local_init.cc, 
>> cxx11-hash_tr1.cc,
>> cxx11-ios_failure.cc entries to...
>> (sources): ...this.
>> (extra_string_inst_sources): Move cow-fstream-inst.cc, 
>> cow-sstream-inst.cc, cow-string-inst.cc,
>> cow-string-io-inst.cc, cow-wtring-inst.cc, 
>> cow-wstring-io-inst.cc, cxx11-locale-inst.cc,
>> cxx11-wlocale-inst.cc entries to...
>> (inst_sources): ...this.
>> * src/c++11/Makefile.in: Regenerate.
>> * src/c++11/cow-fstream-inst.cc [_GLIBCXX_USE_CXX11_ABI]: Skip 
>> definitions.
>> * src/c++11/cow-locale_init.cc [_GLIBCXX_USE_CXX11_ABI]: Skip 
>> definitions.
>> * src/c++11/cow-sstream-inst.cc [_GLIBCXX_USE_CXX11_ABI]: Skip 
>> definitions.
>> * src/c++11/cow-stdexce

RE: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

2023-10-09 Thread Li, Pan2
Committed, thanks Jeff.

Pan

-Original Message-
From: Jeff Law  
Sent: Monday, October 9, 2023 10:28 PM
To: juzhe.zhong 
Cc: gcc-patches@gcc.gnu.org; rguent...@suse.de
Subject: Re: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV



On 10/9/23 08:21, juzhe.zhong wrote:
> Do you mean add a check whether it is vectorized or not?
Yes.

> 
> Sounds reasonable, I can add that in another patch.
Sounds good.  Thanks.

jeff


RE: [PATCH] RISC-V Regression tests: Fix FAIL of pr97832* for RVV

2023-10-09 Thread Li, Pan2
Committed, thanks Jeff.

Pan

-Original Message-
From: Jeff Law  
Sent: Monday, October 9, 2023 9:53 PM
To: Juzhe-Zhong ; gcc-patches@gcc.gnu.org
Cc: rguent...@suse.de
Subject: Re: [PATCH] RISC-V Regression tests: Fix FAIL of pr97832* for RVV



On 10/9/23 07:15, Juzhe-Zhong wrote:
> These cases are vectorized by vec_load_lanes with strided = 8 instead of SLP
> with -fno-vect-cost-model.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/pr97832-2.c: Adapt dump check for target supports 
> load_lanes with stride = 8.
>   * gcc.dg/vect/pr97832-3.c: Ditto.
>   * gcc.dg/vect/pr97832-4.c: Ditto.
OK.  Same question as last 3 acks.

jeff


RE: [PATCH] RISC-V Regression test: Fix FAIL of slp-12a.c

2023-10-09 Thread Li, Pan2
Committed, thanks Jeff.

Pan

-Original Message-
From: Jeff Law  
Sent: Monday, October 9, 2023 9:53 PM
To: Juzhe-Zhong ; gcc-patches@gcc.gnu.org
Cc: rguent...@suse.de
Subject: Re: [PATCH] RISC-V Regression test: Fix FAIL of slp-12a.c



On 10/9/23 07:35, Juzhe-Zhong wrote:
> This case is vectorized by stride8 load_lanes.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/slp-12a.c: Adapt for stride 8 load_lanes.
OK.  Same question as last two ACKs.

jeff


RE: [PATCH] RISC-V Regression test: Fix FAIL of slp-reduc-4.c for RVV

2023-10-09 Thread Li, Pan2
Committed, thanks Jeff.

Pan

-Original Message-
From: Jeff Law  
Sent: Monday, October 9, 2023 9:52 PM
To: Juzhe-Zhong ; gcc-patches@gcc.gnu.org
Cc: rguent...@suse.de
Subject: Re: [PATCH] RISC-V Regression test: Fix FAIL of slp-reduc-4.c for RVV



On 10/9/23 07:41, Juzhe-Zhong wrote:
> RVV vectortizes this case with stride8 load_lanes.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/slp-reduc-4.c: Adapt test for stride8 load_lanes.
OK.  Similar question as my last ack.  Do we want a follow-up here which 
tests the .vect dump for the ! { vect_load_lanes && vec_strided8 } case?

jeff


Re: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

2023-10-09 Thread Jeff Law




On 10/9/23 08:21, juzhe.zhong wrote:

Do you mean add a check whether it is vectorized or not?

Yes.



Sounds reasonable, I can add that in another patch.

Sounds good.  Thanks.

jeff


RE: [PATCH] RISC-V Regression test: Adapt SLP tests like ARM SVE

2023-10-09 Thread Li, Pan2
Committed, thanks Jeff.

Pan

-Original Message-
From: Jeff Law  
Sent: Monday, October 9, 2023 9:49 PM
To: Juzhe-Zhong ; gcc-patches@gcc.gnu.org
Cc: rguent...@suse.de
Subject: Re: [PATCH] RISC-V Regression test: Adapt SLP tests like ARM SVE



On 10/9/23 07:37, Juzhe-Zhong wrote:
> Like ARM SVE, RVV is vectorizing these 2 cases in the same way.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/slp-23.c: Add RVV like ARM SVE.
>   * gcc.dg/vect/slp-perm-10.c: Ditto.
OK
jeff


Re: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

2023-10-09 Thread juzhe.zhong
Do you mean add a check whether it is vectorized or not?Sounds reasonable, I can add that in another patch. Replied Message FromJeff LawDate10/09/2023 21:51 ToJuzhe-Zhong,gcc-patches@gcc.gnu.org Ccrguent...@suse.deSubjectRe: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

On 10/9/23 07:39, Juzhe-Zhong wrote:
> RVV vectorize it with stride5 load_lanes.
>  
> gcc/testsuite/ChangeLog:
>  
>     * gcc.dg/vect/slp-perm-4.c: Adapt test for stride5 load_lanes.
OK.

As a follow-up, would it make sense to test the .vect dump for something  
else in the ! {vec_load_lanes && vect_strided5 } case to verify that it  
does and continues to be vectorized for that configuration?

jeff



Re: [PATCH v1 2/4] RISC-V: Refactor riscv_option_override and riscv_convert_vector_bits. [NFC]

2023-10-09 Thread Jeff Law




On 10/3/23 03:09, Kito Cheng wrote:

Allow those funciton apply from a local gcc_options rather than the
global options.

Preparatory for target attribute, sperate this change for eaiser reivew
since it's a NFC.

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_convert_vector_bits): Get setting
from argument rather than get setting from global setting.
(riscv_override_options_internal): New, splited from
riscv_override_options, also take a gcc_options argument.
(riscv_option_override): Splited most part to
riscv_override_options_internal.

OK once prerequisites are approved and installed.

jeff


Re: [PATCH v1 1/4] options: Define TARGET__P and TARGET__OPTS_P macro for Mask and InverseMask

2023-10-09 Thread Jeff Law




On 10/3/23 03:09, Kito Cheng wrote:

We TARGET__P marcro to test a Mask and InverseMask with user
specified target_variable, however we may want to test with specific
gcc_options variable rather than target_variable.

Like RISC-V has defined lots of Mask with TargetVariable, which is not
easy to use, because that means we need to known which Mask are associate with
which TargetVariable, so take a gcc_options variable is a better interface
for such use case.

gcc/ChangeLog:

* doc/options.texi (Mask): Document TARGET__P and
TARGET__OPTS_P.
(InverseMask): Ditto.
* opth-gen.awk (Mask): Generate TARGET__P and
TARGET__OPTS_P macro.
(InverseMask): Ditto.
Doesn't this need to be updated to avoid multi-dimensional arrays in awk 
and rebased?


Jeff


Re: [pushed] analyzer: improvements to out-of-bounds diagrams [PR111155]

2023-10-09 Thread David Malcolm
On Mon, 2023-10-09 at 12:09 +0200, Tobias Burnus wrote:
> Hi David,
> 
> your commit breaks compilation with GCC < 6, here with GCC 5.2:
> 
> gcc/analyzer/access-diagram.cc: In member function 'void
> ana::boundaries::add(const ana::access_range&,
> ana::boundaries::kind)':
> gcc/analyzer/access-diagram.cc:655:20: error: 'kind' is not a class,
> namespace, or enumeration
>     (kind == kind::HARD) ? "HARD" : "soft");
>  ^
> The problem is ...
> 
> On 09.10.23 00:58, David Malcolm wrote:
> 
> > Update out-of-bounds diagrams to show existing string values,
> > diff --git a/gcc/analyzer/access-diagram.cc b/gcc/analyzer/access-
> > diagram.cc
> > index a51d594b5b2..2197ec63f53 100644
> > --- a/gcc/analyzer/access-diagram.cc
> > +++ b/gcc/analyzer/access-diagram.cc
> > @@ -630,8 +630,8 @@ class boundaries
> >   public:
> >     enum class kind { HARD, SOFT};
> 
> ...
> 
> > @@ -646,6 +646,15 @@ public:
> 
> Just above the following diff is the line:
> 
>    void add (const access_range &range, enum kind kind)
> 
> >     {
> >   add (range.m_start, kind);
> >   add (range.m_next, kind);
> > +    if (m_logger)
> > +  {
> > + m_logger->start_log_line ();
> > + m_logger->log_partial ("added access_range: ");
> > + range.dump_to_pp (m_logger->get_printer (), true);
> > + m_logger->log_partial (" (%s)",
> > +    (kind == kind::HARD) ? "HARD" :
> > "soft");
> > + m_logger->end_log_line ();
> 
> Actual problem:
> 
> Playing around also with the compiler explorer shows that GCC 5.2 or
> likewise 5.5
> do not like the variable (PARAM_DECL) name "kind" combined with 
> "kind::HARD".
> 
> The following works:
> (A) Using "kind == boundaries::kind::HARD" - i.e. adding
> "boundaries::"
> (B) Renaming the parameter name "kind" to something else - like "k"
> as used
>  in the other functions.
> 
> Can you fix it?

Sorry about the breakage, and thanks for the investigation.

Does the following patch fix the build for you?
Thanks


gcc/analyzer/ChangeLog:
* access-diagram.cc (boundaries::add): Explicitly state
"boundaries::" scope for "kind" enum.
---
 gcc/analyzer/access-diagram.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/analyzer/access-diagram.cc b/gcc/analyzer/access-diagram.cc
index 2197ec63f53..c7d190e3188 100644
--- a/gcc/analyzer/access-diagram.cc
+++ b/gcc/analyzer/access-diagram.cc
@@ -652,7 +652,8 @@ public:
m_logger->log_partial ("added access_range: ");
range.dump_to_pp (m_logger->get_printer (), true);
m_logger->log_partial (" (%s)",
-  (kind == kind::HARD) ? "HARD" : "soft");
+  (kind == boundaries::kind::HARD)
+  ? "HARD" : "soft");
m_logger->end_log_line ();
   }
   }
-- 
2.26.3




RE: [PATCH V2] RISC-V: Support movmisalign of RVV VLA modes

2023-10-09 Thread Li, Pan2
Committed, thanks Robin.

Pan

-Original Message-
From: Robin Dapp  
Sent: Monday, October 9, 2023 9:54 PM
To: Juzhe-Zhong ; gcc-patches@gcc.gnu.org
Cc: rdapp@gmail.com; kito.ch...@gmail.com; kito.ch...@sifive.com; 
jeffreya...@gmail.com
Subject: Re: [PATCH V2] RISC-V: Support movmisalign of RVV VLA modes

Thanks, for now this LGTM.

Regards
 Robin


Re: [PATCH V2] RISC-V: Support movmisalign of RVV VLA modes

2023-10-09 Thread Robin Dapp
Thanks, for now this LGTM.

Regards
 Robin


Re: [PATCH] RISC-V Regression tests: Fix FAIL of pr97832* for RVV

2023-10-09 Thread Jeff Law




On 10/9/23 07:15, Juzhe-Zhong wrote:

These cases are vectorized by vec_load_lanes with strided = 8 instead of SLP
with -fno-vect-cost-model.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr97832-2.c: Adapt dump check for target supports 
load_lanes with stride = 8.
* gcc.dg/vect/pr97832-3.c: Ditto.
* gcc.dg/vect/pr97832-4.c: Ditto.

OK.  Same question as last 3 acks.

jeff


Re: [PATCH] RISC-V Regression test: Fix FAIL of slp-12a.c

2023-10-09 Thread Jeff Law




On 10/9/23 07:35, Juzhe-Zhong wrote:

This case is vectorized by stride8 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-12a.c: Adapt for stride 8 load_lanes.

OK.  Same question as last two ACKs.

jeff


Re: [PATCH] RISC-V Regression test: Fix FAIL of slp-reduc-4.c for RVV

2023-10-09 Thread Jeff Law




On 10/9/23 07:41, Juzhe-Zhong wrote:

RVV vectortizes this case with stride8 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-reduc-4.c: Adapt test for stride8 load_lanes.
OK.  Similar question as my last ack.  Do we want a follow-up here which 
tests the .vect dump for the ! { vect_load_lanes && vec_strided8 } case?


jeff


Re: [PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

2023-10-09 Thread Jeff Law




On 10/9/23 07:39, Juzhe-Zhong wrote:

RVV vectorize it with stride5 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-perm-4.c: Adapt test for stride5 load_lanes.

OK.

As a follow-up, would it make sense to test the .vect dump for something 
else in the ! {vec_load_lanes && vect_strided5 } case to verify that it 
does and continues to be vectorized for that configuration?


jeff


Re: [PATCH] RISC-V Regression test: Adapt SLP tests like ARM SVE

2023-10-09 Thread Jeff Law




On 10/9/23 07:37, Juzhe-Zhong wrote:

Like ARM SVE, RVV is vectorizing these 2 cases in the same way.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-23.c: Add RVV like ARM SVE.
* gcc.dg/vect/slp-perm-10.c: Ditto.

OK
jeff


Re: [PATCH] wide-int: Allow up to 16320 bits wide_int and change widest_int precision to 32640 bits [PR102989]

2023-10-09 Thread Jakub Jelinek
On Mon, Oct 09, 2023 at 01:54:19PM +0100, Richard Sandiford wrote:
> > I've additionally built it with the incremental attached patch and
> > on make -C gcc check-gcc check-g++ -j32 -k it didn't show any
> > wide_int/widest_int heap allocations unless a > 128-bit _BitInt or wb/uwb
> > constant needing > 128-bit _BitInt was used in a testcase.
> 
> Overall it looks really good to me FWIW.  Some comments about the
> wide-int.h changes below.  Will send a separate message about wide-int.cc.

Thanks, just quick answers, will work on patch adjustments after trying to
get rid of rwide_int (seems dwarf2out has very limited needs from it, just
some routine to construct it in GCed memory (and never change afterwards)
from const wide_int_ref & or so, and then working operator ==,
get_precision, elt, get_len and get_val methods, so I think we could just
have a struct dw_wide_int { unsigned int prec, len; HOST_WIDE_INT val[1]; };
and perform the methods on it after converting to a storage ref.

> > @@ -380,7 +406,11 @@ namespace wi
> >  
> >  /* The integer has a constant precision (known at GCC compile time)
> > and is signed.  */
> > -CONST_PRECISION
> > +CONST_PRECISION,
> > +
> > +/* Like CONST_PRECISION, but with WIDEST_INT_MAX_PRECISION or larger
> > +   precision where not all elements of arrays are always present.  */
> > +WIDEST_CONST_PRECISION
> >};
> 
> Sorry to bring this up so late, but how about using INL_CONST_PRECISION
> for the fully inline case and CONST_PRECISION for the general case?
> That seems more consistent with the other naming in the patch.

Ok.

> > @@ -482,6 +541,18 @@ namespace wi
> >};
> >  
> >template 
> > +  struct binary_traits  > WIDEST_CONST_PRECISION>
> > +  {
> > +STATIC_ASSERT (int_traits ::precision == int_traits 
> > ::precision);
> 
> Should this assert for equal inl_precision too?  Although it probably
> isn't necessary computationally, it seems a bit arbitrary to pick the
> first inl_precision...

inl_precision is only used for widest_int/widest2_int, so if precision is
equal, inl_precision is as well.

> > +inline wide_int_storage::wide_int_storage (const wide_int_storage &x)
> > +{
> > +  len = x.len;
> > +  precision = x.precision;
> > +  if (UNLIKELY (precision > WIDE_INT_MAX_INL_PRECISION))
> > +{
> > +  u.valp = XNEWVEC (HOST_WIDE_INT, CEIL (precision, 
> > HOST_BITS_PER_WIDE_INT));
> > +  memcpy (u.valp, x.u.valp, len * sizeof (HOST_WIDE_INT));
> > +}
> > +  else if (LIKELY (precision))
> > +memcpy (u.val, x.u.val, len * sizeof (HOST_WIDE_INT));
> > +}
> 
> Does the variable-length memcpy pay for itself?  If so, perhaps that's a
> sign that we should have a smaller inline buffer for this class (say 2 HWIs).

Guess I'll try to see what results in smaller .text size.

> > +namespace wi
> > +{
> > +  template 
> > +  struct int_traits < widest_int_storage  >
> > +  {
> > +static const enum precision_type precision_type = 
> > WIDEST_CONST_PRECISION;
> > +static const bool host_dependent_precision = false;
> > +static const bool is_sign_extended = true;
> > +static const bool needs_write_val_arg = true;
> > +static const unsigned int precision
> > +  = N / WIDE_INT_MAX_INL_PRECISION * WIDEST_INT_MAX_PRECISION;
> 
> What's the reasoning behind this calculation?  It would give 0 for
> N < WIDE_INT_MAX_INL_PRECISION, and the "MAX" suggests that N
> shouldn't be > WIDE_INT_MAX_INL_PRECISION either.
> 
> I wonder whether this should be a second template parameter, with an
> assert that precision > inl_precision.

Maybe.  Yet another option would be to always use WIDE_INT_MAX_INL_PRECISION
as the inline precision (and use N template parameter just to decide about
the overall precision), regardless of whether it is widest_int or
widest2_int.  The latter is very rare and even much rarer that something
wouldn't fit into the WIDE_INT_MAX_INL_PRECISION when not using _BitInt.
The reason for introducing inl_precision was to avoid the heap allocation
for widest2_int unless _BitInt is in use, but maybe that isn't worth it.

> Nit: might format more naturally with:
> 
>   using res_traits = wi::int_traits :
>   ...

Ok.

> > @@ -2203,6 +2781,9 @@ wi::sext (const T &x, unsigned int offse
> >unsigned int precision = get_precision (result);
> >WIDE_INT_REF_FOR (T) xi (x, precision);
> >  
> > +  if (result.needs_write_val_arg)
> > +val = result.write_val (MAX (xi.len,
> > +CEIL (offset, HOST_BITS_PER_WIDE_INT)));
> 
> Why MAX rather than MIN?

Because it needs to be an upper bound.
In this case, sext_large has
  unsigned int len = offset / HOST_BITS_PER_WIDE_INT;
  /* Extending beyond the precision is a no-op.  If we have only stored
 OFFSET bits or fewer, the rest are already signs.  */
  if (offset >= precision || len >= xlen)
{
  for (unsigned i = 0; i < xlen; ++i)
val[i] = xval[i];
  return xlen;
}
  unsigned int suboffset = of

[PATCH] RISC-V Regression test: Fix FAIL of slp-reduc-4.c for RVV

2023-10-09 Thread Juzhe-Zhong
RVV vectortizes this case with stride8 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-reduc-4.c: Adapt test for stride8 load_lanes.

---
 gcc/testsuite/gcc.dg/vect/slp-reduc-4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-reduc-4.c 
b/gcc/testsuite/gcc.dg/vect/slp-reduc-4.c
index 15f5c259e98..e2fe01bb13d 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-reduc-4.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-reduc-4.c
@@ -60,6 +60,6 @@ int main (void)
 /* For variable-length SVE, the number of scalar statements in the
reduction exceeds the number of elements in a 128-bit granule.  */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { ! vect_multiple_sizes } xfail { vect_no_int_min_max || { aarch64_sve 
&& vect_variable_length } } } } } */
-/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
vect_multiple_sizes } } } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
vect_multiple_sizes && { ! { vect_load_lanes && vect_strided8 } } } } } } */
 /* { dg-final { scan-tree-dump-times "VEC_PERM_EXPR" 0 "vect" { xfail { 
aarch64_sve && vect_variable_length } } } } */
 
-- 
2.36.3



[PATCH] RISC-V Regression test: Fix slp-perm-4.c FAIL for RVV

2023-10-09 Thread Juzhe-Zhong
RVV vectorize it with stride5 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-perm-4.c: Adapt test for stride5 load_lanes.

---
 gcc/testsuite/gcc.dg/vect/slp-perm-4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-4.c 
b/gcc/testsuite/gcc.dg/vect/slp-perm-4.c
index 107968f1f7c..f4bda39c837 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-perm-4.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-perm-4.c
@@ -115,4 +115,4 @@ int main (int argc, const char* argv[])
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 /* { dg-final { scan-tree-dump-times "gaps requires scalar epilogue loop" 0 
"vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } 
} */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { ! { vect_load_lanes && vect_strided5 } } } } } */
-- 
2.36.3



[PATCH] RISC-V Regression test: Adapt SLP tests like ARM SVE

2023-10-09 Thread Juzhe-Zhong
Like ARM SVE, RVV is vectorizing these 2 cases in the same way.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-23.c: Add RVV like ARM SVE.
* gcc.dg/vect/slp-perm-10.c: Ditto.

---
 gcc/testsuite/gcc.dg/vect/slp-23.c  | 2 +-
 gcc/testsuite/gcc.dg/vect/slp-perm-10.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-23.c 
b/gcc/testsuite/gcc.dg/vect/slp-23.c
index d32ee5ba73b..8836acf0330 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-23.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-23.c
@@ -114,5 +114,5 @@ int main (void)
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { ! vect_perm } } } } */
 /* SLP fails for the second loop with variable-length SVE because
the load size is greater than the minimum vector size.  */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
target vect_perm xfail { aarch64_sve && vect_variable_length } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
target vect_perm xfail { { aarch64_sve || riscv_v } && vect_variable_length } } 
} } */
   
diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-10.c 
b/gcc/testsuite/gcc.dg/vect/slp-perm-10.c
index 2cce30c2444..03de4c61b50 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-perm-10.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-perm-10.c
@@ -53,4 +53,4 @@ int main ()
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_perm } } } */
 /* SLP fails for variable-length SVE because the load size is greater
than the minimum vector size.  */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_perm xfail { aarch64_sve && vect_variable_length } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_perm xfail { { aarch64_sve || riscv_v } && vect_variable_length } } 
} } */
-- 
2.36.3



Re: [PATCH 1/3]middle-end: Refactor vectorizer loop conditionals and separate out IV to new variables

2023-10-09 Thread Richard Biener
On Mon, 2 Oct 2023, Tamar Christina wrote:

> Hi All,
> 
> This is extracted out of the patch series to support early break vectorization
> in order to simplify the review of that patch series.
> 
> The goal of this one is to separate out the refactoring from the new
> functionality.
> 
> This first patch separates out the vectorizer's definition of an exit to their
> own values inside loop_vinfo.  During vectorization we can have three separate
> copies for each loop: scalar, vectorized, epilogue.  The scalar loop can also 
> be
> the versioned loop before peeling.
> 
> Because of this we track 3 different exits inside loop_vinfo corresponding to
> each of these loops.  Additionally each function that uses an exit, when not
> obviously clear which exit is needed will now take the exit explicitly as an
> argument.
> 
> This is because often times the callers switch the loops being passed around.
> While the caller knows which loops it is, the callee does not.
> 
> For now the loop exits are simply initialized to same value as before 
> determined
> by single_exit (..).
> 
> No change in functionality is expected throughout this patch series.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu, and
> no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   * tree-loop-distribution.cc (copy_loop_before): Pass exit explicitly.
>   (loop_distribution::distribute_loop): Bail out of not single exit.
>   * tree-scalar-evolution.cc (get_loop_exit_condition): New.
>   * tree-scalar-evolution.h (get_loop_exit_condition): New.
>   * tree-vect-data-refs.cc (vect_enhance_data_refs_alignment): Pass exit
>   explicitly.
>   * tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors,
>   vect_set_loop_condition_partial_vectors_avx512,
>   vect_set_loop_condition_normal, vect_set_loop_condition): Explicitly
>   take exit.
>   (slpeel_tree_duplicate_loop_to_edge_cfg): Explicitly take exit and
>   return new peeled corresponding peeled exit.
>   (slpeel_can_duplicate_loop_p): Explicitly take exit.
>   (find_loop_location): Handle not knowing an explicit exit.
>   (vect_update_ivs_after_vectorizer, vect_gen_vector_loop_niters_mult_vf,
>   find_guard_arg, slpeel_update_phi_nodes_for_loops,
>   slpeel_update_phi_nodes_for_guard2): Use new exits.
>   (vect_do_peeling): Update bookkeeping to keep track of exits.
>   * tree-vect-loop.cc (vect_get_loop_niters): Explicitly take exit to
>   analyze.
>   (vec_init_loop_exit_info): New.
>   (_loop_vec_info::_loop_vec_info): Initialize vec_loop_iv,
>   vec_epilogue_loop_iv, scalar_loop_iv.
>   (vect_analyze_loop_form): Initialize exits.
>   (vect_create_loop_vinfo): Set main exit.
>   (vect_create_epilog_for_reduction, vectorizable_live_operation,
>   vect_transform_loop): Use it.
>   (scale_profile_for_vect_loop): Explicitly take exit to scale.
>   * tree-vectorizer.cc (set_uid_loop_bbs): Initialize loop exit.
>   * tree-vectorizer.h (LOOP_VINFO_IV_EXIT, LOOP_VINFO_EPILOGUE_IV_EXIT,
>   LOOP_VINFO_SCALAR_IV_EXIT): New.
>   (struct loop_vec_info): Add vec_loop_iv, vec_epilogue_loop_iv,
>   scalar_loop_iv.
>   (vect_set_loop_condition, slpeel_can_duplicate_loop_p,
>   slpeel_tree_duplicate_loop_to_edge_cfg): Take explicit exits.
>   (vec_init_loop_exit_info): New.
>   (struct vect_loop_form_info): Add loop_exit.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
> index 
> a28470b66ea935741a61fb73961ed7c927543a3d..902edc49ab588152a5b845f2c8a42a7e2a1d6080
>  100644
> --- a/gcc/tree-loop-distribution.cc
> +++ b/gcc/tree-loop-distribution.cc
> @@ -949,7 +949,8 @@ copy_loop_before (class loop *loop, bool 
> redirect_lc_phi_defs)
>edge preheader = loop_preheader_edge (loop);
>  
>initialize_original_copy_tables ();
> -  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
> +  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, single_exit (loop), 
> NULL,
> + NULL, preheader, NULL);
>gcc_assert (res != NULL);
>  
>/* When a not last partition is supposed to keep the LC PHIs computed
> @@ -3043,6 +3044,24 @@ loop_distribution::distribute_loop (class loop *loop,
>return 0;
>  }
>  
> +  /* Loop distribution only does prologue peeling but we still need to
> + initialize loop exit information.  However we only support single exits 
> at
> + the moment.  As such, should exit information not have been provided 
> and we
> + have more than one exit, bail out.  */
> +  if (!single_exit (loop))
> +{
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file,
> +  "Loop %d not distributed: too many exits.\n",
> +  loop->num);
> +
> +  free_rdg (rdg);
> +  loop_nest.release ();
> +  free_d

[PATCH] RISC-V Regression test: Fix FAIL of slp-12a.c

2023-10-09 Thread Juzhe-Zhong
This case is vectorized by stride8 load_lanes.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/slp-12a.c: Adapt for stride 8 load_lanes.

---
 gcc/testsuite/gcc.dg/vect/slp-12a.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-12a.c 
b/gcc/testsuite/gcc.dg/vect/slp-12a.c
index f0dda55acae..973de6ada21 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-12a.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-12a.c
@@ -76,5 +76,5 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { 
vect_strided8 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { 
! { vect_strided8 && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { vect_strided8 && vect_int_mult } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { 
target { ! { vect_strided8 && vect_int_mult } } } } } */
-- 
2.36.3



Re: [PATCH] RISC-V: THead: Fix missing CFI directives for th.sdd in prologue.

2023-10-09 Thread Jeff Law




On 10/4/23 01:49, Xianmiao Qu wrote:

From: quxm 

When generating CFI directives for the store-pair instruction,
if we add two parallel REG_FRAME_RELATED_EXPR expr_lists like
   (expr_list:REG_FRAME_RELATED_EXPR (set (mem/c:DI (plus:DI (reg/f:DI 2 sp)
 (const_int 8 [0x8])) [1  S8 A64])
 (reg:DI 1 ra))
   (expr_list:REG_FRAME_RELATED_EXPR (set (mem/c:DI (reg/f:DI 2 sp) [1  S8 A64])
 (reg:DI 8 s0))
only the first expr_list will be recognized by dwarf2out_frame_debug
funciton. So, here we generate a SEQUENCE expression of REG_FRAME_RELATED_EXPR,
which includes two sub-expressions of RTX_FRAME_RELATED_P. Then the
dwarf2out_frame_debug_expr function will iterate through all the sub-expressions
and generate the corresponding CFI directives.

gcc/
* config/riscv/thead.cc (th_mempair_save_regs): Fix missing CFI
directives for store-pair instruction.

gcc/testsuite/
* gcc.target/riscv/xtheadmempair-4.c: New test.

THanks.  I pushed this to the trunk.
jeff



Re: [PATCH] tree-optimization/111715 - improve TBAA for access paths with pun

2023-10-09 Thread Richard Biener
On Mon, 9 Oct 2023, Richard Biener wrote:

> The following improves basic TBAA for access paths formed by
> C++ abstraction where we are able to combine a path from an
> address-taking operation with a path based on that access using
> a pun to avoid memory access semantics on the address-taking part.
> 
> The trick is to identify the point the semantic memory access path
> starts which allows us to use the alias set of the outermost access
> instead of only that of the base of this path.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu for all languages
> with a slightly different variant, re-bootstrapping/testing now
> (with doing the extra walk just for AGGREGATE_TYPE_P).

I ended up pushing the original version below after bothing the
AGGREGATE_TYPE_P, improperly hiding the local variable.  It's
a micr-optimization not worth the trouble I think.

Bootstrapped and tested on x86_64-unknown-linux-gnu for all languages, 
pushed.

>From 9cf3fca604db73866d0dc69dc88f95155027b3d7 Mon Sep 17 00:00:00 2001
From: Richard Biener 
Date: Mon, 9 Oct 2023 13:05:10 +0200
Subject: [PATCH] tree-optimization/111715 - improve TBAA for access paths with
 pun
To: gcc-patches@gcc.gnu.org

The following improves basic TBAA for access paths formed by
C++ abstraction where we are able to combine a path from an
address-taking operation with a path based on that access using
a pun to avoid memory access semantics on the address-taking part.

The trick is to identify the point the semantic memory access path
starts which allows us to use the alias set of the outermost access
instead of only that of the base of this path.

PR tree-optimization/111715
* alias.cc (reference_alias_ptr_type_1): When we have
a type-punning ref at the base search for the access
path part that's still semantically valid.

* gcc.dg/tree-ssa/ssa-fre-102.c: New testcase.
---
 gcc/alias.cc| 17 ++-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c | 32 +
 2 files changed, 48 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c

diff --git a/gcc/alias.cc b/gcc/alias.cc
index 7c1af1fe96e..86d8f7104ad 100644
--- a/gcc/alias.cc
+++ b/gcc/alias.cc
@@ -774,7 +774,22 @@ reference_alias_ptr_type_1 (tree *t)
   && (TYPE_MAIN_VARIANT (TREE_TYPE (inner))
  != TYPE_MAIN_VARIANT
   (TREE_TYPE (TREE_TYPE (TREE_OPERAND (inner, 1))
-return TREE_TYPE (TREE_OPERAND (inner, 1));
+{
+  tree alias_ptrtype = TREE_TYPE (TREE_OPERAND (inner, 1));
+  /* Unless we have the (aggregate) effective type of the access
+somewhere on the access path.  If we have for example
+(&a->elts[i])->l.len exposed by abstraction we'd see
+MEM  [(B *)a].elts[i].l.len and we can use the alias set
+of 'len' when typeof (MEM  [(B *)a].elts[i]) == B for
+example.  See PR111715.  */
+  tree inner = *t;
+  while (handled_component_p (inner)
+&& (TYPE_MAIN_VARIANT (TREE_TYPE (inner))
+!= TYPE_MAIN_VARIANT (TREE_TYPE (alias_ptrtype
+   inner = TREE_OPERAND (inner, 0);
+  if (TREE_CODE (inner) == MEM_REF)
+   return alias_ptrtype;
+}
 
   /* Otherwise, pick up the outermost object that we could have
  a pointer to.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c
new file mode 100644
index 000..afd48050819
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c
@@ -0,0 +1,32 @@
+/* PR/111715 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-fre1" } */
+
+struct B {
+   struct { int len; } l;
+   long n;
+};
+struct A {
+   struct B elts[8];
+};
+
+static void
+set_len (struct B *b, int len)
+{
+  b->l.len = len;
+}
+
+static int
+get_len (struct B *b)
+{
+  return b->l.len;
+}
+
+int foo (struct A *a, int i, long *q)
+{
+  set_len (&a->elts[i], 1);
+  *q = 2;
+  return get_len (&a->elts[i]);
+}
+
+/* { dg-final { scan-tree-dump "return 1;" "fre1" } } */
-- 
2.35.3



[PATCH] RISC-V Regression tests: Fix FAIL of pr97832* for RVV

2023-10-09 Thread Juzhe-Zhong
These cases are vectorized by vec_load_lanes with strided = 8 instead of SLP
with -fno-vect-cost-model.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr97832-2.c: Adapt dump check for target supports 
load_lanes with stride = 8.
* gcc.dg/vect/pr97832-3.c: Ditto.
* gcc.dg/vect/pr97832-4.c: Ditto.

---
 gcc/testsuite/gcc.dg/vect/pr97832-2.c | 4 ++--
 gcc/testsuite/gcc.dg/vect/pr97832-3.c | 4 ++--
 gcc/testsuite/gcc.dg/vect/pr97832-4.c | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-2.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-2.c
index 4f0578120ee..7d8d2691432 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-2.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-2.c
@@ -25,5 +25,5 @@ void foo1x1(double* restrict y, const double* restrict x, int 
clen)
   }
 }
 
-/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
-/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-3.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-3.c
index ad1225ddbaa..c0603e1432e 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-3.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-3.c
@@ -46,5 +46,5 @@ void foo(double* restrict y, const double* restrict x0, const 
double* restrict x
   }
 }
 
-/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
-/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr97832-4.c 
b/gcc/testsuite/gcc.dg/vect/pr97832-4.c
index 74ae27ff873..c03442816a4 100644
--- a/gcc/testsuite/gcc.dg/vect/pr97832-4.c
+++ b/gcc/testsuite/gcc.dg/vect/pr97832-4.c
@@ -24,5 +24,5 @@ void foo1x1(double* restrict y, const double* restrict x, int 
clen)
   }
 }
 
-/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
-/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { 
! { vect_load_lanes && vect_strided8 } } } } } */
+/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" "vect" { target 
{ ! { vect_load_lanes && vect_strided8 } } } } } */
-- 
2.36.3



RE: [PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen

2023-10-09 Thread Li, Pan2
Committed, thanks Juzhe.

Pan

From: juzhe.zh...@rivai.ai 
Sent: Monday, October 9, 2023 9:11 PM
To: Li, Pan2 ; gcc-patches 
Cc: Li, Pan2 ; Wang, Yanzhang ; 
kito.cheng 
Subject: Re: [PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen

LGTM now.

Thanks.


juzhe.zh...@rivai.ai

From: pan2.li
Date: 2023-10-09 21:09
To: gcc-patches
CC: juzhe.zhong; 
pan2.li; 
yanzhang.wang; 
kito.cheng
Subject: [PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen
From: Pan Li mailto:pan2...@intel.com>>

Update in v2

* Remove emit helper functions.
* Take expand_binop instead.

Original log:

This patch would like to refine the code gen for the bswap16.

We will have VEC_PERM_EXPR after rtl expand when invoking
__builtin_bswap. It will generate about 9 instructions in
loop as below, no matter it is bswap16, bswap32 or bswap64.

  .L2:
1 vle16.v v4,0(a0)
2 vmv.v.x v2,a7
3 vand.vv v2,v6,v2
4 sllia2,a5,1
5 vrgatherei16.vv v1,v4,v2
6 sub a4,a4,a5
7 vse16.v v1,0(a3)
8 add a0,a0,a2
9 add a3,a3,a2
  bne a4,zero,.L2

But for bswap16 we may have a even simple code gen, which
has only 7 instructions in loop as below.

  .L5
1 vle8.v  v2,0(a5)
2 addia5,a5,32
3 vsrl.vi v4,v2,8
4 vsll.vi v2,v2,8
5 vor.vv  v4,v4,v2
6 vse8.v  v4,0(a4)
7 addia4,a4,32
  bne a5,a6,.L5

Unfortunately, this way will make the insn in loop will grow up to
13 and 24 for bswap32 and bswap64. Thus, we will refine the code
gen for the bswap16 only, and leave both the bswap32 and bswap64
as is.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (shuffle_bswap_pattern): New func impl
for shuffle bswap.
(expand_vec_perm_const_1): Add handling for shuffle bswap pattern.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls/perm-4.c: Adjust checker.
* gcc.target/riscv/rvv/autovec/unop/bswap16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/bswap16-0.c: New test.

Signed-off-by: Pan Li mailto:pan2...@intel.com>>
---
gcc/config/riscv/riscv-v.cc   | 91 +++
.../riscv/rvv/autovec/unop/bswap16-0.c| 17 
.../riscv/rvv/autovec/unop/bswap16-run-0.c| 44 +
.../riscv/rvv/autovec/vls/bswap16-0.c | 34 +++
.../gcc.target/riscv/rvv/autovec/vls/perm-4.c |  4 +-
5 files changed, 188 insertions(+), 2 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/bswap16-0.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 23633a2a74d..c72e411f125 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -3030,6 +3030,95 @@ shuffle_decompress_patterns (struct expand_vec_perm_d *d)
   return true;
}
+static bool
+shuffle_bswap_pattern (struct expand_vec_perm_d *d)
+{
+  HOST_WIDE_INT diff;
+  unsigned i, size, step;
+
+  if (!d->one_vector_p || !d->perm[0].is_constant (&diff) || !diff)
+return false;
+
+  step = diff + 1;
+  size = step * GET_MODE_UNIT_BITSIZE (d->vmode);
+
+  switch (size)
+{
+case 16:
+  break;
+case 32:
+case 64:
+  /* We will have VEC_PERM_EXPR after rtl expand when invoking
+ __builtin_bswap. It will generate about 9 instructions in
+ loop as below, no matter it is bswap16, bswap32 or bswap64.
+.L2:
+ 1 vle16.v v4,0(a0)
+ 2 vmv.v.x v2,a7
+ 3 vand.vv v2,v6,v2
+ 4 sllia2,a5,1
+ 5 vrgatherei16.vv v1,v4,v2
+ 6 sub a4,a4,a5
+ 7 vse16.v v1,0(a3)
+ 8 add a0,a0,a2
+ 9 add a3,a3,a2
+bne a4,zero,.L2
+
+ But for bswap16 we may have a even simple code gen, which
+ has only 7 instructions in loop as below.
+.L5
+ 1 vle8.v  v2,0(a5)
+ 2 addia5,a5,32
+ 3 vsrl.vi v4,v2,8
+ 4 vsll.vi v2,v2,8
+ 5 vor.vv  v4,v4,v2
+ 6 vse8.v  v4,0(a4)
+ 7 addia4,a4,32
+bne a5,a6,.L5
+
+ Unfortunately, the instructions in loop will grow to 13 and 24
+ for bswap32 and bswap64. Thus, we will leverage vrgather (9 insn)
+ for both the bswap64 and bswap32, but take shift and or (7 insn)
+ for bswap16.
+   */
+default:
+  return false;
+}
+
+  for (i = 0; i < step; i++)
+if (!d->perm.series_p (i, step, diff - i, step))
+  return false;
+
+  if (d->testing_p)
+return true;
+
+  machine_mode vhi_mode;
+  poly_uint64 vhi_nunits = exact_div (GET_MODE_NUNITS (d->vmode), 2);
+
+  if (!get_vector_mode (HImode, vhi_nunits).exists (&vhi_mode))
+return false;
+
+  /* Step-1: Move op0 to src with VHI mode.  */
+  rtx src = gen_reg_rtx (vhi_mode);
+  emit_move_insn (src, gen_lowpart (vhi_mode, d->op0));
+
+  /* Step-2: Shift right 8 bits to dest.  */
+  rtx dest = expand_binop (vhi_mod

Re: [PATCH 6/6] aarch64: Add front-end argument type checking for target builtins

2023-10-09 Thread Victor Do Nascimento




On 10/7/23 12:53, Richard Sandiford wrote:

Richard Earnshaw  writes:

On 03/10/2023 16:18, Victor Do Nascimento wrote:

In implementing the ACLE read/write system register builtins it was
observed that leaving argument type checking to be done at expand-time
meant that poorly-formed function calls were being "fixed" by certain
optimization passes, meaning bad code wasn't being properly picked up
in checking.

Example:

const char *regname = "amcgcr_el0";
long long a = __builtin_aarch64_rsr64 (regname);

is reduced by the ccp1 pass to

long long a = __builtin_aarch64_rsr64 ("amcgcr_el0");

As these functions require an argument of STRING_CST type, there needs
to be a check carried out by the front-end capable of picking this up.

The introduced `check_general_builtin_call' function will be called by
the TARGET_CHECK_BUILTIN_CALL hook whenever a call to a builtin
belonging to the AARCH64_BUILTIN_GENERAL category is encountered,
carrying out any appropriate checks associated with a particular
builtin function code.


Doesn't this prevent reasonable wrapping of the __builtin... names with
something more palatable?  Eg:

static inline __attribute__(("always_inline")) long long get_sysreg_ll
(const char *regname)
{
return __builtin_aarch64_rsr64 (regname);
}

...
long long x = get_sysreg_ll("amcgcr_el0");
...


I think it's case of picking your poison.  If we didn't do this,
and only checked later, then it's unlikely that GCC and Clang would
be consistent about when a constant gets folded soon enough.

But yeah, it means that the above would need to be a macro in C.
Enlightened souls using C++ could instead do:

   template
   long long get_sysreg_ll()
   {
 return __builtin_aarch64_rsr64(regname);
   }

   ... get_sysreg_ll<"amcgcr_el0">() ...

Or at least I hope so.  Might be nice to have a test for this.

Thanks,
Richard


As Richard Earnshaw mentioned, this does break the use of `static inline 
__attribute__(("always_inline"))', something I had found out in my 
testing.  My chosen implementation was indeed, to quote Richard 
Sandiford, a case of "picking your poison" to have things line up with 
Clang and behaving consistently across optimization levels.


Relaxing the the use of `TARGET_CHECK_BUILTIN_CALL' meant optimizations 
were letting too many things through. Example:


const char *regname = "amcgcr_el0";
long long a = __builtin_aarch64_rsr64 (regname);

gets folded to

long long a = __builtin_aarch64_rsr64 ("amcgcr_el0");

and compilation passes at -01 even though it fails at -O0.

I had, however, not given any thought to the use of a template as a 
valid C++ alternative.


I will evaluate the use of templates and add tests accordingly.

Cheers,
Victor


RE: [PATCH] RISC-V Regression test: Fix FAIL of pr45752.c for RVV

2023-10-09 Thread Li, Pan2
Committed, thanks Richard.

Pan

-Original Message-
From: Richard Biener  
Sent: Monday, October 9, 2023 9:07 PM
To: Juzhe-Zhong 
Cc: gcc-patches@gcc.gnu.org; jeffreya...@gmail.com
Subject: Re: [PATCH] RISC-V Regression test: Fix FAIL of pr45752.c for RVV

On Mon, 9 Oct 2023, Juzhe-Zhong wrote:

> RVV use load_lanes with stride = 5 vectorize this case with 
> -fno-vect-cost-model
> instead of SLP.

OK

> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/pr45752.c: Adapt dump check for target supports 
> load_lanes with stride = 5.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/pr45752.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/pr45752.c 
> b/gcc/testsuite/gcc.dg/vect/pr45752.c
> index e8b364f29eb..3c87d9b04fc 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr45752.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr45752.c
> @@ -159,4 +159,4 @@ int main (int argc, const char* argv[])
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>  /* { dg-final { scan-tree-dump-times "gaps requires scalar epilogue loop" 0 
> "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" 
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" 
> {target { ! { vect_load_lanes && vect_strided5 } } } } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen

2023-10-09 Thread juzhe.zh...@rivai.ai
LGTM now.

Thanks.



juzhe.zh...@rivai.ai
 
From: pan2.li
Date: 2023-10-09 21:09
To: gcc-patches
CC: juzhe.zhong; pan2.li; yanzhang.wang; kito.cheng
Subject: [PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen
From: Pan Li 
 
Update in v2
 
* Remove emit helper functions.
* Take expand_binop instead.
 
Original log:
 
This patch would like to refine the code gen for the bswap16.
 
We will have VEC_PERM_EXPR after rtl expand when invoking
__builtin_bswap. It will generate about 9 instructions in
loop as below, no matter it is bswap16, bswap32 or bswap64.
 
  .L2:
1 vle16.v v4,0(a0)
2 vmv.v.x v2,a7
3 vand.vv v2,v6,v2
4 sllia2,a5,1
5 vrgatherei16.vv v1,v4,v2
6 sub a4,a4,a5
7 vse16.v v1,0(a3)
8 add a0,a0,a2
9 add a3,a3,a2
  bne a4,zero,.L2
 
But for bswap16 we may have a even simple code gen, which
has only 7 instructions in loop as below.
 
  .L5
1 vle8.v  v2,0(a5)
2 addia5,a5,32
3 vsrl.vi v4,v2,8
4 vsll.vi v2,v2,8
5 vor.vv  v4,v4,v2
6 vse8.v  v4,0(a4)
7 addia4,a4,32
  bne a5,a6,.L5
 
Unfortunately, this way will make the insn in loop will grow up to
13 and 24 for bswap32 and bswap64. Thus, we will refine the code
gen for the bswap16 only, and leave both the bswap32 and bswap64
as is.
 
gcc/ChangeLog:
 
* config/riscv/riscv-v.cc (shuffle_bswap_pattern): New func impl
for shuffle bswap.
(expand_vec_perm_const_1): Add handling for shuffle bswap pattern.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/vls/perm-4.c: Adjust checker.
* gcc.target/riscv/rvv/autovec/unop/bswap16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/bswap16-0.c: New test.
 
Signed-off-by: Pan Li 
---
gcc/config/riscv/riscv-v.cc   | 91 +++
.../riscv/rvv/autovec/unop/bswap16-0.c| 17 
.../riscv/rvv/autovec/unop/bswap16-run-0.c| 44 +
.../riscv/rvv/autovec/vls/bswap16-0.c | 34 +++
.../gcc.target/riscv/rvv/autovec/vls/perm-4.c |  4 +-
5 files changed, 188 insertions(+), 2 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/bswap16-0.c
 
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 23633a2a74d..c72e411f125 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -3030,6 +3030,95 @@ shuffle_decompress_patterns (struct expand_vec_perm_d *d)
   return true;
}
+static bool
+shuffle_bswap_pattern (struct expand_vec_perm_d *d)
+{
+  HOST_WIDE_INT diff;
+  unsigned i, size, step;
+
+  if (!d->one_vector_p || !d->perm[0].is_constant (&diff) || !diff)
+return false;
+
+  step = diff + 1;
+  size = step * GET_MODE_UNIT_BITSIZE (d->vmode);
+
+  switch (size)
+{
+case 16:
+  break;
+case 32:
+case 64:
+  /* We will have VEC_PERM_EXPR after rtl expand when invoking
+ __builtin_bswap. It will generate about 9 instructions in
+ loop as below, no matter it is bswap16, bswap32 or bswap64.
+.L2:
+ 1 vle16.v v4,0(a0)
+ 2 vmv.v.x v2,a7
+ 3 vand.vv v2,v6,v2
+ 4 sllia2,a5,1
+ 5 vrgatherei16.vv v1,v4,v2
+ 6 sub a4,a4,a5
+ 7 vse16.v v1,0(a3)
+ 8 add a0,a0,a2
+ 9 add a3,a3,a2
+bne a4,zero,.L2
+
+ But for bswap16 we may have a even simple code gen, which
+ has only 7 instructions in loop as below.
+.L5
+ 1 vle8.v  v2,0(a5)
+ 2 addia5,a5,32
+ 3 vsrl.vi v4,v2,8
+ 4 vsll.vi v2,v2,8
+ 5 vor.vv  v4,v4,v2
+ 6 vse8.v  v4,0(a4)
+ 7 addia4,a4,32
+bne a5,a6,.L5
+
+ Unfortunately, the instructions in loop will grow to 13 and 24
+ for bswap32 and bswap64. Thus, we will leverage vrgather (9 insn)
+ for both the bswap64 and bswap32, but take shift and or (7 insn)
+ for bswap16.
+   */
+default:
+  return false;
+}
+
+  for (i = 0; i < step; i++)
+if (!d->perm.series_p (i, step, diff - i, step))
+  return false;
+
+  if (d->testing_p)
+return true;
+
+  machine_mode vhi_mode;
+  poly_uint64 vhi_nunits = exact_div (GET_MODE_NUNITS (d->vmode), 2);
+
+  if (!get_vector_mode (HImode, vhi_nunits).exists (&vhi_mode))
+return false;
+
+  /* Step-1: Move op0 to src with VHI mode.  */
+  rtx src = gen_reg_rtx (vhi_mode);
+  emit_move_insn (src, gen_lowpart (vhi_mode, d->op0));
+
+  /* Step-2: Shift right 8 bits to dest.  */
+  rtx dest = expand_binop (vhi_mode, lshr_optab, src, gen_int_mode (8, Pmode),
+NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-3: Shift left 8 bits to src.  */
+  src = expand_binop (vhi_mode, ashl_optab, src, gen_int_mode (8, Pmode),
+   NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-4: Logic Or dest and src to dest.  */
+  dest = expand_binop (vhi_mode, ior_optab, dest, src,
+NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-5: Move src to target with VQI mode.  */
+  emit_move_insn (d->target, gen_lowpart (d->vmode, dest));
+
+  return true;
+}
+
/*

[PATCH v2] RISC-V: Refine bswap16 auto vectorization code gen

2023-10-09 Thread pan2 . li
From: Pan Li 

Update in v2

* Remove emit helper functions.
* Take expand_binop instead.

Original log:

This patch would like to refine the code gen for the bswap16.

We will have VEC_PERM_EXPR after rtl expand when invoking
__builtin_bswap. It will generate about 9 instructions in
loop as below, no matter it is bswap16, bswap32 or bswap64.

  .L2:
1 vle16.v v4,0(a0)
2 vmv.v.x v2,a7
3 vand.vv v2,v6,v2
4 sllia2,a5,1
5 vrgatherei16.vv v1,v4,v2
6 sub a4,a4,a5
7 vse16.v v1,0(a3)
8 add a0,a0,a2
9 add a3,a3,a2
  bne a4,zero,.L2

But for bswap16 we may have a even simple code gen, which
has only 7 instructions in loop as below.

  .L5
1 vle8.v  v2,0(a5)
2 addia5,a5,32
3 vsrl.vi v4,v2,8
4 vsll.vi v2,v2,8
5 vor.vv  v4,v4,v2
6 vse8.v  v4,0(a4)
7 addia4,a4,32
  bne a5,a6,.L5

Unfortunately, this way will make the insn in loop will grow up to
13 and 24 for bswap32 and bswap64. Thus, we will refine the code
gen for the bswap16 only, and leave both the bswap32 and bswap64
as is.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (shuffle_bswap_pattern): New func impl
for shuffle bswap.
(expand_vec_perm_const_1): Add handling for shuffle bswap pattern.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls/perm-4.c: Adjust checker.
* gcc.target/riscv/rvv/autovec/unop/bswap16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/bswap16-0.c: New test.

Signed-off-by: Pan Li 
---
 gcc/config/riscv/riscv-v.cc   | 91 +++
 .../riscv/rvv/autovec/unop/bswap16-0.c| 17 
 .../riscv/rvv/autovec/unop/bswap16-run-0.c| 44 +
 .../riscv/rvv/autovec/vls/bswap16-0.c | 34 +++
 .../gcc.target/riscv/rvv/autovec/vls/perm-4.c |  4 +-
 5 files changed, 188 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/bswap16-0.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 23633a2a74d..c72e411f125 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -3030,6 +3030,95 @@ shuffle_decompress_patterns (struct expand_vec_perm_d *d)
   return true;
 }
 
+static bool
+shuffle_bswap_pattern (struct expand_vec_perm_d *d)
+{
+  HOST_WIDE_INT diff;
+  unsigned i, size, step;
+
+  if (!d->one_vector_p || !d->perm[0].is_constant (&diff) || !diff)
+return false;
+
+  step = diff + 1;
+  size = step * GET_MODE_UNIT_BITSIZE (d->vmode);
+
+  switch (size)
+{
+case 16:
+  break;
+case 32:
+case 64:
+  /* We will have VEC_PERM_EXPR after rtl expand when invoking
+__builtin_bswap. It will generate about 9 instructions in
+loop as below, no matter it is bswap16, bswap32 or bswap64.
+  .L2:
+1 vle16.v v4,0(a0)
+2 vmv.v.x v2,a7
+3 vand.vv v2,v6,v2
+4 sllia2,a5,1
+5 vrgatherei16.vv v1,v4,v2
+6 sub a4,a4,a5
+7 vse16.v v1,0(a3)
+8 add a0,a0,a2
+9 add a3,a3,a2
+  bne a4,zero,.L2
+
+But for bswap16 we may have a even simple code gen, which
+has only 7 instructions in loop as below.
+  .L5
+1 vle8.v  v2,0(a5)
+2 addia5,a5,32
+3 vsrl.vi v4,v2,8
+4 vsll.vi v2,v2,8
+5 vor.vv  v4,v4,v2
+6 vse8.v  v4,0(a4)
+7 addia4,a4,32
+  bne a5,a6,.L5
+
+Unfortunately, the instructions in loop will grow to 13 and 24
+for bswap32 and bswap64. Thus, we will leverage vrgather (9 insn)
+for both the bswap64 and bswap32, but take shift and or (7 insn)
+for bswap16.
+   */
+default:
+  return false;
+}
+
+  for (i = 0; i < step; i++)
+if (!d->perm.series_p (i, step, diff - i, step))
+  return false;
+
+  if (d->testing_p)
+return true;
+
+  machine_mode vhi_mode;
+  poly_uint64 vhi_nunits = exact_div (GET_MODE_NUNITS (d->vmode), 2);
+
+  if (!get_vector_mode (HImode, vhi_nunits).exists (&vhi_mode))
+return false;
+
+  /* Step-1: Move op0 to src with VHI mode.  */
+  rtx src = gen_reg_rtx (vhi_mode);
+  emit_move_insn (src, gen_lowpart (vhi_mode, d->op0));
+
+  /* Step-2: Shift right 8 bits to dest.  */
+  rtx dest = expand_binop (vhi_mode, lshr_optab, src, gen_int_mode (8, Pmode),
+  NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-3: Shift left 8 bits to src.  */
+  src = expand_binop (vhi_mode, ashl_optab, src, gen_int_mode (8, Pmode),
+ NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-4: Logic Or dest and src to dest.  */
+  dest = expand_binop (vhi_mode, ior_optab, dest, src,
+  NULL_RTX, 0, OPTAB_DIRECT);
+
+  /* Step-5: Move src to target with VQI mode.  */
+  emit_move

Re: [PATCH] RISC-V Regression test: Fix FAIL of pr45752.c for RVV

2023-10-09 Thread Richard Biener
On Mon, 9 Oct 2023, Juzhe-Zhong wrote:

> RVV use load_lanes with stride = 5 vectorize this case with 
> -fno-vect-cost-model
> instead of SLP.

OK

> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/pr45752.c: Adapt dump check for target supports 
> load_lanes with stride = 5.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/pr45752.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/pr45752.c 
> b/gcc/testsuite/gcc.dg/vect/pr45752.c
> index e8b364f29eb..3c87d9b04fc 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr45752.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr45752.c
> @@ -159,4 +159,4 @@ int main (int argc, const char* argv[])
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>  /* { dg-final { scan-tree-dump-times "gaps requires scalar epilogue loop" 0 
> "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" 
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" 
> {target { ! { vect_load_lanes && vect_strided5 } } } } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-09 Thread Richard Biener
On Mon, 9 Oct 2023, Robin Dapp wrote:

> > Hmm, the function is called at transform time so this shouldn't help
> > avoiding the ICE.  I expected we refuse to vectorize _any_ reduction
> > when sign dependent rounding is in effect?  OTOH maybe sign-dependent
> > rounding is OK but only when we use a unconditional fold-left
> > (so a loop mask from fully masking is OK but not an original COND_ADD?).
> 
> So we currently only disable the use of partial vectors
> 
>   else if (reduction_type == FOLD_LEFT_REDUCTION
>  && reduc_fn == IFN_LAST

aarch64 probably chokes because reduc_fn is not IFN_LAST.

>  && FLOAT_TYPE_P (vectype_in)
>  && HONOR_SIGNED_ZEROS (vectype_in)

so with your change we'd support signed zeros correctly.

>  && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in))
>   {
> if (dump_enabled_p ())
>   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>"can't operate on partial vectors because"
>" signed zeros cannot be preserved.\n");
> LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> 
> which is inside a LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P block.
> 
> For the fully masked case we continue (and then fail the assertion
> on aarch64 at transform time).
> 
> I didn't get why that case is ok, though?  We still merge the initial
> definition with the identity/neutral op (i.e. possibly -0.0) based on
> the loop mask.  Is that different to partial masking?

I think the main point with my earlier change is that without
native support for a fold-left reduction (like on x86) we get

 ops = mask ? ops : neutral;
 acc += ops[0];
 acc += ops[1];
 ...

so we wouldn't use a COND_ADD but add neutral elements for masked
elements.  That's OK for signed zeros after your change (great)
but not OK for sign dependent rounding (because we can't decide on
the sign of the neutral zero then).

For the case of using an internal function, thus direct target support,
it should be OK to have sign-dependent rounding if we can use
the masked-fold-left reduction op.  As we do

  /* On the first iteration the input is simply the scalar phi
 result, and for subsequent iterations it is the output of
 the preceding operation.  */
  if (reduc_fn != IFN_LAST || (mask && mask_reduc_fn != IFN_LAST))
{
  if (mask && len && mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS)
new_stmt = gimple_build_call_internal (mask_reduc_fn, 5, 
reduc_var,
   def0, mask, len, bias);
  else if (mask && mask_reduc_fn == IFN_MASK_FOLD_LEFT_PLUS)
new_stmt = gimple_build_call_internal (mask_reduc_fn, 3, 
reduc_var,
   def0, mask);
  else
new_stmt = gimple_build_call_internal (reduc_fn, 2, reduc_var,
   def0);

the last case should be able to assert that 
!HONOR_SIGN_DEPENDENT_ROUNDING (also the reduc_fn == IFN_LAST case).

The quoted condition above should change to drop the HONOR_SIGNED_ZEROS
condition and the reduc_fn == IFN_LAST should change, maybe to
internal_fn_mask_index (reduc_fn) == -1?

Richard.


[PATCH] RISC-V Regression test: Fix FAIL of pr45752.c for RVV

2023-10-09 Thread Juzhe-Zhong
RVV use load_lanes with stride = 5 vectorize this case with -fno-vect-cost-model
instead of SLP.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr45752.c: Adapt dump check for target supports 
load_lanes with stride = 5.

---
 gcc/testsuite/gcc.dg/vect/pr45752.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr45752.c 
b/gcc/testsuite/gcc.dg/vect/pr45752.c
index e8b364f29eb..3c87d9b04fc 100644
--- a/gcc/testsuite/gcc.dg/vect/pr45752.c
+++ b/gcc/testsuite/gcc.dg/vect/pr45752.c
@@ -159,4 +159,4 @@ int main (int argc, const char* argv[])
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 /* { dg-final { scan-tree-dump-times "gaps requires scalar epilogue loop" 0 
"vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" } 
} */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" 
{target { ! { vect_load_lanes && vect_strided5 } } } } } */
-- 
2.36.3



Re: [PATCH] wide-int: Allow up to 16320 bits wide_int and change widest_int precision to 32640 bits [PR102989]

2023-10-09 Thread Richard Sandiford
Jakub Jelinek  writes:
> Hi!
>
> As mentioned in the _BitInt support thread, _BitInt(N) is currently limited
> by the wide_int/widest_int maximum precision limitation, which is depending
> on target 191, 319, 575 or 703 bits (one less than WIDE_INT_MAX_PRECISION).
> That is fairly low limit for _BitInt, especially on the targets with the 191
> bit limitation.
>
> The following patch bumps that limit to 16319 bits on all arches, which is
> the limit imposed by INTEGER_CST representation (unsigned char members
> holding number of HOST_WIDE_INT limbs).
>
> In order to achieve that, wide_int is changed from a trivially copyable type
> which contained just an inline array of WIDE_INT_MAX_ELTS (3, 5, 9 or
> 11 limbs depending on target) limbs into a non-trivially copy constructible,
> copy assignable and destructible type which for the usual small cases (up
> to WIDE_INT_MAX_INL_ELTS which is the former WIDE_INT_MAX_ELTS) still uses
> an inline array of limbs, but for larger precisions uses heap allocated
> limb array.  This makes wide_int unusable in GC structures, so for dwarf2out
> which was the only place which needed it there is a new rwide_int type
> (restricted wide_int) which supports only up to RWIDE_INT_MAX_ELTS limbs
> inline and is trivially copyable (dwarf2out should never deal with large
> _BitInt constants, those should have been lowered earlier).
>
> Similarly, widest_int has been changed from a trivially copyable type which
> contained also an inline array of WIDE_INT_MAX_ELTS limbs (but unlike
> wide_int didn't contain precision and assumed that to be
> WIDE_INT_MAX_PRECISION) into a non-trivially copy constructible, copy
> assignable and destructible type which has always WIDEST_INT_MAX_PRECISION
> precision (32640 bits currently, twice as much as INTEGER_CST limitation
> allows) and unlike wide_int decides depending on get_len () value whether
> it uses an inline array (again, up to WIDE_INT_MAX_INL_ELTS) or heap
> allocated one.  In wide-int.h this means we need to estimate an upper
> bound on how many limbs will wide-int.cc (usually, sometimes wide-int.h)
> need to write, heap allocate if needed based on that estimation and upon
> set_len which is done at the end if we guessed over WIDE_INT_MAX_INL_ELTS
> and allocated dynamically, while we actually need less than that
> copy/deallocate.  The unexact guesses are needed because the exact
> computation of the length in wide-int.cc is sometimes quite complex and
> especially canonicalize at the end can decrease it.  widest_int is again
> because of this not usable in GC structures, so cfgloop.h has been changed
> to use fixed_wide_int_storage  and punt if
> we'd have larger _BitInt based iterators, programs having more than 128-bit
> iterators will be hopefully rare and I think it is fine to treat loops with
> more than 2^127 iterations as effectively possibly infinite, omp-general.cc
> is changed to use fixed_wide_int_storage <1024>, as it better should support
> scores with the same precision on all arches.
>
> Code which used WIDE_INT_PRINT_BUFFER_SIZE sized buffers for printing
> wide_int/widest_int into buffer had to be changed to use XALLOCAVEC for
> larger lengths.
>
> On x86_64, the patch in --enable-checking=yes,rtl,extra configured
> bootstrapped cc1plus enlarges the .text section by 1.01% - from
> 0x25725a5 to 0x25e and similarly at least when compiling insn-recog.cc
> with the usual bootstrap option slows compilation down by 1.01%,
> user 4m22.046s and 4m22.384s on vanilla trunk vs.
> 4m25.947s and 4m25.581s on patched trunk.  I'm afraid some code size growth
> and compile time slowdown is unavoidable in this case, we use wide_int and
> widest_int everywhere, and while the rare cases are marked with UNLIKELY
> macros, it still means extra checks for it.

Yeah, it's unfortunate, but like you say, it's probably unavoidable.
Having effectively arbitrary-size integers breaks most of the simplifying
asssumptions.

> The patch also regresses
> +FAIL: gm2/pim/fail/largeconst.mod,  -O  
> +FAIL: gm2/pim/fail/largeconst.mod,  -O -g  
> +FAIL: gm2/pim/fail/largeconst.mod,  -O3 -fomit-frame-pointer  
> +FAIL: gm2/pim/fail/largeconst.mod,  -O3 -fomit-frame-pointer 
> -finline-functions  
> +FAIL: gm2/pim/fail/largeconst.mod,  -Os  
> +FAIL: gm2/pim/fail/largeconst.mod,  -g  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O -g  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O3 -fomit-frame-pointer  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -O3 -fomit-frame-pointer 
> -finline-functions  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -Os  
> +FAIL: gm2/pim/fail/largeconst2.mod,  -g  
> tests, which previously were rejected with
> error: constant literal 
> ‘12345678912345678912345679123456789123456789123456789123456789123456791234567891234567891234567891234567891234567912345678912345678912345678912345678912345679123456789123456789’
>  exceeds internal ZTYPE range
> kind of errors, but now are accepted.  Seems the F

Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-09 Thread Robin Dapp
> Hmm, the function is called at transform time so this shouldn't help
> avoiding the ICE.  I expected we refuse to vectorize _any_ reduction
> when sign dependent rounding is in effect?  OTOH maybe sign-dependent
> rounding is OK but only when we use a unconditional fold-left
> (so a loop mask from fully masking is OK but not an original COND_ADD?).

So we currently only disable the use of partial vectors

  else if (reduction_type == FOLD_LEFT_REDUCTION
   && reduc_fn == IFN_LAST
   && FLOAT_TYPE_P (vectype_in)
   && HONOR_SIGNED_ZEROS (vectype_in)
   && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in))
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "can't operate on partial vectors because"
 " signed zeros cannot be preserved.\n");
  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;

which is inside a LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P block.

For the fully masked case we continue (and then fail the assertion
on aarch64 at transform time).

I didn't get why that case is ok, though?  We still merge the initial
definition with the identity/neutral op (i.e. possibly -0.0) based on
the loop mask.  Is that different to partial masking?

Regards
 Robin



Re: [PATCH 1/6] aarch64: Sync system register information with Binutils

2023-10-09 Thread Victor Do Nascimento




On 10/9/23 01:02, Ramana Radhakrishnan wrote:




On 5 Oct 2023, at 14:04, Victor Do Nascimento  
wrote:

External email: Use caution opening links or attachments


On 10/5/23 12:42, Richard Earnshaw wrote:



On 03/10/2023 16:18, Victor Do Nascimento wrote:

This patch adds the `aarch64-sys-regs.def' file to GCC, teaching
the compiler about system registers known to the assembler and how
these can be used.

The macros used to hold system register information reflect those in
use by binutils, a design choice made to facilitate the sharing of data
between different parts of the toolchain.

By aligning the representation of data common to different parts of
the toolchain we can greatly reduce the duplication of work,
facilitating the maintenance of the aarch64 back-end across different
parts of the toolchain; any `SYSREG (...)' that is added in one
project can just as easily be added to its counterpart.

GCC does not implement the full range of ISA flags present in
Binutils.  Where this is the case, aliases must be added to aarch64.h
with the unknown architectural extension being mapped to its
associated base architecture, such that any flag present in Binutils
and used in system register definitions is understood in GCC.  Again,
this is done such that flags can be used interchangeably between
projects making use of the aarch64-system-regs.def file.  This is done
in the next patch in the series.

`.arch' directives missing from the emitted assembly files as a
consequence of this aliasing are accounted for by the compiler using
the S encoding of system registers when
issuing mrs/msr instructions.  This design choice ensures the
assembler will accept anything that was deemed acceptable by the
compiler.

gcc/ChangeLog:

* gcc/config/aarch64/aarch64-system-regs.def: New.
---
  gcc/config/aarch64/aarch64-sys-regs.def | 1059 +++
  1 file changed, 1059 insertions(+)
  create mode 100644 gcc/config/aarch64/aarch64-sys-regs.def


This file is supposed to be /identical/ to the one in GNU Binutils,
right?


You're right Richard.

We want the same file to be compatible with both parts of the toolchain
and, consequently, there is no compelling reason as to why the copy of
the file found in GCC should in any way diverge from its Binutils
counterpart.


If so, I think it needs to continue to say that it is part of
GNU Binutils, not part of GCC.  Ramana, has this happened before?  If
not, does the SC have a position here?



I’ve not had the time to delve into the patch, apologies.


Is the intention here to keep a copy of the file with the main copy being in 
binutils i.e. modifications are made in binutils and then sync’d with GCC at 
the same time ?


In which case the comments in the file should make the mechanics of updates 
abundantly clear.


That is indeed correct.
I will make this clear in the comments for the file.  Thanks for picking 
up on this.



Is there any reason why if the 2 versions were different, you’d have problems 
between gcc and binutils ?

If so, what kinds of problems would they be ? i.e. would they be no more than 
gas not knowing about a system register that GCC claimed to know because 
binutils and gcc were built with different versions of the system register file.


There would be no problem, should the two versions be different for 
whatever reason.  Even the issue you mention of gas not knowing about a 
system register that GCC claimed to know is circumvented.


Gcc is configured to always emit generic register names in the resulting 
asm, decoupling the system register validation mechanisms of the two 
parts of the toolchain.  If gcc deems the requirements of a particular 
system register to be satisfied, it won't trigger the assembler's 
validation mechanism when the assembly stage is reached.  Consequently, 
a stale copy of `aarch64-sys-reg.def' in binutils will bear no impact on 
gcc's execution.


Conversely, a stale `aarch64-sys-reg.def' on gcc's end will result in 
some register names not being recognized by gcc but, as in the above 
scenario, no ill-behavior will be triggered as a consequence of 
mismatches in `aarch64-sys-reg.def' version between different parts of 
the toolchain.



Speaking for myself, I do not see this request being any different from the 
requests for imports from other repositories into the GCC repository.




R.


This does raise a very interesting question on the intellectual property
front and one that is well beyond my competence to opine about.

Nonetheless, this is a question which may arise again if we abstract
away more target description data into such .def files, as has been
discussed for architectural feature flags (for example).

So what might be nice (but not necessarily tenable) is if we had
appropriate provisions in place for where files were shared across
different parts of the toolchain.

Something like "This file is a shared resource of GCC and Binutils."




This model of an additional shared repository with a 

Re: Re: [PATCH] RISC-V Regression test: Fix FAIL of fast-math-slp-38.c for RVV

2023-10-09 Thread juzhe.zh...@rivai.ai
>> OK.  
Thanks.  Committed.

>> Note load/store-lanes is specifically pre-empting SLP if all
>> loads/stores of a SLP intance can support that.  Not sure if this
>> heuristic is good for load/store lanes with high stride?

Yeah, I understand your concern. 
Em, I am sure too.
But RVV ISA define lanes load/store from 2 to 8 and LLVM already supported.
I think we can fully support them, then let RISC-V COST model decide it whether 
it is profitable or not.

Also, I found RVV can vectorize a TSVC case with stride = 5 
lane_load/lane_store:

tsvc-s353.c:

-/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" { xfail *-*-* } } } 
*/
+/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" { xfail { ! riscv_v 
} } } } */

https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632213.html

So, I think overall it is beneficial we support high stride lane load/store 
which can help us vectorize more cases.



juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-10-09 20:41
To: Juzhe-Zhong
CC: gcc-patches; jeffreyalaw
Subject: Re: [PATCH] RISC-V Regression test: Fix FAIL of fast-math-slp-38.c for 
RVV
On Mon, 9 Oct 2023, Juzhe-Zhong wrote:
 
> Reference: https://godbolt.org/z/G9jzf5Grh
> 
> RVV is able to vectorize this case using SLP. However, with 
> -fno-vect-cost-model, RVV vectorize it by vec_load_lanes with stride 6.
 
OK.  Note load/store-lanes is specifically pre-empting SLP if all
loads/stores of a SLP intance can support that.  Not sure if this
heuristic is good for load/store lanes with high stride?
 
> gcc/testsuite/ChangeLog:
> 
> * gcc.dg/vect/fast-math-slp-38.c: Add ! vect_strided6.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c 
> b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> index 7c7acd5bab6..96751faae7f 100644
> --- a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> +++ b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> @@ -18,4 +18,4 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> { target { ! vect_strided6 } } } } */
> 
 
-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 


Re: [PATCH] RISC-V Regression test: Fix FAIL of fast-math-slp-38.c for RVV

2023-10-09 Thread Richard Biener
On Mon, 9 Oct 2023, Juzhe-Zhong wrote:

> Reference: https://godbolt.org/z/G9jzf5Grh
> 
> RVV is able to vectorize this case using SLP. However, with 
> -fno-vect-cost-model, RVV vectorize it by vec_load_lanes with stride 6.

OK.  Note load/store-lanes is specifically pre-empting SLP if all
loads/stores of a SLP intance can support that.  Not sure if this
heuristic is good for load/store lanes with high stride?

> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/fast-math-slp-38.c: Add ! vect_strided6.
> 
> ---
>  gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c 
> b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> index 7c7acd5bab6..96751faae7f 100644
> --- a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> +++ b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
> @@ -18,4 +18,4 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" 
> { target { ! vect_strided6 } } } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[PATCH] RISC-V Regression test: Fix FAIL of fast-math-slp-38.c for RVV

2023-10-09 Thread Juzhe-Zhong
Reference: https://godbolt.org/z/G9jzf5Grh

RVV is able to vectorize this case using SLP. However, with 
-fno-vect-cost-model,
RVV vectorize it by vec_load_lanes with stride 6.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/fast-math-slp-38.c: Add ! vect_strided6.

---
 gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c 
b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
index 7c7acd5bab6..96751faae7f 100644
--- a/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
+++ b/gcc/testsuite/gcc.dg/vect/fast-math-slp-38.c
@@ -18,4 +18,4 @@ foo (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } 
} */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { ! vect_strided6 } } } } */
-- 
2.36.3



[PATCH V2] RISC-V: Support movmisalign of RVV VLA modes

2023-10-09 Thread Juzhe-Zhong
This patch fixed these following FAILs in regressions:
FAIL: gcc.dg/vect/slp-perm-11.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "vectorizing stmts using SLP" 1
FAIL: gcc.dg/vect/slp-perm-11.c scan-tree-dump-times vect "vectorizing stmts 
using SLP" 1
FAIL: gcc.dg/vect/vect-bitfield-read-2.c -flto -ffat-lto-objects  
scan-tree-dump-not optimized "Invalid sum"
FAIL: gcc.dg/vect/vect-bitfield-read-2.c scan-tree-dump-not optimized "Invalid 
sum"
FAIL: gcc.dg/vect/vect-bitfield-read-4.c -flto -ffat-lto-objects  
scan-tree-dump-not optimized "Invalid sum"
FAIL: gcc.dg/vect/vect-bitfield-read-4.c scan-tree-dump-not optimized "Invalid 
sum"
FAIL: gcc.dg/vect/vect-bitfield-write-2.c -flto -ffat-lto-objects  
scan-tree-dump-not optimized "Invalid sum"
FAIL: gcc.dg/vect/vect-bitfield-write-2.c scan-tree-dump-not optimized "Invalid 
sum"
FAIL: gcc.dg/vect/vect-bitfield-write-3.c -flto -ffat-lto-objects  
scan-tree-dump-not optimized "Invalid sum"
FAIL: gcc.dg/vect/vect-bitfield-write-3.c scan-tree-dump-not optimized "Invalid 
sum"

Previously, I removed the movmisalign pattern to fix the execution FAILs in 
this commit:
https://github.com/gcc-mirror/gcc/commit/f7bff24905a6959f85f866390db2fff1d6f95520

I was thinking that RVV doesn't allow misaligned at the beginning so I removed 
that pattern.
However, after deep investigation && reading RVV ISA again and experiment on 
SPIKE,
I realized I was wrong.

RVV ISA reference: 
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-memory-alignment-constraints

"If an element accessed by a vector memory instruction is not naturally aligned 
to the size of the element, 
 either the element is transferred successfully or an address misaligned 
exception is raised on that element."

It's obvious that RVV ISA does allow misaligned vector load/store.

And experiment and confirm on SPIKE:

[jzzhong@rios-cad122:/work/home/jzzhong/work/toolchain/riscv/gcc/gcc/testsuite/gcc.dg/vect]$~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/bin/spike
 --isa=rv64gcv --varch=vlen:128,elen:64 
~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/riscv64-unknown-elf/bin/pk64
  a.out
bbl loader
z   ra 00010158 sp 003ffb40 gp 00012c48
tp  t0 000110da t1 000f t2 
s0 00013460 s1  a0 00012ef5 a1 00012018
a2 00012a71 a3 000d a4 0004 a5 00012a71
a6 00012a71 a7 00012018 s2  s3 
s4  s5  s6  s7 
s8  s9  sA  sB 
t3  t4  t5  t6 
pc 00010258 va/inst 020660a7 sr 80026620
Store/AMO access fault!

[jzzhong@rios-cad122:/work/home/jzzhong/work/toolchain/riscv/gcc/gcc/testsuite/gcc.dg/vect]$~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/bin/spike
 --misaligned --isa=rv64gcv --varch=vlen:128,elen:64 
~/work/toolchain/riscv/build/dev-rv64gcv_zfh-lp64d-medany-newlib-spike-debug/install/riscv64-unknown-elf/bin/pk64
  a.out
bbl loader

We can see SPIKE can pass previous *FAILED* execution tests with specifying 
--misaligned to SPIKE.

So, to honor RVV ISA SPEC, we should add movmisalign pattern back base on the 
investigations I have done since
it can improve multiple vectorization tests and fix dumple FAILs.

This patch adds TARGET_VECTOR_MISALIGN_SUPPORTED to decide whether we support 
misalign pattern for VLA modes (By default it is enabled).

Consider this following case:

struct s {
unsigned i : 31;
char a : 4;
};

#define N 32
#define ELT0 {0x7FFFUL, 0}
#define ELT1 {0x7FFFUL, 1}
#define ELT2 {0x7FFFUL, 2}
#define ELT3 {0x7FFFUL, 3}
#define RES 48
struct s A[N]
  = { ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
  ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
  ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
  ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3};

int __attribute__ ((noipa))
f(struct s *ptr, unsigned n) {
int res = 0;
for (int i = 0; i < n; ++i)
  res += ptr[i].a;
return res;
}

-O3 -S -fno-vect-cost-model (default strict-align):

f:
mv  a4,a0
beq a1,zero,.L9
addiw   a5,a1,-1
li  a3,14
vsetivlizero,16,e64,m8,ta,ma
bleua5,a3,.L3
andia5,a0,127
bne a5,zero,.L3
srliw   a3,a1,4
sllia3,a3,7
li  a0,15
sllia0,a0,32
add a3,a3,a4
mv  a5,a4
li  a2,32
vmv.v.x v16,a0
vsetvli zero,zero,e32,m4,ta,ma
vmv.v.i v4,0
.L4:
vsetvli zero,zero,e64,m8,ta,ma
vle64.v v8,0(a5)
addia5,a5,128
 

Re: [PATCH] ifcvt/vect: Emit COND_ADD for conditional scalar reduction.

2023-10-09 Thread Robin Dapp
> It'd be good to expand on this comment a bit.  What kind of COND are you
> anticipating?  A COND with the neutral op as the else value, so that the
> PLUS_EXPR (or whatever) can remain unconditional?  If so, it would be
> good to sketch briefly how that happens, and why it's better than using
> the conditional PLUS_EXPR.
> 
> If that's the reason, perhaps we want a single-use check as well.
> It's possible that OP1 is used elsewhere in the loop body, in a
> context that would prefer a different else value.

Would something like the following on top work?

-  /* If possible try to create an IFN_COND_ADD instead of a COND_EXPR and
- a PLUS_EXPR.  Don't do this if the reduction def operand itself is
+  /* If possible create a COND_OP instead of a COND_EXPR and an OP_EXPR.
+ The COND_OP will have a neutral_op else value.
+
+ This allows re-using the mask directly in a masked reduction instead
+ of creating a vector merge (or similar) and then an unmasked reduction.
+
+ Don't do this if the reduction def operand itself is
  a vectorizable call as we can create a COND version of it directly.  */

   if (ifn != IFN_LAST
   && vectorized_internal_fn_supported_p (ifn, TREE_TYPE (lhs))
-  && try_cond_op && !swap)
+  && use_cond_op && !swap && has_single_use (op1))

Regards
 Robin



Re: PR111648: Fix wrong code-gen due to incorrect VEC_PERM_EXPR folding

2023-10-09 Thread Richard Sandiford
Prathamesh Kulkarni  writes:
> Hi,
> The attached patch attempts to fix PR111648.
> As mentioned in PR, the issue is when a1 is a multiple of vector
> length, we end up creating following encoding in result: { base_elem,
> arg[0], arg[1], ... } (assuming S = 1),
> where arg is chosen input vector, which is incorrect, since the
> encoding originally in arg would be: { arg[0], arg[1], arg[2], ... }
>
> For the test-case mentioned in PR, vectorizer pass creates
> VEC_PERM_EXPR where:
> arg0: { -16, -9, -10, -11 }
> arg1: { -12, -5, -6, -7 }
> sel = { 3, 4, 5, 6 }
>
> arg0, arg1 and sel are encoded with npatterns = 1 and nelts_per_pattern = 3.
> Since a1 = 4 and arg_len = 4, it ended up creating the result with
> following encoding:
> res = { arg0[3], arg1[0], arg1[1] } // npatterns = 1, nelts_per_pattern = 3
>   = { -11, -12, -5 }
>
> So for res[3], it used S = (-5) - (-12) = 7
> And hence computed it as -5 + 7 = 2.
> instead of selecting arg1[2], ie, -6.
>
> The patch tweaks valid_mask_for_fold_vec_perm_cst_p to punt if a1 is a 
> multiple
> of vector length, so a1 ... ae select elements only from stepped part
> of the pattern
> from input vector and return false for this case.
>
> Since the vectors are VLS, fold_vec_perm_cst then sets:
> res_npatterns = res_nelts
> res_nelts_per_pattern  = 1
> which seems to fix the issue by encoding all the elements.
>
> The patch resulted in Case 4 and Case 5 failing from test_nunits_min_2 because
> they used sel = { 0, 0, 1, ... } and {len, 0, 1, ... } respectively,
> which used a1 = 0, and thus selected arg1[0].
>
> I removed Case 4 because it was already covered in test_nunits_min_4,
> and moved Case 5 to test_nunits_min_4, with sel = { len, 1, 2, ... }
> and added a new Case 9 to test for this issue.
>
> Passes bootstrap+test on aarch64-linux-gnu with and without SVE,
> and on x86_64-linux-gnu.
> Does the patch look OK ?
>
> Thanks,
> Prathamesh
>
> [PR111648] Fix wrong code-gen due to incorrect VEC_PERM_EXPR folding.
>
> gcc/ChangeLog:
>   PR tree-optimization/111648
>   * fold-const.cc (valid_mask_for_fold_vec_perm_cst_p): Punt if a1
>   is a multiple of vector length.
>   (test_nunits_min_2): Remove Case 4 and move Case 5 to ...
>   (test_nunits_min_4): ... here and rename case numbers. Also add
>   Case 9.
>
> gcc/testsuite/ChangeLog:
>   PR tree-optimization/111648
>   * gcc.dg/vect/pr111648.c: New test.
>
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 4f8561509ff..c5f421d6b76 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -10682,8 +10682,8 @@ valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree 
> arg1,
> return false;
>   }
>  
> -  /* Ensure that the stepped sequence always selects from the same
> -  input pattern.  */
> +  /* Ensure that the stepped sequence always selects from the stepped
> +  part of same input pattern.  */
>unsigned arg_npatterns
>   = ((q1 & 1) == 0) ? VECTOR_CST_NPATTERNS (arg0)
> : VECTOR_CST_NPATTERNS (arg1);
> @@ -10694,6 +10694,20 @@ valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree 
> arg1,
>   *reason = "step is not multiple of npatterns";
> return false;
>   }
> +
> +  /* If a1 is a multiple of len, it will select base element of input
> +  vector resulting in following encoding:
> +  { base_elem, arg[0], arg[1], ... } where arg is the chosen input
> +  vector. This encoding is not originally present in arg, since it's
> +  defined as:
> +  { arg[0], arg[1], arg[2], ... }.  */
> +
> +  if (multiple_p (a1, arg_len))
> + {
> +   if (reason)
> + *reason = "selecting base element of input vector";
> +   return false;
> + }

That wouldn't catch (for example) cases where a1 == arg_len + 1 and the
second argument has 2 stepped patterns.

The equivalent condition that handles multiple patterns would
probably be to reject q1 < arg_npatterns.  But that's only necessary if:

(1) the argument has three elements per pattern (i.e. has a stepped
sequence) and

(2) element 2 - element 1 != element 1 - element 0

I think we should check those to avoid pessimising VLA cases.

Thanks,
Richard

>  }
>  
>return true;
> @@ -17425,47 +17439,6 @@ test_nunits_min_2 (machine_mode vmode)
>   tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
>   validate_res (2, 2, res, expected_res);
>}
> -
> -  /* Case 4: mask = {0, 0, 1, ...} // (1, 3)
> -  Test that the stepped sequence of the pattern selects from
> -  same input pattern. Since input vectors have npatterns = 2,
> -  and step (a2 - a1) = 1, step is not a multiple of npatterns
> -  in input vector. So return NULL_TREE.  */
> -  {
> - tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> - tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> - poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> -
> - vec_perm_builder buil

Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-09 Thread Richard Biener
On Mon, Oct 9, 2023 at 12:17 PM Richard Sandiford
 wrote:
>
> Tamar Christina  writes:
> >> -Original Message-
> >> From: Richard Sandiford 
> >> Sent: Monday, October 9, 2023 10:56 AM
> >> To: Tamar Christina 
> >> Cc: Richard Biener ; gcc-patches@gcc.gnu.org;
> >> nd ; Richard Earnshaw ;
> >> Marcus Shawcroft ; Kyrylo Tkachov
> >> 
> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> >>
> >> Tamar Christina  writes:
> >> >> -Original Message-
> >> >> From: Richard Sandiford 
> >> >> Sent: Saturday, October 7, 2023 10:58 AM
> >> >> To: Richard Biener 
> >> >> Cc: Tamar Christina ;
> >> >> gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> >> ; Marcus Shawcroft
> >> >> ; Kyrylo Tkachov
> >> 
> >> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> >> >>
> >> >> Richard Biener  writes:
> >> >> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
> >> >>  wrote:
> >> >> >>
> >> >> >> > -Original Message-
> >> >> >> > From: Richard Sandiford 
> >> >> >> > Sent: Thursday, October 5, 2023 9:26 PM
> >> >> >> > To: Tamar Christina 
> >> >> >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> >> >> >> > ; Marcus Shawcroft
> >> >> >> > ; Kyrylo Tkachov
> >> >> 
> >> >> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
> >> >> cond_copysign.
> >> >> >> >
> >> >> >> > Tamar Christina  writes:
> >> >> >> > >> -Original Message-
> >> >> >> > >> From: Richard Sandiford 
> >> >> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
> >> >> >> > >> To: Tamar Christina 
> >> >> >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard
> >> >> >> > >> Earnshaw ; Marcus Shawcroft
> >> >> >> > >> ; Kyrylo Tkachov
> >> >> >> > 
> >> >> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
> >> >> cond_copysign.
> >> >> >> > >>
> >> >> >> > >> Tamar Christina  writes:
> >> >> >> > >> > Hi All,
> >> >> >> > >> >
> >> >> >> > >> > This adds an implementation for masked copysign along with
> >> >> >> > >> > an optimized pattern for masked copysign (x, -1).
> >> >> >> > >>
> >> >> >> > >> It feels like we're ending up with a lot of AArch64-specific
> >> >> >> > >> code that just hard- codes the observation that changing the
> >> >> >> > >> sign is equivalent to changing the top bit.  We then need to
> >> >> >> > >> make sure that we choose the best way of changing the top bit
> >> >> >> > >> for any
> >> >> given situation.
> >> >> >> > >>
> >> >> >> > >> Hard-coding the -1/negative case is one instance of that.
> >> >> >> > >> But it looks like we also fail to use the best sequence for 
> >> >> >> > >> SVE2.  E.g.
> >> >> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
> >> >> >> > >>
> >> >> >> > >> #include 
> >> >> >> > >>
> >> >> >> > >> void f(double *restrict a, double *restrict b) {
> >> >> >> > >> for (int i = 0; i < 100; ++i)
> >> >> >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
> >> >> >> > >>
> >> >> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
> >> >> >> > >> for (int i = 0; i < 100; ++i)
> >> >> >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
> >> >> >> > >>
> >> >> >> > >> gives:
> >> >> >> > >>
> >> >> >> > >> f:
> >> >> >> > >> mov x2, 0
> >> >> >> > >> mov w3, 100
> >> >> >> > >> whilelo p7.d, wzr, w3
> >> >> >> > >> .L2:
> >> >> >> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
> >> >> >> > >> ld1dz31.d, p7/z, [x1, x2, lsl 3]
> >> >> >> > >> and z30.d, z30.d, #0x7fff
> >> >> >> > >> and z31.d, z31.d, #0x8000
> >> >> >> > >> orr z31.d, z31.d, z30.d
> >> >> >> > >> st1dz31.d, p7, [x0, x2, lsl 3]
> >> >> >> > >> incdx2
> >> >> >> > >> whilelo p7.d, w2, w3
> >> >> >> > >> b.any   .L2
> >> >> >> > >> ret
> >> >> >> > >> g:
> >> >> >> > >> mov x3, 0
> >> >> >> > >> mov w4, 100
> >> >> >> > >> mov z29.d, x2
> >> >> >> > >> whilelo p7.d, wzr, w4
> >> >> >> > >> .L6:
> >> >> >> > >> ld1dz30.d, p7/z, [x0, x3, lsl 3]
> >> >> >> > >> ld1dz31.d, p7/z, [x1, x3, lsl 3]
> >> >> >> > >> bsl z31.d, z31.d, z30.d, z29.d
> >> >> >> > >> st1dz31.d, p7, [x0, x3, lsl 3]
> >> >> >> > >> incdx3
> >> >> >> > >> whilelo p7.d, w3, w4
> >> >> >> > >> b.any   .L6
> >> >> >> > >> ret
> >> >> >> > >>
> >> >> >> > >> I saw that you originally tried to do this in match.pd and
> >> >> >> > >> that the decision was to fold to copysign instead.  But
> >> >> >> > >> perhaps there's a compromise where isel does something with
> >> >> >> > >> the (new) copysign canonical
> >> >> >> > form?
> >> >> >> > >> I.e. could we go with your new version of the match.pd patch,
> >> >> >> > >> and add some isel stuff as a follow-on?
> >>
> >> [A]
> >>
> >> >> >> > >>
> >> >> >> > >
> >> >> >> > > Sure if that's what's desired But..
> >> >> >> > >
> >> >>

Re: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV

2023-10-09 Thread Richard Biener
On Mon, 9 Oct 2023, Robin Dapp wrote:

> On 10/9/23 09:32, Andreas Schwab wrote:
> > On Okt 09 2023, juzhe.zh...@rivai.ai wrote:
> > 
> >> Turns out COND(_LEN)?_ADD can't work.
> > 
> > It should work though.  Tcl regexps are a superset of POSIX EREs.
> > 
> 
> The problem is that COND(_LEN)?_ADD matches two times against
> COND_LEN_ADD and a scan-tree-dump-times 1 will fail.  So for those
> checks in vect-cond-arith-6.c we either need to switch to
> scan-tree-dump or change the pattern to "\.(?:COND|COND_LEN)_ADD".
> 
> Juzhe, something like the attached works for me.

LGTM.

Richard.

> Regards
>  Robin
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c 
> b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
> index 1af0fe642a0..7d26dbedc5e 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
> @@ -52,8 +52,8 @@ main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
> vect_double_cond_arith } } } */
>  /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
> vect_double_cond_arith } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c 
> b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
> index ec3d9db4202..f7daa13685c 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
> @@ -54,8 +54,8 @@ main (void)
>return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> -/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
> { vect_double_cond_arith && vect_masked_store } } } } */
>  /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target { 
> vect_double_cond_arith && vect_masked_store } } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c 
> b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
> index 2aeebd44f83..a80c30a50b2 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
> @@ -56,8 +56,8 @@ main (void)
>  }
>  
>  /* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 4 "vect" 
> { target vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump-times { = \.COND_ADD} 1 "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump-times { = \.COND_SUB} 1 "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump-times { = \.COND_MUL} 1 "optimized" { target 
> vect_double_cond_arith } } } */
> -/* { dg-final { scan-tree-dump-times { = \.COND_RDIV} 1 "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
> vect_double_cond_arith } } } */
> +/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
> vect_double_cond_arith } } } */
>  /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
> vect_double_cond_arith } } } */
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany

[PATCH] tree-optimization/111715 - improve TBAA for access paths with pun

2023-10-09 Thread Richard Biener
The following improves basic TBAA for access paths formed by
C++ abstraction where we are able to combine a path from an
address-taking operation with a path based on that access using
a pun to avoid memory access semantics on the address-taking part.

The trick is to identify the point the semantic memory access path
starts which allows us to use the alias set of the outermost access
instead of only that of the base of this path.

Bootstrapped and tested on x86_64-unknown-linux-gnu for all languages
with a slightly different variant, re-bootstrapping/testing now
(with doing the extra walk just for AGGREGATE_TYPE_P).

PR tree-optimization/111715
* alias.cc (reference_alias_ptr_type_1): When we have
a type-punning ref at the base search for the access
path part that's still semantically valid.

* gcc.dg/tree-ssa/ssa-fre-102.c: New testcase.
---
 gcc/alias.cc| 20 -
 gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c | 32 +
 2 files changed, 51 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c

diff --git a/gcc/alias.cc b/gcc/alias.cc
index 7c1af1fe96e..4060ff72949 100644
--- a/gcc/alias.cc
+++ b/gcc/alias.cc
@@ -774,7 +774,25 @@ reference_alias_ptr_type_1 (tree *t)
   && (TYPE_MAIN_VARIANT (TREE_TYPE (inner))
  != TYPE_MAIN_VARIANT
   (TREE_TYPE (TREE_TYPE (TREE_OPERAND (inner, 1))
-return TREE_TYPE (TREE_OPERAND (inner, 1));
+{
+  tree alias_ptrtype = TREE_TYPE (TREE_OPERAND (inner, 1));
+  /* Unless we have the (aggregate) effective type of the access
+somewhere on the access path.  If we have for example
+(&a->elts[i])->l.len exposed by abstraction we'd see
+MEM  [(B *)a].elts[i].l.len and we can use the alias set
+of 'len' when typeof (MEM  [(B *)a].elts[i]) == B for
+example.  See PR111715.  */
+  if (AGGREGATE_TYPE_P (TREE_TYPE (alias_ptrtype)))
+   {
+ tree inner = *t;
+ while (handled_component_p (inner)
+&& (TYPE_MAIN_VARIANT (TREE_TYPE (inner))
+!= TYPE_MAIN_VARIANT (TREE_TYPE (alias_ptrtype
+   inner = TREE_OPERAND (inner, 0);
+   }
+  if (TREE_CODE (inner) == MEM_REF)
+   return alias_ptrtype;
+}
 
   /* Otherwise, pick up the outermost object that we could have
  a pointer to.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c
new file mode 100644
index 000..afd48050819
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-102.c
@@ -0,0 +1,32 @@
+/* PR/111715 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-fre1" } */
+
+struct B {
+   struct { int len; } l;
+   long n;
+};
+struct A {
+   struct B elts[8];
+};
+
+static void
+set_len (struct B *b, int len)
+{
+  b->l.len = len;
+}
+
+static int
+get_len (struct B *b)
+{
+  return b->l.len;
+}
+
+int foo (struct A *a, int i, long *q)
+{
+  set_len (&a->elts[i], 1);
+  *q = 2;
+  return get_len (&a->elts[i]);
+}
+
+/* { dg-final { scan-tree-dump "return 1;" "fre1" } } */
-- 
2.35.3


RE: [PATCH v1] RISC-V: Refine bswap16 auto vectorization code gen

2023-10-09 Thread Li, Pan2
Sure thing, will send V2 for this change.

Pan

From: juzhe.zh...@rivai.ai 
Sent: Monday, October 9, 2023 5:04 PM
To: Li, Pan2 ; gcc-patches 
Cc: Li, Pan2 ; Wang, Yanzhang ; 
kito.cheng 
Subject: Re: [PATCH v1] RISC-V: Refine bswap16 auto vectorization code gen

Remove these functions:


+static void

+emit_vec_sll_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)

+{

+  rtx sll_ops[] = {op_0, op_1, op_2};

+  insn_code icode = code_for_pred_scalar (ASHIFT, vec_mode);

+

+  emit_vlmax_insn (icode, BINARY_OP, sll_ops);

+}

+

+static void

+emit_vec_srl_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)

+{

+  rtx srl_ops[] = {op_0, op_1, op_2};

+  insn_code icode = code_for_pred_scalar (LSHIFTRT, vec_mode);

+

+  emit_vlmax_insn (icode, BINARY_OP, srl_ops);

+}

+

+static void

+emit_vec_or (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)

+{

+  rtx or_ops[] = {op_0, op_1, op_2};

+  insn_code icode = code_for_pred (IOR, vec_mode);

+

+  emit_vlmax_insn (icode, BINARY_OP, or_ops);

+}

+

Instead,

For sll, you should use :
rtx tmp
= expand_binop (Pmode, ashl_optab, op_1,
gen_int_mode (8, Pmode), NULL_RTX, 0,
OPTAB_DIRECT);

For srl, you should use:
rtx tmp
= expand_binop (Pmode, lshiftrt_optab, op_1,
gen_int_mode (8, Pmode), NULL_RTX, 0,
OPTAB_DIRECT);


For or, you should use:
expand_binop (Pmode, ior_optab, tmp, dest, NULL_RTX, 0,
   OPTAB_DIRECT);


juzhe.zh...@rivai.ai

From: pan2.li
Date: 2023-10-09 16:51
To: gcc-patches
CC: juzhe.zhong; 
pan2.li; 
yanzhang.wang; 
kito.cheng
Subject: [PATCH v1] RISC-V: Refine bswap16 auto vectorization code gen
From: Pan Li mailto:pan2...@intel.com>>

This patch would like to refine the code gen for the bswap16.

We will have VEC_PERM_EXPR after rtl expand when invoking
__builtin_bswap. It will generate about 9 instructions in
loop as below, no matter it is bswap16, bswap32 or bswap64.

  .L2:
1 vle16.v v4,0(a0)
2 vmv.v.x v2,a7
3 vand.vv v2,v6,v2
4 sllia2,a5,1
5 vrgatherei16.vv v1,v4,v2
6 sub a4,a4,a5
7 vse16.v v1,0(a3)
8 add a0,a0,a2
9 add a3,a3,a2
  bne a4,zero,.L2

But for bswap16 we may have a even simple code gen, which
has only 7 instructions in loop as below.

  .L5
1 vle8.v  v2,0(a5)
2 addia5,a5,32
3 vsrl.vi v4,v2,8
4 vsll.vi v2,v2,8
5 vor.vv  v4,v4,v2
6 vse8.v  v4,0(a4)
7 addia4,a4,32
  bne a5,a6,.L5

Unfortunately, this way will make the insn in loop will grow up to
13 and 24 for bswap32 and bswap64. Thus, we will refine the code
gen for the bswap16 only, and leave both the bswap32 and bswap64
as is.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (emit_vec_sll_scalar): New help func
impl for emit vsll.vi/vsll.vx
(emit_vec_srl_scalar): Likewise for vsrl.vi/vsrl.vx.
(emit_vec_or): Likewise for vor.vv.
(shuffle_bswap_pattern): New func impl for shuffle bswap.
(expand_vec_perm_const_1): Add shuffle bswap pattern.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls/perm-4.c: Adjust checker.
* gcc.target/riscv/rvv/autovec/unop/bswap16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/bswap16-0.c: New test.

Signed-off-by: Pan Li mailto:pan2...@intel.com>>
---
gcc/config/riscv/riscv-v.cc   | 117 ++
.../riscv/rvv/autovec/unop/bswap16-0.c|  17 +++
.../riscv/rvv/autovec/unop/bswap16-run-0.c|  44 +++
.../riscv/rvv/autovec/vls/bswap16-0.c |  34 +
.../gcc.target/riscv/rvv/autovec/vls/perm-4.c |   4 +-
5 files changed, 214 insertions(+), 2 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/bswap16-run-0.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/bswap16-0.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 23633a2a74d..3e3b5f2e797 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -878,6 +878,33 @@ emit_vlmax_decompress_insn (rtx target, rtx op0, rtx op1, 
rtx mask)
   emit_vlmax_masked_gather_mu_insn (target, op1, sel, mask);
}
+static void
+emit_vec_sll_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx sll_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred_scalar (ASHIFT, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, sll_ops);
+}
+
+static void
+emit_vec_srl_scalar (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx srl_ops[] = {op_0, op_1, op_2};
+  insn_code icode = code_for_pred_scalar (LSHIFTRT, vec_mode);
+
+  emit_vlmax_insn (icode, BINARY_OP, srl_ops);
+}
+
+static void
+emit_vec_or (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
+{
+  rtx or_ops[]

Re: Re: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV

2023-10-09 Thread juzhe.zh...@rivai.ai
Thanks Robin. Could you send V3 to Richi ? And commit it if Richi is ok with 
that.



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-10-09 18:26
To: Andreas Schwab; juzhe.zhong
CC: rdapp.gcc; gcc-patches; rguenther; jeffreyalaw
Subject: Re: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV
On 10/9/23 09:32, Andreas Schwab wrote:
> On Okt 09 2023, juzhe.zh...@rivai.ai wrote:
> 
>> Turns out COND(_LEN)?_ADD can't work.
> 
> It should work though.  Tcl regexps are a superset of POSIX EREs.
> 
 
The problem is that COND(_LEN)?_ADD matches two times against
COND_LEN_ADD and a scan-tree-dump-times 1 will fail.  So for those
checks in vect-cond-arith-6.c we either need to switch to
scan-tree-dump or change the pattern to "\.(?:COND|COND_LEN)_ADD".
 
Juzhe, something like the attached works for me.
 
Regards
Robin
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
index 1af0fe642a0..7d26dbedc5e 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
@@ -52,8 +52,8 @@ main (void)
   return 0;
}
-/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
vect_double_cond_arith } } } */
/* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
vect_double_cond_arith } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
index ec3d9db4202..f7daa13685c 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
@@ -54,8 +54,8 @@ main (void)
   return 0;
}
-/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
/* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
index 2aeebd44f83..a80c30a50b2 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
@@ -56,8 +56,8 @@ main (void)
}
/* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 4 "vect" { 
target vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_ADD} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_SUB} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_MUL} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_RDIV} 1 "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
vect_double_cond_arith } } } */
/* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
vect_double_cond_arith } } } */
 


Re: [PATCH V2] TEST: Fix vect_cond_arith_* dump checks for RVV

2023-10-09 Thread Robin Dapp
On 10/9/23 09:32, Andreas Schwab wrote:
> On Okt 09 2023, juzhe.zh...@rivai.ai wrote:
> 
>> Turns out COND(_LEN)?_ADD can't work.
> 
> It should work though.  Tcl regexps are a superset of POSIX EREs.
> 

The problem is that COND(_LEN)?_ADD matches two times against
COND_LEN_ADD and a scan-tree-dump-times 1 will fail.  So for those
checks in vect-cond-arith-6.c we either need to switch to
scan-tree-dump or change the pattern to "\.(?:COND|COND_LEN)_ADD".

Juzhe, something like the attached works for me.

Regards
 Robin

diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
index 1af0fe642a0..7d26dbedc5e 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-4.c
@@ -52,8 +52,8 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
vect_double_cond_arith } } } */
 /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
vect_double_cond_arith } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
index ec3d9db4202..f7daa13685c 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-5.c
@@ -54,8 +54,8 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump { = \.COND_ADD} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_SUB} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_MUL} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
-/* { dg-final { scan-tree-dump { = \.COND_RDIV} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
 /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target { 
vect_double_cond_arith && vect_masked_store } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c 
b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
index 2aeebd44f83..a80c30a50b2 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-cond-arith-6.c
@@ -56,8 +56,8 @@ main (void)
 }
 
 /* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 4 "vect" { 
target vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_ADD} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_SUB} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_MUL} 1 "optimized" { target 
vect_double_cond_arith } } } */
-/* { dg-final { scan-tree-dump-times { = \.COND_RDIV} 1 "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?ADD} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?SUB} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?MUL} "optimized" { target 
vect_double_cond_arith } } } */
+/* { dg-final { scan-tree-dump { = \.COND_(LEN_)?RDIV} "optimized" { target 
vect_double_cond_arith } } } */
 /* { dg-final { scan-tree-dump-not {VEC_COND_EXPR} "optimized" { target 
vect_double_cond_arith } } } */


Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.

2023-10-09 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Monday, October 9, 2023 10:56 AM
>> To: Tamar Christina 
>> Cc: Richard Biener ; gcc-patches@gcc.gnu.org;
>> nd ; Richard Earnshaw ;
>> Marcus Shawcroft ; Kyrylo Tkachov
>> 
>> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> 
>> Tamar Christina  writes:
>> >> -Original Message-
>> >> From: Richard Sandiford 
>> >> Sent: Saturday, October 7, 2023 10:58 AM
>> >> To: Richard Biener 
>> >> Cc: Tamar Christina ;
>> >> gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> >> ; Marcus Shawcroft
>> >> ; Kyrylo Tkachov
>> 
>> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>> >>
>> >> Richard Biener  writes:
>> >> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
>> >>  wrote:
>> >> >>
>> >> >> > -Original Message-
>> >> >> > From: Richard Sandiford 
>> >> >> > Sent: Thursday, October 5, 2023 9:26 PM
>> >> >> > To: Tamar Christina 
>> >> >> > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> >> >> > ; Marcus Shawcroft
>> >> >> > ; Kyrylo Tkachov
>> >> 
>> >> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> >> cond_copysign.
>> >> >> >
>> >> >> > Tamar Christina  writes:
>> >> >> > >> -Original Message-
>> >> >> > >> From: Richard Sandiford 
>> >> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
>> >> >> > >> To: Tamar Christina 
>> >> >> > >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard
>> >> >> > >> Earnshaw ; Marcus Shawcroft
>> >> >> > >> ; Kyrylo Tkachov
>> >> >> > 
>> >> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
>> >> cond_copysign.
>> >> >> > >>
>> >> >> > >> Tamar Christina  writes:
>> >> >> > >> > Hi All,
>> >> >> > >> >
>> >> >> > >> > This adds an implementation for masked copysign along with
>> >> >> > >> > an optimized pattern for masked copysign (x, -1).
>> >> >> > >>
>> >> >> > >> It feels like we're ending up with a lot of AArch64-specific
>> >> >> > >> code that just hard- codes the observation that changing the
>> >> >> > >> sign is equivalent to changing the top bit.  We then need to
>> >> >> > >> make sure that we choose the best way of changing the top bit
>> >> >> > >> for any
>> >> given situation.
>> >> >> > >>
>> >> >> > >> Hard-coding the -1/negative case is one instance of that.
>> >> >> > >> But it looks like we also fail to use the best sequence for SVE2. 
>> >> >> > >>  E.g.
>> >> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
>> >> >> > >>
>> >> >> > >> #include 
>> >> >> > >>
>> >> >> > >> void f(double *restrict a, double *restrict b) {
>> >> >> > >> for (int i = 0; i < 100; ++i)
>> >> >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
>> >> >> > >>
>> >> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
>> >> >> > >> for (int i = 0; i < 100; ++i)
>> >> >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
>> >> >> > >>
>> >> >> > >> gives:
>> >> >> > >>
>> >> >> > >> f:
>> >> >> > >> mov x2, 0
>> >> >> > >> mov w3, 100
>> >> >> > >> whilelo p7.d, wzr, w3
>> >> >> > >> .L2:
>> >> >> > >> ld1dz30.d, p7/z, [x0, x2, lsl 3]
>> >> >> > >> ld1dz31.d, p7/z, [x1, x2, lsl 3]
>> >> >> > >> and z30.d, z30.d, #0x7fff
>> >> >> > >> and z31.d, z31.d, #0x8000
>> >> >> > >> orr z31.d, z31.d, z30.d
>> >> >> > >> st1dz31.d, p7, [x0, x2, lsl 3]
>> >> >> > >> incdx2
>> >> >> > >> whilelo p7.d, w2, w3
>> >> >> > >> b.any   .L2
>> >> >> > >> ret
>> >> >> > >> g:
>> >> >> > >> mov x3, 0
>> >> >> > >> mov w4, 100
>> >> >> > >> mov z29.d, x2
>> >> >> > >> whilelo p7.d, wzr, w4
>> >> >> > >> .L6:
>> >> >> > >> ld1dz30.d, p7/z, [x0, x3, lsl 3]
>> >> >> > >> ld1dz31.d, p7/z, [x1, x3, lsl 3]
>> >> >> > >> bsl z31.d, z31.d, z30.d, z29.d
>> >> >> > >> st1dz31.d, p7, [x0, x3, lsl 3]
>> >> >> > >> incdx3
>> >> >> > >> whilelo p7.d, w3, w4
>> >> >> > >> b.any   .L6
>> >> >> > >> ret
>> >> >> > >>
>> >> >> > >> I saw that you originally tried to do this in match.pd and
>> >> >> > >> that the decision was to fold to copysign instead.  But
>> >> >> > >> perhaps there's a compromise where isel does something with
>> >> >> > >> the (new) copysign canonical
>> >> >> > form?
>> >> >> > >> I.e. could we go with your new version of the match.pd patch,
>> >> >> > >> and add some isel stuff as a follow-on?
>> 
>> [A]
>> 
>> >> >> > >>
>> >> >> > >
>> >> >> > > Sure if that's what's desired But..
>> >> >> > >
>> >> >> > > The example you posted above is for instance worse for x86
>> >> >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has
>> >> >> > > a dependency chain of 2 and the latter of 3.  It's likely any
>> >> >> > > open coding of this
>> >> >> > operation is going to hurt a target.
>> >> >> 

  1   2   >