Re: [PATCH v2] RISC-V: Introduce -mrvv-allow-misalign.

2024-05-24 Thread Robin Dapp
> * -mstrict-align: Both scalar and vector misaligned accesses are
> unsupported (-mrvv-allow-misalign doesn't matter).  I'm not sure if
> there's hardware there, but given we have systems that don't support
> scalar misaligned accesses it seems reasonable to assume they'll also
> not support vector misaligned accesses.

As a data point, and contrary to what I said/hoped before:  There are
examples where -mstrict-align and -mrvv-allow-misalign vectorizes
code and produces unaligned vector accesses.  I haven't looked into
that area of the vectorizer for a while but it doesn't appear as
if we regard STRICT_ALIGNMENT there at all.
We keep track of the known misalignments (via peeling etc.) and either
handle them via movmisalign or give up.  Same for unknown misalignment
but all unaffected by -mstrict-align.

We could have -mrvv-allow-misalign have an "| STRICT_ALIGNMENT" to
get to the behavior you described but right now it's not like that.
And AFAICT -mstrict-align behaves the same way for other targets,
regardless if they support unaligned vector accesses or not.

So, right now, I'd tend towards describing that both flags are
independent and affect either only scalar or only vector code.
Maybe we should rename the whole thing to -mrvv-strict-align?
Might make it even more confusing, though. 

Regards
 Robin


[PATCH v2] RISC-V: Introduce -mrvv-allow-misalign.

2024-05-24 Thread Robin Dapp
> We should have something in doc/invoke too, this one is going to be
> tricky for users.  We'll also have to define how this interacts with
> the existing -mstrict-align.

Addressed the rest in the attached v2 which also fixes tests.
I'm really not sure about -mstrict-align.  I would have hoped that using
-mstrict-align we'd never run into any movmisalign situation but that
might be wishful thinking.  Do we need to specify an
interaction, though?  For now the new options disables movmisalign so
if we hit that despite -mstrict-align we'd still not vectorize it.

Regtested on rv64gcv_zvfh_zvbb.

Regards
 Robin

This patch changes the default from always enabling movmisalign to
not enabling it.  It adds an option to override the default and adds
generic-ooo to the uarchs that support misaligned vector access.

It also adds a check_effective_target_riscv_v_misalign_ok to the
testsuite which enables or disables the vector misalignment tests
depending on whether the target under test can execute a misaligned
vle32.

gcc/ChangeLog:

* config/riscv/riscv-opts.h (TARGET_VECTOR_MISALIGN_SUPPORTED):
Move from here...
* config/riscv/riscv.h (TARGET_VECTOR_MISALIGN_SUPPORTED):
...to here and make dependent on uarch and rvv_allow_misalign.
* config/riscv/riscv.opt: Add -mrvv-allow-unaligned.

gcc/testsuite/ChangeLog:

* lib/target-supports.exp: Add
check_effective_target_riscv_v_misalign_ok.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-7.c: Add
-mrvv-allow-misalign.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-10.c: Ditto.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-11.c: Ditto.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-12.c: Ditto.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-8.c: Ditto.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-9.c: Ditto.
* gcc.target/riscv/rvv/autovec/vls/misalign-1.c:
---
 gcc/config/riscv/riscv-opts.h |  3 --
 gcc/config/riscv/riscv.cc | 18 ++
 gcc/config/riscv/riscv.h  |  6 
 gcc/config/riscv/riscv.opt|  5 +++
 gcc/doc/invoke.texi   |  5 +++
 .../costmodel/riscv/rvv/dynamic-lmul2-7.c |  2 +-
 .../vect/costmodel/riscv/rvv/vla_vs_vls-10.c  |  2 +-
 .../vect/costmodel/riscv/rvv/vla_vs_vls-11.c  |  2 +-
 .../vect/costmodel/riscv/rvv/vla_vs_vls-12.c  |  2 +-
 .../vect/costmodel/riscv/rvv/vla_vs_vls-8.c   |  2 +-
 .../vect/costmodel/riscv/rvv/vla_vs_vls-9.c   |  2 +-
 .../riscv/rvv/autovec/vls/misalign-1.c|  2 +-
 gcc/testsuite/lib/target-supports.exp | 34 +--
 13 files changed, 73 insertions(+), 12 deletions(-)

diff --git a/gcc/config/riscv/riscv-opts.h b/gcc/config/riscv/riscv-opts.h
index 1b2dd5757a8..f58a07abffc 100644
--- a/gcc/config/riscv/riscv-opts.h
+++ b/gcc/config/riscv/riscv-opts.h
@@ -147,9 +147,6 @@ enum rvv_vector_bits_enum {
  ? 0   
\
  : 32 << (__builtin_popcount (opts->x_riscv_zvl_flags) - 1))
 
-/* TODO: Enable RVV movmisalign by default for now.  */
-#define TARGET_VECTOR_MISALIGN_SUPPORTED 1
-
 /* The maximmum LMUL according to user configuration.  */
 #define TARGET_MAX_LMUL
\
   (int) (rvv_max_lmul == RVV_DYNAMIC ? RVV_M8 : rvv_max_lmul)
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 85df5b7ab49..cfdeb56559f 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -287,6 +287,7 @@ struct riscv_tune_param
   unsigned short memory_cost;
   unsigned short fmv_cost;
   bool slow_unaligned_access;
+  bool rvv_unaligned_access;
   bool use_divmod_expansion;
   bool overlap_op_by_pieces;
   unsigned int fusible_ops;
@@ -299,6 +300,10 @@ struct riscv_tune_param
 /* Whether unaligned accesses execute very slowly.  */
 bool riscv_slow_unaligned_access_p;
 
+/* Whether misaligned vector accesses are supported (i.e. do not
+   throw an exception).  */
+bool riscv_rvv_unaligned_access_p;
+
 /* Whether user explicitly passed -mstrict-align.  */
 bool riscv_user_wants_strict_align;
 
@@ -441,6 +446,7 @@ static const struct riscv_tune_param rocket_tune_info = {
   5,   /* memory_cost */
   8,   /* fmv_cost */
   true,/* 
slow_unaligned_access */
+  false,   /* rvv_unaligned_access */
   false,   /* use_divmod_expansion */
   false,   /* overlap_op_by_pieces */
   RISCV_FUSE_NOTHING,   /* fusible_ops */
@@ -459,6 +465,7 @@ static const struct riscv_tune_param sifive_7_tune_info = {
   3,   /* memory_cost */
   8,   

[PATCH] RISC-V: Introduce -mrvv-allow-misalign.

2024-05-24 Thread Robin Dapp
Hi,

this patch changes the default from always enabling movmisalign to
disabling it.  It adds an option to override the default and adds
generic-ooo to the uarchs that support misaligned vector access.

It also adds a check_effective_target_riscv_v_misalign_ok to the
testsuite which enables or disables the vector misalignment tests
depending on whether the target under test can execute a misaligned
vle32.  I haven't actually tested it on a target that does not
support misaligned vector loads, though.

Regtested on rv64gcv_zvfh_zvbb.  There are a few additional
failures in the rvv testsuite.  They are caused by us overwriting
the default vectorizer flags rather than appending.  I'm going to
fix them in a subsequent patch but for now I'd rather get things
rolling.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/riscv-opts.h (TARGET_VECTOR_MISALIGN_SUPPORTED):
Move from here...
* config/riscv/riscv.h (TARGET_VECTOR_MISALIGN_SUPPORTED):
...to here and make dependent on uarch and rvv_allow_misalign.
* config/riscv/riscv.opt: Add -mrvv-allow-unaligned.

gcc/testsuite/ChangeLog:

* lib/target-supports.exp: Add
check_effective_target_riscv_v_misalign_ok.
---
 gcc/config/riscv/riscv-opts.h |  3 ---
 gcc/config/riscv/riscv.h  |  5 
 gcc/config/riscv/riscv.opt|  5 
 gcc/testsuite/lib/target-supports.exp | 34 +--
 4 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/gcc/config/riscv/riscv-opts.h b/gcc/config/riscv/riscv-opts.h
index 1b2dd5757a8..f58a07abffc 100644
--- a/gcc/config/riscv/riscv-opts.h
+++ b/gcc/config/riscv/riscv-opts.h
@@ -147,9 +147,6 @@ enum rvv_vector_bits_enum {
  ? 0   
\
  : 32 << (__builtin_popcount (opts->x_riscv_zvl_flags) - 1))
 
-/* TODO: Enable RVV movmisalign by default for now.  */
-#define TARGET_VECTOR_MISALIGN_SUPPORTED 1
-
 /* The maximmum LMUL according to user configuration.  */
 #define TARGET_MAX_LMUL
\
   (int) (rvv_max_lmul == RVV_DYNAMIC ? RVV_M8 : rvv_max_lmul)
diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h
index d6b14c4d620..8434e5677b6 100644
--- a/gcc/config/riscv/riscv.h
+++ b/gcc/config/riscv/riscv.h
@@ -934,6 +934,11 @@ extern enum riscv_cc get_riscv_cc (const rtx use);
   || (riscv_microarchitecture == sifive_p400) \
   || (riscv_microarchitecture == sifive_p600))
 
+/* True if the target supports misaligned vector loads and stores.  */
+#define TARGET_VECTOR_MISALIGN_SUPPORTED \
+  (rvv_allow_misalign == 1 \
+   || riscv_microarchitecture == generic_ooo)
+
 #define LOGICAL_OP_NON_SHORT_CIRCUIT 0
 
 /* Control the assembler format that we output.  */
diff --git a/gcc/config/riscv/riscv.opt b/gcc/config/riscv/riscv.opt
index 87f58332016..cff34eee8c9 100644
--- a/gcc/config/riscv/riscv.opt
+++ b/gcc/config/riscv/riscv.opt
@@ -628,3 +628,8 @@ Specify TLS dialect.
 mfence-tso
 Target Var(TARGET_FENCE_TSO) Init(1)
 Specifies whether the fence.tso instruction should be used.
+
+mrvv-allow-misalign
+Target Var(rvv_allow_misalign) Init(0)
+Allow the creation of misaligned vector loads and stores irrespective of the
+current uarch. The default is off.
diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index f0f6da52275..ebb908f5c8f 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -2034,7 +2034,7 @@ proc check_effective_target_riscv_zvfh_ok { } {
 # check if we can execute vector insns with the given hardware or
 # simulator
 set gcc_march [regsub {[[:alnum:]]*} [riscv_get_arch] ]
-if { [check_runtime ${gcc_march}_exec {
+if { [check_runtime ${gcc_march}_zvfh_exec {
int main()
{
asm ("vsetivli zero,8,e16,m1,ta,ma");
@@ -2047,6 +2047,29 @@ proc check_effective_target_riscv_zvfh_ok { } {
 return 0
 }
 
+# Return 1 if we can load a vector from a 1-byte aligned address.
+
+proc check_effective_target_riscv_v_misalign_ok { } {
+
+if { ![check_effective_target_riscv_v_ok] } {
+   return 0
+}
+
+set gcc_march [riscv_get_arch]
+if { [check_runtime ${gcc_march}_misalign_exec {
+ int main() {
+ unsigned char a[16]
+   = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
+ asm ("vsetivli zero,7,e8,m1,ta,ma");
+ asm ("addi a7,%0,1" : : "r" (a) : "a7" );
+ asm ("vle8.v v8,0(a7)" : : : "v8");
+ return 0; } } "-march=${gcc_march}"] } {
+   return 1
+}
+
+return 0
+}
+
 proc riscv_get_arch { } {
 set gcc_march ""
 # ??? do we neeed to add more extensions to the list below?
@@ -8139,7 +8162,6 @@ proc check_effective_target_vect_hw_misalign { } {
 || ([istarget mips*-*-*] && [et-is-effective-target mips_msa])
 || ([istarget s390*-*-*]
   

Re: [PATCH] RISC-V: Enable vectorization for vect-early-break_124-pr114403.c

2024-05-21 Thread Robin Dapp
The patch is OK from the riscv side.  generic-ooo includes fast unaligned
access.

Regards
 Robin


[gcc r15-639] RISC-V: Add initial cost handling for segment loads/stores.

2024-05-17 Thread Robin Dapp via Gcc-cvs
https://gcc.gnu.org/g:e0b9c8ad7098fb08a25a61fe17d4274dd73e5145

commit r15-639-ge0b9c8ad7098fb08a25a61fe17d4274dd73e5145
Author: Robin Dapp 
Date:   Mon Feb 26 13:09:15 2024 +0100

RISC-V: Add initial cost handling for segment loads/stores.

This patch makes segment loads and stores more expensive.  It adds
segment_permute_2 as well as 3 to 8 cost fields to the common vector
costs and adds handling to adjust_stmt_cost.

gcc/ChangeLog:

* config/riscv/riscv-protos.h (struct common_vector_cost): Add
segment_permute cost.
* config/riscv/riscv-vector-costs.cc (costs::adjust_stmt_cost):
Handle segment loads/stores.
* config/riscv/riscv.cc: Initialize segment_permute_[2-8] to 1.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/riscv/rvv/pr113112-4.c: Adjust test.

Diff:
---
 gcc/config/riscv/riscv-protos.h|   9 ++
 gcc/config/riscv/riscv-vector-costs.cc | 163 +++--
 gcc/config/riscv/riscv.cc  |  14 ++
 .../gcc.dg/vect/costmodel/riscv/rvv/pr113112-4.c   |   4 +-
 4 files changed, 146 insertions(+), 44 deletions(-)

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 565ead1382a7..004ceb1031b8 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -222,6 +222,15 @@ struct common_vector_cost
   const int gather_load_cost;
   const int scatter_store_cost;
 
+  /* Segment load/store permute cost.  */
+  const int segment_permute_2;
+  const int segment_permute_3;
+  const int segment_permute_4;
+  const int segment_permute_5;
+  const int segment_permute_6;
+  const int segment_permute_7;
+  const int segment_permute_8;
+
   /* Cost of a vector-to-scalar operation.  */
   const int vec_to_scalar_cost;
 
diff --git a/gcc/config/riscv/riscv-vector-costs.cc 
b/gcc/config/riscv/riscv-vector-costs.cc
index 4582b0db4250..0a88e142a934 100644
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -1052,6 +1052,25 @@ costs::better_main_loop_than_p (const vector_costs 
*uncast_other) const
   return vector_costs::better_main_loop_than_p (other);
 }
 
+/* Returns the group size i.e. the number of vectors to be loaded by a
+   segmented load/store instruction.  Return 0 if it is no segmented
+   load/store.  */
+static int
+segment_loadstore_group_size (enum vect_cost_for_stmt kind,
+ stmt_vec_info stmt_info)
+{
+  if (stmt_info
+  && (kind == vector_load || kind == vector_store)
+  && STMT_VINFO_DATA_REF (stmt_info))
+{
+  stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
+  if (stmt_info
+ && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_LOAD_STORE_LANES)
+   return DR_GROUP_SIZE (stmt_info);
+}
+  return 0;
+}
+
 /* Adjust vectorization cost after calling riscv_builtin_vectorization_cost.
For some statement, we would like to further fine-grain tweak the cost on
top of riscv_builtin_vectorization_cost handling which doesn't have any
@@ -1076,55 +1095,115 @@ costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, 
loop_vec_info loop,
 case vector_load:
 case vector_store:
{
- /* Unit-stride vector loads and stores do not have offset addressing
-as opposed to scalar loads and stores.
-If the address depends on a variable we need an additional
-add/sub for each load/store in the worst case.  */
- if (stmt_info && stmt_info->stmt)
+ if (stmt_info && stmt_info->stmt && STMT_VINFO_DATA_REF (stmt_info))
{
- data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
- class loop *father = stmt_info->stmt->bb->loop_father;
- if (!loop && father && !father->inner && father->superloops)
+ /* Segment loads and stores.  When the group size is > 1
+the vectorizer will add a vector load/store statement for
+each vector in the group.  Here we additionally add permute
+costs for each.  */
+ /* TODO: Indexed and ordered/unordered cost.  */
+ int group_size = segment_loadstore_group_size (kind, stmt_info);
+ if (group_size > 1)
+   {
+ switch (group_size)
+   {
+   case 2:
+ if (riscv_v_ext_vector_mode_p (loop->vector_mode))
+   stmt_cost += costs->vla->segment_permute_2;
+ else
+   stmt_cost += costs->vls->segment_permute_2;
+ break;
+   case 3:
+ if (riscv_v_ext_vector_mode_p (loop->vector_mode))
+   stmt_cost += costs-&

[gcc r15-638] internal-fn: Do not force vcond_mask operands to reg.

2024-05-17 Thread Robin Dapp via Gcc-cvs
https://gcc.gnu.org/g:7ca35f2e430081d6ec91e910002f92d9713350fa

commit r15-638-g7ca35f2e430081d6ec91e910002f92d9713350fa
Author: Robin Dapp 
Date:   Fri May 10 12:44:44 2024 +0200

internal-fn: Do not force vcond_mask operands to reg.

In order to directly use constants this patch removes force_regs
in the vcond_mask expander.

gcc/ChangeLog:

PR middle-end/113474

* internal-fn.cc (expand_vec_cond_mask_optab_fn):  Remove
force_regs.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr113474.c: New test.

Diff:
---
 gcc/internal-fn.cc|  3 ---
 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c | 13 +
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 73045ca8c8c1..9c09026793fa 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -3165,9 +3165,6 @@ expand_vec_cond_mask_optab_fn (internal_fn, gcall *stmt, 
convert_optab optab)
   rtx_op1 = expand_normal (op1);
   rtx_op2 = expand_normal (op2);
 
-  mask = force_reg (mask_mode, mask);
-  rtx_op1 = force_reg (mode, rtx_op1);
-
   rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand ([0], target, mode);
   create_input_operand ([1], rtx_op1, mode);
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c
new file mode 100644
index ..0364bf9f5e38
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c
@@ -0,0 +1,13 @@
+/* { dg-do compile { target riscv_v } }  */
+/* { dg-additional-options "-std=c99" }  */
+
+void
+foo (int n, int **a)
+{
+  int b;
+  for (b = 0; b < n; b++)
+for (long e = 8; e > 0; e--)
+  a[b][e] = a[b][e] == 15;
+}
+
+/* { dg-final { scan-assembler "vmerge.vim" } }  */


Re: [PATCH v6] RISC-V: Implement IFN SAT_ADD for both the scalar and vector

2024-05-17 Thread Robin Dapp
Hi Pan,

all in all LGTM.  Just insignificant nits.

> +void
> +expand_vec_usadd (rtx op_0, rtx op_1, rtx op_2, machine_mode vec_mode)
> +{
> +  emit_vec_saddu (op_0, op_1, op_2, BINARY_OP, vec_mode);
> +}
> +

Do we really need this function?  Or do you want it to be a dispatcher
for later?  If it should do more than just a call, please document.

> +  /* Step-1: sum = x + y  */
> +  if (mode == SImode && mode != Xmode)
> +{ /* Take addw to avoid the sum truncate.  */
> +  rtx simode_sum = gen_reg_rtx (SImode);
> +  riscv_emit_binary (PLUS, simode_sum, x, y);
> +  emit_move_insn (xmode_sum, gen_lowpart (Xmode, simode_sum));
> +}
> +  else
> +riscv_emit_binary (PLUS, xmode_sum, xmode_x, xmode_y);

I would add a top-level comment that the emulation is just
sum = x + y;
if (sum < x)
  sum = TYPE_MAX;
and we can implement the if/then by sltu and or.

No need for another revision, though.

Regards
 Robin


[PATCH] RISC-V: Remove dead perm series code and document.

2024-05-17 Thread Robin Dapp
Hi,

with the introduction of shuffle_series_patterns the explicit handler
code for a perm series is dead.  This patch removes it and also adds
a function-level comment to shuffle_series_patterns.

Regtested on rv64gcv_zvfh_zvbb.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/riscv-v.cc (expand_const_vector): Document.
(shuffle_extract_and_slide1up_patterns): Remove.
---
 gcc/config/riscv/riscv-v.cc | 26 --
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 8b41b9c7774..93c2dcd04e4 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -1485,28 +1485,6 @@ expand_const_vector (rtx target, rtx src)
  emit_vlmax_insn (code_for_pred_merge (mode), MERGE_OP, ops);
}
}
-  else if (npatterns == 1 && nelts_per_pattern == 3)
-   {
- /* Generate the following CONST_VECTOR:
-{ base0, base1, base1 + step, base1 + step * 2, ... }  */
- rtx base0 = builder.elt (0);
- rtx base1 = builder.elt (1);
- rtx base2 = builder.elt (2);
-
- rtx step = simplify_binary_operation (MINUS, builder.inner_mode (),
-   base2, base1);
-
- /* Step 1 - { base1, base1 + step, base1 + step * 2, ... }  */
- rtx tmp = gen_reg_rtx (mode);
- expand_vec_series (tmp, base1, step);
- /* Step 2 - { base0, base1, base1 + step, base1 + step * 2, ... }  */
- if (!rtx_equal_p (base0, const0_rtx))
-   base0 = force_reg (builder.inner_mode (), base0);
-
- insn_code icode = optab_handler (vec_shl_insert_optab, mode);
- gcc_assert (icode != CODE_FOR_nothing);
- emit_insn (GEN_FCN (icode) (target, tmp, base0));
-   }
   else
/* TODO: We will enable more variable-length vector in the future.  */
gcc_unreachable ();
@@ -3580,6 +3558,10 @@ shuffle_extract_and_slide1up_patterns (struct 
expand_vec_perm_d *d)
   return true;
 }
 
+/* This looks for a series pattern in the provided vector permute structure D.
+   If successful it emits a series insn as well as a gather to implement it.
+   Return true if successful, false otherwise.  */
+
 static bool
 shuffle_series_patterns (struct expand_vec_perm_d *d)
 {
-- 
2.45.0


[PATCH] RISC-V: Add vector popcount, clz, ctz.

2024-05-17 Thread Robin Dapp
Hi,

this patch adds the zvbb vcpop, vclz and vctz to the autovec machinery
as well as tests for them.  It also changes several non-VLS iterators
to V_VLS iterators for consistency.

Regtested on rv64gcv_zvfh_zvbb.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/autovec.md (ctz2): New expander.
(clz2): Ditto.
* config/riscv/generic-vector-ooo.md: Add bitmanip ops to insn
reservation.
* config/riscv/vector-crypto.md: Add VLS modes to insns.
* config/riscv/vector.md: Add bitmanip ops to mode_idx and other
attributes.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/unop/popcount-1.c: Adjust check
for zvbb.
* gcc.target/riscv/rvv/autovec/unop/popcount-run-1.c: Ditto.
* gcc.target/riscv/rvv/autovec/unop/popcount-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/unop/popcount-3.c: New test.
* gcc.target/riscv/rvv/autovec/unop/popcount-template.h: New test.
* gcc.target/riscv/rvv/autovec/unop/clz-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/clz-run.c: New test.
* gcc.target/riscv/rvv/autovec/unop/clz-template.h: New test.
* gcc.target/riscv/rvv/autovec/unop/ctz-1.c: New test.
* gcc.target/riscv/rvv/autovec/unop/ctz-run.c: New test.
* gcc.target/riscv/rvv/autovec/unop/ctz-template.h: New test.
---
 gcc/config/riscv/autovec.md   | 30 +-
 gcc/config/riscv/generic-vector-ooo.md|  2 +-
 gcc/config/riscv/vector-crypto.md | 93 ++-
 gcc/config/riscv/vector.md| 14 +--
 .../gcc.target/riscv/rvv/autovec/unop/clz-1.c |  8 ++
 .../riscv/rvv/autovec/unop/clz-run.c  | 36 +++
 .../riscv/rvv/autovec/unop/clz-template.h | 21 +
 .../gcc.target/riscv/rvv/autovec/unop/ctz-1.c |  8 ++
 .../riscv/rvv/autovec/unop/ctz-run.c  | 36 +++
 .../riscv/rvv/autovec/unop/ctz-template.h | 21 +
 .../riscv/rvv/autovec/unop/popcount-1.c   |  4 +-
 .../riscv/rvv/autovec/unop/popcount-2.c   |  4 +-
 .../riscv/rvv/autovec/unop/popcount-3.c   |  8 ++
 .../riscv/rvv/autovec/unop/popcount-run-1.c   |  3 +-
 .../rvv/autovec/unop/popcount-template.h  | 21 +
 15 files changed, 250 insertions(+), 59 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/clz-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/clz-run.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/clz-template.h
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/ctz-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/ctz-run.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/ctz-template.h
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-3.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/popcount-template.h

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index aa1ae0fe075..a9391ed146c 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1566,7 +1566,7 @@ (define_expand "xorsign3"
 })
 
 ;; 
---
-;; - [INT] POPCOUNT.
+;; - [INT] POPCOUNT, CTZ and CLZ.
 ;; 
---
 
 (define_expand "popcount2"
@@ -1574,10 +1574,36 @@ (define_expand "popcount2"
(match_operand:V_VLSI 1 "register_operand")]
   "TARGET_VECTOR"
 {
-  riscv_vector::expand_popcount (operands);
+  if (!TARGET_ZVBB)
+riscv_vector::expand_popcount (operands);
+  else
+{
+  riscv_vector::emit_vlmax_insn (code_for_pred_v (POPCOUNT, mode),
+riscv_vector::CPOP_OP, operands);
+}
   DONE;
 })
 
+(define_expand "ctz2"
+  [(match_operand:V_VLSI 0 "register_operand")
+   (match_operand:V_VLSI 1 "register_operand")]
+  "TARGET_ZVBB"
+  {
+riscv_vector::emit_vlmax_insn (code_for_pred_v (CTZ, mode),
+  riscv_vector::CPOP_OP, operands);
+DONE;
+})
+
+(define_expand "clz2"
+  [(match_operand:V_VLSI 0 "register_operand")
+   (match_operand:V_VLSI 1 "register_operand")]
+  "TARGET_ZVBB"
+  {
+riscv_vector::emit_vlmax_insn (code_for_pred_v (CLZ, mode),
+  riscv_vector::CPOP_OP, operands);
+DONE;
+})
+
 
 ;; -
 ;;  [INT] Highpart multiplication
diff --git a/gcc/config/riscv/generic-vector-ooo.md 
b/gcc/config/riscv/generic-vector-ooo.md
index 96cb1a0be29..5e933c83841 100644
--- a/gcc/config/riscv/generic-vector-ooo.md
+++ b/gcc/config/riscv/generic-vector-ooo.md
@@ -74,7 +74,7 @@ (define_insn_reservation "vec_fmul" 6
 
 ;; Vector crypto, assumed to be a generic operation for now.
 (define_insn_reservation "vec_crypto" 4
-  (eq_attr "type" "crypto")
+  (eq_attr "type" 

[PATCH] RISC-V: Add vandn combine helper.

2024-05-17 Thread Robin Dapp
Hi,

this patch adds a combine pattern for vandn as well as tests for it.

Regtested on rv64gcv_zvfh_zvbb.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/autovec-opt.md (*vandn_): New pattern.
* config/riscv/vector.md: Add vandn to mode_idx.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/vandn-1.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vandn-run.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vandn-template.h: New test.
---
 gcc/config/riscv/autovec-opt.md   | 18 +++
 gcc/config/riscv/vector.md|  2 +-
 .../riscv/rvv/autovec/binop/vandn-1.c |  8 +++
 .../riscv/rvv/autovec/binop/vandn-run.c   | 54 +++
 .../riscv/rvv/autovec/binop/vandn-template.h  | 38 +
 5 files changed, 119 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vandn-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vandn-run.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vandn-template.h

diff --git a/gcc/config/riscv/autovec-opt.md b/gcc/config/riscv/autovec-opt.md
index 06438f9e2f7..07372d965b0 100644
--- a/gcc/config/riscv/autovec-opt.md
+++ b/gcc/config/riscv/autovec-opt.md
@@ -1559,3 +1559,21 @@ (define_insn_and_split "*vwsll_zext1_trunc_scalar_"
 DONE;
   }
   [(set_attr "type" "vwsll")])
+
+;; vnot + vand = vandn.
+(define_insn_and_split "*vandn_"
+ [(set (match_operand:V_VLSI 0 "register_operand" "=vr")
+   (and:V_VLSI
+(not:V_VLSI
+  (match_operand:V_VLSI  2 "register_operand"  "vr"))
+(match_operand:V_VLSI1 "register_operand"  "vr")))]
+  "TARGET_ZVBB && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+insn_code icode = code_for_pred_vandn (mode);
+riscv_vector::emit_vlmax_insn (icode, riscv_vector::BINARY_OP, operands);
+DONE;
+  }
+  [(set_attr "type" "vandn")])
diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
index c6a3845dc13..dafcd7d9bf9 100644
--- a/gcc/config/riscv/vector.md
+++ b/gcc/config/riscv/vector.md
@@ -743,7 +743,7 @@ (define_attr "mode_idx" ""
vfcmp,vfminmax,vfsgnj,vfclass,vfmerge,vfmov,\

vfcvtitof,vfncvtitof,vfncvtftoi,vfncvtftof,vmalu,vmiota,vmidx,\

vimovxv,vfmovfv,vslideup,vslidedown,vislide1up,vislide1down,vfslide1up,vfslide1down,\
-   vgather,vcompress,vmov,vnclip,vnshift")
+   vgather,vcompress,vmov,vnclip,vnshift,vandn")
   (const_int 0)
 
   (eq_attr "type" "vimovvx,vfmovvf")
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vandn-1.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vandn-1.c
new file mode 100644
index 000..3bb5bf8dd5b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vandn-1.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-add-options "riscv_v" } */
+/* { dg-add-options "riscv_zvbb" } */
+/* { dg-additional-options "-std=c99 -fno-vect-cost-model" } */
+
+#include "vandn-template.h"
+
+/* { dg-final { scan-assembler-times {\tvandn\.vv} 8 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vandn-run.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vandn-run.c
new file mode 100644
index 000..243c5975068
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vandn-run.c
@@ -0,0 +1,54 @@
+/* { dg-do run } */
+/* { dg-require-effective-target "riscv_zvbb_ok" } */
+/* { dg-add-options "riscv_v" } */
+/* { dg-add-options "riscv_zvbb" } */
+/* { dg-additional-options "-std=c99 -fno-vect-cost-model" } */
+
+#include "vandn-template.h"
+
+#include 
+
+#define SZ 512
+
+#define RUN(TYPE, VAL) 
\
+  TYPE a##TYPE[SZ];
\
+  TYPE b##TYPE[SZ];
\
+  for (int i = 0; i < SZ; i++) 
\
+{  
\
+  a##TYPE[i] = 123;
\
+  b##TYPE[i] = VAL;
\
+}  
\
+  vandn_##TYPE (a##TYPE, a##TYPE, b##TYPE, SZ);
\
+  for (int i = 0; i < SZ; i++) 
\
+assert (a##TYPE[i] == (TYPE) (123 & ~VAL));
+
+#define RUN2(TYPE, VAL)
\
+  TYPE as##TYPE[SZ];   
\
+  for (int i = 0; i < SZ; i++) 
\
+as##TYPE[i] = 123;  

[PATCH] RISC-V: Use widening shift for scatter/gather if applicable.

2024-05-17 Thread Robin Dapp
Hi,

with the zvbb extension we can emit a widening shift for scatter/gather
index preparation in case we need to multiply by 2 and zero extend.

The patch also adds vwsll to the mode_idx attribute and removes the
mode from shift-count operand of the insn pattern.

Regtested on rv64gcv_zvfh_zvbb.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/riscv-v.cc (expand_gather_scatter): Use vwsll if
applicable.
* config/riscv/vector-crypto.md: Remove mode from vwsll shift
count operator.
* config/riscv/vector.md: Add vwsll to mode iterator.

gcc/testsuite/ChangeLog:

* lib/target-supports.exp: Add zvbb.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_64-12-zvbb.c: 
New test.
---
 gcc/config/riscv/riscv-v.cc   |  42 +--
 gcc/config/riscv/vector-crypto.md |   4 +-
 gcc/config/riscv/vector.md|   4 +-
 .../gather-scatter/gather_load_64-12-zvbb.c   | 113 ++
 gcc/testsuite/lib/target-supports.exp |  48 +++-
 5 files changed, 193 insertions(+), 18 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_64-12-zvbb.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 814c5febabe..8b41b9c7774 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -4016,7 +4016,7 @@ expand_gather_scatter (rtx *ops, bool is_load)
 {
   rtx ptr, vec_offset, vec_reg;
   bool zero_extend_p;
-  int scale_log2;
+  int shift;
   rtx mask = ops[5];
   rtx len = ops[6];
   if (is_load)
@@ -4025,7 +4025,7 @@ expand_gather_scatter (rtx *ops, bool is_load)
   ptr = ops[1];
   vec_offset = ops[2];
   zero_extend_p = INTVAL (ops[3]);
-  scale_log2 = exact_log2 (INTVAL (ops[4]));
+  shift = exact_log2 (INTVAL (ops[4]));
 }
   else
 {
@@ -4033,7 +4033,7 @@ expand_gather_scatter (rtx *ops, bool is_load)
   ptr = ops[0];
   vec_offset = ops[1];
   zero_extend_p = INTVAL (ops[2]);
-  scale_log2 = exact_log2 (INTVAL (ops[3]));
+  shift = exact_log2 (INTVAL (ops[3]));
 }
 
   machine_mode vec_mode = GET_MODE (vec_reg);
@@ -4043,9 +4043,12 @@ expand_gather_scatter (rtx *ops, bool is_load)
   poly_int64 nunits = GET_MODE_NUNITS (vec_mode);
   bool is_vlmax = is_vlmax_len_p (vec_mode, len);
 
+  bool use_widening_shift = false;
+
   /* Extend the offset element to address width.  */
   if (inner_offsize < BITS_PER_WORD)
 {
+  use_widening_shift = TARGET_ZVBB && zero_extend_p && shift == 1;
   /* 7.2. Vector Load/Store Addressing Modes.
 If the vector offset elements are narrower than XLEN, they are
 zero-extended to XLEN before adding to the ptr effective address. If
@@ -4054,8 +4057,8 @@ expand_gather_scatter (rtx *ops, bool is_load)
 raise an illegal instruction exception if the EEW is not supported for
 offset elements.
 
-RVV spec only refers to the scale_log == 0 case.  */
-  if (!zero_extend_p || scale_log2 != 0)
+RVV spec only refers to the shift == 0 case.  */
+  if (!zero_extend_p || shift)
{
  if (zero_extend_p)
inner_idx_mode
@@ -4064,19 +4067,32 @@ expand_gather_scatter (rtx *ops, bool is_load)
inner_idx_mode = int_mode_for_size (BITS_PER_WORD, 0).require ();
  machine_mode new_idx_mode
= get_vector_mode (inner_idx_mode, nunits).require ();
- rtx tmp = gen_reg_rtx (new_idx_mode);
- emit_insn (gen_extend_insn (tmp, vec_offset, new_idx_mode, idx_mode,
- zero_extend_p ? true : false));
- vec_offset = tmp;
+ if (!use_widening_shift)
+   {
+ rtx tmp = gen_reg_rtx (new_idx_mode);
+ emit_insn (gen_extend_insn (tmp, vec_offset, new_idx_mode, 
idx_mode,
+ zero_extend_p ? true : false));
+ vec_offset = tmp;
+   }
  idx_mode = new_idx_mode;
}
 }
 
-  if (scale_log2 != 0)
+  if (shift)
 {
-  rtx tmp = expand_binop (idx_mode, ashl_optab, vec_offset,
- gen_int_mode (scale_log2, Pmode), NULL_RTX, 0,
- OPTAB_DIRECT);
+  rtx tmp;
+  if (!use_widening_shift)
+   tmp = expand_binop (idx_mode, ashl_optab, vec_offset,
+   gen_int_mode (shift, Pmode), NULL_RTX, 0,
+   OPTAB_DIRECT);
+  else
+   {
+ tmp = gen_reg_rtx (idx_mode);
+ insn_code icode = code_for_pred_vwsll_scalar (idx_mode);
+ rtx ops[] = {tmp, vec_offset, const1_rtx};
+ emit_vlmax_insn (icode, BINARY_OP, ops);
+   }
+
   vec_offset = tmp;
 }
 
diff --git a/gcc/config/riscv/vector-crypto.md 
b/gcc/config/riscv/vector-crypto.md
index 24822e2712c..0ddc2f3f3c6 100755
--- a/gcc/config/riscv/vector-crypto.md
+++ b/gcc/config/riscv/vector-crypto.md
@@ 

[PATCH] RISC-V: Add vwsll combine helpers.

2024-05-17 Thread Robin Dapp
Hi,

this patch enables the usage of vwsll in autovec context by adding the
necessary combine patterns and tests.

Regtested on rv64gcv_zvfh_zvbb.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/autovec-opt.md (*vwsll_zext1_): New
pattern.
(*vwsll_zext2_): Ditto.
(*vwsll_zext1_scalar_): Ditto.
(*vwsll_zext1_trunc_): Ditto.
(*vwsll_zext2_trunc_): Ditto.
(*vwsll_zext1_trunc_scalar_): Ditto.
* config/riscv/vector-crypto.md: Make pattern similar to other
narrowing/widening patterns.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/vwsll-1.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vwsll-run.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vwsll-template.h: New test.
---
 gcc/config/riscv/autovec-opt.md   | 123 ++
 gcc/config/riscv/vector-crypto.md |   2 +-
 .../riscv/rvv/autovec/binop/vwsll-1.c |  10 ++
 .../riscv/rvv/autovec/binop/vwsll-run.c   |  67 ++
 .../riscv/rvv/autovec/binop/vwsll-template.h  |  49 +++
 5 files changed, 250 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vwsll-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vwsll-run.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vwsll-template.h

diff --git a/gcc/config/riscv/autovec-opt.md b/gcc/config/riscv/autovec-opt.md
index 645dc53d868..06438f9e2f7 100644
--- a/gcc/config/riscv/autovec-opt.md
+++ b/gcc/config/riscv/autovec-opt.md
@@ -1436,3 +1436,126 @@ (define_insn_and_split "*n"
 DONE;
   }
   [(set_attr "type" "vmalu")])
+
+;; vzext.vf2 + vsll = vwsll.
+(define_insn_and_split "*vwsll_zext1_"
+  [(set (match_operand:VWEXTI 0"register_operand" "=vr 
")
+  (ashift:VWEXTI
+   (zero_extend:VWEXTI
+ (match_operand: 1 "register_operand" " vr "))
+ (match_operand: 2 "vector_shift_operand" "vrvk")))]
+  "TARGET_ZVBB && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+insn_code icode = code_for_pred_vwsll (mode);
+riscv_vector::emit_vlmax_insn (icode, riscv_vector::BINARY_OP, operands);
+DONE;
+  }
+  [(set_attr "type" "vwsll")])
+
+(define_insn_and_split "*vwsll_zext2_"
+  [(set (match_operand:VWEXTI 0"register_operand" "=vr 
")
+  (ashift:VWEXTI
+   (zero_extend:VWEXTI
+ (match_operand: 1 "register_operand" " vr "))
+   (zero_extend:VWEXTI
+ (match_operand: 2 "vector_shift_operand" "vrvk"]
+  "TARGET_ZVBB && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+insn_code icode = code_for_pred_vwsll (mode);
+riscv_vector::emit_vlmax_insn (icode, riscv_vector::BINARY_OP, operands);
+DONE;
+  }
+  [(set_attr "type" "vwsll")])
+
+
+(define_insn_and_split "*vwsll_zext1_scalar_"
+  [(set (match_operand:VWEXTI 0"register_operand"  
  "=vr")
+  (ashift:VWEXTI
+   (zero_extend:VWEXTI
+ (match_operand: 1 "register_operand"" 
vr"))
+ (match_operand:2 "vector_scalar_shift_operand" " 
rK")))]
+  "TARGET_ZVBB && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+if (GET_CODE (operands[2]) == SUBREG)
+  operands[2] = SUBREG_REG (operands[2]);
+insn_code icode = code_for_pred_vwsll_scalar (mode);
+riscv_vector::emit_vlmax_insn (icode, riscv_vector::BINARY_OP, operands);
+DONE;
+  }
+  [(set_attr "type" "vwsll")])
+
+;; For
+;;   uint16_t dst;
+;;   uint8_t a, b;
+;;   dst = vwsll (a, b)
+;; we seem to create
+;;   aa = (int) a;
+;;   bb = (int) b;
+;;   dst = (short) vwsll (aa, bb);
+;; The following patterns help to combine this idiom into one vwsll.
+
+(define_insn_and_split "*vwsll_zext1_trunc_"
+  [(set (match_operand: 0   "register_operand""=vr ")
+(truncate:
+  (ashift:VQEXTI
+   (zero_extend:VQEXTI
+ (match_operand: 1   "register_operand" " vr "))
+   (match_operand:VQEXTI   2   "vector_shift_operand" "vrvk"]
+  "TARGET_ZVBB && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+insn_code icode = code_for_pred_vwsll (mode);
+riscv_vector::emit_vlmax_insn (icode, riscv_vector::BINARY_OP, operands);
+DONE;
+  }
+  [(set_attr "type" "vwsll")])
+
+(define_insn_and_split "*vwsll_zext2_trunc_"
+  [(set (match_operand: 0   "register_operand""=vr ")
+(truncate:
+  (ashift:VQEXTI
+   (zero_extend:VQEXTI
+ (match_operand: 1   "register_operand" " vr "))
+   (zero_extend:VQEXTI
+ (match_operand: 2   "vector_shift_operand" "vrvk")]
+  "TARGET_ZVBB && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(const_int 0)]
+  {
+insn_code icode = code_for_pred_vwsll (mode);
+riscv_vector::emit_vlmax_insn (icode, riscv_vector::BINARY_OP, operands);
+DONE;
+  }
+  [(set_attr "type" "vwsll")])
+

[PATCH] RISC-V: Split vwadd.wx and vwsub.wx and add helpers.

2024-05-17 Thread Robin Dapp
Hi,

vwadd.wx and vwsub.wx have the same problem vfwadd.wf had.  This patch
splits the insn pattern in the same way vfwadd.wf was split.

It also adds two patterns to recognize extended scalars.  In practice
those do not provide a lot of improvement over what we already have but
in some instances we can get rid of redundant extensions.  If somebody
considers the patterns excessive, I'd be open to not add them.

Regtested on rv64gcv_zvfh_zvbb.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/vector.md: Split vwadd.wx/vwsub.wx pattern and
add extended_scalar patterns.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/pr115068.c: Add vwadd.wx/vwsub.wx
tests.
* gcc.target/riscv/rvv/base/pr115068-run.c: Include pr115068.c.
* gcc.target/riscv/rvv/base/vwaddsub-1.c: New test.
---
 gcc/config/riscv/vector.md| 62 ---
 .../gcc.target/riscv/rvv/base/pr115068-run.c  | 24 +--
 .../gcc.target/riscv/rvv/base/pr115068.c  | 26 
 .../gcc.target/riscv/rvv/base/vwaddsub-1.c| 47 ++
 4 files changed, 127 insertions(+), 32 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/vwaddsub-1.c

diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
index 107914afa3a..248461302dd 100644
--- a/gcc/config/riscv/vector.md
+++ b/gcc/config/riscv/vector.md
@@ -3900,27 +3900,71 @@ (define_insn 
"@pred_single_widen_add"
(set_attr "mode" "")])
 
 (define_insn 
"@pred_single_widen__scalar"
-  [(set (match_operand:VWEXTI 0 "register_operand"   "=vr,   
vr")
+  [(set (match_operand:VWEXTI 0 "register_operand" "=vd,vd, 
vr, vr")
(if_then_else:VWEXTI
  (unspec:
-   [(match_operand: 1 "vector_mask_operand"   
"vmWc1,vmWc1")
-(match_operand 5 "vector_length_operand"  "   rK,   
rK")
-(match_operand 6 "const_int_operand"  "i,
i")
-(match_operand 7 "const_int_operand"  "i,
i")
-(match_operand 8 "const_int_operand"  "i,
i")
+   [(match_operand: 1 "vector_mask_operand"   " 
vm,vm,Wc1,Wc1")
+(match_operand 5 "vector_length_operand"  " rK,rK, rK, 
rK")
+(match_operand 6 "const_int_operand"  "  i, i,  i, 
 i")
+(match_operand 7 "const_int_operand"  "  i, i,  i, 
 i")
+(match_operand 8 "const_int_operand"  "  i, i,  i, 
 i")
 (reg:SI VL_REGNUM)
 (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
  (plus_minus:VWEXTI
-   (match_operand:VWEXTI 3 "register_operand" "   vr,   
vr")
+   (match_operand:VWEXTI 3 "register_operand" " vr,vr, vr, 
vr")
(any_extend:VWEXTI
  (vec_duplicate:
-   (match_operand: 4 "reg_or_0_operand"   "   rJ,   
rJ"
- (match_operand:VWEXTI 2 "vector_merge_operand"   "   vu,
0")))]
+   (match_operand: 4 "reg_or_0_operand"   " rJ,rJ, rJ, 
rJ"
+ (match_operand:VWEXTI 2 "vector_merge_operand"   " vu, 0, vu, 
 0")))]
   "TARGET_VECTOR"
   "vw.wx\t%0,%3,%z4%p1"
   [(set_attr "type" "vi")
(set_attr "mode" "")])
 
+(define_insn "@pred_single_widen_add_extended_scalar"
+  [(set (match_operand:VWEXTI 0 "register_operand" "=vd,vd, 
vr, vr")
+   (if_then_else:VWEXTI
+ (unspec:
+   [(match_operand: 1 "vector_mask_operand"   " 
vm,vm,Wc1,Wc1")
+(match_operand 5 "vector_length_operand"  " rK,rK, rK, 
rK")
+(match_operand 6 "const_int_operand"  "  i, i,  i, 
 i")
+(match_operand 7 "const_int_operand"  "  i, i,  i, 
 i")
+(match_operand 8 "const_int_operand"  "  i, i,  i, 
 i")
+(reg:SI VL_REGNUM)
+(reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
+ (plus:VWEXTI
+   (vec_duplicate:VWEXTI
+ (any_extend:
+   (match_operand: 4 "reg_or_0_operand"   " rJ,rJ, rJ, 
rJ")))
+   (match_operand:VWEXTI 3 "register_operand" " vr,vr, vr, 
vr"))
+ (match_operand:VWEXTI 2 "vector_merge_operand"   " vu, 0, vu, 
 0")))]
+  "TARGET_VECTOR"
+  "vwadd.wx\t%0,%3,%z4%p1"
+  [(set_attr "type" "viwalu")
+   (set_attr "mode" "")])
+
+(define_insn "@pred_single_widen_sub_extended_scalar"
+  [(set (match_operand:VWEXTI 0 "register_operand" "=vd,vd, 
vr, vr")
+   (if_then_else:VWEXTI
+ (unspec:
+   [(match_operand: 1 "vector_mask_operand"   " 
vm,vm,Wc1,Wc1")
+(match_operand 5 "vector_length_operand"  " rK,rK, rK, 
rK")
+(match_operand 6 "const_int_operand"  "  i, i,  i, 
 i")
+(match_operand 7 

Re: [PATCH] internal-fn: Do not force vcond operand to reg.

2024-05-17 Thread Robin Dapp
> OK if that pre-commit CI works out.

The CI didn't pick it up, guess it needs to be a bit more explicit.
In the meanwhile, however, I managed to catch a short window with
> 10G free on gcc185 =>  Bootstrap and regtest successful on aarch64.
Going to push the patch later today.

Regards
 Robin


Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-05-16 Thread Robin Dapp
> Can eqne pattern removal patches be committed firstly?

Please first make sure you test with corner cases, NaNs in
particular.  I'm pretty sure we don't have any test cases for
those.

Regards
 Robin


Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-05-15 Thread Robin Dapp
Hi Demin,

are you still going to continue with this?

Regards
 Robin


Re: [PATCH] RISC-V: Do not allow v0 as dest when merging [PR115068].

2024-05-15 Thread Robin Dapp
> I saw vwadd/vwsub.wx have same issue. Could you change them and add test too ?

Yes, will do.  At first I didn't manage to reproduce it because we
seem to be lacking a combine-opt pattern for it.  I'm going to post
it separately.

Regards
 Robin



Re: [PATCH v1 2/3] RISC-V: Implement vectorizable early exit with vcond_mask_len

2024-05-13 Thread Robin Dapp
Hi Pan,

thanks for working on this.

In general the patch looks reasonable to me but I'd rather
have some more comments about the high-level idea.
E.g. cbranch is implemented like aarch64 by xor'ing the
bitmasks and comparing the result against zero (so we branch
based on mask equality).

> +;; vcond_mask_len

High-level description here instead please.

> +(define_insn_and_split "vcond_mask_len_"
> +  [(set (match_operand:VB 0 "register_operand")

> +(unspec: VB [
> + (match_operand:VB 1 "register_operand")
> + (match_operand:VB 2 "const_1_operand")

I guess it works like that because operand[2] is just implicitly
used anyway but shouldn't that rather be an all_ones_operand?

> +   && riscv_vector::get_vector_mode (Pmode, GET_MODE_NUNITS 
> (mode)).exists ()"

Seems a bit odd on first sight.  If all we want to do is to
select between two masks why do we need a large Pmode mode?

> +rtx ops[] = {operands[0], operands[1], operands[1], cmp, reg, 
> operands[4]};

So that's basically a mask-move with length?  Can't this be done
differently?  If not, please describe, maybe this is already
the shortest way.

Regards
 Robin



[PATCH] RISC-V: Do not allow v0 as dest when merging [PR115068].

2024-05-13 Thread Robin Dapp
Hi,

this patch splits the vfw...wf pattern so we do not emit
e.g. vfwadd.wf v0,v8,fa5,v0.t anymore.

Regtested on rv64gcv_zvfh.

Regards
 Robin

gcc/ChangeLog:

PR target/115068

* config/riscv/vector.md:  Split vfw.wf pattern.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/pr115068-run.c: New test.
* gcc.target/riscv/rvv/base/pr115068.c: New test.
---
 gcc/config/riscv/vector.md| 20 ++---
 .../gcc.target/riscv/rvv/base/pr115068-run.c  | 28 ++
 .../gcc.target/riscv/rvv/base/pr115068.c  | 29 +++
 3 files changed, 67 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr115068-run.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr115068.c

diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
index 2a54f78df8e..e408baa809c 100644
--- a/gcc/config/riscv/vector.md
+++ b/gcc/config/riscv/vector.md
@@ -7178,24 +7178,24 @@ (define_insn "@pred_single_widen_sub"
(symbol_ref "riscv_vector::get_frm_mode (operands[9])"))])
 
 (define_insn "@pred_single_widen__scalar"
-  [(set (match_operand:VWEXTF 0 "register_operand"   "=vr,   
vr")
+  [(set (match_operand:VWEXTF 0 "register_operand""=vd, vd, 
vr, vr")
(if_then_else:VWEXTF
  (unspec:
-   [(match_operand: 1 "vector_mask_operand"   
"vmWc1,vmWc1")
-(match_operand 5 "vector_length_operand"  "   rK,   
rK")
-(match_operand 6 "const_int_operand"  "i,
i")
-(match_operand 7 "const_int_operand"  "i,
i")
-(match_operand 8 "const_int_operand"  "i,
i")
-(match_operand 9 "const_int_operand"  "i,
i")
+   [(match_operand: 1 "vector_mask_operand"  " vm, 
vm,Wc1,Wc1")
+(match_operand 5 "vector_length_operand" " rK, rK, rK, 
rK")
+(match_operand 6 "const_int_operand" "  i,  i,  i, 
 i")
+(match_operand 7 "const_int_operand" "  i,  i,  i, 
 i")
+(match_operand 8 "const_int_operand" "  i,  i,  i, 
 i")
+(match_operand 9 "const_int_operand" "  i,  i,  i, 
 i")
 (reg:SI VL_REGNUM)
 (reg:SI VTYPE_REGNUM)
 (reg:SI FRM_REGNUM)] UNSPEC_VPREDICATE)
  (plus_minus:VWEXTF
-   (match_operand:VWEXTF 3 "register_operand" "   vr,   
vr")
+   (match_operand:VWEXTF 3 "register_operand"" vr, vr, vr, 
vr")
(float_extend:VWEXTF
  (vec_duplicate:
-   (match_operand: 4 "register_operand"   "f,
f"
- (match_operand:VWEXTF 2 "vector_merge_operand"   "   vu,
0")))]
+   (match_operand: 4 "register_operand"  "  f,  f,  f, 
 f"
+ (match_operand:VWEXTF 2 "vector_merge_operand"  " vu,  0, vu, 
 0")))]
   "TARGET_VECTOR"
   "vfw.wf\t%0,%3,%4%p1"
   [(set_attr "type" "vf")
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/pr115068-run.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/pr115068-run.c
new file mode 100644
index 000..95ec8e06021
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/pr115068-run.c
@@ -0,0 +1,28 @@
+/* { dg-do run } */
+/* { dg-require-effective-target riscv_v_ok } */
+/* { dg-add-options riscv_v } */
+/* { dg-additional-options "-std=gnu99" } */
+
+#include 
+#include 
+
+vfloat64m8_t
+test_vfwadd_wf_f64m8_m (vbool8_t vm, vfloat64m8_t vs2, float rs1, size_t vl)
+{
+  return __riscv_vfwadd_wf_f64m8_m (vm, vs2, rs1, vl);
+}
+
+char global_memory[1024];
+void *fake_memory = (void *) global_memory;
+
+int
+main ()
+{
+  asm volatile ("fence" ::: "memory");
+  vfloat64m8_t vfwadd_wf_f64m8_m_vd = test_vfwadd_wf_f64m8_m (
+__riscv_vreinterpret_v_i8m1_b8 (__riscv_vundefined_i8m1 ()),
+__riscv_vundefined_f64m8 (), 1.0, __riscv_vsetvlmax_e64m8 ());
+  asm volatile ("" ::"vr"(vfwadd_wf_f64m8_m_vd) : "memory");
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/pr115068.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/pr115068.c
new file mode 100644
index 000..6d680037aa1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/pr115068.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-add-options riscv_v } */
+/* { dg-additional-options "-std=gnu99" } */
+
+#include 
+#include 
+
+vfloat64m8_t
+test_vfwadd_wf_f64m8_m (vbool8_t vm, vfloat64m8_t vs2, float rs1, size_t vl)
+{
+  return __riscv_vfwadd_wf_f64m8_m (vm, vs2, rs1, vl);
+}
+
+char global_memory[1024];
+void *fake_memory = (void *) global_memory;
+
+int
+main ()
+{
+  asm volatile ("fence" ::: "memory");
+  vfloat64m8_t vfwadd_wf_f64m8_m_vd = test_vfwadd_wf_f64m8_m (
+__riscv_vreinterpret_v_i8m1_b8 (__riscv_vundefined_i8m1 ()),
+

Re: [PATCH v1 3/3] RISC-V: Enable vectorizable early exit test

2024-05-13 Thread Robin Dapp
Hi Pan,

>  
> @@ -4114,6 +4115,7 @@ proc check_effective_target_vect_early_break_hw { } {
>   || [check_effective_target_arm_v8_neon_hw]
>   || [check_sse4_hw_available]
>   || [istarget amdgcn-*-*]
> + || [check_effective_target_riscv_v]
>   }}]
>  }

I believe this should be riscv_v_ok.  riscv_v only checks if we can
compile.  OK with that changed after 2/3 is in.

Regards
 Robin


Re: [PATCH] internal-fn: Do not force vcond operand to reg.

2024-05-13 Thread Robin Dapp
> What happens if we simply remove all of the force_reg here?

On x86 I bootstrapped and tested the attached without fallout
(gcc188, so it's no avx512-native machine and therefore limited
coverage).  riscv regtest is unchanged.
For aarch64 I would to rely on the pre-commit CI to pick it
up (does that work on sub-threads?).

Regards
 Robin


gcc/ChangeLog:

PR middle-end/113474

* internal-fn.cc (expand_vec_cond_mask_optab_fn):  Remove
force_regs.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr113474.c: New test.
---
 gcc/internal-fn.cc  |  3 ---
 .../gcc.target/riscv/rvv/autovec/pr113474.c | 13 +
 2 files changed, 13 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 2c764441cde..4d226c478b4 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -3163,9 +3163,6 @@ expand_vec_cond_mask_optab_fn (internal_fn, gcall *stmt, 
convert_optab optab)
   rtx_op1 = expand_normal (op1);
   rtx_op2 = expand_normal (op2);
 
-  mask = force_reg (mask_mode, mask);
-  rtx_op1 = force_reg (mode, rtx_op1);
-
   rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand ([0], target, mode);
   create_input_operand ([1], rtx_op1, mode);
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c
new file mode 100644
index 000..0364bf9f5e3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c
@@ -0,0 +1,13 @@
+/* { dg-do compile { target riscv_v } }  */
+/* { dg-additional-options "-std=c99" }  */
+
+void
+foo (int n, int **a)
+{
+  int b;
+  for (b = 0; b < n; b++)
+for (long e = 8; e > 0; e--)
+  a[b][e] = a[b][e] == 15;
+}
+
+/* { dg-final { scan-assembler "vmerge.vim" } }  */
-- 
2.45.0



Re: [PATCH] internal-fn: Do not force vcond operand to reg.

2024-05-13 Thread Robin Dapp
> How does this make a difference in the end?  I'd expect say forwprop to
> fix things?

In general we try to only add the masking "boilerplate" of our
instructions at split time so fwprop, combine et al. can do their
work uninhibited of it (and we don't need numerous
(if_then_else ... (if_then_else) ...) combinations in our patterns).
A vec constant we expand directly to a masked representation, though
which makes further simplification difficult.  I can experiment with
changing that if preferred.

My thinking was, however, that for other operations like binops we
directly emit the right variant via expand_operands without
forcing to a reg and don't even need to fwprop so I wanted to
imitate that.

Regards
 Robin



[PATCH] internal-fn: Do not force vcond operand to reg.

2024-05-10 Thread Robin Dapp
Hi,

this only forces the first comparison operator into a register if it is
not already suitable.

Bootstrap and regtest is running on x86 and aarch64, successful on p10.
Regtested on riscv.

gcc/ChangeLog:

PR middle-end/113474

* internal-fn.cc (expand_vec_cond_mask_optab_fn):  Only force
op1 to reg if necessary.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr113474.c: New test.

Regards
 Robin

---
 gcc/internal-fn.cc  |  3 ++-
 .../gcc.target/riscv/rvv/autovec/pr113474.c | 13 +
 2 files changed, 15 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 2c764441cde..72cc6b7a1f7 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -3164,7 +3164,8 @@ expand_vec_cond_mask_optab_fn (internal_fn, gcall *stmt, 
convert_optab optab)
   rtx_op2 = expand_normal (op2);
 
   mask = force_reg (mask_mode, mask);
-  rtx_op1 = force_reg (mode, rtx_op1);
+  if (!insn_operand_matches (icode, 1, rtx_op1))
+rtx_op1 = force_reg (mode, rtx_op1);
 
   rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand ([0], target, mode);
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c
new file mode 100644
index 000..0364bf9f5e3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113474.c
@@ -0,0 +1,13 @@
+/* { dg-do compile { target riscv_v } }  */
+/* { dg-additional-options "-std=c99" }  */
+
+void
+foo (int n, int **a)
+{
+  int b;
+  for (b = 0; b < n; b++)
+for (long e = 8; e > 0; e--)
+  a[b][e] = a[b][e] == 15;
+}
+
+/* { dg-final { scan-assembler "vmerge.vim" } }  */
-- 
2.45.0


[PATCH] RISC-V: Add testcase for PR114749.

2024-04-25 Thread Robin Dapp
Hi,

this adds a test case for PR114749.
Going to commit as obvious unless somebody complains.

Regards
 Robin

gcc/testsuite/ChangeLog:

PR tree-optimization/114749

* gcc.target/riscv/rvv/autovec/pr114749.c: New test.
---
 .../gcc.target/riscv/rvv/autovec/pr114749.c   | 15 +++
 1 file changed, 15 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114749.c

diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114749.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114749.c
new file mode 100644
index 000..6733b0481a6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114749.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv_zvl128b -mabi=lp64d -fwhole-program -O3 
-mrvv-vector-bits=zvl" } */
+
+extern int a[];
+extern char b[];
+int c = 24;
+_Bool d[24][24][24];
+_Bool (*e)[24][24] = d;
+int main() {
+  for (short f = 0; f < 24; f += 3)
+for (unsigned g = 0; g < (char)c; g += 2) {
+  a[f] = 0;
+  b[g] |= ({ e[f][f][f]; });
+}
+}
-- 
2.44.0


Re: State of risc-v port in the current merge, revert, rinse-repeat commotion

2024-04-24 Thread Robin Dapp
Thanks Vineet!

> The dynamic icounts looks sane (vs. Apr 10 snapshot) except for a
> regression in x264 which is likely independent of the chaos going on.
> 
>  Apr 10 | Apr 23  |
>   109f1b28fc94  |  6f0a646dd2fc   |
> +-+
> 276,584,692,883 | 277,816,987,018 |  -0.45%
> 913,452,236,000 | 927,291,935,180 |  -1.52%
> 903,916,092,805 | 915,364,006,176 |  -1.27%

x264 uses widening arithmetic so it could be the reverts.
Can you compare the hot functions (e.g. x264_pixel_sad_16x16)
if anything stands out surrounding the vwadd.wv for example?

Regards
 Robin



Re: [PATCH v1] Revert "RISC-V: Support highpart register overlap for vwcvt"

2024-04-24 Thread Robin Dapp
>  (define_insn "@pred_vwsll_scalar"
> - [(set (match_operand:VWEXTI 0 "register_operand" "=vd, vr, vd, vr, vd, vr, 
> vd, vr, vd, vr, vd, vr, ?, ?")
> + [(set (match_operand:VWEXTI 0 "register_operand" "=vr, vr")

Just noticed, not a problem of the revert but wasn't that wrong before
without the early-clobber?
vwsll.vx  v0, v0, a0 for LMUL = 2 would be allowed but should not?
We probably don't have tests for this, in particular runtime?

Regards
 Robin


Re: [PATCH] RISC-V: Add xfail test case for wv insn highest overlap

2024-04-20 Thread Robin Dapp
LGTM.

Regards
 Robin



Re: [PATCH v1] RISC-V: Revert RVV wv instructions overlap and xfail tests

2024-04-19 Thread Robin Dapp
Hi Pan,

> The RVV register overlap requires both the dest, and src operands.
> Thus the rigister filter in constraint cannot cover the fully sematics
> of the vector register overlap.

I'm not sure I'm following.  Did we miss something that should have been
covered?  Like only an overlap on the srcs but not the dest?
Are there testcases that fail?  If so we should definitely have one.

If something is broken then indeed we should revert it.

But...

> Thus, revert these overlap patches list and xfail the related test
> cases.  This patch would like to revert *b3b2799b872*, and the full
> picture of related series are listed as below.

... why not just revert everything and xfail all the tests in a
follow up?  Your patch is essentially a revert but doesn't look like
it.  I'd rather we let a revert be a revert and adjust the tests
separately so it becomes clear. 

Regards
 Robin



[gcc r14-9972] RISC-V: Add VLS to mask vec_extract [PR114668].

2024-04-15 Thread Robin Dapp via Gcc-cvs
https://gcc.gnu.org/g:02cc8f3e68f9af96d484d9946ceaa9e3eed38151

commit r14-9972-g02cc8f3e68f9af96d484d9946ceaa9e3eed38151
Author: Robin Dapp 
Date:   Mon Apr 15 12:44:56 2024 +0200

RISC-V: Add VLS to mask vec_extract [PR114668].

This adds the missing VLS modes to the mask extract expanders.

gcc/ChangeLog:

PR target/114668

* config/riscv/autovec.md: Add VLS.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr114668.c: New test.

Diff:
---
 gcc/config/riscv/autovec.md|  4 +--
 .../gcc.target/riscv/rvv/autovec/pr114668.c| 35 ++
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index 3b32369f68c..aa1ae0fe075 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1427,7 +1427,7 @@
 (define_expand "vec_extractqi"
   [(set (match_operand:QI0 "register_operand")
  (vec_select:QI
-   (match_operand:VB 1 "register_operand")
+   (match_operand:VB_VLS 1 "register_operand")
(parallel
 [(match_operand  2 "nonmemory_operand")])))]
   "TARGET_VECTOR"
@@ -1453,7 +1453,7 @@
 (define_expand "vec_extractbi"
   [(set (match_operand:QI0 "register_operand")
  (vec_select:QI
-   (match_operand:VB 1 "register_operand")
+   (match_operand:VB_VLS 1 "register_operand")
(parallel
 [(match_operand  2 "nonmemory_operand")])))]
   "TARGET_VECTOR"
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114668.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114668.c
new file mode 100644
index 000..3a13c3c0012
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114668.c
@@ -0,0 +1,35 @@
+/* { dg-do run } */
+/* { dg-require-effective-target riscv_v } */
+/* { dg-options { -O3 -fno-vect-cost-model -march=rv64gcv -mabi=lp64d  } } */
+
+char a;
+int b;
+short e[14];
+char f[4][12544];
+_Bool c[4][5];
+
+__attribute__ ((noipa))
+void foo (int a)
+{
+  if (a != 1)
+__builtin_abort ();
+}
+
+int main ()
+{
+  for (int i = 0; i < 4; ++i)
+for (int l = 0; l < 15; ++l)
+  for (int m = 0; m < 15; ++m)
+   f[i][l * m] = 3;
+  for (int j = 0; j < 4; j += 1)
+for (int k = 3; k < 13; k += 3)
+  for (_Bool l = 0; l < 1; l = 1)
+   for (int m = 0; m < 4; m += 1)
+ {
+   a = 0;
+   b -= e[k];
+   c[j][m] = f[j][6];
+ }
+  for (long i = 2; i < 4; ++i)
+foo (c[3][3]);
+}


[PATCH] RISC-V: Add VLS to mask vec_extract [PR114668].

2024-04-15 Thread Robin Dapp
Hi,

this adds the missing VLS modes to the mask extract expanders.
I found a dump scan difficult to create reliably so I just
kept the PR's run test case.

Regtested on rv64gcv. 

Regards
 Robin

gcc/ChangeLog:

PR target/114668

* config/riscv/autovec.md: Add VLS.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr114668.c: New test.
---
 gcc/config/riscv/autovec.md   |  4 +--
 .../gcc.target/riscv/rvv/autovec/pr114668.c   | 35 +++
 2 files changed, 37 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114668.c

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index 3b32369f68c..aa1ae0fe075 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1427,7 +1427,7 @@ (define_expand "vec_extract"
 (define_expand "vec_extractqi"
   [(set (match_operand:QI0 "register_operand")
  (vec_select:QI
-   (match_operand:VB 1 "register_operand")
+   (match_operand:VB_VLS 1 "register_operand")
(parallel
 [(match_operand  2 "nonmemory_operand")])))]
   "TARGET_VECTOR"
@@ -1453,7 +1453,7 @@ (define_expand "vec_extractqi"
 (define_expand "vec_extractbi"
   [(set (match_operand:QI0 "register_operand")
  (vec_select:QI
-   (match_operand:VB 1 "register_operand")
+   (match_operand:VB_VLS 1 "register_operand")
(parallel
 [(match_operand  2 "nonmemory_operand")])))]
   "TARGET_VECTOR"
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114668.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114668.c
new file mode 100644
index 000..3a13c3c0012
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114668.c
@@ -0,0 +1,35 @@
+/* { dg-do run } */
+/* { dg-require-effective-target riscv_v } */
+/* { dg-options { -O3 -fno-vect-cost-model -march=rv64gcv -mabi=lp64d  } } */
+
+char a;
+int b;
+short e[14];
+char f[4][12544];
+_Bool c[4][5];
+
+__attribute__ ((noipa))
+void foo (int a)
+{
+  if (a != 1)
+__builtin_abort ();
+}
+
+int main ()
+{
+  for (int i = 0; i < 4; ++i)
+for (int l = 0; l < 15; ++l)
+  for (int m = 0; m < 15; ++m)
+   f[i][l * m] = 3;
+  for (int j = 0; j < 4; j += 1)
+for (int k = 3; k < 13; k += 3)
+  for (_Bool l = 0; l < 1; l = 1)
+   for (int m = 0; m < 4; m += 1)
+ {
+   a = 0;
+   b -= e[k];
+   c[j][m] = f[j][6];
+ }
+  for (long i = 2; i < 4; ++i)
+foo (c[3][3]);
+}
-- 
2.44.0


Re: [PATCH] RISC-V: Add initial cost handling for segment loads/stores.

2024-03-25 Thread Robin Dapp
> So where do we stand with this?  Juzhe asked it to be rebased, but I
> don't see a rebased version in my inbox and I don't see anything that
> looks like this on the trunk.

I missed this one and figured as we're pretty late in the cycle it can
wait until GCC 15.  Therefore let's call it "deferred".

Regards
 Robin


Re: [PATCH v2] RISC-V: Introduce option -mrvv-max-lmul for RVV autovec

2024-03-18 Thread Robin Dapp
LGTM as well.

Regards
 Robin



Re: [PATCH] RISC-V: Introduce option -mrvv-autovec-max-lmul for RVV autovec

2024-03-14 Thread Robin Dapp
Should it really be called autovec-max-lmul?  We also use TARGET_MAX_LMUL
for builtins etc.  Or are we just following LLVM's naming here?
Isn't -mrvv-max-lmul sufficient?

> PR target/112648 

This PR is not really resolved or affected by the patch if I'm not
mistaken.  We still have code paths that will generate a larger LMUL
(also in vsetvl last I checked, but that was a while ago).

Regards
 Robin


[gcc r14-9366] vect: Do not peel epilogue for partial vectors.

2024-03-07 Thread Robin Dapp via Gcc-cvs
https://gcc.gnu.org/g:226043a4d8fb23c7fe7bf16e485b3cfaa094db21

commit r14-9366-g226043a4d8fb23c7fe7bf16e485b3cfaa094db21
Author: Robin Dapp 
Date:   Wed Mar 6 16:54:35 2024 +0100

vect: Do not peel epilogue for partial vectors.

r14-7036-gcbf569486b2dec added an epilogue vectorization guard for early
break but PR114196 shows that we also run into the problem without early
break.  Therefore merge the condition into the topmost vectorization
guard.

gcc/ChangeLog:

PR middle-end/114196

* tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Merge
vectorization guards.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr114196.c: New test.
* gcc.target/riscv/rvv/autovec/pr114196.c: New test.

Diff:
---
 gcc/testsuite/gcc.target/aarch64/pr114196.c| 19 ++
 .../gcc.target/riscv/rvv/autovec/pr114196.c| 19 ++
 gcc/tree-vect-loop-manip.cc| 30 +-
 3 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/pr114196.c 
b/gcc/testsuite/gcc.target/aarch64/pr114196.c
new file mode 100644
index 000..15e4b0e31b8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr114196.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options { -O3 -fno-vect-cost-model -march=armv9-a 
-msve-vector-bits=256 } } */
+
+unsigned a;
+int b;
+long *c;
+
+int
+main ()
+{
+  for (int d = 0; d < 22; d += 4) {
+  b = ({
+   int e = c[d];
+   e;
+   })
+  ? 0 : -c[d];
+  a *= 3;
+  }
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c
new file mode 100644
index 000..7ba9cbbed70
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options { -O3 -fno-vect-cost-model -march=rv64gcv_zvl256b -mabi=lp64d 
-mrvv-vector-bits=zvl } } */
+
+unsigned a;
+int b;
+long *c;
+
+int
+main ()
+{
+  for (int d = 0; d < 22; d += 4) {
+  b = ({
+   int e = c[d];
+   e;
+   })
+  ? 0 : -c[d];
+  a *= 3;
+  }
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index f72da915103..56a6d8e4a8d 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -2129,16 +2129,19 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
  For mult, don't known how to generate
  init_expr * pow (step, niters) for variable niters.
  For neg, it should be ok, since niters of vectorized main loop
- will always be multiple of 2.  */
-  if ((!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-   || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
+ will always be multiple of 2.
+ See also PR113163 and PR114196.  */
+  if ((!LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
+   || LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+   || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
   && induction_type != vect_step_op_neg)
 {
   if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "Peeling for epilogue is not supported"
 " for nonlinear induction except neg"
-" when iteration count is unknown.\n");
+" when iteration count is unknown or"
+" when using partial vectorization.\n");
   return false;
 }
 
@@ -2178,25 +2181,6 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
   return false;
 }
 
-  /* We can't support partial vectors and early breaks with an induction
- type other than add or neg since we require the epilog and can't
- perform the peeling.  The below condition mirrors that of
- vect_gen_vector_loop_niters  where niters_vector_mult_vf_var then sets
- step_vector to VF rather than 1.  This is what creates the nonlinear
- IV.  PR113163.  */
-  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
-  && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
-  && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
-  && induction_type != vect_step_op_neg)
-{
-  if (dump_enabled_p ())
-   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-"Peeling for epilogue is not supported"
-" for nonlinear induction except neg"
-" when VF is known and early breaks.\n");
-  return false;
-}
-
   return true;
 }


Re: [PATCH] vect: Do not peel epilogue for partial vectors [PR114196].

2024-03-07 Thread Robin Dapp
Attached v2 combines the checks.

Bootstrapped and regtested on x86 an power10, aarch64 still running.
Regtested on riscv64.

Regards
 Robin


Subject: [PATCH v2] vect: Do not peel epilogue for partial vectors.

r14-7036-gcbf569486b2dec added an epilogue vectorization guard for early
break but PR114196 shows that we also run into the problem without early
break.  Therefore merge the condition into the topmost vectorization
guard.

gcc/ChangeLog:

PR middle-end/114196

* tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Merge
vectorization guards.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr114196.c: New test.
* gcc.target/riscv/rvv/autovec/pr114196.c: New test.
---
 gcc/testsuite/gcc.target/aarch64/pr114196.c   | 19 
 .../gcc.target/riscv/rvv/autovec/pr114196.c   | 19 
 gcc/tree-vect-loop-manip.cc   | 30 +--
 3 files changed, 45 insertions(+), 23 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr114196.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c

diff --git a/gcc/testsuite/gcc.target/aarch64/pr114196.c 
b/gcc/testsuite/gcc.target/aarch64/pr114196.c
new file mode 100644
index 000..15e4b0e31b8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr114196.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options { -O3 -fno-vect-cost-model -march=armv9-a 
-msve-vector-bits=256 } } */
+
+unsigned a;
+int b;
+long *c;
+
+int
+main ()
+{
+  for (int d = 0; d < 22; d += 4) {
+  b = ({
+   int e = c[d];
+   e;
+   })
+  ? 0 : -c[d];
+  a *= 3;
+  }
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c
new file mode 100644
index 000..7ba9cbbed70
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options { -O3 -fno-vect-cost-model -march=rv64gcv_zvl256b -mabi=lp64d 
-mrvv-vector-bits=zvl } } */
+
+unsigned a;
+int b;
+long *c;
+
+int
+main ()
+{
+  for (int d = 0; d < 22; d += 4) {
+  b = ({
+   int e = c[d];
+   e;
+   })
+  ? 0 : -c[d];
+  a *= 3;
+  }
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index f72da915103..56a6d8e4a8d 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -2129,16 +2129,19 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
  For mult, don't known how to generate
  init_expr * pow (step, niters) for variable niters.
  For neg, it should be ok, since niters of vectorized main loop
- will always be multiple of 2.  */
-  if ((!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-   || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
+ will always be multiple of 2.
+ See also PR113163 and PR114196.  */
+  if ((!LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
+   || LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+   || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
   && induction_type != vect_step_op_neg)
 {
   if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "Peeling for epilogue is not supported"
 " for nonlinear induction except neg"
-" when iteration count is unknown.\n");
+" when iteration count is unknown or"
+" when using partial vectorization.\n");
   return false;
 }
 
@@ -2178,25 +2181,6 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
   return false;
 }
 
-  /* We can't support partial vectors and early breaks with an induction
- type other than add or neg since we require the epilog and can't
- perform the peeling.  The below condition mirrors that of
- vect_gen_vector_loop_niters  where niters_vector_mult_vf_var then sets
- step_vector to VF rather than 1.  This is what creates the nonlinear
- IV.  PR113163.  */
-  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
-  && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
-  && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
-  && induction_type != vect_step_op_neg)
-{
-  if (dump_enabled_p ())
-   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-"Peeling for epilogue is not supported"
-" for nonlinear induction except neg"
-" when VF is known and early breaks.\n");
-  return false;
-}
-
   return true;
 }
 
-- 
2.43.2



Re: [PATCH] vect: Do not peel epilogue for partial vectors [PR114196].

2024-03-07 Thread Robin Dapp
> r14-7036-gcbf569486b2dec added an epilogue vectorization guard for early
> break but PR114196 shows that we also run into the problem without early
> break.  Therefore remove early break from the conditions.

Forgot:

Bootstrapped and regtested on x86, aarch64 and power10.  Regtested on
riscv64.

Regards
 Robin


[PATCH] vect: Do not peel epilogue for partial vectors [PR114196].

2024-03-06 Thread Robin Dapp
Hi,

r14-7036-gcbf569486b2dec added an epilogue vectorization guard for early
break but PR114196 shows that we also run into the problem without early
break.  Therefore remove early break from the conditions.

gcc/ChangeLog:

PR middle-end/114196

* tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Remove
early break check from guards.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr114196.c: New test.
* gcc.target/riscv/rvv/autovec/pr114196.c: New test.
---
 gcc/testsuite/gcc.target/aarch64/pr114196.c   | 19 +++
 .../gcc.target/riscv/rvv/autovec/pr114196.c   | 19 +++
 gcc/tree-vect-loop-manip.cc   |  6 +++---
 3 files changed, 41 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr114196.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c

diff --git a/gcc/testsuite/gcc.target/aarch64/pr114196.c 
b/gcc/testsuite/gcc.target/aarch64/pr114196.c
new file mode 100644
index 000..15e4b0e31b8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr114196.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options { -O3 -fno-vect-cost-model -march=armv9-a 
-msve-vector-bits=256 } } */
+
+unsigned a;
+int b;
+long *c;
+
+int
+main ()
+{
+  for (int d = 0; d < 22; d += 4) {
+  b = ({
+   int e = c[d];
+   e;
+   })
+  ? 0 : -c[d];
+  a *= 3;
+  }
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c
new file mode 100644
index 000..7ba9cbbed70
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options { -O3 -fno-vect-cost-model -march=rv64gcv_zvl256b -mabi=lp64d 
-mrvv-vector-bits=zvl } } */
+
+unsigned a;
+int b;
+long *c;
+
+int
+main ()
+{
+  for (int d = 0; d < 22; d += 4) {
+  b = ({
+   int e = c[d];
+   e;
+   })
+  ? 0 : -c[d];
+  a *= 3;
+  }
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index f72da915103..c3cd20eef70 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -2183,9 +2183,9 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
  perform the peeling.  The below condition mirrors that of
  vect_gen_vector_loop_niters  where niters_vector_mult_vf_var then sets
  step_vector to VF rather than 1.  This is what creates the nonlinear
- IV.  PR113163.  */
-  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
-  && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
+ IV.  PR113163.
+ This also happens without early breaks, see PR114196.  */
+  if (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
   && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
   && induction_type != vect_step_op_neg)
 {
-- 
2.43.2


[gcc r14-9345] RISC-V: Use vmv1r.v instead of vmv.v.v for fma output reloads [PR114200].

2024-03-06 Thread Robin Dapp via Gcc-cvs
https://gcc.gnu.org/g:59554a50be8ebbd52e8a6348a92110af182e1874

commit r14-9345-g59554a50be8ebbd52e8a6348a92110af182e1874
Author: Robin Dapp 
Date:   Wed Mar 6 12:15:40 2024 +0100

RISC-V: Use vmv1r.v instead of vmv.v.v for fma output reloads [PR114200].

Three-operand instructions like vmacc are modeled with an implicit
output reload when the output does not match one of the operands.  For
this we use vmv.v.v which is subject to length masking.

In a situation where the current vl is less than the full vlenb
and the fma's result value is used as input for a vector reduction
(which is never length masked) we effectively only reduce vl
elements.  The masked-out elements are relevant for the
reduction, though, leading to a wrong result.

This patch replaces the vmv reloads by full-register reloads.

gcc/ChangeLog:

PR target/114200
PR target/114202

* config/riscv/vector.md: Use vmv[1248]r.v instead of vmv.v.v.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr114200.c: New test.
* gcc.target/riscv/rvv/autovec/pr114202.c: New test.

Diff:
---
 gcc/config/riscv/vector.md | 96 +++---
 .../gcc.target/riscv/rvv/autovec/pr114200.c| 18 
 .../gcc.target/riscv/rvv/autovec/pr114202.c| 20 +
 3 files changed, 86 insertions(+), 48 deletions(-)

diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
index f89f9c2fa86..8b1c24c5d79 100644
--- a/gcc/config/riscv/vector.md
+++ b/gcc/config/riscv/vector.md
@@ -5351,10 +5351,10 @@
   "@
vmadd.vv\t%0,%4,%5%p1
vmacc.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%4\;vmacc.vv\t%0,%3,%4%p1
+   vmv%m4r.v\t%0,%4\;vmacc.vv\t%0,%3,%4%p1
vmadd.vv\t%0,%4,%5%p1
vmacc.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%5\;vmacc.vv\t%0,%3,%4%p1"
+   vmv%m5r.v\t%0,%5\;vmacc.vv\t%0,%3,%4%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")])
 
@@ -5378,9 +5378,9 @@
   "TARGET_VECTOR"
   "@
vmadd.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%2\;vmadd.vv\t%0,%3,%4%p1
+   vmv%m2r.v\t%0,%2\;vmadd.vv\t%0,%3,%4%p1
vmadd.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%2\;vmadd.vv\t%0,%3,%4%p1"
+   vmv%m2r.v\t%0,%2\;vmadd.vv\t%0,%3,%4%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "2")
@@ -5409,9 +5409,9 @@
   "TARGET_VECTOR"
   "@
vmacc.vv\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vv\t%0,%2,%3%p1
+   vmv%m4r.v\t%0,%4;vmacc.vv\t%0,%2,%3%p1
vmacc.vv\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vv\t%0,%2,%3%p1"
+   vmv%m4r.v\t%0,%4\;vmacc.vv\t%0,%2,%3%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "4")
@@ -5462,9 +5462,9 @@
   "TARGET_VECTOR"
   "@
vmadd.vx\t%0,%2,%4%p1
-   vmv.v.v\t%0,%3\;vmadd.vx\t%0,%2,%4%p1
+   vmv%m3r.v\t%0,%3\;vmadd.vx\t%0,%2,%4%p1
vmadd.vx\t%0,%2,%4%p1
-   vmv.v.v\t%0,%3\;vmadd.vx\t%0,%2,%4%p1"
+   vmv%m3r.v\t%0,%3\;vmadd.vx\t%0,%2,%4%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "3")
@@ -5494,9 +5494,9 @@
   "TARGET_VECTOR"
   "@
vmacc.vx\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1
+   vmv%m4r.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1
vmacc.vx\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1"
+   vmv%m4r.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "4")
@@ -5562,9 +5562,9 @@
   "TARGET_VECTOR && !TARGET_64BIT"
   "@
vmadd.vx\t%0,%2,%4%p1
-   vmv.v.v\t%0,%2\;vmadd.vx\t%0,%2,%4%p1
+   vmv%m2r.v\t%0,%2\;vmadd.vx\t%0,%2,%4%p1
vmadd.vx\t%0,%2,%4%p1
-   vmv.v.v\t%0,%2\;vmadd.vx\t%0,%2,%4%p1"
+   vmv%m2r.v\t%0,%2\;vmadd.vx\t%0,%2,%4%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "3")
@@ -5595,9 +5595,9 @@
   "TARGET_VECTOR && !TARGET_64BIT"
   "@
vmacc.vx\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1
+   vmv%m4r.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1
vmacc.vx\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1"
+   vmv%m4r.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "4")
@@ -5649,10 +5649,10 @@
   "@
vnmsub.vv\t%0,%4,%5%p1
vnmsac.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%3\;vnmsub.vv\t%0,%4,%5%p1
+   vmv%m3r.v\t%0,%3\;vnmsub.vv\t%0,%4,%5%p1
vnmsu

[gcc r14-9344] RISC-V: Adjust vec unit-stride load/store costs.

2024-03-06 Thread Robin Dapp via Gcc-cvs
https://gcc.gnu.org/g:9ae83078fe45d093bbaa02b8348f2407fe0c62d6

commit r14-9344-g9ae83078fe45d093bbaa02b8348f2407fe0c62d6
Author: Robin Dapp 
Date:   Mon Jan 15 17:34:58 2024 +0100

RISC-V: Adjust vec unit-stride load/store costs.

Scalar loads provide offset addressing while unit-stride vector
instructions cannot.  The offset must be loaded into a general-purpose
register before it can be used.  In order to account for this, this
patch adds an address arithmetic heuristic that keeps track of data
reference operands.  If we haven't seen the operand before we add the
cost of a scalar statement.

This helps to get rid of an lbm regression when vectorizing (roughly
0.5% fewer dynamic instructions).  gcc5 improves by 0.2% and deepsjeng
by 0.25%.  wrf and nab degrade by 0.1%.  This is because before we now
adjust the cost of SLP as well as loop-vectorized instructions whereas
we would only adjust loop-vectorized instructions before.
Considering higher scalar_to_vec costs (3 vs 1) for all vectorization
types causes some snippets not to get vectorized anymore.  Given these
costs the decision looks correct but appears worse when just counting
dynamic instructions.

In total SPECint 2017 has 4 bln dynamic instructions less and SPECfp 0.7
bln.

gcc/ChangeLog:

* config/riscv/riscv-vector-costs.cc (adjust_stmt_cost): Move...
(costs::adjust_stmt_cost): ... to here and add vec_load/vec_store
offset handling.
(costs::add_stmt_cost): Also adjust cost for statements without
stmt_info.
* config/riscv/riscv-vector-costs.h: Define zero constant.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/riscv/rvv/vse-slp-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vse-slp-2.c: New test.

Diff:
---
 gcc/config/riscv/riscv-vector-costs.cc | 86 +++---
 gcc/config/riscv/riscv-vector-costs.h  | 10 +++
 .../gcc.dg/vect/costmodel/riscv/rvv/vse-slp-1.c| 51 +
 .../gcc.dg/vect/costmodel/riscv/rvv/vse-slp-2.c| 51 +
 4 files changed, 188 insertions(+), 10 deletions(-)

diff --git a/gcc/config/riscv/riscv-vector-costs.cc 
b/gcc/config/riscv/riscv-vector-costs.cc
index 7c9840df4e9..adf9c197df5 100644
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -42,6 +42,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "backend.h"
 #include "tree-data-ref.h"
 #include "tree-ssa-loop-niter.h"
+#include "tree-hash-traits.h"
 
 /* This file should be included last.  */
 #include "riscv-vector-costs.h"
@@ -1047,18 +1048,81 @@ costs::better_main_loop_than_p (const vector_costs 
*uncast_other) const
top of riscv_builtin_vectorization_cost handling which doesn't have any
information on statement operation codes etc.  */
 
-static unsigned
-adjust_stmt_cost (enum vect_cost_for_stmt kind, tree vectype, int stmt_cost)
+unsigned
+costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, loop_vec_info loop,
+stmt_vec_info stmt_info,
+slp_tree, tree vectype, int stmt_cost)
 {
   const cpu_vector_cost *costs = get_vector_costs ();
   switch (kind)
 {
 case scalar_to_vec:
-  return stmt_cost += (FLOAT_TYPE_P (vectype) ? costs->regmove->FR2VR
- : costs->regmove->GR2VR);
+  stmt_cost += (FLOAT_TYPE_P (vectype) ? costs->regmove->FR2VR
+   : costs->regmove->GR2VR);
+  break;
 case vec_to_scalar:
-  return stmt_cost += (FLOAT_TYPE_P (vectype) ? costs->regmove->VR2FR
- : costs->regmove->VR2GR);
+  stmt_cost += (FLOAT_TYPE_P (vectype) ? costs->regmove->VR2FR
+   : costs->regmove->VR2GR);
+  break;
+case vector_load:
+case vector_store:
+   {
+ /* Unit-stride vector loads and stores do not have offset addressing
+as opposed to scalar loads and stores.
+If the address depends on a variable we need an additional
+add/sub for each load/store in the worst case.  */
+ if (stmt_info && stmt_info->stmt)
+   {
+ data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
+ class loop *father = stmt_info->stmt->bb->loop_father;
+ if (!loop && father && !father->inner && father->superloops)
+   {
+ tree ref;
+ if (TREE_CODE (dr->ref) != MEM_REF
+ || !(ref = TREE_OPERAND (dr->ref, 0))
+ || TREE_CODE (ref) != SSA_NAME)
+   break;
+
+ if (SSA_NAME_IS_DEFAULT_DEF (ref))

[PATCH] RISC-V: Use vmv1r.v instead of vmv.v.v for fma output reloads [PR114200].

2024-03-06 Thread Robin Dapp
Hi,

three-operand instructions like vmacc are modeled with an implicit
output reload when the output does not match one of the operands.  For
this we use vmv.v.v which is subject to length masking.

In a situation where the current vl is less than the full vlenb
and the fma's result value is used as input for a vector reduction
(which is never length masked) we effectively only reduce vl
elements.  The masked-out elements are relevant for the
reduction, though, leading to a wrong result.

This patch replaces the vmv reloads by full-register reloads.

Regtested on rv64, rv32 is running.

Regards
 Robin

gcc/ChangeLog:

PR target/114200
PR target/114202

* config/riscv/vector.md: Use vmv[1248]r.v instead of vmv.v.v.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr114200.c: New test.
* gcc.target/riscv/rvv/autovec/pr114202.c: New test.
---
 gcc/config/riscv/vector.md| 96 +--
 .../gcc.target/riscv/rvv/autovec/pr114200.c   | 18 
 .../gcc.target/riscv/rvv/autovec/pr114202.c   | 20 
 3 files changed, 86 insertions(+), 48 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114200.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114202.c

diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
index f89f9c2fa86..8b1c24c5d79 100644
--- a/gcc/config/riscv/vector.md
+++ b/gcc/config/riscv/vector.md
@@ -5351,10 +5351,10 @@ (define_insn "*pred_mul_plus_undef"
   "@
vmadd.vv\t%0,%4,%5%p1
vmacc.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%4\;vmacc.vv\t%0,%3,%4%p1
+   vmv%m4r.v\t%0,%4\;vmacc.vv\t%0,%3,%4%p1
vmadd.vv\t%0,%4,%5%p1
vmacc.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%5\;vmacc.vv\t%0,%3,%4%p1"
+   vmv%m5r.v\t%0,%5\;vmacc.vv\t%0,%3,%4%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")])
 
@@ -5378,9 +5378,9 @@ (define_insn "*pred_madd"
   "TARGET_VECTOR"
   "@
vmadd.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%2\;vmadd.vv\t%0,%3,%4%p1
+   vmv%m2r.v\t%0,%2\;vmadd.vv\t%0,%3,%4%p1
vmadd.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%2\;vmadd.vv\t%0,%3,%4%p1"
+   vmv%m2r.v\t%0,%2\;vmadd.vv\t%0,%3,%4%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "2")
@@ -5409,9 +5409,9 @@ (define_insn "*pred_macc"
   "TARGET_VECTOR"
   "@
vmacc.vv\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vv\t%0,%2,%3%p1
+   vmv%m4r.v\t%0,%4;vmacc.vv\t%0,%2,%3%p1
vmacc.vv\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vv\t%0,%2,%3%p1"
+   vmv%m4r.v\t%0,%4\;vmacc.vv\t%0,%2,%3%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "4")
@@ -5462,9 +5462,9 @@ (define_insn "*pred_madd_scalar"
   "TARGET_VECTOR"
   "@
vmadd.vx\t%0,%2,%4%p1
-   vmv.v.v\t%0,%3\;vmadd.vx\t%0,%2,%4%p1
+   vmv%m3r.v\t%0,%3\;vmadd.vx\t%0,%2,%4%p1
vmadd.vx\t%0,%2,%4%p1
-   vmv.v.v\t%0,%3\;vmadd.vx\t%0,%2,%4%p1"
+   vmv%m3r.v\t%0,%3\;vmadd.vx\t%0,%2,%4%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "3")
@@ -5494,9 +5494,9 @@ (define_insn "*pred_macc_scalar"
   "TARGET_VECTOR"
   "@
vmacc.vx\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1
+   vmv%m4r.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1
vmacc.vx\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1"
+   vmv%m4r.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "4")
@@ -5562,9 +5562,9 @@ (define_insn "*pred_madd_extended_scalar"
   "TARGET_VECTOR && !TARGET_64BIT"
   "@
vmadd.vx\t%0,%2,%4%p1
-   vmv.v.v\t%0,%2\;vmadd.vx\t%0,%2,%4%p1
+   vmv%m2r.v\t%0,%2\;vmadd.vx\t%0,%2,%4%p1
vmadd.vx\t%0,%2,%4%p1
-   vmv.v.v\t%0,%2\;vmadd.vx\t%0,%2,%4%p1"
+   vmv%m2r.v\t%0,%2\;vmadd.vx\t%0,%2,%4%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "3")
@@ -5595,9 +5595,9 @@ (define_insn "*pred_macc_extended_scalar"
   "TARGET_VECTOR && !TARGET_64BIT"
   "@
vmacc.vx\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1
+   vmv%m4r.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1
vmacc.vx\t%0,%2,%3%p1
-   vmv.v.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1"
+   vmv%m4r.v\t%0,%4\;vmacc.vx\t%0,%2,%3%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")
(set_attr "merge_op_idx" "4")
@@ -5649,10 +5649,10 @@ (define_insn "*pred_minus_mul_undef"
   "@
vnmsub.vv\t%0,%4,%5%p1
vnmsac.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%3\;vnmsub.vv\t%0,%4,%5%p1
+   vmv%m3r.v\t%0,%3\;vnmsub.vv\t%0,%4,%5%p1
vnmsub.vv\t%0,%4,%5%p1
vnmsac.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%3\;vnmsub.vv\t%0,%4,%5%p1"
+   vmv%m3r.v\t%0,%3\;vnmsub.vv\t%0,%4,%5%p1"
   [(set_attr "type" "vimuladd")
(set_attr "mode" "")])
 
@@ -5676,9 +5676,9 @@ (define_insn "*pred_nmsub"
   "TARGET_VECTOR"
   "@
vnmsub.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%2\;vnmsub.vv\t%0,%3,%4%p1
+   vmv%m2r.v\t%0,%2\;vnmsub.vv\t%0,%3,%4%p1
vnmsub.vv\t%0,%3,%4%p1
-   vmv.v.v\t%0,%2\;vnmsub.vv\t%0,%3,%4%p1"
+   

Re: [PATCH] RISC-V: Add initial cost handling for segment loads/stores.

2024-03-01 Thread Robin Dapp
> +  /* Segment load/store permute cost.  */
> +  const int segment_permute_2;
> +  const int segment_permute_4;
> +  const int segment_permute_8;
> 
> Why do we only have 2/4/8, I think we should have 2/3/4/5/6/7/8

No idea why I posted that (wrong) version, I used it for
some testing locally.  Attached is the proper version, still
called it v3...

Regards
 Robin

Subject: [PATCH v3] RISC-V: Add initial cost handling for segment
 loads/stores.

This patch makes segment loads and stores more expensive.  It adds
segment_permute_2 as well as 3 to 8 cost fields to the common vector
costs and adds handling to adjust_stmt_cost.

gcc/ChangeLog:

* config/riscv/riscv-protos.h (struct common_vector_cost): Add
segment_permute cost.
* config/riscv/riscv-vector-costs.cc (costs::adjust_stmt_cost):
Handle segment loads/stores.
* config/riscv/riscv.cc: Initialize segment_permute_[2-8] to 1.
---
 gcc/config/riscv/riscv-protos.h|   9 ++
 gcc/config/riscv/riscv-vector-costs.cc | 163 ++---
 gcc/config/riscv/riscv.cc  |  14 +++
 3 files changed, 144 insertions(+), 42 deletions(-)

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 80efdf2b7e5..90d1fcbb3b1 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -218,6 +218,15 @@ struct common_vector_cost
   const int gather_load_cost;
   const int scatter_store_cost;
 
+  /* Segment load/store permute cost.  */
+  const int segment_permute_2;
+  const int segment_permute_3;
+  const int segment_permute_4;
+  const int segment_permute_5;
+  const int segment_permute_6;
+  const int segment_permute_7;
+  const int segment_permute_8;
+
   /* Cost of a vector-to-scalar operation.  */
   const int vec_to_scalar_cost;
 
diff --git a/gcc/config/riscv/riscv-vector-costs.cc 
b/gcc/config/riscv/riscv-vector-costs.cc
index adf9c197df5..f4da213fe14 100644
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -1043,6 +1043,25 @@ costs::better_main_loop_than_p (const vector_costs 
*uncast_other) const
   return vector_costs::better_main_loop_than_p (other);
 }
 
+/* Returns the group size i.e. the number of vectors to be loaded by a
+   segmented load/store instruction.  Return 0 if it is no segmented
+   load/store.  */
+static int
+segment_loadstore_group_size (enum vect_cost_for_stmt kind,
+ stmt_vec_info stmt_info)
+{
+  if (stmt_info
+  && (kind == vector_load || kind == vector_store)
+  && STMT_VINFO_DATA_REF (stmt_info))
+{
+  stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
+  if (stmt_info
+ && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_LOAD_STORE_LANES)
+   return DR_GROUP_SIZE (stmt_info);
+}
+  return 0;
+}
+
 /* Adjust vectorization cost after calling riscv_builtin_vectorization_cost.
For some statement, we would like to further fine-grain tweak the cost on
top of riscv_builtin_vectorization_cost handling which doesn't have any
@@ -1067,55 +1086,115 @@ costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, 
loop_vec_info loop,
 case vector_load:
 case vector_store:
{
- /* Unit-stride vector loads and stores do not have offset addressing
-as opposed to scalar loads and stores.
-If the address depends on a variable we need an additional
-add/sub for each load/store in the worst case.  */
- if (stmt_info && stmt_info->stmt)
+ if (stmt_info && stmt_info->stmt && STMT_VINFO_DATA_REF (stmt_info))
{
- data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
- class loop *father = stmt_info->stmt->bb->loop_father;
- if (!loop && father && !father->inner && father->superloops)
+ /* Segment loads and stores.  When the group size is > 1
+the vectorizer will add a vector load/store statement for
+each vector in the group.  Here we additionally add permute
+costs for each.  */
+ /* TODO: Indexed and ordered/unordered cost.  */
+ int group_size = segment_loadstore_group_size (kind, stmt_info);
+ if (group_size > 1)
+   {
+ switch (group_size)
+   {
+   case 2:
+ if (riscv_v_ext_vector_mode_p (loop->vector_mode))
+   stmt_cost += costs->vla->segment_permute_2;
+ else
+   stmt_cost += costs->vls->segment_permute_2;
+ break;
+   case 3:
+ if (riscv_v_ext_vector_mode_p (loop->vector_mode))
+   stmt_cost += costs->vla->segment_permute_3;
+ else
+   stmt_cost += costs->vls->segment_permute_3;
+ break;
+   case 4:
+ if 

Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-03-01 Thread Robin Dapp
> 2.  When I write if (a == 2) and if (2 == a), the results are
> same
> 
> 3.  The vec_duplicate operand  is the 5th operand in both cmp and
> eqne patterns. I think they are equal.

A comparison with a constant is always canonicalized to have the
constant second, that's why you won't see a difference.
A vector constant follows the same rule because
swap_commutative_operands_p will place it second.

I'm not sure whether we need the vec_duplicate first, honestly.
I don't remember a canonicalization rule that puts it there.
We do have something for constants and vec_merge.  As long as
things come from expand I think a constant will always be
second and this patch removes the patterns where the duplicate
is first.

Generally with fast math we could invert the condition so
a comparison should be "commutative".  With NaNs I think we
also allow it if the unordered comparisons are supported.
But I'm not even certain that we try something like that with
vectors.  On the other hand - as there is no canonical order
nothing would prevent it from being first in the future?

Will need to think about it some more (and try with NaNs) but
we could give try removing the patterns with GCC15 I suppose.

The rest should still be handled in a more generic fashion.

Regards
 Robin



Re: [PATCH 5/5] RISC-V: Support vmsxx.vx for autovec comparison of vec and imm

2024-03-01 Thread Robin Dapp
Hi Han,

in addition to what Juzhe mentioned (and that late-combine is going
to handle such cases) it should be noted that register pressure
should not be the only consideration here.  Many uarchs have a higher
latency for register-file-crossing moves.  At least without spilling
the vv variant is preferable, with spilling it very much depends.

Regards
 Robin



Re: [PATCH v2] DSE: Bugfix ICE after allow vector type in get_stored_val

2024-02-29 Thread Robin Dapp
On 2/29/24 02:38, Li, Pan2 wrote:
>> So it's going to check if V2SF can be tied to DI and V4QI with SI.  I 
>> suspect those are going to fail for RISC-V as those aren't tieable.
> 
> Yes, you are right. Different REG_CLASS are not allowed to be tieable in 
> RISC-V.
> 
> static bool
> riscv_modes_tieable_p (machine_mode mode1, machine_mode mode2)
> {
>   /* We don't allow different REG_CLASS modes tieable since it
>  will cause ICE in register allocation (RA).
>  E.g. V2SI and DI are not tieable.  */
>   if (riscv_v_ext_mode_p (mode1) != riscv_v_ext_mode_p (mode2))
> return false;
>   return (mode1 == mode2
>   || !(GET_MODE_CLASS (mode1) == MODE_FLOAT
>&& GET_MODE_CLASS (mode2) == MODE_FLOAT));
> }

Yes, but what we set tieable is e.g. V4QI and V2SF.

I suggested a target band-aid before:

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 799d7919a4a..982ca1a4250 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -8208,6 +8208,11 @@ riscv_modes_tieable_p (machine_mode mode1, machine_mode 
mode2)
  E.g. V2SI and DI are not tieable.  */
   if (riscv_v_ext_mode_p (mode1) != riscv_v_ext_mode_p (mode2))
 return false;
+  if (GET_MODE_CLASS (GET_MODE_INNER (mode1)) == MODE_INT
+  && GET_MODE_CLASS (GET_MODE_INNER (mode2)) == MODE_FLOAT
+  && GET_MODE_SIZE (GET_MODE_INNER (mode1))
+   != GET_MODE_SIZE (GET_MODE_INNER (mode2)))
+return false;
   return (mode1 == mode2
  || !(GET_MODE_CLASS (mode1) == MODE_FLOAT
   && GET_MODE_CLASS (mode2) == MODE_FLOAT));

but I don't like that as it just works around something
that I didn't even understand fully...

Regards
 Robin



Re: [PATCH v3] RISC-V: Introduce gcc option mrvv-vector-bits for RVV

2024-02-29 Thread Robin Dapp
> I think it makes more sense to remove the whole
> --param=riscv-autovec-preference since we should use 
> -fno-tree-vectorize instead of --param=riscv-autovec-preference=none
> which is more reasonable compile option for users.
> 
> --param is just a internal testing option that we added before,
> ideally we should remove them.
Yes, I agree with that.  At least the "none" part doesn't seem
necessary.

Regards
 Robin


Re: [PATCH] RISC-V: Update test expectancies with recent scheduler change

2024-02-28 Thread Robin Dapp
> I suggest specify -fno-schedule-insns to force tests assembler never
> change for any scheduling model.

We already do that and that's the point - as I mentioned before, no
scheduling is worse than default scheduling here (for some definition
of worse).  The way to reduce the number of vsetvls is to set the
load latency to a low value.

Regards
 Robin



Re: [PATCH] RISC-V: Add initial cost handling for segment loads/stores.

2024-02-27 Thread Robin Dapp
> This patch looks odd to me.
> I don't see memrefs in the trunk code.

It's on top of the vle/vse offset handling patch from
a while back that I haven't committed yet.

> Also, I prefer list all cost in cost tune info for NF = 2 ~ 8 like ARM SVE 
> does:
I don't mind having separate costs for each but I figured they
scale anyway with the number of vectors already.  Attached v2
is more similar to aarch64.

Regards
 Robin

Subject: [PATCH v2] RISC-V: Add initial cost handling for segment
 loads/stores.

This patch makes segment loads and stores more expensive.  It adds
segment_permute_2 (as well as 4 and 8) cost fields to the common vector
costs and adds handling to adjust_stmt_cost.

gcc/ChangeLog:

* config/riscv/riscv-protos.h (struct common_vector_cost): Add
segment_permute cost.
* config/riscv/riscv-vector-costs.cc (costs::adjust_stmt_cost):
Handle segment loads/stores.
* config/riscv/riscv.cc: Initialize segment_permute_[248] to 1.
---
 gcc/config/riscv/riscv-protos.h|   5 +
 gcc/config/riscv/riscv-vector-costs.cc | 139 +
 gcc/config/riscv/riscv.cc  |   6 ++
 3 files changed, 108 insertions(+), 42 deletions(-)

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 80efdf2b7e5..9b737aca1a3 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -218,6 +218,11 @@ struct common_vector_cost
   const int gather_load_cost;
   const int scatter_store_cost;
 
+  /* Segment load/store permute cost.  */
+  const int segment_permute_2;
+  const int segment_permute_4;
+  const int segment_permute_8;
+
   /* Cost of a vector-to-scalar operation.  */
   const int vec_to_scalar_cost;
 
diff --git a/gcc/config/riscv/riscv-vector-costs.cc 
b/gcc/config/riscv/riscv-vector-costs.cc
index adf9c197df5..c8178d71101 100644
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -1043,6 +1043,25 @@ costs::better_main_loop_than_p (const vector_costs 
*uncast_other) const
   return vector_costs::better_main_loop_than_p (other);
 }
 
+/* Returns the group size i.e. the number of vectors to be loaded by a
+   segmented load/store instruction.  Return 0 if it is no segmented
+   load/store.  */
+static int
+segment_loadstore_group_size (enum vect_cost_for_stmt kind,
+ stmt_vec_info stmt_info)
+{
+  if (stmt_info
+  && (kind == vector_load || kind == vector_store)
+  && STMT_VINFO_DATA_REF (stmt_info))
+{
+  stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
+  if (stmt_info
+ && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_LOAD_STORE_LANES)
+   return DR_GROUP_SIZE (stmt_info);
+}
+  return 0;
+}
+
 /* Adjust vectorization cost after calling riscv_builtin_vectorization_cost.
For some statement, we would like to further fine-grain tweak the cost on
top of riscv_builtin_vectorization_cost handling which doesn't have any
@@ -1067,55 +1086,91 @@ costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, 
loop_vec_info loop,
 case vector_load:
 case vector_store:
{
- /* Unit-stride vector loads and stores do not have offset addressing
-as opposed to scalar loads and stores.
-If the address depends on a variable we need an additional
-add/sub for each load/store in the worst case.  */
- if (stmt_info && stmt_info->stmt)
+ if (stmt_info && stmt_info->stmt && STMT_VINFO_DATA_REF (stmt_info))
{
- data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
- class loop *father = stmt_info->stmt->bb->loop_father;
- if (!loop && father && !father->inner && father->superloops)
+ /* Segment loads and stores.  When the group size is > 1
+the vectorizer will add a vector load/store statement for
+each vector in the group.  Here we additionally add permute
+costs for each.  */
+ /* TODO: Indexed and ordered/unordered cost.  */
+ int group_size = segment_loadstore_group_size (kind, stmt_info);
+ if (group_size > 1)
+   {
+ switch (group_size)
+   {
+   case 2:
+ if (riscv_v_ext_vector_mode_p (loop->vector_mode))
+   stmt_cost += costs->vla->segment_permute_2;
+ else
+   stmt_cost += costs->vls->segment_permute_2;
+ break;
+   case 4:
+ if (riscv_v_ext_vector_mode_p (loop->vector_mode))
+   stmt_cost += costs->vla->segment_permute_4;
+ else
+   stmt_cost += costs->vls->segment_permute_4;
+ break;
+   case 8:
+ if (riscv_v_ext_vector_mode_p (loop->vector_mode))
+   

[PATCH] RISC-V: Add initial cost handling for segment loads/stores.

2024-02-26 Thread Robin Dapp
Hi,

This has been sitting on my local tree - I've been wanting to post it
for a while but somehow forgot.

This patch makes segment loads and stores more expensive.  It adds
segment_load and segment_store cost fields to the common vector costs
and adds handling to adjust_stmt_cost.  In the future we could handle
this in a more fine-grained manner but let's start somehow.

Regtested on rv64.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/riscv-protos.h (struct common_vector_cost): Add
segment_[load/store]_cost.
* config/riscv/riscv-vector-costs.cc (costs::adjust_stmt_cost):
Handle segment loads/stores.
* config/riscv/riscv.cc: Initialize segment_[load/store]_cost
to 1.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/riscv/rvv/pr113112-4.c: Expect m4
instead of m2.
---
 gcc/config/riscv/riscv-protos.h   |   4 +
 gcc/config/riscv/riscv-vector-costs.cc| 127 --
 gcc/config/riscv/riscv.cc |   4 +
 .../vect/costmodel/riscv/rvv/pr113112-4.c |   4 +-
 4 files changed, 95 insertions(+), 44 deletions(-)

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 80efdf2b7e5..2e8ab9990a8 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -218,6 +218,10 @@ struct common_vector_cost
   const int gather_load_cost;
   const int scatter_store_cost;
 
+  /* Segment load/store cost.  */
+  const int segment_load_cost;
+  const int segment_store_cost;
+
   /* Cost of a vector-to-scalar operation.  */
   const int vec_to_scalar_cost;
 
diff --git a/gcc/config/riscv/riscv-vector-costs.cc 
b/gcc/config/riscv/riscv-vector-costs.cc
index adf9c197df5..d3c12444773 100644
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -1043,6 +1043,24 @@ costs::better_main_loop_than_p (const vector_costs 
*uncast_other) const
   return vector_costs::better_main_loop_than_p (other);
 }
 
+/* Returns the group size i.e. the number of vectors to be loaded by a
+   segmented load/store instruction.  Return 0 if it is no segmented
+   load/store.  */
+static int
+segment_loadstore_group_size (enum vect_cost_for_stmt kind,
+ stmt_vec_info stmt_info)
+{
+  if ((kind == vector_load || kind == vector_store)
+  && STMT_VINFO_DATA_REF (stmt_info))
+{
+  stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
+  if (stmt_info
+ && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_LOAD_STORE_LANES)
+   return DR_GROUP_SIZE (stmt_info);
+}
+  return 0;
+}
+
 /* Adjust vectorization cost after calling riscv_builtin_vectorization_cost.
For some statement, we would like to further fine-grain tweak the cost on
top of riscv_builtin_vectorization_cost handling which doesn't have any
@@ -1067,55 +1085,80 @@ costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, 
loop_vec_info loop,
 case vector_load:
 case vector_store:
{
- /* Unit-stride vector loads and stores do not have offset addressing
-as opposed to scalar loads and stores.
-If the address depends on a variable we need an additional
-add/sub for each load/store in the worst case.  */
- if (stmt_info && stmt_info->stmt)
+ if (stmt_info && stmt_info->stmt && STMT_VINFO_DATA_REF (stmt_info))
{
- data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
- class loop *father = stmt_info->stmt->bb->loop_father;
- if (!loop && father && !father->inner && father->superloops)
+ int group_size;
+ if ((group_size
+  = segment_loadstore_group_size (kind, stmt_info)) > 1)
{
- tree ref;
- if (TREE_CODE (dr->ref) != MEM_REF
- || !(ref = TREE_OPERAND (dr->ref, 0))
- || TREE_CODE (ref) != SSA_NAME)
-   break;
+ /* Segment loads and stores.  When the group size is > 1
+the vectorizer will add a vector load/store statement for
+each vector in the group.  Note that STMT_COST is
+overwritten here rather than adjusted.  */
+ if (riscv_v_ext_vector_mode_p (loop->vector_mode))
+   stmt_cost
+ = (DR_IS_READ (STMT_VINFO_DATA_REF (stmt_info))
+? costs->vla->segment_load_cost
+: costs->vla->segment_store_cost);
+ else
+   stmt_cost
+ = (DR_IS_READ (STMT_VINFO_DATA_REF (stmt_info))
+? costs->vls->segment_load_cost
+: costs->vls->segment_store_cost);
+ break;
+ /* TODO: Indexed and ordered/unordered cost.  */
+   }
+ else
+   {
+ /* 

Re: [PATCH] RISC-V: Update test expectancies with recent scheduler change

2024-02-26 Thread Robin Dapp
On 2/24/24 00:10, Edwin Lu wrote:
> Given the recent change with adding the scheduler pipeline descriptions,
> many scan-dump failures emerged. Relax the expected assembler output
> conditions on the affected tests to reduce noise.

I'm not entirely sure yet about relaxing the scans like this.
There seem to be uarchs that want to minimize vsetvls under all
circumstances while others don't seem to care all that much.  We could
(not must) assume that the tests that now regress have been written
with this minimization aspect in mind and that we'd want to be sure
that we still manage to emit the minimal number of vsetvls.

Why is the new upper bound acceptable?  What if a vector_load cost
of 12 (or so) causes even more vsetvls?  The 6 in generic_ooo is more
or less arbitrary chosen.

My suggestion before was to create another sched model that has
load costs like before and run the regressing tests with that
model.  That's of course also not really ideal and actually
shoehorned a bit, in particular as no scheduling also increases
the number of vsetvls.

Juzhe: What's your intention with those tests?  I'd suppose you
want the vsetvl number to be minimal here and not higher?  Did you
plan to add a particular scheduling model or are you happy with
the default (all 1) latencies?

Regards
 Robin


Re: [PATCH] RISC-V: Fix vec_init for simple sequences [PR114028].

2024-02-23 Thread Robin Dapp
> +/* { dg-final { scan-assembler-times "vmv\.v\.i\tv\[0-9\],0" 0 } } */
> 
> I think you should use "scan-assembler-not"

Thanks, going to commit with that change.

Regards
 Robin


[PATCH] RISC-V: Fix vec_init for simple sequences [PR114028].

2024-02-22 Thread Robin Dapp
Hi,

for a vec_init (_a, _a, _a, _a) with _a of mode DImode we try to
construct a "superword" of two "_a"s.  This only works for modes < Pmode
when we can "shift and or" two halves into one Pmode register.
This patch disallows the optimization for inner_mode == Pmode and emits
a simple broadcast in such a case.

The test is not a run test because it requires vlen=256 in qemu.
I can adjust that still of course.

Regtested on rv64, rv32 still running.

Regards
 Robin

gcc/ChangeLog:

PR target/114028

* config/riscv/riscv-v.cc 
(rvv_builder::can_duplicate_repeating_sequence_p):
Return false if inner mode is already Pmode.
(rvv_builder::is_all_same_sequence): New function.
(expand_vec_init): Emit broadcast if sequence is all same.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr114028.c: New test.
---
 gcc/config/riscv/riscv-v.cc   | 25 ++-
 .../gcc.target/riscv/rvv/autovec/pr114028.c   | 25 +++
 2 files changed, 49 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114028.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 0cfbd21ce6f..29d58deb995 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -443,6 +443,7 @@ public:
   }
 
   bool can_duplicate_repeating_sequence_p ();
+  bool is_repeating_sequence ();
   rtx get_merged_repeating_sequence ();
 
   bool repeating_sequence_use_merge_profitable_p ();
@@ -483,7 +484,8 @@ rvv_builder::can_duplicate_repeating_sequence_p ()
 {
   poly_uint64 new_size = exact_div (full_nelts (), npatterns ());
   unsigned int new_inner_size = m_inner_bits_size * npatterns ();
-  if (!int_mode_for_size (new_inner_size, 0).exists (_new_inner_mode)
+  if (m_inner_mode == Pmode
+  || !int_mode_for_size (new_inner_size, 0).exists (_new_inner_mode)
   || GET_MODE_SIZE (m_new_inner_mode) > UNITS_PER_WORD
   || !get_vector_mode (m_new_inner_mode, new_size).exists (_new_mode))
 return false;
@@ -492,6 +494,18 @@ rvv_builder::can_duplicate_repeating_sequence_p ()
   return nelts_per_pattern () == 1;
 }
 
+/* Return true if the vector is a simple sequence with one pattern and all
+   elements the same.  */
+bool
+rvv_builder::is_repeating_sequence ()
+{
+  if (npatterns () > 1)
+return false;
+  if (full_nelts ().is_constant ())
+return repeating_sequence_p (0, full_nelts ().to_constant (), 1);
+  return nelts_per_pattern () == 1;
+}
+
 /* Return true if it is a repeating sequence that using
merge approach has better codegen than using default
approach (slide1down).
@@ -2544,6 +2558,15 @@ expand_vec_init (rtx target, rtx vals)
 v.quick_push (XVECEXP (vals, 0, i));
   v.finalize ();
 
+  /* If the sequence is v = { a, a, a, a } just broadcast an element.  */
+  if (v.is_repeating_sequence ())
+{
+  machine_mode mode = GET_MODE (target);
+  rtx dup = expand_vector_broadcast (mode, v.elt (0));
+  emit_move_insn (target, dup);
+  return;
+}
+
   if (nelts > 3)
 {
   /* Case 1: Convert v = { a, b, a, b } into v = { ab, ab }.  */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114028.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114028.c
new file mode 100644
index 000..a451d85e3fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114028.c
@@ -0,0 +1,25 @@
+/* { dg-do compile }  */
+/* { dg-options "-march=rv64gcv_zvl256b -O3" } */
+
+int a, d = 55003;
+long c = 0, h;
+long e = 1;
+short i;
+
+int
+main ()
+{
+  for (int g = 0; g < 16; g++)
+{
+  d |= c;
+  short l = d;
+  i = l < 0 || a >> 4 ? d : a;
+  h = i - 8L;
+  e &= h;
+}
+
+  if (e != 1)
+__builtin_abort ();
+}
+
+/* { dg-final { scan-assembler-times "vmv\.v\.i\tv\[0-9\],0" 0 } } */
-- 
2.43.2


Re: [PATCH V4 4/5] RISC-V: Quick and simple fixes to testcases that break due to reordering

2024-02-21 Thread Robin Dapp


> For calling-convention-*.c, LGTM but one nit about change log. Take
> **Update** here may make others not easy to learn what you did about
> the file. You can say similar to "Rearrange and adjust the
> asm-checker times" or likewise. Of course, you can refine the
> changelog when commit.
>> * gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c: update
> 

Yes, agreed,  changes LGTM but please refine the commit message
slightly.  The first letter should also be capitalized I believe.

The rest of the is already ACK'ed so I believe it's good to go now.
I didn't pay a lot of attention to the other commit messages.
In case they need refining you can do that still.  

Regards
 Robin



Re: [PATCH V4 1/5] RISC-V: Add non-vector types to dfa pipelines

2024-02-21 Thread Robin Dapp
OK.

Regards
 Robin



Re: [PATCH] RISC-V: Set require-effective-target rv64 for PR113742

2024-02-15 Thread Robin Dapp
> Ah oops I glanced over the /* { dg-do compile } */part. It should be
> fine to add '-march=rv64gc' instead then?

Hmm it's a bit tricky.  So generally -mcpu=sifive-p670 includes rv64
but it does not override a previously specified -march=rv32 (that might
have been added by the test harness or the test target).  It looks
like it does override a (build option and thus not directly specified
when compiling) --with-arch=rv32.

For now I'd stick with something like -march=rv64gc -mtune=sifive-p670
(but please check if the original problem does occur with this).
While you're at it you could delete the redundant '/' in the first
line.

In general it's a bit counterintuitive a test specifying a
particular CPU (that supports several extensions) might have
those overridden when e.g. testing on a rv32 target not supporting
those.  We also do not support cpu names in the march string
so there is no nice way of overriding previously specified marchs.

Kito: Any idea regarding this?  I read in your commit message that
mcpu has lower precedence than march.  Right now that allows us to
somewhat silently remove architecture options that are specified
last on the command line.

aarch64 warns in case something is in conflict, maybe we should do
that as well?

At least I find it a bit annoying that we don't have a way of
saying:
"This test always needs to be compiled with all arch features of
cpu = ..." and rather need to specify -march=rv64gcv_z..._z...

Without having this thought through, can't mcpu be of kind of
similar precedence to march and we'd let the one specified last
"win" in case of conflicts?  Possibly with an exception for
the 32/64 bit.  Does LLVM not have this problem?

Regards
 Robin



Re: [PATCH] RISC-V: Set require-effective-target rv64 for PR113742

2024-02-14 Thread Robin Dapp
On 2/14/24 20:46, Edwin Lu wrote:
> The testcase pr113742.c is failing for 32 bit targets due to the following cc1
> error:
> cc1: error: ABI requries '-march=rv64'

I think we usually just add exactly this to the test options (so
it is always run rather than just on a 64-bit target.

Regards
 Robin



[PATCH] RISC-V: Adjust vec unit-stride load/store costs.

2024-02-13 Thread Robin Dapp
Hi,

scalar loads provide offset addressing while unit-stride vector
instructions cannot.  The offset must be loaded into a general-purpose
register before it can be used.  In order to account for this, this
patch adds an address arithmetic heuristic that keeps track of data
reference operands.  If we haven't seen the operand before we add the
cost of a scalar statement.

This helps to get rid of an lbm regression when vectorizing (roughly
0.5% fewer dynamic instructions).  gcc5 improves by 0.2% and deepsjeng
by 0.25%.  wrf and nab degrade by 0.1%.  This is because before we now
adjust the cost of SLP as well as loop-vectorized instructions whereas
we would only adjust loop-vectorized instructions before.
Considering higher scalar_to_vec costs (3 vs 1) for all vectorization
types causes some snippets not to get vectorized anymore.  Given these
costs the decisions look correct but appear worse when just counting
dynamic instructions.

In total SPECint 2017 has 4 bn dynamic instructions less and SPECfp 0.7
bn less so not a whole lot.

Regtested on riscv64.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/riscv-vector-costs.cc (adjust_stmt_cost): Move...
(costs::adjust_stmt_cost): ... to here and add vec_load/vec_store
offset handling.
(costs::add_stmt_cost): Also adjust cost for statements without
stmt_info.
* config/riscv/riscv-vector-costs.h: Define zero constant.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/riscv/rvv/vse-slp-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vse-slp-2.c: New test.
---
 gcc/config/riscv/riscv-vector-costs.cc| 86 ---
 gcc/config/riscv/riscv-vector-costs.h | 10 +++
 .../vect/costmodel/riscv/rvv/vse-slp-1.c  | 51 +++
 .../vect/costmodel/riscv/rvv/vse-slp-2.c  | 53 
 4 files changed, 190 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/vse-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/vse-slp-2.c

diff --git a/gcc/config/riscv/riscv-vector-costs.cc 
b/gcc/config/riscv/riscv-vector-costs.cc
index 7c9840df4e9..adf9c197df5 100644
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -42,6 +42,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "backend.h"
 #include "tree-data-ref.h"
 #include "tree-ssa-loop-niter.h"
+#include "tree-hash-traits.h"
 
 /* This file should be included last.  */
 #include "riscv-vector-costs.h"
@@ -1047,18 +1048,81 @@ costs::better_main_loop_than_p (const vector_costs 
*uncast_other) const
top of riscv_builtin_vectorization_cost handling which doesn't have any
information on statement operation codes etc.  */
 
-static unsigned
-adjust_stmt_cost (enum vect_cost_for_stmt kind, tree vectype, int stmt_cost)
+unsigned
+costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, loop_vec_info loop,
+stmt_vec_info stmt_info,
+slp_tree, tree vectype, int stmt_cost)
 {
   const cpu_vector_cost *costs = get_vector_costs ();
   switch (kind)
 {
 case scalar_to_vec:
-  return stmt_cost += (FLOAT_TYPE_P (vectype) ? costs->regmove->FR2VR
- : costs->regmove->GR2VR);
+  stmt_cost += (FLOAT_TYPE_P (vectype) ? costs->regmove->FR2VR
+   : costs->regmove->GR2VR);
+  break;
 case vec_to_scalar:
-  return stmt_cost += (FLOAT_TYPE_P (vectype) ? costs->regmove->VR2FR
- : costs->regmove->VR2GR);
+  stmt_cost += (FLOAT_TYPE_P (vectype) ? costs->regmove->VR2FR
+   : costs->regmove->VR2GR);
+  break;
+case vector_load:
+case vector_store:
+   {
+ /* Unit-stride vector loads and stores do not have offset addressing
+as opposed to scalar loads and stores.
+If the address depends on a variable we need an additional
+add/sub for each load/store in the worst case.  */
+ if (stmt_info && stmt_info->stmt)
+   {
+ data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
+ class loop *father = stmt_info->stmt->bb->loop_father;
+ if (!loop && father && !father->inner && father->superloops)
+   {
+ tree ref;
+ if (TREE_CODE (dr->ref) != MEM_REF
+ || !(ref = TREE_OPERAND (dr->ref, 0))
+ || TREE_CODE (ref) != SSA_NAME)
+   break;
+
+ if (SSA_NAME_IS_DEFAULT_DEF (ref))
+   break;
+
+ if (memrefs.contains ({ref, cst0}))
+   break;
+
+ memrefs.add ({ref, cst0});
+
+ /* In case we have not seen REF before and the base address
+is a pointer operation try a bit harder.  */
+ tree base = DR_BASE_ADDRESS 

Re: [PATCH v1] RISC-V: Fix misspelled term args in error_at message

2024-02-12 Thread Robin Dapp
OK.

Regards
 Robin



Re: [PATCH] RISC-V: Allow LICM hoist POLY_INT configuration code sequence

2024-02-06 Thread Robin Dapp
> The root cause is this following RTL pattern, after fwprop1:
> 
> (insn 82 78 84 9 (set (reg:DI 230)
>         (sign_extend:DI (minus:SI (subreg/s/v:SI (reg:DI 150 [ niters.10 ]) 0)
>                 (subreg:SI (reg:DI 221) 0 13 {subsi3_extended}
>      (expr_list:REG_EQUAL (sign_extend:DI (plus:SI (subreg/s/v:SI (reg:DI 150 
> [ niters.10 ]) 0)
>                 *(const_poly_int:SI [-16, -16])*))
>         (nil)))
> 
> The highlight *(const_poly_int:SI [-16, -16])*
> causes ICE.
> 
> This RTL is because:
> (insn 69 68 71 8 (set (reg:DI 221)
>         (const_poly_int:DI [16, 16])) 208 {*movdi_64bit}
>      (nil))
> (insn 82 78 84 9 (set (reg:DI 230)
>         (sign_extend:DI (minus:SI (subreg/s/v:SI (reg:DI 150 [ niters.10 ]) 0)
>                 (subreg:SI (reg:DI 221) 0 13 {subsi3_extended}            
>                               > (subreg:SI (const_poly_int:SI [-16, 
> -16])) fwprop1 add  (const_poly_int:SI [-16, -16]) reg_equal
>      (expr_list:REG_EQUAL (sign_extend:DI (plus:SI (subreg/s/v:SI (reg:DI 150 
> [ niters.10 ]) 0)
>                 (const_poly_int:SI [-16, -16])))
>         (nil)))

I'm seeing a slightly different pattern but that doesn't change
the problem.

> (set (reg:SI)  (subreg:SI (DI: poly value))) but it causes ICE that I
> mentioned above.

That's indeed a bit more idiomatic and I wouldn't oppose that.

The problem causing the ICE is that we want to simplify a PLUS
with (const_poly_int:SI [16, 16]) and (const_int 0) but the mode
is DImode.  My suspicion is that this is caused by our
addsi3_extended pattern and we fail to deduce the proper mode
for analysis.

I'm just speculating but maybe that's because we assert that a
plus is of the form simple_reg_p (op0) && CONSTANT_P (op1).
Usually, constants don't have a mode and can just be used.
poly_int_csts do have one and need to be explicitly converted
(kind of).

We can only analyze this zero_extended plus at all since Jeff
added the addsi3_extended handling for loop-iv.   Maybe we could
punt like

diff --git a/gcc/loop-iv.cc b/gcc/loop-iv.cc
index eb7e923a38b..796413c25a3 100644
--- a/gcc/loop-iv.cc
+++ b/gcc/loop-iv.cc
@@ -714,6 +714,9 @@ get_biv_step_1 (df_ref def, scalar_int_mode outer_mode, rtx 
reg,
  if (!simple_reg_p (op0) || !CONSTANT_P (op1))
return false;
 
+ if (CONST_POLY_INT_P (op1) && GET_MODE (op1) != outer_mode)
+   return false;
+

This helps for your test case but I haven't done any further
testing.  I'd think this is relatively safe because it's only
a missed analysis/optimization in the worst case.
Still, generally, I don't see a reason why we wouldn't be able
to analyze this?

Regards
 Robin



Re: [PATCH] RISC-V: Fix infinite compilation of VSETVL PASS

2024-02-06 Thread Robin Dapp
> Testing is running. Ok for trunk if I passed the testing with no
> regression ?
OK.

Regards
 Robin



Re: [PATCH] RISC-V: Remove vsetvl_pre bogus instructions in VSETVL PASS

2024-02-01 Thread Robin Dapp
> +static bool
> +vsetvl_pre_insn_p (rtx_insn *rinsn)
> +{
> +  return recog_memoized (rinsn) >= 0
> +  && get_attr_type (rinsn) == TYPE_VSETVL_PRE;
> +}

Indent looks off on my screen.  Can you check?

Apart from that LGTM (no need for v2 of course).

Regards
 Robin



Re: [PATCH V4 2/4] RISC-V: Add vector related pipelines

2024-01-31 Thread Robin Dapp
LGTM, thanks.

Regards
 Robin


Re: [PATCH] RISC-V: Support scheduling for sifive p600 series

2024-01-31 Thread Robin Dapp
> +  NULL,  /* vector cost */
> +};

Does the P600 series include a vector unit?  From what I found on
the web it looks like it.  If so I would suggest specifying at least
the default (generic) vector cost model here.  We fall back to the
default one for NULL but I find it more explicit to specify one. 

> +;; The Sifive 8 has six pipelines:

P600?  Is 8 the generation and P600 the official name?

> +(define_insn_reservation "sifive_p600_div" 33
> +  (and (eq_attr "tune" "sifive_p600")
> +   (eq_attr "type" "idiv"))
> +  "sifive_p600_M, sifive_p600_idiv*32")
> +

> +(define_insn_reservation "sifive_p600_fdiv_s" 18
> +  (and (eq_attr "tune" "sifive_p600")
> +   (eq_attr "type" "fdiv,fsqrt")
> +   (eq_attr "mode" "SF"))
> +  "sifive_p600_FM, sifive_p600_fdiv*17")
> +
> +(define_insn_reservation "sifive_p600_fdiv_d" 31
> +  (and (eq_attr "tune" "sifive_p600")
> +   (eq_attr "type" "fdiv,fsqrt")
> +   (eq_attr "mode" "DF"))
> +  "sifive_p600_FM, sifive_p600_fdiv*30")

I would suggest not to block the units for that long.  It will
needlessly increase the automata's complexity causing longer build
times.  Even if you want to keep the latency high (doubtful if
that's beneficial in terms of spilling) you could just block the
unit for maybe 3-5 cycles.  Up to you in the end, though and not
a blocker.

Regards
 Robin



[PATCH] match: Fix vcond into conditional op folding [PR113607].

2024-01-31 Thread Robin Dapp
Hi,

in PR113607 we see an invalid fold of

  _429 = .COND_SHL (mask_patt_205.47_276, vect_cst__262, vect_cst__262, { 0, 
... });
  vect_prephitmp_129.51_282 = _429;
  vect_iftmp.55_287 = VEC_COND_EXPR ;

to

  Applying pattern match.pd:9607, gimple-match-10.cc:3817
  gimple_simplified to vect_iftmp.55_287 = .COND_SHL (mask_patt_205.47_276, 
vect_cst__262, vect_cst__262, { 0, ... });

where we essentially use COND_SHL's else instead of VEC_COND_EXPR's.

This patch adjusts the corresponding match.pd pattern and makes it only
match when the else values are the same.

That, however, causes the exact test case to fail which this pattern
was introduced for.  XFAIL it for now.

Bootstrapped and regtested on x86. Regtested on riscv.  aarch64
is still running.

Regards
 Robin


gcc/ChangeLog:

PR middle-end/113607

* match.pd: Make sure else values match when folding a
vec_cond into a conditional operation.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/pre_cond_share_1.c: XFAIL.
* gcc.target/riscv/rvv/autovec/pr113607-run.c: New test.
* gcc.target/riscv/rvv/autovec/pr113607.c: New test.
---
 gcc/match.pd  |  8 +--
 .../gcc.target/aarch64/sve/pre_cond_share_1.c |  2 +-
 .../riscv/rvv/autovec/pr113607-run.c  |  4 ++
 .../gcc.target/riscv/rvv/autovec/pr113607.c   | 49 +++
 4 files changed, 58 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113607-run.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113607.c

diff --git a/gcc/match.pd b/gcc/match.pd
index e42ecaf9ec7..7c391a8fe20 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -9592,18 +9592,18 @@ and,
 
 /* Detect simplification for vector condition folding where
 
-  c = mask1 ? (masked_op mask2 a b) : b
+  c = mask1 ? (masked_op mask2 a b els) : els
 
   into
 
-  c = masked_op (mask1 & mask2) a b
+  c = masked_op (mask1 & mask2) a b els
 
   where the operation can be partially applied to one operand. */
 
 (for cond_op (COND_BINARY)
  (simplify
   (vec_cond @0
-   (cond_op:s @1 @2 @3 @4) @3)
+   (cond_op:s @1 @2 @3 @4) @4)
   (cond_op (bit_and @1 @0) @2 @3 @4)))
 
 /* And same for ternary expressions.  */
@@ -9611,7 +9611,7 @@ and,
 (for cond_op (COND_TERNARY)
  (simplify
   (vec_cond @0
-   (cond_op:s @1 @2 @3 @4 @5) @4)
+   (cond_op:s @1 @2 @3 @4 @5) @5)
   (cond_op (bit_and @1 @0) @2 @3 @4 @5)))
 
 /* For pointers @0 and @2 and nonnegative constant offset @1, look for
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pre_cond_share_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pre_cond_share_1.c
index b51d0f298ea..e4f754d739c 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pre_cond_share_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pre_cond_share_1.c
@@ -129,4 +129,4 @@ fasten_main(size_t group, size_t ntypes, size_t nposes, 
size_t natlig, size_t na
 }
 
 /* { dg-final { scan-tree-dump-times {\.COND_MUL} 1 "optimized" } } */
-/* { dg-final { scan-tree-dump-times {\.VCOND} 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times {\.VCOND} 1 "optimized" { xfail *-*-* } } 
} */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113607-run.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113607-run.c
new file mode 100644
index 000..06074767ce5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113607-run.c
@@ -0,0 +1,4 @@
+/* { dg-do run { target { riscv_v && rv64 } } } */
+/* { dg-options "-O3 -march=rv64gcv -mabi=lp64d -fdump-tree-optimized" } */
+
+#include "pr113607.c"
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113607.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113607.c
new file mode 100644
index 000..70a93665497
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113607.c
@@ -0,0 +1,49 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=rv64gcv -mabi=lp64d -fdump-tree-optimized" } */
+
+struct {
+  signed b;
+} c, d = {6};
+
+short e, f;
+int g[1000];
+signed char h;
+int i, j;
+long k, l;
+
+long m(long n, long o) {
+  if (n < 1 && o == 0)
+return 0;
+  return n;
+}
+
+static int p() {
+  long q = 0;
+  int a = 0;
+  for (; e < 2; e += 1)
+g[e * 7 + 1] = -1;
+  for (; h < 1; h += 1) {
+k = g[8] || f;
+l = m(g[f * 7 + 1], k);
+a = l;
+j = a < 0 || g[f * 7 + 1] < 0 || g[f * 7 + 1] >= 32 ? a : a << g[f * 7 + 
1];
+if (j)
+  ++q;
+  }
+  if (q)
+c = d;
+  return i;
+}
+
+int main() {
+  p();
+  if (c.b != 6)
+__builtin_abort ();
+}
+
+/* We must not fold VEC_COND_EXPR into COND_SHL.
+   Therefore, make sure that we still have 2/4 VCOND_MASKs with real else
+   value.  */
+
+/* { dg-final { scan-tree-dump-times { = \.VCOND_MASK.\([a-z0-9\._]+, 
[a-z0-9\._\{\}, ]+, [0-9\.\{\},]+\);} 0 "optimized" } } */
+/* { dg-final { scan-tree-dump-times { = \.VCOND_MASK.\([a-z0-9\._]+, 
[a-z0-9\._\{\}, ]+, [a-z0-9\._]+\);} 4 "optimized" } } */
-- 
2.43.0


Re: [PATCH] RISC-V: Fix VSETLV PASS compile-time issue

2024-01-30 Thread Robin Dapp
LGTM.

Regards
 Robin



Re: [PATCH V3 2/4] RISC-V: Add vector related pipelines

2024-01-30 Thread Robin Dapp
> I think removing the is_inorder attribute should be ok. I added it
> because I wanted to avoid having two matching insn reservations
> defined since matching solely on the type attribute should also match
> on all subsets as well (i.e. if eventually we add an insn reservation
> checking for type "vlde" and tune "generic-ooo", any "vlde" insn
> would map to both reservations)
Ah, I see.  Yes we should prevent that from happening and in case we
have two (or more) similarly named reservations that would both match
such an attribute would make sense I guess.  My preference would just
be to not just add it yet before we know what else we'll be needing.
Chance is, not a whole lot will change until the release ;)

> For now I should just remove the is_inorder attribute. We will update
> the latencies and add new reservations after we know what they should
> be. Is that correct?

Yes, that should work.

Regards
 Robin


[PATCH] genopinit: Split init_all_optabs [PR113575]

2024-01-26 Thread Robin Dapp
Hi,

init_all_optabs initializes > 1 patterns for riscv targets.  This
leads to pathological situations in dataflow analysis (which can occur
with many adjacent stores).
To alleviate this this patch makes genopinit split the init_all_optabs
function into several init_optabs_xx functions that each initialize 1000
patterns.

With this change insn-opinit.cc's compilation time is reduced from 4+
minutes to 1:30 and memory consumption decreases from 1.2G to 630M.

Bootstrapped and regtested on x86 and aarch64 (where we do split) and
on power10 (where we don't).  Regtested on riscv.

Regards
 Robin

gcc/ChangeLog:

PR other/113575

* genopinit.cc (main): Split init_all_optabs into functions
of 1000 patterns each.
---
 gcc/genopinit.cc | 43 ++-
 1 file changed, 38 insertions(+), 5 deletions(-)

diff --git a/gcc/genopinit.cc b/gcc/genopinit.cc
index 88ccafa5b2c..d8682b2a9ad 100644
--- a/gcc/genopinit.cc
+++ b/gcc/genopinit.cc
@@ -367,11 +367,44 @@ main (int argc, const char **argv)
 fprintf (s_file, "  { %#08x, CODE_FOR_%s },\n", p->sort_num, p->name);
   fprintf (s_file, "};\n\n");
 
-  fprintf (s_file, "void\ninit_all_optabs (struct target_optabs 
*optabs)\n{\n");
-  fprintf (s_file, "  bool *ena = optabs->pat_enable;\n");
-  for (i = 0; patterns.iterate (i, ); ++i)
-fprintf (s_file, "  ena[%u] = HAVE_%s;\n", i, p->name);
-  fprintf (s_file, "}\n\n");
+  /* Some targets like riscv have a large number of patterns.  In order to
+ prevent pathological situations in dataflow analysis split the init
+ function into separate ones that initialize 1000 patterns each.  */
+
+  const int patterns_per_function = 1000;
+
+  if (patterns.length () > patterns_per_function)
+{
+  unsigned num_init_functions
+   = patterns.length () / patterns_per_function + 1;
+  for (i = 0; i < num_init_functions; i++)
+   {
+ fprintf (s_file, "static void\ninit_optabs_%02d "
+  "(struct target_optabs *optabs)\n{\n", i);
+ fprintf (s_file, "  bool *ena = optabs->pat_enable;\n");
+ unsigned start = i * patterns_per_function;
+ unsigned end = MIN (patterns.length (),
+ (i + 1) * patterns_per_function);
+ for (j = start; j < end; ++j)
+   fprintf (s_file, "  ena[%u] = HAVE_%s;\n", j, patterns[j].name);
+ fprintf (s_file, "}\n\n");
+   }
+
+  fprintf (s_file, "void\ninit_all_optabs "
+  "(struct target_optabs *optabs)\n{\n");
+  for (i = 0; i < num_init_functions; ++i)
+   fprintf (s_file, "  init_optabs_%02d (optabs);\n", i);
+  fprintf (s_file, "}\n\n");
+}
+  else
+{
+  fprintf (s_file, "void\ninit_all_optabs "
+  "(struct target_optabs *optabs)\n{\n");
+  fprintf (s_file, "  bool *ena = optabs->pat_enable;\n");
+  for (i = 0; patterns.iterate (i, ); ++i)
+   fprintf (s_file, "  ena[%u] = HAVE_%s;\n", i, p->name);
+  fprintf (s_file, "}\n\n");
+}
 
   fprintf (s_file,
   "/* Returns TRUE if the target supports any of the partial vector\n"
-- 
2.43.0


Re: [PATCH V3 4/4] RISC-V: Enable assert for insn_has_dfa_reservation

2024-01-25 Thread Robin Dapp
>/* If we ever encounter an insn without an insn reservation, trip
>   an assert so we can find and fix this problem.  */
> -#if 0
> +  if (! insn_has_dfa_reservation_p (insn)) {
> +print_rtl(stderr, insn);
> +fprintf(stderr, "%d", get_attr_type (insn));
> +  }
>gcc_assert (insn_has_dfa_reservation_p (insn));
> -#endif
>  
>return more - 1;
>  }

I was thinking about make the gcc_assert a gcc_checking_assert so,
in case we accidentally forget something at any point, it would
only gracefully degrade in a release build.  As we already have
a hard assert for the type the patch (and not many test with
enable checking anyway) this is OK IMHO.

I suppose you tested with all available -mtune options?

Regards
 Robin



Re: [PATCH V3 3/4] RISC-V: Use default cost model for insn scheduling

2024-01-25 Thread Robin Dapp
> Use default cost model scheduling on these test cases. All these tests
> introduce scan dump failures with -mtune generic-ooo. Since the vector
> cost models are the same across all three tunes, some of the tests
> in PR113249 will be fixed with this patch series.

This is OK, thanks.

> 39 additional unique testsuite failures (scan dumps) will still be present.
> I don't know how optimal the new output is compared to the old. Should I 
> update
> the testcase expected output to match the new scan dumps?

Currently, without vector op latency, the output should come close
to what's normally considered "good" (i.e. minimal number of vsetvls
and so on).  Therefore I'd suggest not to change the scan dumps to
much except when there is a real problem.  If you have a specific
example that you're unsure about we can discuss this on or off list.

Regards
 Robin



Re: [PATCH V3 2/4] RISC-V: Add vector related pipelines

2024-01-25 Thread Robin Dapp
Thanks, that looks better IMHO.

> +;; Copyright (C) 2011-2024 Free Software Foundation, Inc.
> +;; Contributed by Andrew Waterman (and...@sifive.com).
> +;; Based on MIPS target for GNU compiler.

You might want to change that, as well as the date.  While at
it you can also fix the broken date in my original file ;)

> +(define_insn_reservation "vec_load" 6
> +  (and (eq_attr "is_inorder" "no")
> +   (eq_attr "type" "vlde,vldm,vlds,vldux,vldox,vldff,vldr"))
> +  "vxu_ooo_issue,vxu_ooo_alu")

I would rather ditch the is_inorder attribute for now and define
"low" latencies as well as reservations explicitly once we're
sure rather than falling back to scheduler defaults. 

OK with those changes.

Regards
 Robin


Re: [PATCH V3 1/4] RISC-V: Add non-vector types to dfa pipelines

2024-01-25 Thread Robin Dapp
LGTM, thanks.

Regards
 Robin


Re: [PATCH] RISC-V: Fix incorrect LCM delete bug [VSETVL PASS]

2024-01-25 Thread Robin Dapp
The non-test parts are OK IMHO.

Regards
 Robin


[PATCH] testsuite/vect: Add target checks to refined patterns [PR113558]

2024-01-24 Thread Robin Dapp
Hi,

on Solaris/SPARC several vector tests appeared to be regressing.  They
were never vectorized but the checks before r14-3612-ge40edf64995769
would match regardless if a loop was actually vectorized or not.
The refined checks only match a successful vectorization attempt
but are run unconditionally.  This patch adds target checks to them.

Bootstrapped (unnecessarily) and regtested on x86, aarch64 and
power10.  Regtested on riscv and (the previous version that 
missed vect-reduc-pattern-2a.c) on Solaris/SPARC by Rainer Orth.

Is this OK if Rainer's second run is successful?

Regards
 Robin

gcc/testsuite/ChangeLog:

PR testsuite/113558

* gcc.dg/vect/no-scevccp-outer-7.c: Add target check.
* gcc.dg/vect/vect-outer-4c-big-array.c: Ditto.
* gcc.dg/vect/vect-reduc-dot-s16a.c: Ditto.
* gcc.dg/vect/vect-reduc-dot-s8a.c: Ditto.
* gcc.dg/vect/vect-reduc-dot-s8b.c: Ditto.
* gcc.dg/vect/vect-reduc-dot-u16b.c: Ditto.
* gcc.dg/vect/vect-reduc-dot-u8a.c: Ditto.
* gcc.dg/vect/vect-reduc-dot-u8b.c: Ditto.
* gcc.dg/vect/vect-reduc-pattern-1a.c: Ditto.
* gcc.dg/vect/vect-reduc-pattern-1b-big-array.c: Ditto.
* gcc.dg/vect/vect-reduc-pattern-1c-big-array.c: Ditto.
* gcc.dg/vect/vect-reduc-pattern-2a.c: Ditto.
* gcc.dg/vect/vect-reduc-pattern-2b-big-array.c: Ditto.
* gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c: Ditto.
---
 gcc/testsuite/gcc.dg/vect/no-scevccp-outer-7.c  | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-outer-4c-big-array.c | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16a.c | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8a.c  | 4 ++--
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8b.c  | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-u16b.c | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-u8a.c  | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-u8b.c  | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-1a.c   | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-1b-big-array.c | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-1c-big-array.c | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-2a.c   | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-2b-big-array.c | 2 +-
 gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c| 4 ++--
 14 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-7.c 
b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-7.c
index 058d1d2db2d..87048422013 100644
--- a/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-7.c
+++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-7.c
@@ -77,4 +77,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED." 1 "vect" { 
target vect_widen_mult_hi_to_si } } } */
-/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: 
detected(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: 
detected(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" { target 
vect_widen_mult_hi_to_si } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-4c-big-array.c 
b/gcc/testsuite/gcc.dg/vect/vect-outer-4c-big-array.c
index 5c3eea95476..4aaf2932006 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-outer-4c-big-array.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-outer-4c-big-array.c
@@ -24,4 +24,4 @@ foo (){
 }
 
 /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { target 
{ vect_short_mult && { ! vect_no_align } } } } } */
-/* { dg-final { scan-tree-dump-times "zero step in outer 
loop.(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "zero step in outer 
loop.(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" { target { 
vect_short_mult && { ! vect_no_align } } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16a.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16a.c
index d826828e3d6..86fdcf37df8 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16a.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16a.c
@@ -51,7 +51,7 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: 
detected(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: 
detected(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" { target { 
vect_sdot_hi || vect_widen_mult_hi_to_si } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_sdot_hi } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_widen_mult_hi_to_si } } } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8a.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8a.c
index 4e1e0b234f4..99c53d0ff02 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8a.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8a.c

Re: [PATCH] RISC-V: Fix large memory usage of VSETVL PASS [PR113495]

2024-01-23 Thread Robin Dapp
> SPEC 2017 wrf benchmark expose unreasonble memory usage of VSETVL PASS
> that is, VSETVL PASS consume over 33 GB memory which make use impossible
> to compile SPEC 2017 wrf in a laptop.
> 
> The root cause is wasting-memory variables:

LGTM.   The new code matches compute_lcm_local_properties more
closely which makes sense to me.

One separate thing, nothing to do with this patch - I find
bitmap_union_of_preds_with_entry not wrong but weirdly written.
Probably because it was copied from somewhere and slightly
adjusted?  If you touch more code anyway, would you mind fixing it?

  for (ix = 0; ix < EDGE_COUNT (b->preds); ix++)
{
  e = EDGE_PRED (b, ix);
  bitmap_copy (dst, src[e->src->index]);
  break;
}
  if (ix == EDGE_COUNT (b->preds))
bitmap_clear (dst);

The whole idea seems to _not_ skip the entry block.  So something
like if (EDGE_COUNT () == 0) {...} else { bitmap_copy (...)) should
be sufficient?  If the input is assumed to be empty we could even
skip the copy.

> -/* { dg-options "--param=riscv-autovec-preference=scalable -march=rv32gcv 
> -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2 -fno-tree-vectorize" } */
> +/* { dg-options "--param=riscv-autovec-preference=scalable -march=rv32gcv 
> -mabi=ilp32 -fno-tree-vectorize" } */

Why that change?  Was no-schedule necessary before and is not anymore?
Is it a result from the changes?  I'd hope not.

Regards
 Robin


Re: [PATCH] RISC-V: Lower vmv.v.x (avl = 1) into vmv.s.x

2024-01-22 Thread Robin Dapp
LGTM.

Regards
 Robin


Re: [PATCH] RISC-V: Fix regressions due to 86de9b66480b710202a2898cf513db105d8c432f

2024-01-22 Thread Robin Dapp
> No, we didn't undo the optimization.
> 
> We just disallow move pattern for (set (reg) (VL_REGNUM)).

Ah, what I referred to was the opposite direction.  We allow
(subreg:V8QI (reg:DI ...)) which is not touched by this patch.

Then it is OK.

Regards
 Robin


Re: [PATCH] RISC-V: Fix regressions due to 86de9b66480b710202a2898cf513db105d8c432f

2024-01-22 Thread Robin Dapp


Hi Juzhe,

in principle this seems ok to me but I wonder about:

> We shouldn't worry about subreg:...VL_REGNUM since it's impossible
> that we can have such situation,

I think we allow this in legitimize_move for situations like
(subreg:SI (reg:V4QI)).  That was not added for correctness but
optimization - are we sure we don't undo this optimization with
that change?

Regards
 Robin



Re: [PATCH V2] RISC-V: Fix RVV_VLMAX

2024-01-19 Thread Robin Dapp
Ah, interesting that this was it.  Thanks for fixing and also
thanks to Andrew for suggesting that fix.

Regards
 Robin


Re: [PATCH V2] RISC-V: Add has compatible check for conflict vsetvl fusion

2024-01-17 Thread Robin Dapp
OK.

Regards
 Robin



Re: [PATCH] RISC-V: Add has compatible check for conflict vsetvl fusion

2024-01-17 Thread Robin Dapp
Hi Juzhe,

the change itself is OK but I don't think we should add binary
files like this.  Even if not ideal, if you want to go forward
IMHO let's skip the test for now and add it at a (not much) later
time.

> diff --git 
> a/gcc/testsuite/gcc.target/riscv/rvv/fortran/spec2017_cam4/ppgrid.mod 
> b/gcc/testsuite/gcc.target/riscv/rvv/fortran/spec2017_cam4/ppgrid.mod
> new file mode 100644
> index 
> ..cb021390ccd758e75c3ad11b33da93e5fba9dd25
> GIT binary patch
> literal 296
> zcmV+@0oVQ?iwFP!01J#p3PlGTRhVT6q@2zl{=@`s-tfVf)tq@kHSA}j8Hz3S;
> z@Yh?$U?jR7j0cyt>GwAM+UHHbPVT~3#av=jq`S4ohpx6+k%JCBiloxd?>fb@DmEy~
> zRh6Yz%d*TqwV7`iu`C;Z(McQF#DqT#2lPd+lGk1SMnM}A6Hp9cSqmNq{B|nvAn#@P
> zCGE%0NIhX~f}
> z%QY-XvEF`Xig?UtLYWiJK(`p`0sj4z|eo?bN0F=wJgq8=k>8<#-V;obgE;
> u

Re: [PATCH] fold-const: Handle AND, IOR, XOR with stepped vectors [PR112971].

2024-01-15 Thread Robin Dapp
I gave it another shot now by introducing a separate function as
Richard suggested.  It's probably not at the location he intended.

The way I read the discussion there hasn't been any consensus
on how (or rather where) to properly tackle the problem.  Any
other ideas still?

Regards
 Robin


Found in PR112971 this patch adds folding support for bitwise operations
of const duplicate zero/one vectors with stepped vectors.
On riscv we have the situation that a folding would perpetually continue
without simplifying because e.g. {0, 0, 0, ...} & {7, 6, 5, ...} would
not be folded to {0, 0, 0, ...}.

gcc/ChangeLog:

PR middle-end/112971

* fold-const.cc (simplify_const_binop): New function for binop
simplification of two constant vectors when element-wise
handling is not necessary.
(const_binop): Call new function.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr112971.c: New test.
---
 gcc/fold-const.cc | 31 +++
 .../gcc.target/riscv/rvv/autovec/pr112971.c   | 18 +++
 2 files changed, 49 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112971.c

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 385e4a69ab3..2ef425aec0f 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -1343,6 +1343,29 @@ distributes_over_addition_p (tree_code op, int opno)
 }
 }
 
+/* OP is the INDEXth operand to CODE (counting from zero) and OTHER_OP
+   is the other operand.  Try to use the value of OP to simplify the
+   operation in one step, without having to process individual elements.  */
+static tree
+simplify_const_binop (tree_code code, tree op, tree other_op,
+ int index ATTRIBUTE_UNUSED)
+{
+  /* AND, IOR as well as XOR with a zerop can be simplified directly.  */
+  if (TREE_CODE (op) == VECTOR_CST && TREE_CODE (other_op) == VECTOR_CST)
+{
+  if (integer_zerop (other_op))
+   {
+ if (code == BIT_IOR_EXPR || code == BIT_XOR_EXPR)
+   return op;
+ else if (code == BIT_AND_EXPR)
+   return other_op;
+   }
+}
+
+  return NULL_TREE;
+}
+
+
 /* Combine two constants ARG1 and ARG2 under operation CODE to produce a new
constant.  We assume ARG1 and ARG2 have the same data type, or at least
are the same kind of constant and the same machine mode.  Return zero if
@@ -1646,6 +1669,14 @@ const_binop (enum tree_code code, tree arg1, tree arg2)
return build_complex (type, real, imag);
 }
 
+  tree simplified;
+  if ((simplified = simplify_const_binop (code, arg1, arg2, 0)))
+return simplified;
+
+  if (commutative_tree_code (code)
+  && (simplified = simplify_const_binop (code, arg2, arg1, 1)))
+return simplified;
+
   if (TREE_CODE (arg1) == VECTOR_CST
   && TREE_CODE (arg2) == VECTOR_CST
   && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)),
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112971.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112971.c
new file mode 100644
index 000..816ebd3c493
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112971.c
@@ -0,0 +1,18 @@
+/* { dg-do compile }  */
+/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -fno-vect-cost-model" 
}  */
+
+int a;
+short b[9];
+char c, d;
+void e() {
+  d = 0;
+  for (;; d++) {
+if (b[d])
+  break;
+a = 8;
+for (; a >= 0; a--) {
+  char *f = 
+  *f &= d == (a & d);
+}
+  }
+}
-- 
2.43.0




Re: [PATCH] RISC-V: Adjust loop len by costing 1 when NITER < VF

2024-01-15 Thread Robin Dapp
LGTM.

Regards
 Robin



Re: [PATCH] RISC-V: Fix regression (GCC-14 compare with GCC-13.2) of SHA256 from coremark-pro

2024-01-15 Thread Robin Dapp
OK, thanks.

Regards
 Robin



Re: [PATCH V3] RISC-V: Adjust scalar_to_vec cost

2024-01-12 Thread Robin Dapp
> Tested on both RV32/RV64 no regression, Ok for trunk ?

Yes, thanks!

Btw out of curiosity, did you see why we actually fail to
optimize away the VLA loop?  We should open a bug for that
I suppose.

Regards
 Robin



Re: [PATCH V2] RISC-V: Adjust scalar_to_vec cost accurately

2024-01-11 Thread Robin Dapp
> 1. This patch set scalar_to_vec cost as 2 instead 1 since scalar move
>instruction is slightly more costly than normal rvv instructions (e.g. 
> vadd.vv).

We can go with 2 or 3 (if needed) for now but should later
really incorporate reg-move costs in this IMHO.  Just like e.g.

static const struct cpu_regmove_cost cortexa57_regmove_cost =
{
  1, /* GP2GP  */
  /* Avoid the use of slow int<->fp moves for spilling by setting
 their cost higher than memmov_cost.  */
  5, /* GP2FP  */
  ...
};

we can add V2FP, V2GP and the reverse.  Then add those to
scalar_to_vec (later vec_to_scalar as well) in adjust_stmt_cost
according to the mode.

> 2. Adjust scalar_to_vec cost accurately according to the splat value, for 
> example,
>a value like 32872, needs 2 more scalar instructions:
>so the cost = 2 (scalar instructions) + 2 (scalar move).

>We adjust the cost like this since it doesn need such many instructions in 
> vectorized codes,
>wheras they are not needed in scalar codes.

I'm afraid the issue I mentioned (we don't count the constant
synthesis for scalar but would for vector with the change) is
still present.
Even if it does not cause any regressions or problems now it
certainly might in the future, especially with complex constants.
Basically we would not vectorize something containing several
synthesized constants (like popcount) anymore.
Therefore I would advise against it even though the given
example cannot be "solved" unconditionally then.

Regards
 Robin


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
> 32872 spends 2 scalar instructions + 1 scalar_to_vec cost:
> 
> lia4,-32768
> addiwa4,a4,104
> vmv.v.xv16,a4
> 
> It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but 
> failed on -march=rv64gcv_zvl4096b.
The scalar version also needs both instructions:

li  a0,32768
addiw   a0,a0,104

Therefore I don't think we should just add them to the
vectorization costs.  That would only be necessary if we needed
to synthesize a different constant (e.g. if a scalar constant
cannot be used directly in a vector setting).

Currently, scalar_outside_cost = 0 so we don't model it on the
scalar side either.

With scalar_to_vec = 2 we first try RVVMF2QI, vf = 8 at zvl256b:

a.4_25 = PHI <1(2), _4(11)> 1 times vector_stmt costs 1 in body
a.4_25 = PHI <1(2), _4(11)> 2 times scalar_to_vec costs 4 in prologue
(unsigned short) a.4_25 1 times vector_stmt costs 1 in body
MIN_EXPR  1 times scalar_to_vec costs 2 in prologue
MIN_EXPR  1 times vector_stmt costs 1 in body
32872 >> patt_26 1 times scalar_to_vec costs 2 in prologue
32872 >> patt_26 1 times vector_stmt costs 1 in body
 1 times scalar_stmt costs 1 in prologue
 1 times scalar_stmt costs 1 in body

  Vector inside of loop cost: 5
  Scalar iteration cost: 1 (shouldn't that be 2? but anyway)

So one vector iteration costs 5 right now regardless of
scalar_to_vec because there are 5 vector operations (phi,
promote, min, shift, vsetvl/len adjustment).
The scalar_to_vec costs are added to the prologue because it is
assumed that broadcasts are hoisted out of the loop.

Then vectorization is supposed to be profitable if
#iterations = 18 > (body_cost * min_iters)
   + vector_outside_cost - scalar_outside_cost + 1 = 15.

If we really don't want to vectorize, then we can either
further increase the prologue cost or the body itself.  The
body statements are all vector_stmts, though.  For the
prologue we need a good argument why to increase scalar_to_vec
to beyond, say 2.

> Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be
> more costly than normal vadd.vv since it is transferring data between
> different pipeline/register class.

We want it to be more expensive, yes.  In one of the last
messages I explained how I would model it using either
register_move_cost or using (tune-specific) costs directly.
I would start with scalar_to_vec = 1 and add 1 or 2 depending
on the uarch/tune-specific reg-move costs.

Regards
 Robin



Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
> I think we shouldn't vectorize it with any vlen, since the non-vectorized 
> codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.

Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.

My point is cost comparison only has the scalar loop to compare
against which is:

li  a5,1
li  a3,19
.L2:
mv  a4,a5
addiw   a5,a5,1
bne a5,a3,.L2

That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.

How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.

Regards
 Robin


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
On 1/11/24 11:20, juzhe.zh...@rivai.ai wrote:
> Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the 
> loop we have these 2 scalar_to_vec:
> 
> 1. MIN_EXPR  1 times scalar_to_vec costs 1 in prologue
> 
>    This scalar_to_vec cost should be 0 or 1 since it only generate single 
> instructions: vmv.v.iv16,15
> 
> 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue
> 
>    This cost should be higher since it cost 3 instructions:
>     lia4,-32768
>     addiwa4,a4,104
>     vmv.v.xv16,a4
> 
> Am I correct ?
> 
> I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be 
> good.

That would be the general idea, yes.  As Richard mentioned, it doesn't
always work well but for this case here it could help a bit.
(My question whether why we shouldn't vectorize this at 256b
and above still stands, though)

As mentioned before, the other thing that needs to be considered
is register-move costs (or the respective cost structure).  On
some uarchs the vmv.v.f might be more expensive than vmv.v.x and
so on - in addition to the instructions needed to synthesize the
constant.

Regards
 Robin



Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
>  Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN 
> size,
> that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8...
> 
> I am confused now how to fix this case.

4 is definitely too high compared to a regular instruction.
vmv.vx could even be zero-cost for constants.

To catch constants we could add handling in add_stmt_cost, inspecting
the stmt directly.

Regards
 Robin


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
>> The slidedown/vmv.x.s part is of course vec_extract but we indeed
>> don't seem to cost it as vec_to_scalar here.
> 
> It looks like a vectorized live operation as it's not in the loop body
> (and thus really irrelevant for costing in practice).  This has
> 
>   /* ???  Enable for loop costing as well.  */
>   if (!loop_vinfo)
> record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
>   0, vect_epilogue);
> 
> so live ops are not costed at all.  I would suggest to try unconditionally
> enabling this?
> 

IMHO this example is not really ideal to start with anyway so we should maybe
try another one first.  I'd still argue that one or two iterations vs.
potentially 16+ scalar ones is not necessarily bad.

That said, we also don't really cost all our vsetvls yet (difficult...).

Regards
 Robin


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote:
> Oh. I see I think I have done wrong here.
> 
> I should adjust cost for VEC_EXTRACT not VEC_SET.
> 
> But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
> cost in vect.dump.

The slidedown/vmv.x.s part is of course vec_extract but we indeed
don't seem to cost it as vec_to_scalar here.

vmv.vx correspond to scalar_to_vec and I'd say 3 seems a
bit high when a regular vector instruction is "1".
It should rather be dependent on the latency between register
files.  We can't really say in general but I'd say "2" is not so bad.

I would suggest adding special handling in builtin_vectorization_cost
like:

/* Add register-register latency.  */
case scalar_to_vec:
  return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...)

and adjust register_move_cost accordingly.  Instead of using
register_move_cost we could also use a cost structure directly.
(E.g. like aarch64's regmove tuning structures.  Those don't
contain VRs but for us it could make sense to add them).

> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize 
> -fdump-tree-vect-details" } */
With a cost of "3" we still vectorize for zvl512b and larger.
Is that intended?  I don't really see why 512 should vectorized
but 256 not.  Disregarding that everything should be optimized
away, 2 iterations for the whole loop with 256 bits doesn't
seem that bad.

Regards
 Robin



Re: [PATCH v5] RISC-V: Fix register overlap issue for some xtheadvector instructions

2024-01-11 Thread Robin Dapp
LGTM now, thanks.  I find it much more readable that way.

Regards
 Robin


Re: [PATCH V2 2/4][RFC] RISC-V: Add vector related reservations

2024-01-10 Thread Robin Dapp
> Since all the pipelines should be tuned to their cost model, they
> would be different anyway. If it would be simpler for now, I could
> separate the files out.
> I think I'm getting a bit confused. Is there a reason why we would
> want to exchange scheduler descriptions like the example you
> provided? I'm just thinking why a in-order model would want to use an
> ooo vector model and vice versa. Please correct me if I got the wrong
> idea.

Yeah, the confusion is understandable as it's all in flow and several
things I mentioned are artifacts of us not yet being stabilized (or
actually having hard data to base our decisions on).

Usually, once a uarch has settled there is no reason to exchange
anything, just smaller tweaks might be done.  I was more thinking of
the near to mid-term future where larger changes like ripping out
one thing and using another one altogether might still happen.

Regarding out of order vs in order - for in-order pipelines we will
always want to get latencies right.  For out of order it is a balancing
act (proper latencies often mean more spilling and the processor will
reorder correctly anyway).

So you're mostly right that the argument is not very strong as soon
as we really know what to do and not to do.

> I also want to double check, isn't forcing all typed instructions to
> be part of a dfa pipeline in effect removing a situation where a tune
> model does not specify a "vector tune model"? At least from my
> testing with the assert statement, I get ICEs when trying to run the
> testsuite without the vector tune model even on gc.

There are (at least) three parts of the "tune model":
 - vector cost model, specifying the cost of generic vector operations,
   not necessarily corresponding to an insn
 - insn cost, specifying the cost of an individual insn, usually close
   to latency but sometimes also "complexity" or other things.
 - insn latency and other hardware scheduler properties.

We can leave out any of those which will make us fall back to default
values.  Even if we forced a scheduler description we could still have
the default fallback for the other two and generate unfavorable code
as a result.

However, this is of course not desirable and we will soon have a
reasonable vector cost model that corresponds to the non-uarch
specific properties of the vector spec.  Once this is in place
we will also want a somewhat generic vector scheduler description
that goes hand in hand with that.  Despite the name, the vector
part of generic-ooo could be used for in-order vector uarchs and
we might want to define a different description for out-of-order
uarchs.  That's a separate discussion but at least for that
contingency it would make sense to easily interchange the scheduler
description ;)

Regards
 Robin


Re: [PATCH V2 2/4][RFC] RISC-V: Add vector related reservations

2024-01-10 Thread Robin Dapp
Hi Edwin,

> This patch copies the vector reservations from generic-ooo.md and
> inserts them into generic.md and sifive.md. Creates new vector crypto related
> insn reservations.

In principle, the changes look good to me but I wonder if we could
split off the vector parts from generic-ooo into their own md file
(generic-vector-ooo or so?) and include this in the others?  Or is
there a reason why you decided against this?

A recurring question in vector cost model discussions seems to be how
to handle the situation when a tune model does not specify a "vector tune
model".  The problem exists for the scheduler descriptions and the
normal vector cost model (and possibly insn_costs as well).

Juzhe just implemented a fallback so we always use the "generic rvv" cost
model.  Your changes would be in the same vein and if we could split
them off then we'd be able to easier exchange one scheduler descriptions
for another one (say if one tune model wants to use an in-order vector
model).

There is also still the question of whether to set all latencies
to 1 for an OOO core but this question should be settled separately
as soon as we have proper hardware benchmark results.  If so we
would probably rename generic-vector-ooo into
generic-vector-in-order ;)

Regards
 Robin



Re: [PATCH V2] RISC-V: Switch RVV cost model.

2024-01-10 Thread Robin Dapp
LGTM.

Regards
 Robin



Re: [PATCH] RISC-V: Switch RVV cost model to generic vector cost model

2024-01-10 Thread Robin Dapp
> Current generic cost model makes dynamic-lmul2-7.c generate inferior codegen.
> 
> I found if I tweak the cost a little bit then dynamic-lmul2-7.c codegen can 
> be recovered.
> However, it makes other tests failed
> It's complicated story

Ok, makes sense.  So the plan seems to be:

 (1) Fall back to the generic cost model if the tune model didn't
 specify one, i.e. make sure we always use the generic cost
 model rather than the default one.
 (2) Change this generic (fallback) cost model so we don't have
 regressions on the current trunk, as it's now always used.
 (3) Adjust it piece by piece.

Sure this makes sense and is also what I had in mind.

> It's true that: we can keep current cost model 
> default_builtin_vectorization_cost
> And tweak generic cost model, for exampl, add testcase for SHA256 and add 
> -mtune=generic-ooo to test it.

> But the question, how do you know whether there is a regression on current 
> testsuite with -mtune=generic-ooo ?

That's a valid question and not easily solved.  Ideally the
generic model is generic enough to be a good base for most
uarchs.  Then the uarchs would only do minor adjustments and
have their own tests for that while the bulk of the generic
tests would still pass.

Generally, normal tests should be pretty independent of the
cost model with the exception of checking instruction sequences.
Those that are not should either specify their own -mtune and/or
disable scheduling.  Of course that's easier said than done...

Back to the patch:

I would suggest either renaming generic_vl[sa]_vector_cost to
rvv_vl[sa]_vector_cost (I find generic a bit too close to default)
and/or add comments that those are supposed to be the vector cost models
used by default if no other cost model was specified.

After understanding (2) of the plan the patch is OK to me with
that changed.

Regards
 Robin



Re: [PATCH] RISC-V: Switch RVV cost model to generic vector cost model

2024-01-10 Thread Robin Dapp
On 1/10/24 15:40, 钟居哲 wrote:
> I need to add these costs for segment load/stores:
> 
> /* Generic costs for VLA vector operations.  */
> static const scalable_vector_cost generic_vla_vector_cost = {
>   {
>     1,/* int_stmt_cost  */
>     1,/* fp_stmt_cost  */
>     1,/* gather_load_cost  */
>     1,/* scatter_store_cost  */
>     1,/* vec_to_scalar_cost  */
>     1,/* scalar_to_vec_cost  */
>     1,/* permute_cost  */
>     1,/* align_load_cost  */
>     1,/* align_store_cost  */
>     2,/* unalign_load_cost  */
>     2,/* unalign_store_cost  */
>   },
>   2,/* vlseg2_vsseg2_permute_cost  */
>   2,/* vlseg3_vsseg3_permute_cost  */
>   3,/* vlseg4_vsseg4_permute_cost  */
>   3,/* vlseg5_vsseg5_permute_cost  */
>   4,/* vlseg6_vsseg6_permute_cost  */
>   4,/* vlseg7_vsseg7_permute_cost  */
>   4,/* vlseg8_vsseg8_permute_cost  */
> };
> 
> to fix the SLP issues in the following patches.
> 
> If you don't allow me to switch to generic vector cost model and tune it.
> How can I fix the FAILs of slp-*.c cases ?
> 
> Currently, l let all slp-*.c tests all XFAIL which definitely incorrect.

Of course we don't want those XFAILs.  It's not a matter of "allowing"
or not but rather that I'd like to understand the reasoning.  The patch
itself seems reasonable to me apart from not really getting the
intention.

Your main point seems to be

> +  const cpu_vector_cost *costs = tune_param->vec_costs;
> +  if (!costs)
> +return _vector_cost
and that is fine.  What's not clear is whether changing the actual
costs is a temporary thing or whether it is supposed to be another
fallback.  If they are going to be changed anyway, why do we need
to revert to the default model now?  As discussed yesterday
increased permute costs and vec_to_scalar costs make sense, to first
order.  Is that because of dynamic-lmul2-7.c?

Generally we need to make the costs dependent on the
type or mode of course, just as we started to do with the latencies.
Permute is particularly sensitive as you already gathered.

Regards
 Robin



Re: [PATCH V2] RISC-V: Minor tweak dynamic cost model

2024-01-10 Thread Robin Dapp
LGTM.

Regards
 Robin


Re: [PATCH] RISC-V: Switch RVV cost model to generic vector cost model

2024-01-10 Thread Robin Dapp
Hi Juzhe,

> The reason we want to switch to generic vector cost model is the default
> cost model generates inferior codegen for various benchmarks.
> 
> For example, PR113247, we have performance bug that we end up having over 70%
> performance drop of SHA256.  Currently, no matter how we adapt cost model,
> we are not able to fix the performance bug since we always use default cost 
> model by default.
> 
> Also, tweak the generic cost model back to default cost model since we have 
> some FAILs in
> current tests.

So to recap:

 - Our current default tune model is rocket which does not have a vector
   cost model.  No other tune model except generic-ooo has one.

 - We want tune models with no vector cost model to fall back to the
   default vector cost model for now, later possibly the generic RVV
   cost model.

 - You're seeing inferior codegen for dynamic-lmul2-7.c with our generic
   RVV (not default) vector cost model (built with -mtune=generic-ooo?).

Therefore the suggestions is to start over freshly with the default
vector cost model?

>  /* Generic costs for VLA vector operations.  */
> @@ -374,13 +374,13 @@ static const scalable_vector_cost 
> generic_vla_vector_cost = {
>  1, /* fp_stmt_cost  */
>  1, /* gather_load_cost  */
>  1, /* scatter_store_cost  */
> -2, /* vec_to_scalar_cost  */
> +1, /* vec_to_scalar_cost  */
>  1, /* scalar_to_vec_cost  */
> -2, /* permute_cost  */
> +1, /* permute_cost  */
>  1, /* align_load_cost  */
>  1, /* align_store_cost  */
> -1, /* unalign_load_cost  */
> -1, /* unalign_store_cost  */
> +2, /* unalign_load_cost  */
> +2, /* unalign_store_cost  */
>},
>  };

So is the idea here to just revert the values to the defaults for now
and change them again soon?  And not to keep this as another default
and add others?

I'm a bit confused here :)  How does this help?  Can't we continue to
fall back to the default vector cost model when a tune model does not
specify a vector cost model?  If generic-ooo using the generic vector
cost model is the problem, then let's just change it to NULL for now?

I suppose at some point we will not want to fall back to the default
vector cost model anymore but always use the generic RVV cost model.
Once we reach the costing part we need to fall back to something
if nothing was defined and generic RVV is supposed to always be better 
than default.

Regards
 Robin



Re: [PATCH v5] RISC-V: Fix register overlap issue for some xtheadvector instructions

2024-01-10 Thread Robin Dapp
Hi Joshua,

> For th.vmadc/th.vmsbc as well as narrowing arithmetic instructions
> and floating-point compare instructions, an illegal instruction
> exception will be raised if the destination vector register overlaps
> a source vector register group.
> 
> To handle this issue, we use "group_overlap" and "enabled" attribute
> to disable some alternatives for xtheadvector.

>  ;; Widening instructions have group-overlap constraints.  Those are only
>  ;; valid for certain register-group sizes.  This attribute marks the
>  ;; alternatives not matching the required register-group size as disabled.
> -(define_attr "group_overlap" "none,W21,W42,W84,W43,W86,W87,W0"
> +(define_attr "group_overlap" 
> "none,W21,W42,W84,W43,W86,W87,W0,thv_disabled,rvv_disabled"
>(const_string "none"))

I realize there have been some discussions before but I find the naming
misleading.  The group_overlap attribute is supposed to specify whether
groups overlap (and mark the respective alternatives accepting
only this overlap).
Then we check if the groups overlap and disable all non-matching
alternatives.  "none" i.e. "no overlap" always matches.

Your first goal seems to be to disable existing non-early-clobber
alternatives for thv.  For this, maybe "full", "same" (or "any"?) would
work?  Please also add a comment in group_overlap_valid then that we
need not actually check for register equality.

For the other insns, I wonder if we could get away with not really
disabling the newly added early-clobber alternatives for RVV but
just disparaging ("?") them?  That way we could re-use "full" for
the thv-disabled alternatives and "none" for the newly added ones.
("none" will still be misleading then, though :/)

If this doesn't work or others feel the separation is not strict
enough, I'd prefer a separate attribute rather than overloading
group_overlap.  Maybe something like "spec_restriction" or similar
with two values "rvv" and "thv"?

Regards
 Robin



Re: [PATCH] RISC-V: Fix loop invariant check

2024-01-09 Thread Robin Dapp
OK.  I'm still a bit unsure about whether SSA_NAME_IS_DEFAULT_DEF
can really occur but if it does it's not wrong to treat it as
loop invariant.

Regards
 Robin


  1   2   3   4   5   6   7   8   9   10   >