Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization

2023-12-15 Thread Robin Dapp
On 12/15/23 13:52, juzhe.zh...@rivai.ai wrote:
> Do you mean :
> 
>   /* We need to use precomputed mask for such situation and such mask
>      can only be computed in compile-time known size modes.  */
>   bool indices_fit_selector_p
>     = GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, 
> 256);
>   if (!indices_fit_selector_p && !vec_len.is_constant ())
>     return false;

Yes and then reuse this in the if.

Regards
 Robin


Re: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization

2023-12-15 Thread juzhe.zh...@rivai.ai
Do you mean :

  /* We need to use precomputed mask for such situation and such mask
 can only be computed in compile-time known size modes.  */
  bool indices_fit_selector_p
= GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, 256);
  if (!indices_fit_selector_p && !vec_len.is_constant ())
return false;



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-12-15 20:44
To: juzhe.zh...@rivai.ai; gcc-patches
CC: rdapp.gcc; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm 
vectorization
> Oh. I think it should be renamed into not_fit.
> 
> Is this following make sense to you ?
> 
>   /* We need to use precomputed mask for such situation and such mask
>  can only be computed in compile-time known size modes.  */
>   bool indices_not_fit_selector_p
> = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode)));
>   if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8
>   && indices_not_fit_selector_p
>   && !vec_len.is_constant ())
> return false;
 
Mhm, right, I don't think this makes it nicer overall.  Maybe just like
the following then:
 
bool ..._p = GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt 
(vec_len, 256);
 
if (!..._p && !vec_len.is_constant ())
 
then later
 
if (..._p)
...
else
...
 
Regards
Robin
 
 


Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization

2023-12-15 Thread Robin Dapp
> Oh. I think it should be renamed into not_fit.
> 
> Is this following make sense to you ?
> 
>   /* We need to use precomputed mask for such situation and such mask
>      can only be computed in compile-time known size modes.  */
>   bool indices_not_fit_selector_p
>     = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode)));
>   if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8
>       && indices_not_fit_selector_p
>       && !vec_len.is_constant ())
>     return false;

Mhm, right, I don't think this makes it nicer overall.  Maybe just like
the following then:

bool ..._p = GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt 
(vec_len, 256);

if (!..._p && !vec_len.is_constant ())

then later

if (..._p)
...
else
...

Regards
 Robin



Re: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization

2023-12-15 Thread juzhe.zh...@rivai.ai
Oh. I think it should be renamed into not_fit.

Is this following make sense to you ?

  /* We need to use precomputed mask for such situation and such mask
 can only be computed in compile-time known size modes.  */
  bool indices_not_fit_selector_p
= maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode)));
  if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8
  && indices_not_fit_selector_p
  && !vec_len.is_constant ())
return false;



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-12-15 20:25
To: juzhe.zh...@rivai.ai; gcc-patches
CC: rdapp.gcc; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm 
vectorization
On 12/15/23 13:16, juzhe.zh...@rivai.ai wrote:
> 
>>> bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE 
>>> (GET_MODE_INNER (vmode)));
> No, I think it will make us miss some optimization.
> 
> For example, for poly value [16,16]  maybe_ge ([16,16], 65536) which makes us 
> missed merge optimization but
> we definitely can do merge optimization.
 
I didn't mean to skip the && !vec_len.is_constant (), that should
stay.  Just the first part of condition that can be re-used in the
if as well (inverted).
 
Regards
Robin
 


Re: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization

2023-12-15 Thread juzhe.zh...@rivai.ai
Do you mean like this ?

  /* We need to use precomputed mask for such situation and such mask
 can only be computed in compile-time known size modes.  */
  bool indices_fit_selector_p
= maybe_ge (vec_len, 2 << GET_MODE_BITSIZE (GET_MODE_INNER (vmode)));
  if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8、
  && indices_fit_selector_p
  && !vec_len.is_constant ())
return false;



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-12-15 20:25
To: juzhe.zh...@rivai.ai; gcc-patches
CC: rdapp.gcc; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm 
vectorization
On 12/15/23 13:16, juzhe.zh...@rivai.ai wrote:
> 
>>> bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE 
>>> (GET_MODE_INNER (vmode)));
> No, I think it will make us miss some optimization.
> 
> For example, for poly value [16,16]  maybe_ge ([16,16], 65536) which makes us 
> missed merge optimization but
> we definitely can do merge optimization.
 
I didn't mean to skip the && !vec_len.is_constant (), that should
stay.  Just the first part of condition that can be re-used in the
if as well (inverted).
 
Regards
Robin
 


Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization

2023-12-15 Thread Robin Dapp
On 12/15/23 13:16, juzhe.zh...@rivai.ai wrote:
> 
>>> bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE 
>>> (GET_MODE_INNER (vmode)));
> No, I think it will make us miss some optimization.
> 
> For example, for poly value [16,16]  maybe_ge ([16,16], 65536) which makes us 
> missed merge optimization but
> we definitely can do merge optimization.

I didn't mean to skip the && !vec_len.is_constant (), that should
stay.  Just the first part of condition that can be re-used in the
if as well (inverted).

Regards
 Robin


Re: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization

2023-12-15 Thread juzhe.zh...@rivai.ai

>> bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE 
>> (GET_MODE_INNER (vmode)));
No, I think it will make us miss some optimization.

For example, for poly value [16,16]  maybe_ge ([16,16], 65536) which makes us 
missed merge optimization but
we definitely can do merge optimization.

>> Also add a comment that the non-constant case is handled by
>> shuffle_decompress_patterns in case we have a HImode vector twice the
>> size that can hold our indices.

Ok.


>> Comment should go inside the if branch.
Ok.


>>Also add this as comment:

>>As the mask is a simple {0, 1, ...} pattern and the length is known we can
>>store it in a scalar register and broadcast it to a mask register.

Ok.

>>I would have hoped that a simple

>>  v.quick_push (gen_int_mode (0b01010101, QImode));

>>suffices but that will probably clash if there are more than
>>two npatterns.

No, we definitely can not use this. more details you can see the current test 
vmerge-*.c .
We have various patterns:

E.g.
0, nunits + 1, nunits+ 2, ... it is 011

nunits, 1, 2 it 100.


Many different kinds of patterns can be used vmerge optimization.



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-12-15 19:14
To: Juzhe-Zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm 
vectorization
Hi Juzhe,
 
in general looks OK.
 
> +  /* We need to use precomputed mask for such situation and such mask
> + can only be computed in compile-time known size modes.  */
> +  if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8 && maybe_ge (vec_len, 
> 256)
> +  && !vec_len.is_constant ())
> +return false;
> +
 
We could make this a separate variable like:
bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE 
(GET_MODE_INNER (vmode)));
 
Also add a comment that the non-constant case is handled by
shuffle_decompress_patterns in case we have a HImode vector twice the
size that can hold our indices.
 
>/* MASK = SELECTOR < NUNTIS ? 1 : 0.  */
 
Comment should go inside the if branch.
 
> +  if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, 
> 256))
 
> +}
> +  else
> +{
> +  /* For EEW8 and NUNITS may be larger than 255, we can't use vmsltu
> + directly to generate the selector mask, instead, we can only use
> + precomputed mask.
 
I find that comment a bit misleading as it's not the vmsltu itself but
rather that the indices cannot be held.
 
> +
> + E.g. selector = <0, 257, 2, 259> for EEW8 vector with NUNITS = 256, we
> + don't have a QImode scalar register to hold larger than 255.  */
 
We also cannot hold that in a vector QImode register, and, since there
is no larger HI mode vector we cannot create a larger selector.
 
Also add this as comment:
 
As the mask is a simple {0, 1, ...} pattern and the length is known we can
store it in a scalar register and broadcast it to a mask register.
 
> +  gcc_assert (vec_len.is_constant ());
> +  int size = CEIL (GET_MODE_NUNITS (mask_mode).to_constant (), 8);
> +  machine_mode mode = get_vector_mode (QImode, size).require ();
> +  rtx tmp = gen_reg_rtx (mode);
> +  rvv_builder v (mode, 1, size);
> +  for (int i = 0; i < vec_len.to_constant () / 8; i++)
> + {
> +   uint8_t value = 0;
> +   for (int j = 0; j < 8; j++)
> + {
> +   int index = i * 8 + j;
> +   if (known_lt (d->perm[index], 256))
> + value |= 1 << j;
> + }
 
I would have hoped that a simple
 
  v.quick_push (gen_int_mode (0b01010101, QImode));
 
suffices but that will probably clash if there are more than
two npatterns.
 
Regards
Robin
 
 


Re: [PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization

2023-12-15 Thread Robin Dapp
Hi Juzhe,

in general looks OK.

> +  /* We need to use precomputed mask for such situation and such mask
> + can only be computed in compile-time known size modes.  */
> +  if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8 && maybe_ge (vec_len, 
> 256)
> +  && !vec_len.is_constant ())
> +return false;
> +

We could make this a separate variable like:
bool indices_fit_selector = maybe_ge (vec_len, 2 << GET_MODE_BITSIZE 
(GET_MODE_INNER (vmode)));

Also add a comment that the non-constant case is handled by
shuffle_decompress_patterns in case we have a HImode vector twice the
size that can hold our indices.

>/* MASK = SELECTOR < NUNTIS ? 1 : 0.  */

Comment should go inside the if branch.

> +  if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, 
> 256))

> +}
> +  else
> +{
> +  /* For EEW8 and NUNITS may be larger than 255, we can't use vmsltu
> +  directly to generate the selector mask, instead, we can only use
> +  precomputed mask.

I find that comment a bit misleading as it's not the vmsltu itself but
rather that the indices cannot be held.

> +
> +  E.g. selector = <0, 257, 2, 259> for EEW8 vector with NUNITS = 256, we
> +  don't have a QImode scalar register to hold larger than 255.  */

We also cannot hold that in a vector QImode register, and, since there
is no larger HI mode vector we cannot create a larger selector.

Also add this as comment:

As the mask is a simple {0, 1, ...} pattern and the length is known we can
store it in a scalar register and broadcast it to a mask register.

> +  gcc_assert (vec_len.is_constant ());
> +  int size = CEIL (GET_MODE_NUNITS (mask_mode).to_constant (), 8);
> +  machine_mode mode = get_vector_mode (QImode, size).require ();
> +  rtx tmp = gen_reg_rtx (mode);
> +  rvv_builder v (mode, 1, size);
> +  for (int i = 0; i < vec_len.to_constant () / 8; i++)
> + {
> +   uint8_t value = 0;
> +   for (int j = 0; j < 8; j++)
> + {
> +   int index = i * 8 + j;
> +   if (known_lt (d->perm[index], 256))
> + value |= 1 << j;
> + }

I would have hoped that a simple

  v.quick_push (gen_int_mode (0b01010101, QImode));

suffices but that will probably clash if there are more than
two npatterns.

Regards
 Robin



[PATCH] RISC-V: Fix vmerge optimization bug in vec_perm vectorization

2023-12-14 Thread Juzhe-Zhong
This patch fixes the following FAILs in "full coverage" testing:

Running target 
riscv-sim/-march=rv64gcv_zvl256b/-mabi=lp64d/-mcmodel=medlow/--param=riscv-autovec-lmul=m8/--param=riscv-autovec-preference=fixed-vlmax
FAIL: gcc.dg/vect/vect-strided-mult-char-ls.c -flto -ffat-lto-objects execution 
test
FAIL: gcc.dg/vect/vect-strided-mult-char-ls.c execution test
FAIL: gcc.dg/vect/vect-strided-u8-i2.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/vect-strided-u8-i2.c execution test

The root cause is vmerge optimization on this following IR:

_45 = VEC_PERM_EXPR ;

It's obvious we have many index > 255 in shuffle indice. Here we use vmerge 
optimizaiton which is available but incorrect codgen cause run fail.

The bug codegen:
vsetvli zero,a4,e8,m8,ta,ma
vmsltu.vi   v0,v0,0 -> it should be 256 instead of 0, but since 
it is EEW8 vector, 256 is not a available value that 8bit register can hold it.
vmerge.vvm  v8,v8,v16,v0

After this patch:
vmv.v.x v0,a6
vmerge.vvm  v8,v8,v16,v0

gcc/ChangeLog:

* config/riscv/riscv-v.cc (shuffle_merge_patterns): Fix bug.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/bug-1.c: New test.

---
 gcc/config/riscv/riscv-v.cc   | 49 ---
 .../gcc.target/riscv/rvv/autovec/bug-1.c  | 39 +++
 2 files changed, 81 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/bug-1.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 680e2a0e03a..b8ba597682b 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -2987,20 +2987,55 @@ shuffle_merge_patterns (struct expand_vec_perm_d *d)
&& !d->perm.series_p (i, n_patterns, vec_len + i, n_patterns))
   return false;
 
+  /* We need to use precomputed mask for such situation and such mask
+ can only be computed in compile-time known size modes.  */
+  if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) == 8 && maybe_ge (vec_len, 256)
+  && !vec_len.is_constant ())
+return false;
+
   if (d->testing_p)
 return true;
 
   machine_mode mask_mode = get_mask_mode (vmode);
   rtx mask = gen_reg_rtx (mask_mode);
 
-  rtx sel = vec_perm_indices_to_rtx (sel_mode, d->perm);
-
   /* MASK = SELECTOR < NUNTIS ? 1 : 0.  */
-  rtx x = gen_int_mode (vec_len, GET_MODE_INNER (sel_mode));
-  insn_code icode = code_for_pred_cmp_scalar (sel_mode);
-  rtx cmp = gen_rtx_fmt_ee (LTU, mask_mode, sel, x);
-  rtx ops[] = {mask, cmp, sel, x};
-  emit_vlmax_insn (icode, COMPARE_OP, ops);
+  if (GET_MODE_BITSIZE (GET_MODE_INNER (vmode)) > 8 || known_lt (vec_len, 256))
+{
+  rtx sel = vec_perm_indices_to_rtx (sel_mode, d->perm);
+  rtx x = gen_int_mode (vec_len, GET_MODE_INNER (sel_mode));
+  insn_code icode = code_for_pred_cmp_scalar (sel_mode);
+  rtx cmp = gen_rtx_fmt_ee (LTU, mask_mode, sel, x);
+  rtx ops[] = {mask, cmp, sel, x};
+  emit_vlmax_insn (icode, COMPARE_OP, ops);
+}
+  else
+{
+  /* For EEW8 and NUNITS may be larger than 255, we can't use vmsltu
+directly to generate the selector mask, instead, we can only use
+precomputed mask.
+
+E.g. selector = <0, 257, 2, 259> for EEW8 vector with NUNITS = 256, we
+don't have a QImode scalar register to hold larger than 255.  */
+  gcc_assert (vec_len.is_constant ());
+  int size = CEIL (GET_MODE_NUNITS (mask_mode).to_constant (), 8);
+  machine_mode mode = get_vector_mode (QImode, size).require ();
+  rtx tmp = gen_reg_rtx (mode);
+  rvv_builder v (mode, 1, size);
+  for (int i = 0; i < vec_len.to_constant () / 8; i++)
+   {
+ uint8_t value = 0;
+ for (int j = 0; j < 8; j++)
+   {
+ int index = i * 8 + j;
+ if (known_lt (d->perm[index], 256))
+   value |= 1 << j;
+   }
+ v.quick_push (gen_int_mode (value, QImode));
+   }
+  emit_move_insn (tmp, v.build ());
+  emit_move_insn (mask, gen_lowpart (mask_mode, tmp));
+}
 
   /* TARGET = MASK ? OP0 : OP1.  */
   /* swap op0 and op1 since the order is opposite to pred_merge.  */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/bug-1.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/bug-1.c
new file mode 100644
index 000..88059971503
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/bug-1.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d 
--param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax  
-fno-vect-cost-model -O3 -fdump-tree-optimized" } */
+
+#include 
+
+#define N 64
+
+typedef struct
+{
+  unsigned char a;
+  unsigned char b;
+} s;
+
+int
+main1 (s *arr)
+{
+  s *ptr = arr;
+  s res[N];
+  int i;
+
+  for (i = 0; i < N; i++)
+{
+  res[i].a = ptr->b - ptr->a;
+  res[i].b = ptr->b + ptr->a;
+  ptr++;
+}
+/* check results:  */