RE: [PATCH] RISC-V: Use merge approach to optimize vector permutation

2023-06-14 Thread Li, Pan2 via Gcc-patches
Addressed the comments in PATCH v2 as below.

https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621789.html

Pan

-Original Message-
From: Gcc-patches  On Behalf 
Of Jeff Law via Gcc-patches
Sent: Thursday, June 15, 2023 3:11 AM
To: Robin Dapp ; juzhe.zh...@rivai.ai; 
gcc-patches@gcc.gnu.org
Cc: kito.ch...@gmail.com; kito.ch...@sifive.com; pal...@dabbelt.com; 
pal...@rivosinc.com
Subject: Re: [PATCH] RISC-V: Use merge approach to optimize vector permutation



On 6/14/23 09:00, Robin Dapp wrote:
> Hi Juzhe,
> 
> the general method seems sane and useful (it's not very complicated).
> I was just distracted by
> 
>> Selector = { 0, 17, 2, 19, 4, 21, 6, 23, 8, 9, 10, 27, 12, 29, 14, 31 }, the 
>> common expression:
>> { 0, nunits + 1, 1, nunits + 2, 2, nunits + 3, ...  }
>>
>> For this selector, we can use vmsltu + vmerge to optimize the codegen.
> 
> because it's actually { 0, nunits + 1, 2, nunits + 3, ... } or maybe
> { 0, nunits, 0, nunits, ... } + { 0, 1, 2, 3, ..., nunits - 1 }.
> 
> Because of the ascending/monotonic? selector structure we can use vmerge
> instead of vrgather.
> 
>> +/* Recognize the patterns that we can use merge operation to shuffle the
>> +   vectors. The value of Each element (index i) in selector can only be
>> +   either i or nunits + i.
>> +
>> +   E.g.
>> +   v = VEC_PERM_EXPR (v0, v1, selector),
>> +   selector = { 0, nunits + 1, 1, nunits + 2, 2, nunits + 3, ...  }
> 
> Same.
> 
>> +
>> +   We can transform such pattern into:
>> +
>> +   v = vcond_mask (v0, v1, mask),
>> +   mask = { 0, 1, 0, 1, 0, 1, ... }.  */
>> +
>> +static bool
>> +shuffle_merge_patterns (struct expand_vec_perm_d *d)
>> +{
>> +  machine_mode vmode = d->vmode;
>> +  machine_mode sel_mode = related_int_vector_mode (vmode).require ();
>> +  int n_patterns = d->perm.encoding ().npatterns ();
>> +  poly_int64 vec_len = d->perm.length ();
>> +
>> +  for (int i = 0; i < n_patterns; ++i)
>> +if (!known_eq (d->perm[i], i) && !known_eq (d->perm[i], vec_len + i))
>> +  return false;
>> +
>> +  for (int i = n_patterns; i < n_patterns * 2; i++)
>> +if (!d->perm.series_p (i, n_patterns, i, n_patterns)
>> +&& !d->perm.series_p (i, n_patterns, vec_len + i, n_patterns))
>> +  return false;
> 
> Maybe add a comment that we check that the pattern is actually monotonic
> or however you prefet to call it?
> 
> I didn't go through all tests in detail but skimmed several.  All in all
> looks good to me.
So I think that means we want a V2 for the comment updates.  But I think 
we can go ahead and consider V2 pre-approved.

jeff


Re: [PATCH] RISC-V: Use merge approach to optimize vector permutation

2023-06-14 Thread Jeff Law via Gcc-patches




On 6/14/23 09:00, Robin Dapp wrote:

Hi Juzhe,

the general method seems sane and useful (it's not very complicated).
I was just distracted by


Selector = { 0, 17, 2, 19, 4, 21, 6, 23, 8, 9, 10, 27, 12, 29, 14, 31 }, the 
common expression:
{ 0, nunits + 1, 1, nunits + 2, 2, nunits + 3, ...  }

For this selector, we can use vmsltu + vmerge to optimize the codegen.


because it's actually { 0, nunits + 1, 2, nunits + 3, ... } or maybe
{ 0, nunits, 0, nunits, ... } + { 0, 1, 2, 3, ..., nunits - 1 }.

Because of the ascending/monotonic? selector structure we can use vmerge
instead of vrgather.


+/* Recognize the patterns that we can use merge operation to shuffle the
+   vectors. The value of Each element (index i) in selector can only be
+   either i or nunits + i.
+
+   E.g.
+   v = VEC_PERM_EXPR (v0, v1, selector),
+   selector = { 0, nunits + 1, 1, nunits + 2, 2, nunits + 3, ...  }


Same.


+
+   We can transform such pattern into:
+
+   v = vcond_mask (v0, v1, mask),
+   mask = { 0, 1, 0, 1, 0, 1, ... }.  */
+
+static bool
+shuffle_merge_patterns (struct expand_vec_perm_d *d)
+{
+  machine_mode vmode = d->vmode;
+  machine_mode sel_mode = related_int_vector_mode (vmode).require ();
+  int n_patterns = d->perm.encoding ().npatterns ();
+  poly_int64 vec_len = d->perm.length ();
+
+  for (int i = 0; i < n_patterns; ++i)
+if (!known_eq (d->perm[i], i) && !known_eq (d->perm[i], vec_len + i))
+  return false;
+
+  for (int i = n_patterns; i < n_patterns * 2; i++)
+if (!d->perm.series_p (i, n_patterns, i, n_patterns)
+   && !d->perm.series_p (i, n_patterns, vec_len + i, n_patterns))
+  return false;


Maybe add a comment that we check that the pattern is actually monotonic
or however you prefet to call it?

I didn't go through all tests in detail but skimmed several.  All in all
looks good to me.
So I think that means we want a V2 for the comment updates.  But I think 
we can go ahead and consider V2 pre-approved.


jeff


Re: [PATCH] RISC-V: Use merge approach to optimize vector permutation

2023-06-14 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

the general method seems sane and useful (it's not very complicated).
I was just distracted by

> Selector = { 0, 17, 2, 19, 4, 21, 6, 23, 8, 9, 10, 27, 12, 29, 14, 31 }, the 
> common expression:
> { 0, nunits + 1, 1, nunits + 2, 2, nunits + 3, ...  }
> 
> For this selector, we can use vmsltu + vmerge to optimize the codegen.

because it's actually { 0, nunits + 1, 2, nunits + 3, ... } or maybe
{ 0, nunits, 0, nunits, ... } + { 0, 1, 2, 3, ..., nunits - 1 }.

Because of the ascending/monotonic? selector structure we can use vmerge
instead of vrgather.

> +/* Recognize the patterns that we can use merge operation to shuffle the
> +   vectors. The value of Each element (index i) in selector can only be
> +   either i or nunits + i.
> +
> +   E.g.
> +   v = VEC_PERM_EXPR (v0, v1, selector),
> +   selector = { 0, nunits + 1, 1, nunits + 2, 2, nunits + 3, ...  }

Same.

> +
> +   We can transform such pattern into:
> +
> +   v = vcond_mask (v0, v1, mask),
> +   mask = { 0, 1, 0, 1, 0, 1, ... }.  */
> +
> +static bool
> +shuffle_merge_patterns (struct expand_vec_perm_d *d)
> +{
> +  machine_mode vmode = d->vmode;
> +  machine_mode sel_mode = related_int_vector_mode (vmode).require ();
> +  int n_patterns = d->perm.encoding ().npatterns ();
> +  poly_int64 vec_len = d->perm.length ();
> +
> +  for (int i = 0; i < n_patterns; ++i)
> +if (!known_eq (d->perm[i], i) && !known_eq (d->perm[i], vec_len + i))
> +  return false;
> +
> +  for (int i = n_patterns; i < n_patterns * 2; i++)
> +if (!d->perm.series_p (i, n_patterns, i, n_patterns)
> + && !d->perm.series_p (i, n_patterns, vec_len + i, n_patterns))
> +  return false;

Maybe add a comment that we check that the pattern is actually monotonic
or however you prefet to call it?

I didn't go through all tests in detail but skimmed several.  All in all
looks good to me.

Regards
 Robin



[PATCH] RISC-V: Use merge approach to optimize vector permutation

2023-06-13 Thread juzhe . zhong
From: Juzhe-Zhong 

This patch is to optimize the permuation case that is suiteable use
merge approach.

Consider this following case:
typedef int8_t vnx16qi __attribute__((vector_size (16)));

#define MASK_16 0, 17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 
31

void __attribute__ ((noipa))
merge0 (vnx16qi x, vnx16qi y, vnx16qi *out)
{
  vnx16qi v = __builtin_shufflevector ((vnx16qi) x, (vnx16qi) y, MASK_16);
  *(vnx16qi*)out = v;
} 

The gimple IR:
v_3 = VEC_PERM_EXPR ;

Selector = { 0, 17, 2, 19, 4, 21, 6, 23, 8, 9, 10, 27, 12, 29, 14, 31 }, the 
common expression:
{ 0, nunits + 1, 1, nunits + 2, 2, nunits + 3, ...  }

For this selector, we can use vmsltu + vmerge to optimize the codegen.

Before this patch:
merge0:
addia5,sp,16
vl1re8.vv3,0(a5)
li  a5,31
vsetivlizero,16,e8,m1,ta,mu
vmv.v.x v2,a5
lui a5,%hi(.LANCHOR0)
addia5,a5,%lo(.LANCHOR0)
vl1re8.vv1,0(a5)
vl1re8.vv4,0(sp)
vand.vv v1,v1,v2
vmsgeu.vi   v0,v1,16
vrgather.vv v2,v4,v1
vadd.vi v1,v1,-16
vrgather.vv v2,v3,v1,v0.t
vs1r.v  v2,0(a0)
ret

After this patch:
merge0:
addia5,sp,16
vl1re8.vv1,0(a5)
lui a5,%hi(.LANCHOR0)
addia5,a5,%lo(.LANCHOR0)
vsetivlizero,16,e8,m1,ta,ma
vl1re8.vv0,0(a5)
vl1re8.vv2,0(sp)
vmsltu.vi   v0,v0,16
vmerge.vvm  v1,v1,v2,v0
vs1r.v  v1,0(a0)
ret

The key of this optimization is that:
1. mask = vmsltu (selector, nunits)
2. result = vmerge (op0, op1, mask)

gcc/ChangeLog:

* config/riscv/riscv-v.cc (shuffle_merge_patterns): New pattern.
(expand_vec_perm_const_1): Add merge optmization.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-3.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-4.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-5.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-6.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-7.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-3.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-4.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-5.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-6.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-7.c: New test.

---
 gcc/config/riscv/riscv-v.cc   |  52 +
 .../riscv/rvv/autovec/vls-vlmax/merge-1.c | 101 +
 .../riscv/rvv/autovec/vls-vlmax/merge-2.c | 103 +
 .../riscv/rvv/autovec/vls-vlmax/merge-3.c | 109 +
 .../riscv/rvv/autovec/vls-vlmax/merge-4.c | 122 ++
 .../riscv/rvv/autovec/vls-vlmax/merge-5.c |  76 +++
 .../riscv/rvv/autovec/vls-vlmax/merge-6.c |  51 +
 .../riscv/rvv/autovec/vls-vlmax/merge-7.c |  25 +++
 .../riscv/rvv/autovec/vls-vlmax/merge_run-1.c | 119 ++
 .../riscv/rvv/autovec/vls-vlmax/merge_run-2.c | 121 ++
 .../riscv/rvv/autovec/vls-vlmax/merge_run-3.c | 150 +
 .../riscv/rvv/autovec/vls-vlmax/merge_run-4.c | 210 ++
 .../riscv/rvv/autovec/vls-vlmax/merge_run-5.c |  89 
 .../riscv/rvv/autovec/vls-vlmax/merge_run-6.c |  59 +
 .../riscv/rvv/autovec/vls-vlmax/merge_run-7.c |  29 +++
 15 files changed, 1416 insertions(+)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge-1.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge-2.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge-3.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge-4.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge-5.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge-6.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge-7.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-1.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-2.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-3.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-4.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-5.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-6.c
 create mode 100644 
g