Re: Re: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization

2023-10-26 Thread juzhe.zh...@rivai.ai
Oh. It's surprising.
I think current RVV GCC is not stable and buggy so that different FAILs in 
different machines.

Currently, we have 2 middle-end bugs:

1. COND_LEN_XXX: 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111760 
2. Gather load bug:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111970 

I guess they are related to make RVV GCC unstable, so testing various in 
different machines.



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-10-26 16:34
To: juzhe.zh...@rivai.ai; Kito.cheng
CC: rdapp.gcc; gcc-patches; kito.cheng; jeffreyalaw; Patrick O'Neill
Subject: Re: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV 
auto-vectorization
> I have sent V3 with adapting testcases (2 additional dump FAILs detected by 
> both Pan Li and Patrick).
> No need to review.
> 
> I will wait for patrick is ok to ignore popcount FAILs for now then commit it.
 
Just to confirm:  I can now also reproduce the popcount fail on my machine
without your patch.
 
Regards
Robin
 


Re: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization

2023-10-26 Thread Robin Dapp
> I have sent V3 with adapting testcases (2 additional dump FAILs detected by 
> both Pan Li and Patrick).
> No need to review.
> 
> I will wait for patrick is ok to ignore popcount FAILs for now then commit it.

Just to confirm:  I can now also reproduce the popcount fail on my machine
without your patch.

Regards
 Robin


Re: Re: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization

2023-10-26 Thread juzhe.zh...@rivai.ai
Thanks Kito.

I have sent V3 with adapting testcases (2 additional dump FAILs detected by 
both Pan Li and Patrick).
No need to review.

I will wait for patrick is ok to ignore popcount FAILs for now then commit it.



juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-10-26 15:51
To: Juzhe-Zhong
CC: gcc-patches; kito.cheng; jeffreyalaw; rdapp.gcc; Patrick O'Neill
Subject: Re: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV 
auto-vectorization
LGTM, Thanks, it's really awesome - the implementation is simpler than
I expected, it's another great improvement for RISC-V GCC!
 
Just make sure Patrick gives a green light on the testing before
committing the patch :)
 
 
 
 
On Wed, Oct 25, 2023 at 8:05 PM Juzhe-Zhong  wrote:
>
> This patch addresses the redundant AVL/VL toggling in RVV partial 
> auto-vectorization
> which is a known issue for a long time and I finally find the time to address 
> it.
>
> Consider a simple vector addition operation:
>
> https://godbolt.org/z/7hfGfEjW3
>
> void
> foo (int *__restrict a,
>  int *__restrict b,
>  int *__restrict n)
> {
>   for (int i = 0; i < n; i++)
>   a[i] = a[i] + b[i];
> }
>
> Optimized IR:
>
> Loop body:
>   _38 = .SELECT_VL (ivtmp_36, POLY_INT_CST [4, 4]);  
> -> vsetvli a5,a2,e8,mf4,ta,ma
>   ...
>   vect__4.8_27 = .MASK_LEN_LOAD (vectp_a.6_29, 32B, { -1, ... }, _38, 0);
> -> vle32.v v2,0(a0)
>   vect__6.11_20 = .MASK_LEN_LOAD (vectp_b.9_25, 32B, { -1, ... }, _38, 0);   
> -> vle32.v v1,0(a1)
>   vect__7.12_19 = vect__6.11_20 + vect__4.8_27;  
> -> vsetvli a6,zero,e32,m1,ta,ma + vadd.vv v1,v1,v2
>   .MASK_LEN_STORE (vectp_a.13_11, 32B, { -1, ... }, _38, 0, vect__7.12_19);  
> -> vsetvli zero,a5,e32,m1,ta,ma + vse32.v v1,0(a4)
>
> We can see 2 redundant vsetvls inside the loop body due to AVL/VL toggling.
> The AVL/VL toggling is because we are missing LEN information in simple 
> PLUS_EXPR GIMPLE assignment:
>
> vect__7.12_19 = vect__6.11_20 + vect__4.8_27;
>
> GCC apply partial predicate load/store and un-predicated full vector 
> operation on partial vectorization.
> Such flow are used by all other targets like ARM SVE (RVV also uses such 
> flow):
>
> ARM SVE:
>
> .L3:
> ld1wz30.s, p7/z, [x0, x3, lsl 2]   -> predicated load
> ld1wz31.s, p7/z, [x1, x3, lsl 2]   -> predicated load
> add z31.s, z31.s, z30.s-> un-predicated add
> st1wz31.s, p7, [x0, x3, lsl 2] -> predicated store
>
> Such vectorization flow causes AVL/VL toggling on RVV so we need AVL 
> propagation PASS for it.
>
> Also, It's very unlikely that we can apply predicated operations on all 
> vectorization for following reasons:
>
> 1. It's very heavy workload to support them on all vectorization and we don't 
> see any benefits if we can handle that on targets backend.
> 2. Changing Loop vectorizer for it will make code base ugly and hard to 
> maintain.
> 3. We will need so many patterns for all operations. Not only COND_LEN_ADD, 
> COND_LEN_SUB, 
>We also need COND_LEN_EXTEND, , COND_LEN_CEIL, ... .. over 100+ 
> patterns, unreasonable number of patterns.
>
> To conclude, we prefer un-predicated operations here, and design a nice and 
> clean AVL propagation PASS for it to elide the redundant vsetvls
> due to AVL/VL toggling.
>
> The second question is that why we separate a PASS called AVL propagation. 
> Why not optimize it in VSETVL PASS (We definitetly can optimize AVL in VSETVL 
> PASS)
>
> Frankly, I was planning to address such issue in VSETVL PASS that's why we 
> recently refactored VSETVL PASS. However, I changed my mind recently after 
> several
> experiments and tries.
>
> The reasons as follows:
>
> 1. For code base management and maintainience. Current VSETVL PASS is 
> complicated enough and aleady has enough aggressive and fancy optimizations 
> which
>turns out it can always generate optimal codegen in most of the cases. 
> It's not a good idea keep adding more features into VSETVL PASS to make VSETVL
>  PASS become heavy and heavy again, then we will need to refactor it 
> again in the future.
>  Actuall, the VSETVL PASS is very stable and optimal after the recent 
> refactoring. Hopefully, we should not change VSETVL PASS any more except the 
> minor
>  fixes.
>
> 2. vsetvl insertion (VSETVL PASS does this thing) and AVL propagation are 2 
> different things,  I don't think we should fuse them into same PASS.
>
> 3. VSETVL PASS is an post-RA PASS, wheras AVL propagtion should be done 
> before RA which can reduce register allocatio

Re: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization

2023-10-26 Thread Kito Cheng
LGTM, Thanks, it's really awesome - the implementation is simpler than
I expected, it's another great improvement for RISC-V GCC!

Just make sure Patrick gives a green light on the testing before
committing the patch :)




On Wed, Oct 25, 2023 at 8:05 PM Juzhe-Zhong  wrote:
>
> This patch addresses the redundant AVL/VL toggling in RVV partial 
> auto-vectorization
> which is a known issue for a long time and I finally find the time to address 
> it.
>
> Consider a simple vector addition operation:
>
> https://godbolt.org/z/7hfGfEjW3
>
> void
> foo (int *__restrict a,
>  int *__restrict b,
>  int *__restrict n)
> {
>   for (int i = 0; i < n; i++)
>   a[i] = a[i] + b[i];
> }
>
> Optimized IR:
>
> Loop body:
>   _38 = .SELECT_VL (ivtmp_36, POLY_INT_CST [4, 4]);  
> -> vsetvli a5,a2,e8,mf4,ta,ma
>   ...
>   vect__4.8_27 = .MASK_LEN_LOAD (vectp_a.6_29, 32B, { -1, ... }, _38, 0);
> -> vle32.v v2,0(a0)
>   vect__6.11_20 = .MASK_LEN_LOAD (vectp_b.9_25, 32B, { -1, ... }, _38, 0);   
> -> vle32.v v1,0(a1)
>   vect__7.12_19 = vect__6.11_20 + vect__4.8_27;  
> -> vsetvli a6,zero,e32,m1,ta,ma + vadd.vv v1,v1,v2
>   .MASK_LEN_STORE (vectp_a.13_11, 32B, { -1, ... }, _38, 0, vect__7.12_19);  
> -> vsetvli zero,a5,e32,m1,ta,ma + vse32.v v1,0(a4)
>
> We can see 2 redundant vsetvls inside the loop body due to AVL/VL toggling.
> The AVL/VL toggling is because we are missing LEN information in simple 
> PLUS_EXPR GIMPLE assignment:
>
> vect__7.12_19 = vect__6.11_20 + vect__4.8_27;
>
> GCC apply partial predicate load/store and un-predicated full vector 
> operation on partial vectorization.
> Such flow are used by all other targets like ARM SVE (RVV also uses such 
> flow):
>
> ARM SVE:
>
> .L3:
> ld1wz30.s, p7/z, [x0, x3, lsl 2]   -> predicated load
> ld1wz31.s, p7/z, [x1, x3, lsl 2]   -> predicated load
> add z31.s, z31.s, z30.s-> un-predicated add
> st1wz31.s, p7, [x0, x3, lsl 2] -> predicated store
>
> Such vectorization flow causes AVL/VL toggling on RVV so we need AVL 
> propagation PASS for it.
>
> Also, It's very unlikely that we can apply predicated operations on all 
> vectorization for following reasons:
>
> 1. It's very heavy workload to support them on all vectorization and we don't 
> see any benefits if we can handle that on targets backend.
> 2. Changing Loop vectorizer for it will make code base ugly and hard to 
> maintain.
> 3. We will need so many patterns for all operations. Not only COND_LEN_ADD, 
> COND_LEN_SUB, 
>We also need COND_LEN_EXTEND, , COND_LEN_CEIL, ... .. over 100+ 
> patterns, unreasonable number of patterns.
>
> To conclude, we prefer un-predicated operations here, and design a nice and 
> clean AVL propagation PASS for it to elide the redundant vsetvls
> due to AVL/VL toggling.
>
> The second question is that why we separate a PASS called AVL propagation. 
> Why not optimize it in VSETVL PASS (We definitetly can optimize AVL in VSETVL 
> PASS)
>
> Frankly, I was planning to address such issue in VSETVL PASS that's why we 
> recently refactored VSETVL PASS. However, I changed my mind recently after 
> several
> experiments and tries.
>
> The reasons as follows:
>
> 1. For code base management and maintainience. Current VSETVL PASS is 
> complicated enough and aleady has enough aggressive and fancy optimizations 
> which
>turns out it can always generate optimal codegen in most of the cases. 
> It's not a good idea keep adding more features into VSETVL PASS to make VSETVL
>  PASS become heavy and heavy again, then we will need to refactor it 
> again in the future.
>  Actuall, the VSETVL PASS is very stable and optimal after the recent 
> refactoring. Hopefully, we should not change VSETVL PASS any more except the 
> minor
>  fixes.
>
> 2. vsetvl insertion (VSETVL PASS does this thing) and AVL propagation are 2 
> different things,  I don't think we should fuse them into same PASS.
>
> 3. VSETVL PASS is an post-RA PASS, wheras AVL propagtion should be done 
> before RA which can reduce register allocation.
>
> 4. This patch's AVL propagation PASS only does AVL propagation for RVV 
> partial auto-vectorization situations.
>This patch's codes are only hundreds lines which is very managable and can 
> be very easily extended features and enhancements.
>  We can easily extend and enhance more AVL propagation in a clean and 
> separate PASS in the future. (If we do it on VSETVL PASS, we will complicate
>  VSETVL PASS again which is already so complicated.)
>
> Here is an example to demonstrate more:
>
> https://godbolt.org/z/bE86sv3q5
>
> void foo2 (int *__restrict a,
>   int *__restrict b,
>   int *__restrict c,
>   int *__restrict a2,
>   int *__restrict b2,
>   int *__restrict c2,
>   int *__restrict a3,
>   int *__restrict b3,
>   int *__re

[PATCH V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization

2023-10-25 Thread Juzhe-Zhong
This patch addresses the redundant AVL/VL toggling in RVV partial 
auto-vectorization
which is a known issue for a long time and I finally find the time to address 
it.

Consider a simple vector addition operation:

https://godbolt.org/z/7hfGfEjW3

void
foo (int *__restrict a,
 int *__restrict b,
 int *__restrict n)
{
  for (int i = 0; i < n; i++)
  a[i] = a[i] + b[i];
}

Optimized IR:

Loop body:
  _38 = .SELECT_VL (ivtmp_36, POLY_INT_CST [4, 4]);  -> 
vsetvli a5,a2,e8,mf4,ta,ma
  ...
  vect__4.8_27 = .MASK_LEN_LOAD (vectp_a.6_29, 32B, { -1, ... }, _38, 0);-> 
vle32.v v2,0(a0)
  vect__6.11_20 = .MASK_LEN_LOAD (vectp_b.9_25, 32B, { -1, ... }, _38, 0);   -> 
vle32.v v1,0(a1)
  vect__7.12_19 = vect__6.11_20 + vect__4.8_27;  -> 
vsetvli a6,zero,e32,m1,ta,ma + vadd.vv v1,v1,v2
  .MASK_LEN_STORE (vectp_a.13_11, 32B, { -1, ... }, _38, 0, vect__7.12_19);  -> 
vsetvli zero,a5,e32,m1,ta,ma + vse32.v v1,0(a4)

We can see 2 redundant vsetvls inside the loop body due to AVL/VL toggling.
The AVL/VL toggling is because we are missing LEN information in simple 
PLUS_EXPR GIMPLE assignment:

vect__7.12_19 = vect__6.11_20 + vect__4.8_27;

GCC apply partial predicate load/store and un-predicated full vector operation 
on partial vectorization.
Such flow are used by all other targets like ARM SVE (RVV also uses such flow):

ARM SVE:
   
.L3:
ld1wz30.s, p7/z, [x0, x3, lsl 2]   -> predicated load
ld1wz31.s, p7/z, [x1, x3, lsl 2]   -> predicated load
add z31.s, z31.s, z30.s-> un-predicated add
st1wz31.s, p7, [x0, x3, lsl 2] -> predicated store

Such vectorization flow causes AVL/VL toggling on RVV so we need AVL 
propagation PASS for it.

Also, It's very unlikely that we can apply predicated operations on all 
vectorization for following reasons:

1. It's very heavy workload to support them on all vectorization and we don't 
see any benefits if we can handle that on targets backend.
2. Changing Loop vectorizer for it will make code base ugly and hard to 
maintain.
3. We will need so many patterns for all operations. Not only COND_LEN_ADD, 
COND_LEN_SUB, 
   We also need COND_LEN_EXTEND, , COND_LEN_CEIL, ... .. over 100+ 
patterns, unreasonable number of patterns.

To conclude, we prefer un-predicated operations here, and design a nice and 
clean AVL propagation PASS for it to elide the redundant vsetvls
due to AVL/VL toggling.

The second question is that why we separate a PASS called AVL propagation. Why 
not optimize it in VSETVL PASS (We definitetly can optimize AVL in VSETVL PASS)

Frankly, I was planning to address such issue in VSETVL PASS that's why we 
recently refactored VSETVL PASS. However, I changed my mind recently after 
several
experiments and tries.

The reasons as follows:

1. For code base management and maintainience. Current VSETVL PASS is 
complicated enough and aleady has enough aggressive and fancy optimizations 
which
   turns out it can always generate optimal codegen in most of the cases. It's 
not a good idea keep adding more features into VSETVL PASS to make VSETVL
 PASS become heavy and heavy again, then we will need to refactor it 
again in the future.
 Actuall, the VSETVL PASS is very stable and optimal after the recent 
refactoring. Hopefully, we should not change VSETVL PASS any more except the 
minor
 fixes.

2. vsetvl insertion (VSETVL PASS does this thing) and AVL propagation are 2 
different things,  I don't think we should fuse them into same PASS.

3. VSETVL PASS is an post-RA PASS, wheras AVL propagtion should be done before 
RA which can reduce register allocation.

4. This patch's AVL propagation PASS only does AVL propagation for RVV partial 
auto-vectorization situations.
   This patch's codes are only hundreds lines which is very managable and can 
be very easily extended features and enhancements.
 We can easily extend and enhance more AVL propagation in a clean and 
separate PASS in the future. (If we do it on VSETVL PASS, we will complicate 
 VSETVL PASS again which is already so complicated.) 

Here is an example to demonstrate more:

https://godbolt.org/z/bE86sv3q5

void foo2 (int *__restrict a,
  int *__restrict b,
  int *__restrict c,
  int *__restrict a2,
  int *__restrict b2,
  int *__restrict c2,
  int *__restrict a3,
  int *__restrict b3,
  int *__restrict c3,
  int *__restrict a4,
  int *__restrict b4,
  int *__restrict c4,
  int *__restrict a5,
  int *__restrict b5,
  int *__restrict c5,
  int n)
{
for (int i = 0; i < n; i++){
  a[i] = b[i] + c[i];
  b5[i] = b[i] + c[i];
  a2[i] = b2[i] + c2[i];
  a3[i] = b3[i] + c3[i];
  a4[i] = b4[i] + c4[i];
  a5[i] = a[i] + a4[i];
  a[i] = a5[i] + b5[i]+ a[i];

  a[i] = a[i] + c[i];
  b5[i] = a[i] +