[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c

2023-10-26 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #9 from JuzheZhong  ---
(In reply to Maciej W. Rozycki from comment #7)
> Thank you for all your explanations.  I think I'm still missing something
> here, so I'll write it differently (and let's ignore the tail-agnostic vs
> tail-undisturbed choice for the purpose of this consideration).
> 
> Why is the `vl' value determined by hardware from `avl' by an explicit
> request (!) of the programmer who inserted the vsetvl intrinsics ignored?
> Is the compiler able to prove the use of `avl' in place of `vl' does not
> affect the operation of the VLE32.V and VSE32.V instructions in any way?
> What is the purpose of these intrinsics if they can be freely ignored?
> 
> Please forgive me if my questions seem to you obvious to answer or
> irrelevant, I'm still rather new to this RVV stuff.

As long as the ratio of user vsetvl intrinsics are same as the following
RVV normal instruction, compiler is free to optimize it.

For example:

vl = __riscv_vsetvl_e32m1 (avl)
__riscv_vadd_vv_i32m1 (...,vl)

A naive way to insert vsetvl:

vsetvl VL, AVL e32 m1
vsetvl zero, VL e32 m1
vadd.vv

Howerver, since they are have same ratio, we can do it:

vsetvl zero, AVL e32 m1
vadd.vv

It's absolutely correct in-dependent on hardware.

However, different ratio:

vl = __riscv_vsetvl_e32m1 (avl)
__riscv_vadd_vv_i64m1 (...,vl)

vsetvl VL, AVL e32 m1
vsetvl zero, VL e64 m1
vadd.vv

We can't optimize it. This is the only correct codegen.

Thanks.

[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c

2023-10-26 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #8 from JuzheZhong  ---
(In reply to Maciej W. Rozycki from comment #7)
> Thank you for all your explanations.  I think I'm still missing something
> here, so I'll write it differently (and let's ignore the tail-agnostic vs
> tail-undisturbed choice for the purpose of this consideration).
> 
> Let me paste the whole assembly code produced here (sans decorations):
> 
>   beq a5,zero,.L2
>   vsetvli zero,a6,e32,m1,tu,ma
> .L3:
>   beq a4,zero,.L7
>   li  a5,0
> .L5:
>   vle32.v v1,0(a0)
>   vle32.v v1,0(a1)
>   vle32.v v1,0(a2)
>   vse32.v v1,0(a3)
>   addia5,a5,1
>   bne a4,a5,.L5
> .L7:
>   ret
> .L2:
>   vsetvli zero,a6,e32,m1,tu,ma
>   j   .L3
> 
> This seems to me to correspond to this source code:
> 
>   if (cond)
> __riscv_vsetvl_e32m1(avl);
>   else
> __riscv_vsetvl_e16mf2(avl);
>   for (size_t i = 0; i < n; i += 1) {
> vint32m1_t a = __riscv_vle32_v_i32m1(in1, avl);
> vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, avl);
> vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, avl);
> __riscv_vse32_v_i32m1(out, c, avl);
>   }
> 
> And in that case I'd expect the conditional to be optimised away, as its
> result is ignored (along with the intrinsics) and does not affect actual
> code executed except for the different execution path, i.e.:
> 
>   beq a4,zero,.L7
>   vsetvli zero,a6,e32,m1,tu,ma
>   li  a5,0
> .L5:
>   vle32.v v1,0(a0)
>   vle32.v v1,0(a1)
>   vle32.v v1,0(a2)
>   vse32.v v1,0(a3)
>   addia5,a5,1
>   bne a4,a5,.L5
> .L7:
>   ret
> 

Good catch ! I think we have a missed-optimization here and I agree this code
is correct and optimal codegen for this case.
We have a close-to-optimal (not optimal enough) codegen for now.

And this optimization should not be done by VSETVL PASS.

After VSETVL PASS fusion, both e16mf2 and e32m1 user vsetvl instrinsic are
fused into e32m1, tu. They are totally the same so it's meaningless seperate
them into different blocks (They should be the same single block).

The reason why we missed an optimization here is because we expand user
vsetvl __riscv_vsetvl_e32m1 and __riscv_vsetvl_e16mf2 into 2 different
RTL expressions. The before PASSes (before VSETVL) don't known they are
equivalent, so separate them into different blocks.

If you change codes as follows:
  if (cond)
vl = __riscv_vsetvl_e32m1(avl);
  else
vl = __riscv_vsetvl_e32m1(avl);

I am sure the codegen will be as you said above. (A single vsetvl e32m1 tu in
a single block).

To optimize it, a alternative approach is that we expand all user vsetvl
instrinscs into same RTL expression (as long as they are having same ratio).


Meaning, expand 

__riscv_vsetvl_e64m1
__riscv_vsetvl_e32m1
__riscv_vsetvl_e16mf2
__riscv_vsetvl_e8mf8

into same RTL expression since their VL outputs are definitely the same.

I don't see it will cause any problems here.

But different ratio like 32m1 and e32mf2 should be different RLT expression.

I am not sure kito agree with this idea.


Another alternative approach is that we enhance bb_reorder PASS.
The VSETVL PASS is run before bb_reorder PASS and current bb_reorder PASS
is unable to fuse these 2 vsetvls e32m1 Tu into same block because we split
it into "real" vsetvls which is the RTL pattern has side effects.

The "real" vsetvl patterns which generate assembly should have side effects
since vsetvl does change global VL/VTYPE status and also set a general
register.

No matter which approach to optimize it, I won't do it in GCC-14 since stage 1
is soon to close.  We have a few more features (which are much more imporant)
that we are planning and working to support in GCC-14.
I have confidence that our RVV GCC current VSETVL PASS is really optimal and
fancy enough.

After stage 1 close, we won't do any optimizations, we will only run full
coverage testing (for example, using different LMUL different -march to run the
whole gcc testsuite) and fix bugs.

[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c

2023-10-26 Thread macro at orcam dot me.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #7 from Maciej W. Rozycki  ---
Thank you for all your explanations.  I think I'm still missing something
here, so I'll write it differently (and let's ignore the tail-agnostic vs
tail-undisturbed choice for the purpose of this consideration).

Let me paste the whole assembly code produced here (sans decorations):

beq a5,zero,.L2
vsetvli zero,a6,e32,m1,tu,ma
.L3:
beq a4,zero,.L7
li  a5,0
.L5:
vle32.v v1,0(a0)
vle32.v v1,0(a1)
vle32.v v1,0(a2)
vse32.v v1,0(a3)
addia5,a5,1
bne a4,a5,.L5
.L7:
ret
.L2:
vsetvli zero,a6,e32,m1,tu,ma
j   .L3

This seems to me to correspond to this source code:

  if (cond)
__riscv_vsetvl_e32m1(avl);
  else
__riscv_vsetvl_e16mf2(avl);
  for (size_t i = 0; i < n; i += 1) {
vint32m1_t a = __riscv_vle32_v_i32m1(in1, avl);
vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, avl);
vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, avl);
__riscv_vse32_v_i32m1(out, c, avl);
  }

And in that case I'd expect the conditional to be optimised away, as its
result is ignored (along with the intrinsics) and does not affect actual
code executed except for the different execution path, i.e.:

beq a4,zero,.L7
vsetvli zero,a6,e32,m1,tu,ma
li  a5,0
.L5:
vle32.v v1,0(a0)
vle32.v v1,0(a1)
vle32.v v1,0(a2)
vse32.v v1,0(a3)
addia5,a5,1
bne a4,a5,.L5
.L7:
ret

However actual source code is as follows:

  size_t vl;
  if (cond)
vl = __riscv_vsetvl_e32m1(avl);
  else
vl = __riscv_vsetvl_e16mf2(avl);
  for (size_t i = 0; i < n; i += 1) {
vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl);
vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl);
__riscv_vse32_v_i32m1(out, c, vl);
  }

Based on what you write I'd expect code like this instead:

beq a5,zero,.L2
vsetvli a6,a6,e16,mf2,ta,ma
.L3:
beq a4,zero,.L7
vsetvli zero,a6,e32,m1,tu,ma
li  a5,0
.L5:
vle32.v v1,0(a0)
vle32.v v1,0(a1)
vle32.v v1,0(a2)
vse32.v v1,0(a3)
addia5,a5,1
bne a4,a5,.L5
.L7:
ret
.L2:
vsetvli a6,a6,e32,m1,ta,ma
j   .L3

which is roughly what you say LLVM produces.

Why is the `vl' value determined by hardware from `avl' by an explicit
request (!) of the programmer who inserted the vsetvl intrinsics ignored?
Is the compiler able to prove the use of `avl' in place of `vl' does not
affect the operation of the VLE32.V and VSE32.V instructions in any way?
What is the purpose of these intrinsics if they can be freely ignored?

Please forgive me if my questions seem to you obvious to answer or
irrelevant, I'm still rather new to this RVV stuff.

[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c

2023-10-26 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #6 from JuzheZhong  ---

> I have troubles chasing one down and the source code is so
> convoluted with macros I can't even find the implementation.

I am sorry for causing confusion to you here.

But because of the RVV fusion rules are so complicated, we define it in

riscv-vsetvl.def. To understand the codes, I suggest you directly read the
riscv-vsetvl.def

We define all compatible, fusion, available rules there.

For example, vle16.v (e16, m1 ) is compatible with vadd.vv (e32, mf2 ),
In this case, adjacent 2 instructions "vle16" (e16m1) and vadd.vv (e32mf2) can
have the same vsetvl (vsetvl e32mf2).

Wheras vsub.vv(e16,m1) and vadd (e32 mf2), they are not compatible.

[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c

2023-10-25 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #5 from JuzheZhong  ---
Yes. I am agree that some arch prefer agnostic than undisturbed even with more
vsetvls. That's why I have post PR for asking whether we can have a option like

-mprefer-agosnotic.

https://github.com/riscv-non-isa/riscv-toolchain-conventions/issues/37


But I think Maciej is worrying about why GCC fuse vsetvl, and change

e16mf2 vsetvl into e32m1.


For example:

https://godbolt.org/z/6G9G7Pbe9

No 'TU' included.

I think LLVM codegen looks more reasonable:

beqza5, .LBB0_4
vsetvli a1, a6, e32, m1, ta, ma
beqza4, .LBB0_3
.LBB0_2:# =>This Inner Loop Header: Depth=1
vsetvli zero, a1, e32, m1, ta, ma
vle32.v v8, (a0)
vadd.vv v8, v8, v8
addia4, a4, -1
vse32.v v8, (a3)
bneza4, .LBB0_2
.LBB0_3:
ret
.LBB0_4:
sraia1, a6, 2
vsetvli a1, a1, e16, mf2, ta, ma
bneza4, .LBB0_2
j   .LBB0_3

But GCC is correct with optimizations:

foo(int*, int*, int*, int*, unsigned long, int, int):
beq a5,zero,.L2
vsetvli a5,a6,e32,m1,ta,ma
.L3:
beq a4,zero,.L10
li  a2,0
.L5:
vle32.v v1,0(a0)
addia2,a2,1
vadd.vv v1,v1,v1
vse32.v v1,0(a3)
bne a4,a2,.L5
.L10:
ret
.L2:
sraiw   a5,a6,2
vsetvli zero,a5,e32,m1,ta,ma
j   .L3

[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c

2023-10-25 Thread kito at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

Kito Cheng  changed:

   What|Removed |Added

 CC||kito at gcc dot gnu.org

--- Comment #4 from Kito Cheng  ---
The testcase it self is look like tricky but right, 
it typically could use to optimize mixed-width (mixed-SEW) operations,

You can refer to the EEW stuffs in v-spec[1], most load store has encoding
static-EEW and then could apply such vsetvli fusion optimization.

[1]
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#52-vector-operands

Give a (more) practical example here:

```c
#include "riscv_vector.h"

void foo(int32_t *in1, int16_t *in2, int16_t *in3, int32_t *out, size_t n, int
cond, int avl) {
size_t vl = __riscv_vsetvl_e16mf2(avl);
vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
vint16mf2_t b = __riscv_vle16_v_i16mf2(in2, vl);
vint16mf2_t c = __riscv_vle16_v_i16mf2(in3, vl);
vint32m1_t x = __riscv_vwmacc_vv_i32m1(a, b, c, vl);
__riscv_vse32_v_i32m1(out, x, vl);
}

```

> Is is guaranteed by the RVV specification that the value of `vl' produced
> (which is then supplied as an argument to `__riscv_vle32_v_i32m1', etc.;
> I presume implicitly via the VL CSR as I can't see it in actual assembly
> produced) is going to be the same for all microarchitectures for both:
>
>   vsetvli zero,a6,e32,m1,tu,ma
>
>and:
>
>   vsetvli zero,a6,e16,mf2,ta,ma

This is another trick in this case: tail agnostic vs tail undisturbed

tail undisturbed has stronger semantic than tail agnostic, so using tail
undisturbed for agnostic is always safe and satisfied the semantic, same for
mask agnostic vs mask undisturbed.

But performance is another story, as I know some uArch implement agnostic as
undisturbed, which means agnostic or undisturbed no much difference, so fuse
those two vsetvli is become kind of optimization.

However you could imagine, that also means some uArch is implement agnostic in
another way: agnostic MAY has better performance than undisturbed, we should
not fuse those vsetvli IF we are targeting such target, anyway, our cost model
for RVV still in an initial states, so personally I am fine with that for now,
but I guess we need add some more stuff to -mtune to handle those difference.

[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c

2023-10-25 Thread macro at orcam dot me.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #3 from Maciej W. Rozycki  ---
Maybe I'm missing something, but the RVV spec has this for VSETVLI:

"The application specifies the total number of elements to be processed
(the application vector length or AVL) as a candidate value for vl, and
the hardware responds via a general-purpose register with the (frequently
smaller) number of elements that the hardware will handle per iteration
(stored in vl), based on the microarchitectural implementation and the
vtype setting."

Is is guaranteed by the RVV specification that the value of `vl' produced
(which is then supplied as an argument to `__riscv_vle32_v_i32m1', etc.;
I presume implicitly via the VL CSR as I can't see it in actual assembly
produced) is going to be the same for all microarchitectures for both:

vsetvli zero,a6,e32,m1,tu,ma

and:

vsetvli zero,a6,e16,mf2,ta,ma

?

If it is, then still the code is awkward and the conditional ought to be
removed and the code paths merged as both legs execute the same
instruction.

What is the definition of the `vl' parameter to `__riscv_vle32_v_i32m1',
etc. anyway?  I have troubles chasing one down and the source code is so
convoluted with macros I can't even find the implementation.

[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c

2023-10-25 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #2 from JuzheZhong  ---
To demonstrate the idea, here is a simple example to make you easier understand
the idea:

https://godbolt.org/z/Gxzjv48Ec

#include "riscv_vector.h"

void foo(int32_t *in1, int32_t *in2, int32_t *in3, int32_t *out, size_t n, int
cond, int avl) {
size_t vl = __riscv_vsetvl_e16mf2(avl >> 2);
vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl);
vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl);
__riscv_vse32_v_i32m1(out, c, vl);
}

LLVM:

sraia4, a6, 2
vsetvli zero, a4, e16, mf2, ta, ma
vle32.v v8, (a0)
vsetvli zero, zero, e32, m1, tu, ma
vle32.v v8, (a1)
vle32.v v8, (a2)
vse32.v v8, (a3)
ret

LLVM is generating the naive code according to the intrinsics,
as you said, the first vsetvli keep e16mf2 unchanged.

Here is the codgen of GCC:
GCC:

sraia6,a6,2
vsetvli a6,a6,e32,m1,tu,ma
vle32.v v1,0(a0)
vle32.v v1,0(a1)
vle32.v v1,0(a2)
vse32.v v1,0(a3)
ret

since e16 mf2 is same ratio e32 m1, so we change first vsetvl from e16 mf2 into
e32 m1 TU. 

Then we can eliminate the second vsetvl

That is we call "local fusion" here.

For the case you mentioned is "global fusion" But they are the same thing.

Fuse vsetvl according to RVV ISA.

So, the example you mention, GCC is generating correct codes.

[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c

2023-10-25 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

JuzheZhong  changed:

   What|Removed |Added

 CC||juzhe.zhong at rivai dot ai

--- Comment #1 from JuzheZhong  ---
No, it is correct. It's the fancy optimization we have done in VSETVL PASS.

e16mf2 is same ratio e32m1.

The later loop demand e32m1 and TU, so we fuse it into e16mf2
(__riscv_vsetvl_e16mf2(avl)), change it into e32m1 and TU.

This is a valid optimization.

You can change e16mf2 into e16m1. I am sure the fusion will be blocked.