[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092 --- Comment #9 from JuzheZhong --- (In reply to Maciej W. Rozycki from comment #7) > Thank you for all your explanations. I think I'm still missing something > here, so I'll write it differently (and let's ignore the tail-agnostic vs > tail-undisturbed choice for the purpose of this consideration). > > Why is the `vl' value determined by hardware from `avl' by an explicit > request (!) of the programmer who inserted the vsetvl intrinsics ignored? > Is the compiler able to prove the use of `avl' in place of `vl' does not > affect the operation of the VLE32.V and VSE32.V instructions in any way? > What is the purpose of these intrinsics if they can be freely ignored? > > Please forgive me if my questions seem to you obvious to answer or > irrelevant, I'm still rather new to this RVV stuff. As long as the ratio of user vsetvl intrinsics are same as the following RVV normal instruction, compiler is free to optimize it. For example: vl = __riscv_vsetvl_e32m1 (avl) __riscv_vadd_vv_i32m1 (...,vl) A naive way to insert vsetvl: vsetvl VL, AVL e32 m1 vsetvl zero, VL e32 m1 vadd.vv Howerver, since they are have same ratio, we can do it: vsetvl zero, AVL e32 m1 vadd.vv It's absolutely correct in-dependent on hardware. However, different ratio: vl = __riscv_vsetvl_e32m1 (avl) __riscv_vadd_vv_i64m1 (...,vl) vsetvl VL, AVL e32 m1 vsetvl zero, VL e64 m1 vadd.vv We can't optimize it. This is the only correct codegen. Thanks.
[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092 --- Comment #8 from JuzheZhong --- (In reply to Maciej W. Rozycki from comment #7) > Thank you for all your explanations. I think I'm still missing something > here, so I'll write it differently (and let's ignore the tail-agnostic vs > tail-undisturbed choice for the purpose of this consideration). > > Let me paste the whole assembly code produced here (sans decorations): > > beq a5,zero,.L2 > vsetvli zero,a6,e32,m1,tu,ma > .L3: > beq a4,zero,.L7 > li a5,0 > .L5: > vle32.v v1,0(a0) > vle32.v v1,0(a1) > vle32.v v1,0(a2) > vse32.v v1,0(a3) > addia5,a5,1 > bne a4,a5,.L5 > .L7: > ret > .L2: > vsetvli zero,a6,e32,m1,tu,ma > j .L3 > > This seems to me to correspond to this source code: > > if (cond) > __riscv_vsetvl_e32m1(avl); > else > __riscv_vsetvl_e16mf2(avl); > for (size_t i = 0; i < n; i += 1) { > vint32m1_t a = __riscv_vle32_v_i32m1(in1, avl); > vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, avl); > vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, avl); > __riscv_vse32_v_i32m1(out, c, avl); > } > > And in that case I'd expect the conditional to be optimised away, as its > result is ignored (along with the intrinsics) and does not affect actual > code executed except for the different execution path, i.e.: > > beq a4,zero,.L7 > vsetvli zero,a6,e32,m1,tu,ma > li a5,0 > .L5: > vle32.v v1,0(a0) > vle32.v v1,0(a1) > vle32.v v1,0(a2) > vse32.v v1,0(a3) > addia5,a5,1 > bne a4,a5,.L5 > .L7: > ret > Good catch ! I think we have a missed-optimization here and I agree this code is correct and optimal codegen for this case. We have a close-to-optimal (not optimal enough) codegen for now. And this optimization should not be done by VSETVL PASS. After VSETVL PASS fusion, both e16mf2 and e32m1 user vsetvl instrinsic are fused into e32m1, tu. They are totally the same so it's meaningless seperate them into different blocks (They should be the same single block). The reason why we missed an optimization here is because we expand user vsetvl __riscv_vsetvl_e32m1 and __riscv_vsetvl_e16mf2 into 2 different RTL expressions. The before PASSes (before VSETVL) don't known they are equivalent, so separate them into different blocks. If you change codes as follows: if (cond) vl = __riscv_vsetvl_e32m1(avl); else vl = __riscv_vsetvl_e32m1(avl); I am sure the codegen will be as you said above. (A single vsetvl e32m1 tu in a single block). To optimize it, a alternative approach is that we expand all user vsetvl instrinscs into same RTL expression (as long as they are having same ratio). Meaning, expand __riscv_vsetvl_e64m1 __riscv_vsetvl_e32m1 __riscv_vsetvl_e16mf2 __riscv_vsetvl_e8mf8 into same RTL expression since their VL outputs are definitely the same. I don't see it will cause any problems here. But different ratio like 32m1 and e32mf2 should be different RLT expression. I am not sure kito agree with this idea. Another alternative approach is that we enhance bb_reorder PASS. The VSETVL PASS is run before bb_reorder PASS and current bb_reorder PASS is unable to fuse these 2 vsetvls e32m1 Tu into same block because we split it into "real" vsetvls which is the RTL pattern has side effects. The "real" vsetvl patterns which generate assembly should have side effects since vsetvl does change global VL/VTYPE status and also set a general register. No matter which approach to optimize it, I won't do it in GCC-14 since stage 1 is soon to close. We have a few more features (which are much more imporant) that we are planning and working to support in GCC-14. I have confidence that our RVV GCC current VSETVL PASS is really optimal and fancy enough. After stage 1 close, we won't do any optimizations, we will only run full coverage testing (for example, using different LMUL different -march to run the whole gcc testsuite) and fix bugs.
[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092 --- Comment #7 from Maciej W. Rozycki --- Thank you for all your explanations. I think I'm still missing something here, so I'll write it differently (and let's ignore the tail-agnostic vs tail-undisturbed choice for the purpose of this consideration). Let me paste the whole assembly code produced here (sans decorations): beq a5,zero,.L2 vsetvli zero,a6,e32,m1,tu,ma .L3: beq a4,zero,.L7 li a5,0 .L5: vle32.v v1,0(a0) vle32.v v1,0(a1) vle32.v v1,0(a2) vse32.v v1,0(a3) addia5,a5,1 bne a4,a5,.L5 .L7: ret .L2: vsetvli zero,a6,e32,m1,tu,ma j .L3 This seems to me to correspond to this source code: if (cond) __riscv_vsetvl_e32m1(avl); else __riscv_vsetvl_e16mf2(avl); for (size_t i = 0; i < n; i += 1) { vint32m1_t a = __riscv_vle32_v_i32m1(in1, avl); vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, avl); vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, avl); __riscv_vse32_v_i32m1(out, c, avl); } And in that case I'd expect the conditional to be optimised away, as its result is ignored (along with the intrinsics) and does not affect actual code executed except for the different execution path, i.e.: beq a4,zero,.L7 vsetvli zero,a6,e32,m1,tu,ma li a5,0 .L5: vle32.v v1,0(a0) vle32.v v1,0(a1) vle32.v v1,0(a2) vse32.v v1,0(a3) addia5,a5,1 bne a4,a5,.L5 .L7: ret However actual source code is as follows: size_t vl; if (cond) vl = __riscv_vsetvl_e32m1(avl); else vl = __riscv_vsetvl_e16mf2(avl); for (size_t i = 0; i < n; i += 1) { vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl); vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl); vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl); __riscv_vse32_v_i32m1(out, c, vl); } Based on what you write I'd expect code like this instead: beq a5,zero,.L2 vsetvli a6,a6,e16,mf2,ta,ma .L3: beq a4,zero,.L7 vsetvli zero,a6,e32,m1,tu,ma li a5,0 .L5: vle32.v v1,0(a0) vle32.v v1,0(a1) vle32.v v1,0(a2) vse32.v v1,0(a3) addia5,a5,1 bne a4,a5,.L5 .L7: ret .L2: vsetvli a6,a6,e32,m1,ta,ma j .L3 which is roughly what you say LLVM produces. Why is the `vl' value determined by hardware from `avl' by an explicit request (!) of the programmer who inserted the vsetvl intrinsics ignored? Is the compiler able to prove the use of `avl' in place of `vl' does not affect the operation of the VLE32.V and VSE32.V instructions in any way? What is the purpose of these intrinsics if they can be freely ignored? Please forgive me if my questions seem to you obvious to answer or irrelevant, I'm still rather new to this RVV stuff.
[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092 --- Comment #6 from JuzheZhong --- > I have troubles chasing one down and the source code is so > convoluted with macros I can't even find the implementation. I am sorry for causing confusion to you here. But because of the RVV fusion rules are so complicated, we define it in riscv-vsetvl.def. To understand the codes, I suggest you directly read the riscv-vsetvl.def We define all compatible, fusion, available rules there. For example, vle16.v (e16, m1 ) is compatible with vadd.vv (e32, mf2 ), In this case, adjacent 2 instructions "vle16" (e16m1) and vadd.vv (e32mf2) can have the same vsetvl (vsetvl e32mf2). Wheras vsub.vv(e16,m1) and vadd (e32 mf2), they are not compatible.
[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092 --- Comment #5 from JuzheZhong --- Yes. I am agree that some arch prefer agnostic than undisturbed even with more vsetvls. That's why I have post PR for asking whether we can have a option like -mprefer-agosnotic. https://github.com/riscv-non-isa/riscv-toolchain-conventions/issues/37 But I think Maciej is worrying about why GCC fuse vsetvl, and change e16mf2 vsetvl into e32m1. For example: https://godbolt.org/z/6G9G7Pbe9 No 'TU' included. I think LLVM codegen looks more reasonable: beqza5, .LBB0_4 vsetvli a1, a6, e32, m1, ta, ma beqza4, .LBB0_3 .LBB0_2:# =>This Inner Loop Header: Depth=1 vsetvli zero, a1, e32, m1, ta, ma vle32.v v8, (a0) vadd.vv v8, v8, v8 addia4, a4, -1 vse32.v v8, (a3) bneza4, .LBB0_2 .LBB0_3: ret .LBB0_4: sraia1, a6, 2 vsetvli a1, a1, e16, mf2, ta, ma bneza4, .LBB0_2 j .LBB0_3 But GCC is correct with optimizations: foo(int*, int*, int*, int*, unsigned long, int, int): beq a5,zero,.L2 vsetvli a5,a6,e32,m1,ta,ma .L3: beq a4,zero,.L10 li a2,0 .L5: vle32.v v1,0(a0) addia2,a2,1 vadd.vv v1,v1,v1 vse32.v v1,0(a3) bne a4,a2,.L5 .L10: ret .L2: sraiw a5,a6,2 vsetvli zero,a5,e32,m1,ta,ma j .L3
[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092 Kito Cheng changed: What|Removed |Added CC||kito at gcc dot gnu.org --- Comment #4 from Kito Cheng --- The testcase it self is look like tricky but right, it typically could use to optimize mixed-width (mixed-SEW) operations, You can refer to the EEW stuffs in v-spec[1], most load store has encoding static-EEW and then could apply such vsetvli fusion optimization. [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#52-vector-operands Give a (more) practical example here: ```c #include "riscv_vector.h" void foo(int32_t *in1, int16_t *in2, int16_t *in3, int32_t *out, size_t n, int cond, int avl) { size_t vl = __riscv_vsetvl_e16mf2(avl); vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl); vint16mf2_t b = __riscv_vle16_v_i16mf2(in2, vl); vint16mf2_t c = __riscv_vle16_v_i16mf2(in3, vl); vint32m1_t x = __riscv_vwmacc_vv_i32m1(a, b, c, vl); __riscv_vse32_v_i32m1(out, x, vl); } ``` > Is is guaranteed by the RVV specification that the value of `vl' produced > (which is then supplied as an argument to `__riscv_vle32_v_i32m1', etc.; > I presume implicitly via the VL CSR as I can't see it in actual assembly > produced) is going to be the same for all microarchitectures for both: > > vsetvli zero,a6,e32,m1,tu,ma > >and: > > vsetvli zero,a6,e16,mf2,ta,ma This is another trick in this case: tail agnostic vs tail undisturbed tail undisturbed has stronger semantic than tail agnostic, so using tail undisturbed for agnostic is always safe and satisfied the semantic, same for mask agnostic vs mask undisturbed. But performance is another story, as I know some uArch implement agnostic as undisturbed, which means agnostic or undisturbed no much difference, so fuse those two vsetvli is become kind of optimization. However you could imagine, that also means some uArch is implement agnostic in another way: agnostic MAY has better performance than undisturbed, we should not fuse those vsetvli IF we are targeting such target, anyway, our cost model for RVV still in an initial states, so personally I am fine with that for now, but I guess we need add some more stuff to -mtune to handle those difference.
[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092 --- Comment #3 from Maciej W. Rozycki --- Maybe I'm missing something, but the RVV spec has this for VSETVLI: "The application specifies the total number of elements to be processed (the application vector length or AVL) as a candidate value for vl, and the hardware responds via a general-purpose register with the (frequently smaller) number of elements that the hardware will handle per iteration (stored in vl), based on the microarchitectural implementation and the vtype setting." Is is guaranteed by the RVV specification that the value of `vl' produced (which is then supplied as an argument to `__riscv_vle32_v_i32m1', etc.; I presume implicitly via the VL CSR as I can't see it in actual assembly produced) is going to be the same for all microarchitectures for both: vsetvli zero,a6,e32,m1,tu,ma and: vsetvli zero,a6,e16,mf2,ta,ma ? If it is, then still the code is awkward and the conditional ought to be removed and the code paths merged as both legs execute the same instruction. What is the definition of the `vl' parameter to `__riscv_vle32_v_i32m1', etc. anyway? I have troubles chasing one down and the source code is so convoluted with macros I can't even find the implementation.
[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092 --- Comment #2 from JuzheZhong --- To demonstrate the idea, here is a simple example to make you easier understand the idea: https://godbolt.org/z/Gxzjv48Ec #include "riscv_vector.h" void foo(int32_t *in1, int32_t *in2, int32_t *in3, int32_t *out, size_t n, int cond, int avl) { size_t vl = __riscv_vsetvl_e16mf2(avl >> 2); vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl); vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl); vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl); __riscv_vse32_v_i32m1(out, c, vl); } LLVM: sraia4, a6, 2 vsetvli zero, a4, e16, mf2, ta, ma vle32.v v8, (a0) vsetvli zero, zero, e32, m1, tu, ma vle32.v v8, (a1) vle32.v v8, (a2) vse32.v v8, (a3) ret LLVM is generating the naive code according to the intrinsics, as you said, the first vsetvli keep e16mf2 unchanged. Here is the codgen of GCC: GCC: sraia6,a6,2 vsetvli a6,a6,e32,m1,tu,ma vle32.v v1,0(a0) vle32.v v1,0(a1) vle32.v v1,0(a2) vse32.v v1,0(a3) ret since e16 mf2 is same ratio e32 m1, so we change first vsetvl from e16 mf2 into e32 m1 TU. Then we can eliminate the second vsetvl That is we call "local fusion" here. For the case you mentioned is "global fusion" But they are the same thing. Fuse vsetvl according to RVV ISA. So, the example you mention, GCC is generating correct codes.
[Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092 JuzheZhong changed: What|Removed |Added CC||juzhe.zhong at rivai dot ai --- Comment #1 from JuzheZhong --- No, it is correct. It's the fancy optimization we have done in VSETVL PASS. e16mf2 is same ratio e32m1. The later loop demand e32m1 and TU, so we fuse it into e16mf2 (__riscv_vsetvl_e16mf2(avl)), change it into e32m1 and TU. This is a valid optimization. You can change e16mf2 into e16m1. I am sure the fusion will be blocked.