Also, I have investigated power's testcase in RVV: #include <stdint.h>
#define TEST_ALL(T) \ T (int8_t) \ T (uint8_t) \ T (int16_t) \ T (uint16_t) \ T (int32_t) \ T (uint32_t) \ T (int64_t) \ T (uint64_t) \ T (float) \ T (double) #define N 64 #define START 1 #define END 59 #define test(TYPE) \ TYPE x_##TYPE[N] __attribute__((aligned(16))); \ void __attribute__((noinline, noclone)) test_npeel_##TYPE() { \ TYPE v = 0; \ for (unsigned int i = START; i < END; i++) { \ x_##TYPE[i] = v; \ v += 1; \ } \ } TEST_ALL (test) RVV compile option: -march=rv64gcv_zba_zbb_zbc_zbs_zvl256b -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops -ffast-math --param=riscv-autovec-preference=fixed-vlmax -S -fdump-tree-optimized Before this patch: void test_npeel_int16_t () { unsigned long ivtmp.39; vector(16) short int vect_vec_iv_.33; void * _2; vector(16) short int * _8; vector(16) short int _10; unsigned long loop_len_19; unsigned long ivtmp_21; unsigned long ivtmp_22; <bb 2> [local count: 18146240]: ivtmp.39_13 = (unsigned long) &MEM <int16_t[64]> [(void *)&x_int16_t + 2B]; <bb 3> [local count: 72584963]: # vect_vec_iv_.33_12 = PHI <_10(3), { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }(2)> # ivtmp_21 = PHI <ivtmp_22(3), 58(2)> # ivtmp.39_5 = PHI <ivtmp.39_14(3), ivtmp.39_13(2)> loop_len_19 = MIN_EXPR <ivtmp_21, 16>; _10 = vect_vec_iv_.33_12 + { 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 }; _2 = (void *) ivtmp.39_5; _8 = &MEM <vector(16) short int> [(short int *)_2]; .LEN_STORE (_8, 16B, loop_len_19, vect_vec_iv_.33_12, 0); ivtmp_22 = ivtmp_21 - loop_len_19; ivtmp.39_14 = ivtmp.39_5 + 32; if (ivtmp_22 != 0) goto <bb 3>; [75.00%] else goto <bb 4>; [25.00%] <bb 4> [local count: 18146240]: return; } After this patch: void test_npeel_int16_t () { <bb 2> [local count: 18146240]: .LEN_STORE (&MEM <int16_t[64]> [(void *)&x_int16_t + 2B], 16B, 32, { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }, 0); .LEN_STORE (&MEM <int16_t[64]> [(void *)&x_int16_t + 66B], 16B, 26, { 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63 }, 0); [tail call] return; } It seems this patch fixed power's issue now. So, My conclusion: 1. This patch does produce 1 more redundant 'mv' instructions in some cases (not all cases). But it can partially be solved by select_vl pattern. And even we can't fix this issue, one more 'mv' instruction is not a big deal for RVV. 2. This patch can solve power's issue. Thanks. juzhe.zh...@rivai.ai From: Richard Biener Date: 2023-05-30 20:33 To: juzhe.zhong CC: Richard Sandiford; gcc-patches; linkw Subject: Re: [PATCH] VECT: Change flow of decrement IV On Tue, 30 May 2023, juzhe.zhong wrote: > This patch will generate the number of rgroup ?mov? instructions inside the > loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more > instruction in loop. If this patch is necessary? I think I should find a way > to fix it. That's odd, you only need to adjust the IV which is used in the exit test, not all the others. > ---- Replied Message ---- > From > Richard Sandiford<richard.sandif...@arm.com> > Date > 05/30/2023 19:41 > To > juzhe.zh...@rivai.ai<juzhe.zh...@rivai.ai> > Cc > gcc-patches<gcc-patches@gcc.gnu.org>, > rguenther<rguent...@suse.de>, > linkw<li...@linux.ibm.com> > Subject > Re: [PATCH] VECT: Change flow of decrement IV > "juzhe.zh...@rivai.ai" <juzhe.zh...@rivai.ai> writes: > > Before this patch: > > foo: > > ble a2,zero,.L5 > > csrr a3,vlenb > > srli a4,a3,2 > > .L3: > > minu a5,a2,a4 > > vsetvli zero,a5,e32,m1,ta,ma > > vle32.v v2,0(a1) > > vle32.v v1,0(a0) > > vsetvli t1,zero,e32,m1,ta,ma > > vadd.vv v1,v1,v2 > > vsetvli zero,a5,e32,m1,ta,ma > > vse32.v v1,0(a0) > > add a1,a1,a3 > > add a0,a0,a3 > > sub a2,a2,a5 > > bne a2,zero,.L3 > > .L5: > > ret > > > > After this patch: > > > > foo: > > ble a2,zero,.L5 > > csrr a3,vlenb > > srli a4,a3,2 > > neg a7,a4 -->>>additional instruction > > .L3: > > minu a5,a2,a4 > > vsetvli zero,a5,e32,m1,ta,ma > > vle32.v v2,0(a1) > > vle32.v v1,0(a0) > > vsetvli t1,zero,e32,m1,ta,ma > > mv a6,a2 -->>>additional instruction > > vadd.vv v1,v1,v2 > > vsetvli zero,a5,e32,m1,ta,ma > > vse32.v v1,0(a0) > > add a1,a1,a3 > > add a0,a0,a3 > > add a2,a2,a7 > > bgtu a6,a4,.L3 > > .L5: > > ret > > > > There is 1 more instruction in preheader and 1 more instruction in loop. > > But I think it's OK for RVV since we will definitely be using SELECT_VL so > this issue will gone. > > But what about cases where you won't be using SELECT_VL, such as SLP? > > Richard > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)