Also, I have investigated power's testcase in RVV:

#include <stdint.h>

#define TEST_ALL(T)                                                            \
  T (int8_t)                                                                   \
  T (uint8_t)                                                                  \
  T (int16_t)                                                                  \
  T (uint16_t)                                                                 \
  T (int32_t)                                                                  \
  T (uint32_t)                                                                 \
  T (int64_t)                                                                  \
  T (uint64_t)                                                                 \
  T (float)                                                                    \
  T (double)
  
#define N 64
#define START 1
#define END 59

#define test(TYPE)                                                             \
  TYPE x_##TYPE[N] __attribute__((aligned(16)));                                
\
  void __attribute__((noinline, noclone)) test_npeel_##TYPE() {                \
    TYPE v = 0;                                                                \
    for (unsigned int i = START; i < END; i++) {                               \
      x_##TYPE[i] = v;                                                         \
      v += 1;                                                                  \
    }                                                                          \
  }

TEST_ALL (test)

RVV compile option:
-march=rv64gcv_zba_zbb_zbc_zbs_zvl256b -O2 -ftree-vectorize 
-fno-vect-cost-model -fno-unroll-loops -ffast-math 
--param=riscv-autovec-preference=fixed-vlmax -S -fdump-tree-optimized

Before this patch:
void test_npeel_int16_t ()
{
  unsigned long ivtmp.39;
  vector(16) short int vect_vec_iv_.33;
  void * _2;
  vector(16) short int * _8;
  vector(16) short int _10;
  unsigned long loop_len_19;
  unsigned long ivtmp_21;
  unsigned long ivtmp_22;

  <bb 2> [local count: 18146240]:
  ivtmp.39_13 = (unsigned long) &MEM <int16_t[64]> [(void *)&x_int16_t + 2B];

  <bb 3> [local count: 72584963]:
  # vect_vec_iv_.33_12 = PHI <_10(3), { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
12, 13, 14, 15 }(2)>
  # ivtmp_21 = PHI <ivtmp_22(3), 58(2)>
  # ivtmp.39_5 = PHI <ivtmp.39_14(3), ivtmp.39_13(2)>
  loop_len_19 = MIN_EXPR <ivtmp_21, 16>;
  _10 = vect_vec_iv_.33_12 + { 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 
16, 16, 16, 16 };
  _2 = (void *) ivtmp.39_5;
  _8 = &MEM <vector(16) short int> [(short int *)_2];
  .LEN_STORE (_8, 16B, loop_len_19, vect_vec_iv_.33_12, 0);
  ivtmp_22 = ivtmp_21 - loop_len_19;
  ivtmp.39_14 = ivtmp.39_5 + 32;
  if (ivtmp_22 != 0)
    goto <bb 3>; [75.00%]
  else
    goto <bb 4>; [25.00%]

  <bb 4> [local count: 18146240]:
  return;

}

After this patch:
void test_npeel_int16_t ()
{
  <bb 2> [local count: 18146240]:
  .LEN_STORE (&MEM <int16_t[64]> [(void *)&x_int16_t + 2B], 16B, 32, { 0, 1, 2, 
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 
24, 25, 26, 27, 28, 29, 30, 31 }, 0);
  .LEN_STORE (&MEM <int16_t[64]> [(void *)&x_int16_t + 66B], 16B, 26, { 32, 33, 
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 
54, 55, 56, 57, 58, 59, 60, 61, 62, 63 }, 0); [tail call]
  return;

}

It seems this patch fixed power's issue now.

So, My conclusion:
1. This patch does produce 1 more redundant 'mv' instructions in some cases 
(not all cases). But it can partially be solved by select_vl
    pattern. And even we can't fix this issue, one more 'mv' instruction is not 
a big deal for RVV.
2. This patch can solve power's issue.

Thanks. 


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
> ---- Replied Message ----
> From
> Richard Sandiford<richard.sandif...@arm.com>
> Date
> 05/30/2023 19:41
> To
> juzhe.zh...@rivai.ai<juzhe.zh...@rivai.ai>
> Cc
> gcc-patches<gcc-patches@gcc.gnu.org>,
> rguenther<rguent...@suse.de>,
> linkw<li...@linux.ibm.com>
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zh...@rivai.ai" <juzhe.zh...@rivai.ai> writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >       sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Reply via email to