from:"钟居哲"

Re: Re: [PATCH] RISC-V: Basic VLS code gen for RISC-V

2023-05-30 Thread 钟居哲

Hi, Kito.

After consideration,  I think extending VLS modes into VLA pattern is not a 
wise choice now.
And I prefer everything to be pefect (Otherwise, I will rework the whole thing 
in the future and it's wasting time). 
So I have suggestions as follows:

First, add a new avl_type here:
enum avl_type
{
  NONVLMAX,
  VLMAX,
+ VLS_AVL,
};

Second, define SEW && VLMUL && RATIO for VLS modes:
(define_attr "sew" ""
  (cond [(eq_attr "mode" "V16QI")
 (const_int 8)
 (eq_attr "mode" "V8HI")
 (const_int 16)
 (eq_attr "mode" "V4SI")
 (const_int 32)
 (eq_attr "mode" "V2DI")
 (const_int 64)]
(const_int INVALID_ATTRIBUTE)))
(define_attr "vlmul" ""
  (cond [(eq_attr "mode" "V16QI")
   (symbol_ref "riscv_vector::get_vlmul(E_V16QImode)")
(eq_attr "mode" "V8HI")
   (symbol_ref "riscv_vector::get_vlmul(E_V8HImode)")
(eq_attr "mode" "V4SI")
   (symbol_ref "riscv_vector::get_vlmul(E_V4SImode)")
(eq_attr "mode" "V2DI")
   (symbol_ref "riscv_vector::get_vlmul(E_V2DImode)")



For "get_vlmul", we should be careful:
Since V16QI should LMUL = 1 when TARGET_MIN_VLEN == 128,
 LMUL = 1/2 when TARGET_MIN_VLEN == 256...
etc

Third, I think for VLS modes, you can define VLS pattern like this:

For GET_MODE_NUNITS (mode).to_constant () < 32:
+(define_insn "3"
+  [(set (match_operand:VLS 0 "register_operand" "=vr")
+   (any_int_binop_no_shift:VLS
+ (match_operand:VLS 1 "register_operand" "vr")
+ (match_operand:VLS 2 "register_operand" "vr")))]
+  "TARGET_VECTOR"
+  "v.vv\t%0,%1,%2"

+   [(set_attr "type" "")
+(set_attr "mode" "")
+(set_attr "merge_op_idx" const_int INVALID_ATTRIBUTE)
+(set_attr "vl_op_idx" const_int INVALID_ATTRIBUTE)
+(set (attr "ta") (symbol_ref "riscv_vector::TAIL_ANY"))
+(set (attr "ma") (symbol_ref "riscv_vector::MASK_ANY"))
+   (set (attr "avl_type") (symbol_ref "riscv_vector::VLS_AVL"))])

For GET_MODE_NUNITS (mode).to_constant () >= 32:

+(define_insn "3"
+  [(set (match_operand:VLS 0 "register_operand" "=vr")
+   (any_int_binop_no_shift:VLS
+ (match_operand:VLS 1 "register_operand" "vr")
+ (match_operand:VLS 2 "register_operand" "vr")))+(clobber 
(mactch_opearnd:SI 2 ))]
+  "TARGET_VECTOR"
+  "v.vv\t%0,%1,%2"

+   [(set_attr "type" "")
+(set_attr "mode" "")
+(set_attr "merge_op_idx" const_int INVALID_ATTRIBUTE)
+(set_attr "vl_op_idx" const_int 2)
+(set (attr "ta") (symbol_ref "riscv_vector::TAIL_ANY"))
+(set (attr "ma") (symbol_ref "riscv_vector::MASK_ANY"))
+   (set (attr "avl_type") (symbol_ref "riscv_vector::VLS_AVL"))])

Then, with some minor tricks in VSETVL PASS (in "parse_insn" function), I think 
it should work and this is the real optimal solution for
VLS modes auto-vectorizaiton.

Thanks.


juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-30 23:45
To: juzhe.zh...@rivai.ai
CC: Richard Biener; Robin Dapp; Kito.cheng; gcc-patches; palmer; jeffreyalaw; 
pan2.li
Subject: Re: Re: [PATCH] RISC-V: Basic VLS code gen for RISC-V
It's long mail but I think this should explain most high level concept
why I did this:
 
I guess I skipped too much story about the VLS-mode support; VLS-mode
support can be split into the middle-end and back-end.
 
# Middle-end
As Richard mentioned, those VLS types can be held by VLA-modes; for
example, int32x4_t can be held by VNx4SI mode, so IMO there are three
different options here: 1) use VLS type with VLS mode in middle-end,
2) use VLS type with VLA mode in middle-end 3) use VLA type with VLA
mode.
 
Option 2 might be weird and not natural to implement in GCC, so let me
ignore that.
 
Option 3 is a possible way, and actually, I did that on our downstream
compiler, and then...we found a fact that is not friendly to
optimization; give a few practical examples here VLA type is hard to
present a vector constructor other than a step or splat/duplicated
value, we need to push those value into memory first - and then load
by len_load, okay, so constant propagation and folding can't work well
here - since it's hard to evaluate that with unknown vector length.
 
And it is also not friendly to pointer alias - because the length is
unknown, so GCC must be conservative on this, which will block some
optimization due to AA issues.
 
So IMO the use the VLS-type with VLS mode is the best way in the middle-end.
 
# Back-end
OK, it's back-end time; we have two options in the back-end to support
the VLS-type: support that with VLS mode or VLA mode.
 
What's the meaning of support with VLA mode? convert VLS-type stuff
into VLA mode pattern and give the right length information  - then
everything works.
 
But what is wrong with this path? Again, similar issues in the
back-end: the propagation and folding with constant vector will be
limited when we hold in VLA type - we can't be held const_vector other
than splat/duplicated value or step value; it can't even be held
during the combine process, give an example here, we have a = {1, 2,
3, 4} and b = {4,

Re: Re: [PATCH] VECT: Change flow of decrement IV

2023-05-30 Thread 钟居哲

Hi, Richi.

>> As I said in the PR with the proposed scheme you get a loop around copy of 
>> the IV since both the pre and the post decrement values are live at the same 
>> time.  
>> If the CPU has a underflow bit set from the subtraction and a branch on that 
>> test using that could avoid the copy need.

RISC-V port doesn't have such instructions so such copy is needed in RISC-V 
port.
But as I said, such copy is very cheap.

So, I wonder whether you will consider take && review this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620086.html 
or not?

Or you have another plan ?

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-05-31 00:40
To: 钟居哲
CC: richard.sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV


Am 30.05.2023 um 14:38 schrieb 钟居哲 :

 

>> That's odd, you only need to adjust the IV which is used in the exit test,
>> not all the others.
Sorry for my incorrect information. I checked the codegen of both single-rgroup 
and multi-rgroup.
Their codegen are same behavior, after this patch, there will be 1 more neg 
instruction in preheader
and 1 more mv instruction inside the loop.

As I said in the PR with the proposed scheme you get a loop around copy of the 
IV since both the pre and the post decrement values are live at the same time.  
If the CPU has a underflow bit set from the subtraction and a branch on that 
test using that could avoid the copy need.



juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
>  Replied Message 
> From
> Richard Sandiford
> Date
> 05/30/2023 19:41
> To
> juzhe.zh...@rivai.ai
> Cc
> gcc-patches,
> rguenther,
> linkw
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zh...@rivai.ai"  writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >   sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Re: Re: [PATCH] RISC-V: Synthesize power-of-two constants.

2023-05-30 Thread 钟居哲

Ok. I prefer just keep scalar load + vmv.v.x by default since I believe most 
machines 
prefer this way.



juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2023-05-31 06:09
To: 钟居哲; andrew; rdapp.gcc
CC: gcc-patches; kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Synthesize power-of-two constants.
 
 
On 5/30/23 16:01, 钟居哲 wrote:
> I agree with Andrew.
> 
> And I don't think this patch is appropriate for following reasons:
> 1. This patch increases vector workload in machine since
>   it convert scalar load + vmv.v.x into vmv.v.i + vsll.vi.
This is probably uarch dependent.  I can probably construct cases where 
the first will be better and I can probably construct cases where the 
latter will be better.  In fact the recommendation from our uarch team 
is to generally do this stuff on the vector side.
 
 
 
> 2. For multi-issue OoO machine, scalar instructions are very cheap
>  when they are located in vector codegen. For example a sequence
>  like this:
>scalar insn
>scalar insn
>vector insn
>scalar insn
> vector insn
>
>In such situation, we can issue multiple instructions simultaneously,
>and the latency of scalar instructions will be hided so scalar 
> instruction
>is cheap. Wheras this patch increasing vector pipeline workload 
> is not
>friendly to OoO machine what I mentioned above.
I probably need to be careful what I say here :-)  I'll go with mixing 
vector/scalar code may incur certain penalties on some 
microarchitectures depending on the exact code sequences involved.
 
 
> 3.   I can image the only benefit of this patch is that we can reduce 
> scalar register pressure
>in some extreme circumstances. However, I don't this benefit is 
> "real" since GCC should
>well schedule the instruction sequence when we well tune the 
> vector instructions scheduling
>model and cost model to make such register live range very short 
> when the scalar register
>pressure is very high.
> 
> Overal, I disagree with this patch.
What I think this all argues is that it'll likely need to be uarch 
dependent.I'm not yet sure how to describe the properties of the 
uarch in a concise manner to put into our costing structure yet though.
 
jeff

Re: Re: [PATCH] RISC-V: Synthesize power-of-two constants.

2023-05-30 Thread 钟居哲

I agree with Andrew.

And I don't think this patch is appropriate for following reasons:
1. This patch increases vector workload in machine since 
 it convert scalar load + vmv.v.x into vmv.v.i + vsll.vi.
2. For multi-issue OoO machine, scalar instructions are very cheap
when they are located in vector codegen. For example a sequence
like this:
  scalar insn
  scalar insn
  vector insn
  scalar insn
  vector insn

  In such situation, we can issue multiple instructions simultaneously,
  and the latency of scalar instructions will be hided so scalar instruction
  is cheap. Wheras this patch increasing vector pipeline workload is not
  friendly to OoO machine what I mentioned above.
3.   I can image the only benefit of this patch is that we can reduce scalar 
register pressure
  in some extreme circumstances. However, I don't this benefit is "real" 
since GCC should
  well schedule the instruction sequence when we well tune the vector 
instructions scheduling  
  model and cost model to make such register live range very short when the 
scalar register
  pressure is very high.

Overal, I disagree with this patch.

Thanks.

juzhe.zh...@rivai.ai

From: Andrew Waterman
Date: 2023-05-31 04:18
To: Robin Dapp
CC: gcc-patches; Kito Cheng; palmer; juzhe.zh...@rivai.ai; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Synthesize power-of-two constants.
This turns out to be a de-optimization for implementations with any
amount of temporal execution (which is most machines with LMUL > 1 and
even some machines with LMUL <= 1).  Scalar instructions are generally
cheaper than multi-cycle-occupancy vector operations, so reducing
scalar work by increasing vector work is normally not a good tradeoff.
(And even if the vector instruction has unit occupancy, it likely
burns a bit more energy.)  The best generic scheme to load 143 into
all elements of a vector register is to first load 143 into a scalar
register, then use vmv.v.x.  If the proposed scheme is profitable on
some implementations in some circumstances, it should probably be
enabled only when tuning for that implementation.

On Tue, May 30, 2023 at 12:14 PM Robin Dapp via Gcc-patches
 wrote:
>
> Hi,
>
> I figured I'd send this patch that I quickly hacked together some
> days back.  It's likely going to be controversial because we don't
> have vector costs in place at all yet and even with costs it's
> probably debatable as the emitted sequence is longer :)
> I'm willing to defer or ditch it altogether but as it's small and
> localized why not at least discuss it quickly.
>
> For immediates that are powers of two, instead of loading them into a
> GPR and then broadcasting (incurring the scalar-vector latency) we
> can synthesize them with a vmv.vi and a vsll.v.i.  Depending on actual
> costs we could also add more complicated synthesis patterns in the
> future.
>
> Regards
>  Robin
>
> gcc/ChangeLog:
>
> * config/riscv/riscv-selftests.cc (run_const_vector_selftests):
> Adjust expectation.
> * config/riscv/riscv-v.cc (expand_const_vector): Synthesize
> power-of-two constants.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c: Adjust test
> expectation.
> * gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c: Dito.
> * gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c: Dito.
> * gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c: Dito.
> ---
>  gcc/config/riscv/riscv-selftests.cc   |  9 +-
>  gcc/config/riscv/riscv-v.cc   | 31 +++
>  .../riscv/rvv/autovec/vmv-imm-fixed-rv32.c|  5 +--
>  .../riscv/rvv/autovec/vmv-imm-fixed-rv64.c|  5 +--
>  .../riscv/rvv/autovec/vmv-imm-rv32.c  |  5 +--
>  .../riscv/rvv/autovec/vmv-imm-rv64.c  |  5 +--
>  6 files changed, 51 insertions(+), 9 deletions(-)
>
> diff --git a/gcc/config/riscv/riscv-selftests.cc 
> b/gcc/config/riscv/riscv-selftests.cc
> index 1bf1a648fa1..21fa460bb1f 100644
> --- a/gcc/config/riscv/riscv-selftests.cc
> +++ b/gcc/config/riscv/riscv-selftests.cc
> @@ -259,9 +259,16 @@ run_const_vector_selftests (void)
>   rtx_insn *insn = get_last_insn ();
>   rtx src = XEXP (SET_SRC (PATTERN (insn)), 1);
>   /* 1. Should be vmv.v.i for in rang of -16 ~ 15.
> -2. Should be vmv.v.x for exceed -16 ~ 15.  */
> +2. For 16 (and appropriate higher powers of two)
> +   expect a shift because we emit a
> +   vmv.v.i v1, 8 and a
> +   vsll.v.i v1, v1, 1.
> +3. Should be vmv.v.x for everything else.  */
>   if (IN_RANGE (val, -16, 15))
> ASSERT_TRUE (rtx_equal_p (src, dup));
> + else if (IN_RANGE (val, 16, 16))
> +   ASSERT_TRUE (GET_CODE (src) == ASHIFT
> +&& INTVAL (XEXP (src, 1)) == 1);
>

Re: Re: [PATCH] VECT: Change flow of decrement IV

2023-05-30 Thread 钟居哲

More information of power's testcase:

Before this patch:
test_npeel_int16_t:
lui a4,%hi(.LANCHOR0+130)
lui a3,%hi(.LANCHOR1)
addi a3,a3,%lo(.LANCHOR1)
addi a4,a4,%lo(.LANCHOR0+130)
li a5,58
li a2,16
vsetivli zero,16,e16,m1,ta,ma
vl1re16.v v3,0(a3)
vid.v v1
.L5:
minu a3,a5,a2
vsetvli zero,a3,e16,m1,ta,ma
sub a5,a5,a3
vse16.v v1,0(a4)
vsetivli zero,16,e16,m1,ta,ma
addi a4,a4,32
vadd.vv v1,v1,v3
bne a5,zero,.L5
ret

After this patch:
test_npeel_int16_t:
lui a5,%hi(.LANCHOR0)
addi a5,a5,%lo(.LANCHOR0)
li a1,16
vsetivli zero,16,e16,m1,ta,ma
addi a2,a5,130
vid.v v1
addi a3,a5,162
vadd.vx v4,v1,a1
addi a4,a5,194
li a1,32
vadd.vx v3,v1,a1
vse16.v v1,0(a2)
vse16.v v4,0(a3)
vse16.v v3,0(a4)
addi a5,a5,226
li a1,48
vadd.vx v2,v1,a1
vsetivli zero,10,e16,m1,ta,ma
vse16.v v2,0(a5)
ret

It's obvious, previously, power's testcase in RVV side can not unroll, but 
after this patch, in RVV side, it can unroll now.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
>  Replied Message 
> From
> Richard Sandiford
> Date
> 05/30/2023 19:41
> To
> juzhe.zh...@rivai.ai
> Cc
> gcc-patches,
> rguenther,
> linkw
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zh...@rivai.ai"  writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >   sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Re: Re: [PATCH] VECT: Change flow of decrement IV

2023-05-30 Thread 钟居哲

Also, I have investigated power's testcase in RVV:

#include 

#define TEST_ALL(T)\
  T (int8_t)   \
  T (uint8_t)  \
  T (int16_t)  \
  T (uint16_t) \
  T (int32_t)  \
  T (uint32_t) \
  T (int64_t)  \
  T (uint64_t) \
  T (float)\
  T (double)
  
#define N 64
#define START 1
#define END 59

#define test(TYPE) \
  TYPE x_##TYPE[N] __attribute__((aligned(16)));
\
  void __attribute__((noinline, noclone)) test_npeel_##TYPE() {\
TYPE v = 0;\
for (unsigned int i = START; i < END; i++) {   \
  x_##TYPE[i] = v; \
  v += 1;  \
}  \
  }

TEST_ALL (test)

RVV compile option:
-march=rv64gcv_zba_zbb_zbc_zbs_zvl256b -O2 -ftree-vectorize 
-fno-vect-cost-model -fno-unroll-loops -ffast-math 
--param=riscv-autovec-preference=fixed-vlmax -S -fdump-tree-optimized

Before this patch:
void test_npeel_int16_t ()
{
  unsigned long ivtmp.39;
  vector(16) short int vect_vec_iv_.33;
  void * _2;
  vector(16) short int * _8;
  vector(16) short int _10;
  unsigned long loop_len_19;
  unsigned long ivtmp_21;
  unsigned long ivtmp_22;

   [local count: 18146240]:
  ivtmp.39_13 = (unsigned long)   [(void *)_int16_t + 2B];

   [local count: 72584963]:
  # vect_vec_iv_.33_12 = PHI <_10(3), { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
12, 13, 14, 15 }(2)>
  # ivtmp_21 = PHI 
  # ivtmp.39_5 = PHI 
  loop_len_19 = MIN_EXPR ;
  _10 = vect_vec_iv_.33_12 + { 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 
16, 16, 16, 16 };
  _2 = (void *) ivtmp.39_5;
  _8 =   [(short int *)_2];
  .LEN_STORE (_8, 16B, loop_len_19, vect_vec_iv_.33_12, 0);
  ivtmp_22 = ivtmp_21 - loop_len_19;
  ivtmp.39_14 = ivtmp.39_5 + 32;
  if (ivtmp_22 != 0)
goto ; [75.00%]
  else
goto ; [25.00%]

   [local count: 18146240]:
  return;

}

After this patch:
void test_npeel_int16_t ()
{
   [local count: 18146240]:
  .LEN_STORE (  [(void *)_int16_t + 2B], 16B, 32, { 0, 1, 2, 
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 
24, 25, 26, 27, 28, 29, 30, 31 }, 0);
  .LEN_STORE (  [(void *)_int16_t + 66B], 16B, 26, { 32, 33, 
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 
54, 55, 56, 57, 58, 59, 60, 61, 62, 63 }, 0); [tail call]
  return;

}

It seems this patch fixed power's issue now.

So, My conclusion:
1. This patch does produce 1 more redundant 'mv' instructions in some cases 
(not all cases). But it can partially be solved by select_vl
pattern. And even we can't fix this issue, one more 'mv' instruction is not 
a big deal for RVV.
2. This patch can solve power's issue.

Thanks. 


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
>  Replied Message 
> From
> Richard Sandiford
> Date
> 05/30/2023 19:41
> To
> juzhe.zh...@rivai.ai
> Cc
> gcc-patches,
> rguenther,
> linkw
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zh...@rivai.ai"  writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >   sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v

Re: Re: [PATCH] VECT: Change flow of decrement IV

2023-05-30 Thread 钟居哲

Hi, all. After several investigations:
Here is my experiements:
void
single_rgroup (int32_t *__restrict a, int32_t *__restrict b, int n)
{
  for (int i = 0; i < n; i++)
a[i] = b[i] + a[i];
}

void
mutiple_rgroup (float *__restrict f, double *__restrict d, int n)
{
  for (int i = 0; i < n; ++i)
{
  f[i * 2 + 0] = 1;
  f[i * 2 + 1] = 2;
  d[i] = 3;
}
} 


single_rgroup:
ble a2,zero,.L5
li a4,4
.L3:
minu a5,a2,a4
vsetvli zero,a5,e32,m1,ta,ma
vle32.v v1,0(a0)
vle32.v v2,0(a1)
vsetivli zero,4,e32,m1,ta,ma
mv a3,a2   -> 1 more "mv" 
instruction
vadd.vv v1,v1,v2
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v1,0(a0)
addi a1,a1,16
addi a0,a0,16
addi a2,a2,-4
bgtu a3,a4,.L3
.L5:
ret
.size single_rgroup, .-single_rgroup
.align 1
.globl foo5
.type foo5, @function
mutiple_rgroup :
ble a2,zero,.L11
lui a5,%hi(.LANCHOR0)
addi a5,a5,%lo(.LANCHOR0)
vl1re32.v v2,0(a5)
lui a5,%hi(.LANCHOR0+16)
addi a5,a5,%lo(.LANCHOR0+16)
slli a2,a2,1
li a3,8
li a7,4
vl1re64.v v1,0(a5)
.L9:
minu a5,a2,a3
minu a4,a5,a7
sub a5,a5,a4
addi a6,a0,16
vsetvli zero,a4,e32,m1,ta,ma
vse32.v v2,0(a0)
srli a4,a4,1
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v2,0(a6)
srli a5,a5,1
vsetvli zero,a4,e64,m1,ta,ma
addi a6,a1,16
vse64.v v1,0(a1)
mv a4,a2-> 1 more "mv" instruction
vsetvli zero,a5,e64,m1,ta,ma
vse64.v v1,0(a6)
addi a0,a0,32
addi a1,a1,32
addi a2,a2,-8
bgtu a4,a3,.L9
.L11:
ret

These are the examples, I have tried enough amount cases. This is the worst 
case after this patch for RVV:
no matter single-rgroup or multiple-rgroup, we will end up with 1 more "mv" 
instruction inside the loop.
There are also some examples I have tried with no more instructions (It seems 
IVOPTS has done some optimization in some cases).

From my side (RVV),  I think one more "mv" instruction is not a big deal if 
this patch (apply vf step and check conditon by remain > vf)
can help IBM. 

For single-rgroup, this 'mv' instruction will gone when we use SELECT_VL. For 
multiple-rgroup, the 'mv' instruction remains
but as I said, not a big deal.

If this patch's approach is approved, I will rebase and send SELECT_VL patch 
again base on this patch.

Looking forward your suggestions.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
>  Replied Message 
> From
> Richard Sandiford
> Date
> 05/30/2023 19:41
> To
> juzhe.zh...@rivai.ai
> Cc
> gcc-patches,
> rguenther,
> linkw
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zh...@rivai.ai"  writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >   sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Re: Re: [PATCH] VECT: Change flow of decrement IV

2023-05-30 Thread 钟居哲


>> That's odd, you only need to adjust the IV which is used in the exit test,
>> not all the others.
Sorry for my incorrect information. I checked the codegen of both single-rgroup 
and multi-rgroup.
Their codegen are same behavior, after this patch, there will be 1 more neg 
instruction in preheader
and 1 more mv instruction inside the loop.



juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
>  Replied Message 
> From
> Richard Sandiford
> Date
> 05/30/2023 19:41
> To
> juzhe.zh...@rivai.ai
> Cc
> gcc-patches,
> rguenther,
> linkw
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zh...@rivai.ai"  writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >   sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Re: Re: [PATCH] RISC-V: Add RVV FMA auto-vectorization support

2023-05-26 Thread 钟居哲


Hi, Robin.

>> Can you explain these two points (3 and 4, maybe 2) a bit in the comments?
>> I.e. what makes fma different from a normal insn?
You can take a lookt at vector.md. The ternary instruction pattern has 
operands[0] operands[1] operands[2] operands[3] operands[4] operands[5] :

operands[0] = operands[1] ? operands[2] * operands[3] + operands[4] : 
operands[5]
These operands are not necessary the same RTX but we should make them overlap.
Why have operands[5] ? Since we will have len_cond_fma.
So I want to lower simple fma pattern into patterns I define in vector.md.
operands[5] should be operands[1] if operands[1] overlap operand[0] --->vmacc
or operands[3] if operands[3] overlap operand[0] -->vmadd

>>We only have three alternatives here.
Address in V2.

>>We have a bit of naming overlap between "insn" an "op" already.  I would go
>>with just ternay_insn or tern_insn here.  That the insn_types have OP in
>>their name is unfortunate but let's keep that for now.
Ok


>>Can we call data_mode dest_mode here?  data_mode imho only makes sense in
>>the context of conditionals where we have a comparison mode and a data mode.
>>I mean you could argue we always have a data mode and a mask mode so the
>>naming makes sense again but then we should get rid of dest_mode.

ok

>> __restrict vs restrict.

ok

>>Why the difference here?  Why do we need to restrict the optimization here
>>anyway?
Ok


>>Btw. any reason why you don't include fms, vnmsac in the patch?  Wouldn't the
>>patterns be really similar or do you have other plans for those?  Not needed
>>for this patch, just curious.
I want to make patch small and simple enough to review. After this patch is 
merged,
I will post fms.

Thanks.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-05-26 18:16
To: juzhe.zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw; pan2.li
Subject: Re: [PATCH] RISC-V: Add RVV FMA auto-vectorization support
Hi Juzhe,
 
> +;; We can't expand FMA for the following reasons:
 
But we do :)  We just haven't selected the proper alternative yet.
 
> +;; 1. Before RA, we don't know which multiply-add instruction is the ideal 
> one.
> +;;The vmacc is the ideal instruction when operands[3] overlaps 
> operands[0].
> +;;The vmadd is the ideal instruction when operands[1|2] overlaps 
> operands[0].
> +;; 2. According to vector.md, the multiply-add patterns has 'merge' operand 
> which
> +;;is the operands[5]. Since operands[5] should overlap operands[0], this 
> operand
> +;;should be allocated the same regno as operands[1|2|3].
> +;; 3. The 'merge' operand is always a real merge operand and we don't allow 
> undefined
> +;;operand.
> +;; 3. The operation of FMA pattern needs VLMAX vsetlvi which needs a VL 
> operand.
 
Can you explain these two points (3 and 4, maybe 2) a bit in the comments?
I.e. what makes fma different from a normal insn?
 
> +(define_insn_and_split "*fma"
> +  [(set (match_operand:VI 0 "register_operand" "=vr, vr, ?")
> + (plus:VI
> +   (mult:VI
> + (match_operand:VI 1 "register_operand" " %0, vr,   vr")
> + (match_operand:VI 2 "register_operand" " vr, vr,   vr"))
> +   (match_operand:VI 3 "register_operand"   " vr,  0,   vr")))
> +   (clobber (match_scratch:SI 4 "=r,r,r"))]
> +  "TARGET_VECTOR"
> +  "#"
> +  "&& reload_completed"
> +  [(const_int 0)]
> +  {
> +PUT_MODE (operands[4], Pmode);
> +riscv_vector::emit_vlmax_vsetvl (mode, operands[4]);
> +if (which_alternative == 3)
 
We only have three alternatives here.
 
> +  emit_insn (gen_rtx_SET (operands[0], operands[3]));
> +rtx ops[] = {operands[0], operands[1], operands[2], operands[3], 
> operands[0]};
> +riscv_vector::emit_vlmax_ternop_insn (code_for_pred_mul_plus 
> (mode),
> +   riscv_vector::RVV_TERNOP, ops, operands[4]);
> +DONE;
> +  }
> +  [(set_attr "type" "vimuladd")
> +   (set_attr "mode" "")])
> diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
> index 36419c95bbd..86b2798fb5e 100644
> --- a/gcc/config/riscv/riscv-protos.h
> +++ b/gcc/config/riscv/riscv-protos.h
> @@ -140,6 +140,7 @@ enum insn_type
>RVV_MERGE_OP = 4,
>RVV_CMP_OP = 4,
>RVV_CMP_MU_OP = RVV_CMP_OP + 2, /* +2 means mask and maskoff operand.  */
> +  RVV_TERNOP = 5,
>  };
 
> +emit_vlmax_ternop_insn (unsigned icode, int op_num, rtx *ops, rtx vl)
 
We have a bit of naming overlap between "insn" an "op" already.  I would go
with just ternay_insn or tern_insn here.  That the insn_types have OP in
their name is unfortunate but let's keep that for now. 
 
> +  machine_mode data_mode = GET_MODE (ops[0]);
> +  machine_mode mask_mode = get_mask_mode (data_mode).require ();
> +  /* We have a maximum of 11 operands for RVV instruction patterns according 
> to
> +   * vector.md.  */
> +  insn_expander<11> e (/*OP_NUM*/ op_num, /*HAS_DEST_P*/ true,
> +/*FULLY_UNMASKED_P*/ true,
> +/*USE_REAL_MERGE_P*/ true, /*HAS_AVL_P*/ true,
> +

Re: Re: [PATCH v2] RISC-V: Implement autovec abs, vneg, vnot.

2023-05-25 Thread 钟居哲

LGTM this patch. Let's wait for kito's final approval.
Thanks.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-05-25 22:43
To: 钟居哲; gcc-patches; kito.cheng; palmer; Jeff Law
CC: rdapp.gcc
Subject: Re: [PATCH v2] RISC-V: Implement autovec abs, vneg, vnot.
> Beside, V2 patch should change this:
> emit_vlmax_masked_insn (unsigned icode, int op_num, rtx *ops)
> 
> change it into emit_vlmax_masked_mu_insn .
 
V3 is inline with these changes.
 
This patch implements abs2, vneg2 and vnot2 expanders
for integer vector registers and adds tests for them.
 
gcc/ChangeLog:
 
* config/riscv/autovec.md (2): Add vneg/vnot.
(abs2): Add.
* config/riscv/riscv-protos.h (emit_vlmax_masked_mu_insn):
Declare.
* config/riscv/riscv-v.cc (emit_vlmax_masked_mu_insn): New
function.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/rvv.exp: Add unop tests.
* gcc.target/riscv/rvv/autovec/unop/abs-run.c: New test.
* gcc.target/riscv/rvv/autovec/unop/abs-rv32gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/abs-rv64gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/abs-template.h: New test.
* gcc.target/riscv/rvv/autovec/unop/vneg-run.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vneg-rv32gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vneg-rv64gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vneg-template.h: New test.
* gcc.target/riscv/rvv/autovec/unop/vnot-run.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vnot-rv32gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vnot-rv64gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vnot-template.h: New test.
---
gcc/config/riscv/autovec.md   | 43 ++-
gcc/config/riscv/riscv-protos.h   |  2 +
gcc/config/riscv/riscv-v.cc   | 16 +++
.../riscv/rvv/autovec/unop/abs-run.c  | 39 +
.../riscv/rvv/autovec/unop/abs-rv32gcv.c  |  8 
.../riscv/rvv/autovec/unop/abs-rv64gcv.c  |  8 
.../riscv/rvv/autovec/unop/abs-template.h | 26 +++
.../riscv/rvv/autovec/unop/vneg-run.c | 29 +
.../riscv/rvv/autovec/unop/vneg-rv32gcv.c |  6 +++
.../riscv/rvv/autovec/unop/vneg-rv64gcv.c |  6 +++
.../riscv/rvv/autovec/unop/vneg-template.h| 18 
.../riscv/rvv/autovec/unop/vnot-run.c | 43 +++
.../riscv/rvv/autovec/unop/vnot-rv32gcv.c |  6 +++
.../riscv/rvv/autovec/unop/vnot-rv64gcv.c |  6 +++
.../riscv/rvv/autovec/unop/vnot-template.h| 22 ++
gcc/testsuite/gcc.target/riscv/rvv/rvv.exp|  2 +
16 files changed, 279 insertions(+), 1 deletion(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/abs-run.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/abs-rv32gcv.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/abs-rv64gcv.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/abs-template.h
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vneg-run.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vneg-rv32gcv.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vneg-rv64gcv.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vneg-template.h
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vnot-run.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vnot-rv32gcv.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vnot-rv64gcv.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/vnot-template.h
 
diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index 7fe4d94de39..38216d9812f 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -145,7 +145,7 @@ (define_expand "3"
})
;; -
-;;  [INT] Binary shifts by scalar.
+;;  [INT] Binary shifts by vector.
;; -
;; Includes:
;; - vsll.vv/vsra.vv/vsrl.vv
@@ -373,3 +373,44 @@ (define_expand "vcondu"
 DONE;
   }
)
+
+;; =
+;; == Unary arithmetic
+;; =
+
+;; 
---
+;;  [INT] Unary operations
+;; 
---
+;; Includes:
+;; - vneg.v/vnot.v
+;; 
---
+(define_expand "2"
+  [(set (match_operand:VI 0 "register_operand")
+(any_int_unop:VI
+ (match_operand:VI 1 "register_operand")))]
+  "TARGET_VECTOR"
+{
+  insn_code icode = code_for_pred (, mode);
+  riscv_ve

decremnt IV patch create fails on PowerPC

2023-05-25 Thread 钟居哲

Yesterday's patch has been approved (decremnt IV support):
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619663.html 

However, it creates fails on PowerPC:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109971 

I am really sorry for causing inconvinience.

I wonder as we disccussed:
+  /* If we're vectorizing a loop that uses length "controls" and
+ can iterate more than once, we apply decrementing IV approach
+ in loop control.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+  && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
+  && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
+  && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+  && known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
+   LOOP_VINFO_VECT_FACTOR (loop_vinfo
+LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = true;

This conditions can not disable decrement IV on PowerPC.
Should I add a target hook for it? 
The patch I can only do bootstrap and regression on X86.
I didn't have an environment to test PowerPC. I am really sorry.

Thanks.


juzhe.zh...@rivai.ai

Re: Re: [PATCH v2] RISC-V: Implement autovec abs, vneg, vnot.

2023-05-25 Thread 钟居哲

>> Yes, this is the emitted sequence, but the vsetvli mask is indeed
>> wrong.  Just got lucky there.  Or what else did you mean with
>> logically incorrect?
Oh, sorry. I didn't mean this patch logically incorrect.
I mean the MASK_ANY is logicall incorrect.
This patch is ok to me as long as you change MASK TAIL into MASK_UNDISTURBED.

Beside, V2 patch should change this:
emit_vlmax_masked_insn (unsigned icode, int op_num, rtx *ops)

change it into emit_vlmax_masked_mu_insn .


Thanks.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-05-25 20:32
To: juzhe.zh...@rivai.ai; gcc-patches; kito.cheng; palmer; jeffreyalaw
CC: rdapp.gcc
Subject: Re: [PATCH v2] RISC-V: Implement autovec abs, vneg, vnot.
> I think it's logically incorrect.  For ABS, you want:
> 
> operands[0] = operads[1] > 0 ? operands[1] :  (-operands[1])
> So you should do this following sequence:
> 
> vmslt v0,v1,0
> vneg v1,v1v0.t (should use Mask undisturbed)
 
Yes, this is the emitted sequence, but the vsetvli mask is indeed
wrong.  Just got lucky there.  Or what else did you mean with
logically incorrect?
 
> Here I see you set:
> e.set_policy (MASK_ANY); which is incorrect.
> You should use e.set_policy (MASK_UNDISTURBED); instead.> 
> Your testcases fail to catch this issue (you should create a testcase
> to catch this bug with this patch implementation.)
 
Added a regex to look for "ta,mu".
 
> You should not use RVV_UNOP+2. Instead, you should add an enum call
> RVV_UNOP_MU and replace it.
 
I was a bit weary of adding yet another, would rather have that
unified somehow, but well ;) Another time.  Adjusted locally.

Re: Re: [PATCH V16] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-25 Thread 钟居哲

Thanks Richard so much.
I have sent V17 patch for commit (fix format as you suggested).
You don't need to reply that.

I am waiting for Richi's final approval.

Thanks.

juzhe.zh...@rivai.ai

From: Richard Sandiford
Date: 2023-05-25 20:36
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V16] VECT: Add decrement IV iteration loop control by 
variable amount support
This looks good to me.  Just a couple of very minor cosmetic things:

juzhe.zh...@rivai.ai writes:
> @@ -753,17 +846,35 @@ vect_set_loop_condition_partial_vectors (class loop 
> *loop,
>continue;
>}
>  
> - /* See whether zero-based IV would ever generate all-false masks
> -or zero length before wrapping around.  */
> - bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
> -
> - /* Set up all controls for this group.  */
> - test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
> -  _seq,
> -  _seq,
> -  loop_cond_gsi, rgc,
> -  niters, niters_skip,
> -  might_wrap_p);
> + if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) || !iv_rgc
> + || (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
> + != rgc->max_nscalars_per_iter * rgc->factor))

Coding style is to put each subcondition on a separate line when the
whole condition doesn't fit on a single line.  So:

if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
|| !iv_rgc
|| (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
!= rgc->max_nscalars_per_iter * rgc->factor))

> @@ -2725,6 +2726,17 @@ start_over:
>&& !vect_verify_loop_lens (loop_vinfo))
>  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>  
> +  /* If we're vectorizing an loop that uses length "controls" and

s/an loop/a loop/(Sorry for not noticing earlier.)

OK for trunk from my POV with those changes; no need to repost unless
your policies require it.  Please give Richi a chance to comment too
though.

Thanks for your patience with the review process.  The final result
seems pretty clean to me.

Richard

Re: Re: [PATCH V15] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-25 Thread 钟居哲

Thank you so much for your patience.
Could you take a look at V16 patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619652.html 
whether it is ok for trunk ?

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-25 18:19
To: juzhe.zhong\@rivai.ai
CC: gcc-patches; rguenther
Subject: Re: [PATCH V15] VECT: Add decrement IV iteration loop control by 
variable amount support
"juzhe.zh...@rivai.ai"  writes:
> Hi， Richard. Thanks for the comments.
>
>>> if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
>>> || !iv_rgc
>>> || (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
>>> != rgc->max_nscalars_per_iter * rgc->factor))
>>>   {
>   >>   /* See whether zero-based IV would ever generate all-false 
> masks
>>> or zero length before wrapping around.  */
>>>  bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, 
> rgc);
>  
>>>  /* Set up all controls for this group.  */
>  >>test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
> >>  _seq,
> >>  _seq,
> >>  loop_cond_gsi, 
> rgc,
> >>  niters, 
> niters_skip,
> >>  might_wrap_p);
>  
>>>  iv_rgc = rgc;
>   >> }
>
>
> Could you tell me why you add:
> (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
>>> != rgc->max_nscalars_per_iter * rgc->factor) ?
 
The patch creates IVs with the following step:
 
  gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
index_before_incr,
nitems_step));
 
If nitems_step is the same for two IVs, those IVs will always be equal.
 
So having multiple IVs with the same nitems_step is redundant.
 
nitems_step is calculated as follows:
 
  unsigned int nitems_per_iter = rgc->max_nscalars_per_iter * rgc->factor;
  ...
  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
  ...
 
  if (nitems_per_iter != 1)
{
  ...
  tree iv_factor = build_int_cst (iv_type, nitems_per_iter);
  ...
  nitems_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
  nitems_step, iv_factor);
  ...
}
 
so nitems_per_step is equal to:
 
  rgc->max_nscalars_per_iter * rgc->factor * VF
 
VF is fixed for a loop, so nitems_step is equal for two different
rgroup_controls if:
 
  rgc->max_nscalars_per_iter * rgc->factor
 
is the same for those rgroup_controls.
 
Please try the example I posted earlier today. I think you'll see that,
without the:
 
  (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
   != rgc->max_nscalars_per_iter * rgc->factor)
 
you'll have two IVs with the same step (because their MIN_EXPRs have
the same bound).
 
Thanks,
Richard

Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, Richard. After I fix codes, now IR is correct I think:

loop_len_34 = MIN_EXPR ;
  _74 = loop_len_34 * 2;
  loop_len_48 = MIN_EXPR <_74, 4>;
  _75 = _74 - loop_len_48;
  loop_len_49 = MIN_EXPR <_75, 4>;
  _76 = _75 - loop_len_49;
  loop_len_50 = MIN_EXPR <_76, 4>;
  loop_len_51 = _76 - loop_len_50;
  ...
  vect__1.8_33 = .LEN_LOAD (_17, 16B, loop_len_34, 0);
...
  .LEN_STORE (_17, 16B, loop_len_34, vect__4.11_21, 0);
...

  vect__10.16_52 = .LEN_LOAD (_31, 32B, loop_len_48, 0);
...
  vect__10.17_54 = .LEN_LOAD (_29, 32B, loop_len_49, 0);
...
  vect__10.18_56 = .LEN_LOAD (_25, 32B, loop_len_50, 0);
...
  vect__10.19_58 = .LEN_LOAD (_80, 32B, loop_len_51, 0);


For this case:

uint64_t x2[100];
uint16_t y2[200];

void f2(int n) {
  for (int i = 0, j = 0; i < n; i += 2, j += 4) {
x2[i + 0] += 1;
x2[i + 1] += 2;
y2[j + 0] += 1;
y2[j + 1] += 2;
y2[j + 2] += 3;
y2[j + 3] += 4;
  }
}

The IR is like this:

  loop_len_56 = MIN_EXPR ;
  _66 = loop_len_56 * 4;
  loop_len_43 = _66 + 18446744073709551614;
  ...
  vect__1.44_44 = .LEN_LOAD (_6, 64B, 2, 0);
  ...
  vect__1.45_46 = .LEN_LOAD (_14, 64B, loop_len_43, 0);
  vect__2.46_47 = vect__1.44_44 + { 1, 2 };
  vect__2.46_48 = vect__1.45_46 + { 1, 2 };
  .LEN_STORE (_6, 64B, 2, vect__2.46_47, 0);
  .LEN_STORE (_14, 64B, loop_len_43, vect__2.46_48, 0);
  ...
  vect__6.51_57 = .LEN_LOAD (_10, 16B, loop_len_56, 0);

  vect__7.52_58 = vect__6.51_57 + { 1, 2, 3, 4, 1, 2, 3, 4 };
  .LEN_STORE (_10, 16B, loop_len_56, vect__7.52_58, 0);

It seems correct too ?

>> What gives the best code in these cases?  Is emitting a multiplication
>> better?  Or is using a new IV better?
Could you give me more detail information about "new refresh IV" approach.
I'd like to try that.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-25 00:00
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> Oh. I see. Thank you so much for pointing this.
> Could you tell me what I should do in the codes?
> It seems that I should adjust it in 
> vect_adjust_loop_lens_control
>
> muliply by some factor ? Is this correct multiply by max_nscalars_per_iter
> ?
 
max_nscalars_per_iter * factor rather than just max_nscalars_per_iter
 
Note that it's possible for later max_nscalars_per_iter * factor to
be smaller, so a division might be needed in rare cases.  E.g.:
 
uint64_t x[100];
uint16_t y[200];
 
void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}
 
where y has a single-control rgroup with max_nscalars_per_iter == 4
and x has a 2-control rgroup with max_nscalars_per_iter == 2
 
What gives the best code in these cases?  Is emitting a multiplication
better?  Or is using a new IV better?
 
Thanks,
Richard

Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, For the first piece of code ,I tried:
  unsigned int nitems_per_iter
= dest_rgm->max_nscalars_per_iter * dest_rgm->factor;
  step = gimple_build (seq, MULT_EXPR, iv_type, step,
   build_int_cst (iv_type, nitems_per_iter));

Then optimized IR:
loop_len_34 = MIN_EXPR ;
  _74 = loop_len_34 * 4;
  loop_len_51 = _74 + 18446744073709551604;

  _16 = (void *) ivtmp.27_41;
  _17 =   [(short int *)_16];

  vect__1.7_33 = .LEN_LOAD (_17, 16B, loop_len_34, 0);

  vect__2.8_23 = VIEW_CONVERT_EXPR(vect__1.7_33);
  vect__3.9_22 = vect__2.8_23 + { 1, 2, 1, 2, 1, 2, 1, 2 };
  vect__4.10_21 = VIEW_CONVERT_EXPR(vect__3.9_22);
  .LEN_STORE (_17, 16B, loop_len_34, vect__4.10_21, 0);
  _20 = (void *) ivtmp.28_1;
  _31 =   [(int *)_20];

  vect__10.15_52 = .LEN_LOAD (_31, 32B, 4, 0);

  _30 = (void *) ivtmp.31_4;
  _29 =   [(int *)_30];

  vect__10.16_54 = .LEN_LOAD (_29, 32B, 4, 0);

  _26 = (void *) ivtmp.32_8;
  _25 =   [(int *)_26];

  vect__10.17_56 = .LEN_LOAD (_25, 32B, 4, 0);

  _79 = (void *) ivtmp.33_12;
  _80 =   [(int *)_79];

  vect__10.18_58 = .LEN_LOAD (_80, 32B, loop_len_51, 0);

Is it correct ? It looks wierd ? 


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-25 00:00
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> Oh. I see. Thank you so much for pointing this.
> Could you tell me what I should do in the codes?
> It seems that I should adjust it in 
> vect_adjust_loop_lens_control
>
> muliply by some factor ? Is this correct multiply by max_nscalars_per_iter
> ?
 
max_nscalars_per_iter * factor rather than just max_nscalars_per_iter
 
Note that it's possible for later max_nscalars_per_iter * factor to
be smaller, so a division might be needed in rare cases.  E.g.:
 
uint64_t x[100];
uint16_t y[200];
 
void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}
 
where y has a single-control rgroup with max_nscalars_per_iter == 4
and x has a 2-control rgroup with max_nscalars_per_iter == 2
 
What gives the best code in these cases?  Is emitting a multiplication
better?  Or is using a new IV better?
 
Thanks,
Richard

Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Oh. I see. Thank you so much for pointing this.
Could you tell me what I should do in the codes?
It seems that I should adjust it in 
vect_adjust_loop_lens_control

muliply by some factor ? Is this correct multiply by max_nscalars_per_iter
?
Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 23:47
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> Hi, Richard. I still don't understand it. Sorry about that.
>
>>>  loop_len_48 = MIN_EXPR ;
>   >>   _74 = loop_len_34 * 2 - loop_len_48;
>
> I have the tests already tested.
> We have a MIN_EXPR to calculate the total elements:
> loop_len_34 = MIN_EXPR ;
> I think "8" is already multiplied by 2?
>
> Why do we need loop_len_34 * 2 ?
> Could you give me more informations, The similiar tests you present we 
> already have
> execution check and passed. I am not sure whether this patch has the issue 
> that I didn't notice.
 
Think about the maximum values of each SSA name:
 
   loop_len_34 = MIN_EXPR ;   // MAX 8
   loop_len_48 = MIN_EXPR ;// MAX 4
   _74 = loop_len_34 - loop_len_48;// MAX 4
   loop_len_49 = MIN_EXPR <_74, 4>;// MAX 4 (always == _74)
   _75 = _74 - loop_len_49;// 0
   loop_len_50 = MIN_EXPR <_75, 4>;// 0
   loop_len_51 = _75 - loop_len_50;// 0
 
So the final two y vectors will always have 0 controls.
 
Thanks,
Richard

Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, Richard. I still don't understand it. Sorry about that.

>>  loop_len_48 = MIN_EXPR ;
  >>   _74 = loop_len_34 * 2 - loop_len_48;

I have the tests already tested.
We have a MIN_EXPR to calculate the total elements:
loop_len_34 = MIN_EXPR ;
I think "8" is already multiplied by 2?

Why do we need loop_len_34 * 2 ?
Could you give me more informations, The similiar tests you present we already 
have
execution check and passed. I am not sure whether this patch has the issue that 
I didn't notice.

Thanks.

juzhe.zh...@rivai.ai

From: Richard Sandiford
Date: 2023-05-24 23:31
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> Hi, the .optimized dump is like this:
>
>[local count: 21045336]:
>   ivtmp.26_36 = (unsigned long) 
>   ivtmp.27_3 = (unsigned long) 
>   ivtmp.30_6 = (unsigned long)   [(void *) + 16B];
>   ivtmp.31_10 = (unsigned long)   [(void *) + 32B];
>   ivtmp.32_14 = (unsigned long)   [(void *) + 48B];
>
>[local count: 273589366]:
>   # ivtmp_72 = PHI 
>   # ivtmp.26_41 = PHI 
>   # ivtmp.27_1 = PHI 
>   # ivtmp.30_4 = PHI 
>   # ivtmp.31_8 = PHI 
>   # ivtmp.32_12 = PHI 
>   loop_len_34 = MIN_EXPR ;
>   loop_len_48 = MIN_EXPR ;
>   _74 = loop_len_34 - loop_len_48;

Yeah, I think this needs to be:

  loop_len_48 = MIN_EXPR ;
  _74 = loop_len_34 * 2 - loop_len_48;

(as valid gimple).  The point is that...

>   loop_len_49 = MIN_EXPR <_74, 4>;
>   _75 = _74 - loop_len_49;
>   loop_len_50 = MIN_EXPR <_75, 4>;
>   loop_len_51 = _75 - loop_len_50;

...there are 4 lengths capped to 4, for a total element count of 16.
But loop_len_34 is never greater than 8.

So for this case we either need to multiply, or we need to create
a fresh IV for the second rgroup.  Both approaches are fine.

Thanks,
Richard

回复: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, Richard.

I think it can work after I analyze it.
Let's take a look the codes:

void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}

For "x", each scalar iteration calculate 2 elements (x[i + 0] and x[i + 1])
For "y", each scalar iteration calculate 4 elements (y[i + 0] and y[i + 1] and 
y[j + 2] and y[j + 3)
With this patch:

loop_len_34 = MIN_EXPR ;
The total elements of "x" vector of each iteration is maximum 8 which is 128bit 
(8 16bit elements)
So the vector can process "4" scalar iterations (x[i + 0] and x[i + 1])
So there is a len_load: vect__1.6_33 = .LEN_LOAD (_17, 16B, loop_len_34, 0);

Since the INT16 (x) is "4" scalar iterations, then INT8 ("y") is also 4 scalar 
iterations and 
each process 4 scalar elements (y[i + 0] and y[i + 1] and y[j + 2] and y[j + 3)

So you can see 4 vector operations of y:
 vect__11.18_59 = vect__10.14_52 + { 1, 2, 3, 4 };
  vect__11.18_60 = vect__10.15_54 + { 1, 2, 3, 4 };
  vect__11.18_61 = vect__10.16_56 + { 1, 2, 3, 4 };
  vect__11.18_62 = vect__10.17_58 + { 1, 2, 3, 4 };
  .LEN_STORE (_31, 32B, loop_len_48, vect__11.18_59, 0);
  .LEN_STORE (_29, 32B, loop_len_49, vect__11.18_60, 0);
  .LEN_STORE (_25, 32B, loop_len_50, vect__11.18_61, 0);
  .LEN_STORE (_79, 32B, loop_len_51, vect__11.18_62, 0);

So each vector loop has 1 group "x" (4 * 2 elements = 8 elements) and 4 group 
"y" (4 * 4)

And we adjust loop len for each control of y:
loop_len_34 = MIN_EXPR ;
  loop_len_48 = MIN_EXPR ;
  _74 = loop_len_34 - loop_len_48;
  loop_len_49 = MIN_EXPR <_74, 4>;
  _75 = _74 - loop_len_49;
  loop_len_50 = MIN_EXPR <_75, 4>;
  loop_len_51 = _75 - loop_len_50;

It seems to work. I wonder why we need multiplication ?

Thanks.


juzhe.zh...@rivai.ai
 
发件人： 钟居哲
发送时间： 2023-05-24 23:13
收件人： richard.sandiford
抄送： gcc-patches; rguenther
主题： Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
Hi, the .optimized dump is like this:

   [local count: 21045336]:
  ivtmp.26_36 = (unsigned long) 
  ivtmp.27_3 = (unsigned long) 
  ivtmp.30_6 = (unsigned long)   [(void *) + 16B];
  ivtmp.31_10 = (unsigned long)   [(void *) + 32B];
  ivtmp.32_14 = (unsigned long)   [(void *) + 48B];

   [local count: 273589366]:
  # ivtmp_72 = PHI 
  # ivtmp.26_41 = PHI 
  # ivtmp.27_1 = PHI 
  # ivtmp.30_4 = PHI 
  # ivtmp.31_8 = PHI 
  # ivtmp.32_12 = PHI 
  loop_len_34 = MIN_EXPR ;
  loop_len_48 = MIN_EXPR ;
  _74 = loop_len_34 - loop_len_48;
  loop_len_49 = MIN_EXPR <_74, 4>;
  _75 = _74 - loop_len_49;
  loop_len_50 = MIN_EXPR <_75, 4>;
  loop_len_51 = _75 - loop_len_50;
  _16 = (void *) ivtmp.26_41;
  _17 =   [(short int *)_16];
  vect__1.6_33 = .LEN_LOAD (_17, 16B, loop_len_34, 0);
  vect__2.7_23 = VIEW_CONVERT_EXPR(vect__1.6_33);
  vect__3.8_22 = vect__2.7_23 + { 1, 2, 1, 2, 1, 2, 1, 2 };
  vect__4.9_21 = VIEW_CONVERT_EXPR(vect__3.8_22);
  .LEN_STORE (_17, 16B, loop_len_34, vect__4.9_21, 0);
  _20 = (void *) ivtmp.27_1;
  _31 =   [(int *)_20];
  vect__10.14_52 = .LEN_LOAD (_31, 32B, loop_len_48, 0);
  _30 = (void *) ivtmp.30_4;
  _29 =   [(int *)_30];
  vect__10.15_54 = .LEN_LOAD (_29, 32B, loop_len_49, 0);
  _26 = (void *) ivtmp.31_8;
  _25 =   [(int *)_26];
  vect__10.16_56 = .LEN_LOAD (_25, 32B, loop_len_50, 0);
  _78 = (void *) ivtmp.32_12;
  _79 =   [(int *)_78];
  vect__10.17_58 = .LEN_LOAD (_79, 32B, loop_len_51, 0);
  vect__11.18_59 = vect__10.14_52 + { 1, 2, 3, 4 };
  vect__11.18_60 = vect__10.15_54 + { 1, 2, 3, 4 };
  vect__11.18_61 = vect__10.16_56 + { 1, 2, 3, 4 };
  vect__11.18_62 = vect__10.17_58 + { 1, 2, 3, 4 };
  .LEN_STORE (_31, 32B, loop_len_48, vect__11.18_59, 0);
  .LEN_STORE (_29, 32B, loop_len_49, vect__11.18_60, 0);
  .LEN_STORE (_25, 32B, loop_len_50, vect__11.18_61, 0);
  .LEN_STORE (_79, 32B, loop_len_51, vect__11.18_62, 0);
  ivtmp_73 = ivtmp_72 - loop_len_34;
  ivtmp.26_37 = ivtmp.26_41 + 16;
  ivtmp.27_2 = ivtmp.27_1 + 64;
  ivtmp.30_5 = ivtmp.30_4 + 64;
  ivtmp.31_9 = ivtmp.31_8 + 64;
  ivtmp.32_13 = ivtmp.32_12 + 64;
  if (ivtmp_73 != 0)
goto ; [92.31%]
  else
goto ; [7.69%]

I am still check about it but I send it to you earlier.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 23:07
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
Thanks for trying it.  I'm still surprised that no multiplication
is needed though.  Does the patch work for:
 
short x[100];
int y[200];
 
void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}
 
?  Here, there should be a single-control rgroup for x, counting
2 units per scalar iteration.  I'd expect the IV to

Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, the .optimized dump is like this:

   [local count: 21045336]:
  ivtmp.26_36 = (unsigned long) 
  ivtmp.27_3 = (unsigned long) 
  ivtmp.30_6 = (unsigned long)   [(void *) + 16B];
  ivtmp.31_10 = (unsigned long)   [(void *) + 32B];
  ivtmp.32_14 = (unsigned long)   [(void *) + 48B];

   [local count: 273589366]:
  # ivtmp_72 = PHI 
  # ivtmp.26_41 = PHI 
  # ivtmp.27_1 = PHI 
  # ivtmp.30_4 = PHI 
  # ivtmp.31_8 = PHI 
  # ivtmp.32_12 = PHI 
  loop_len_34 = MIN_EXPR ;
  loop_len_48 = MIN_EXPR ;
  _74 = loop_len_34 - loop_len_48;
  loop_len_49 = MIN_EXPR <_74, 4>;
  _75 = _74 - loop_len_49;
  loop_len_50 = MIN_EXPR <_75, 4>;
  loop_len_51 = _75 - loop_len_50;
  _16 = (void *) ivtmp.26_41;
  _17 =   [(short int *)_16];
  vect__1.6_33 = .LEN_LOAD (_17, 16B, loop_len_34, 0);
  vect__2.7_23 = VIEW_CONVERT_EXPR(vect__1.6_33);
  vect__3.8_22 = vect__2.7_23 + { 1, 2, 1, 2, 1, 2, 1, 2 };
  vect__4.9_21 = VIEW_CONVERT_EXPR(vect__3.8_22);
  .LEN_STORE (_17, 16B, loop_len_34, vect__4.9_21, 0);
  _20 = (void *) ivtmp.27_1;
  _31 =   [(int *)_20];
  vect__10.14_52 = .LEN_LOAD (_31, 32B, loop_len_48, 0);
  _30 = (void *) ivtmp.30_4;
  _29 =   [(int *)_30];
  vect__10.15_54 = .LEN_LOAD (_29, 32B, loop_len_49, 0);
  _26 = (void *) ivtmp.31_8;
  _25 =   [(int *)_26];
  vect__10.16_56 = .LEN_LOAD (_25, 32B, loop_len_50, 0);
  _78 = (void *) ivtmp.32_12;
  _79 =   [(int *)_78];
  vect__10.17_58 = .LEN_LOAD (_79, 32B, loop_len_51, 0);
  vect__11.18_59 = vect__10.14_52 + { 1, 2, 3, 4 };
  vect__11.18_60 = vect__10.15_54 + { 1, 2, 3, 4 };
  vect__11.18_61 = vect__10.16_56 + { 1, 2, 3, 4 };
  vect__11.18_62 = vect__10.17_58 + { 1, 2, 3, 4 };
  .LEN_STORE (_31, 32B, loop_len_48, vect__11.18_59, 0);
  .LEN_STORE (_29, 32B, loop_len_49, vect__11.18_60, 0);
  .LEN_STORE (_25, 32B, loop_len_50, vect__11.18_61, 0);
  .LEN_STORE (_79, 32B, loop_len_51, vect__11.18_62, 0);
  ivtmp_73 = ivtmp_72 - loop_len_34;
  ivtmp.26_37 = ivtmp.26_41 + 16;
  ivtmp.27_2 = ivtmp.27_1 + 64;
  ivtmp.30_5 = ivtmp.30_4 + 64;
  ivtmp.31_9 = ivtmp.31_8 + 64;
  ivtmp.32_13 = ivtmp.32_12 + 64;
  if (ivtmp_73 != 0)
goto ; [92.31%]
  else
goto ; [7.69%]

I am still check about it but I send it to you earlier.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 23:07
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
Thanks for trying it.  I'm still surprised that no multiplication
is needed though.  Does the patch work for:
 
short x[100];
int y[200];
 
void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}
 
?  Here, there should be a single-control rgroup for x, counting
2 units per scalar iteration.  I'd expect the IV to use this scale.
 
There should also be a 4-control rgroup for y, counting 4 units per
scalar iteration.  So I think the IV would need to be multiplied by 2
before being used for the y rgroup.
 
Thanks,
Richard
 
juzhe.zh...@rivai.ai writes:
> From: Ju-Zhe Zhong 
>
> This patch is supporting decrement IV by following the flow designed by 
> Richard:
>
> (1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
> call vect_set_loop_controls_directly.
>
> (2) vect_set_loop_controls_directly calculates "step" as in your patch.
> If rgc has 1 control, this step is the SSA name created for that control.
> Otherwise the step is a fresh SSA name, as in your patch.
>
> (3) vect_set_loop_controls_directly stores this step somewhere for later
> use, probably in LOOP_VINFO.  Let's use "S" to refer to this stored step.
>
> (4) After the vect_set_loop_controls_directly call above, and outside
> the "if" statement that now contains vect_set_loop_controls_directly,
> check whether rgc->controls.length () > 1.  If so, use
> vect_adjust_loop_lens_control to set the controls based on S.
>
> Then the only caller of vect_adjust_loop_lens_control is
> vect_set_loop_condition_partial_vectors.  And the starting
> step for vect_adjust_loop_lens_control is always S.
>
> This patch has well tested for single-rgroup and multiple-rgroup (SLP) and
> passed all testcase in RISC-V port.
>
> Also, pass tests for multiple-rgroup (non-SLP) tested on vec_pack_trunk.
>
> ---
>  gcc/tree-vect-loop-manip.cc | 178 +---
>  gcc/tree-vect-loop.cc   |  13 +++
>  gcc/tree-vectorizer.h   |  12 +++
>  3 files changed, 192 insertions(+), 11 deletions(-)
>
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index ff6159e08d5..578ac5b783e 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -468,6 +468,38 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>gimple_stmt_iterator incr_gsi;
>bool insert_after;
>standard_iv_increment_position (loop, _gsi, _after);
> +  if

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Yeah. Thanks. I have sent V14:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619478.html 
which I found there is no distinction between SLP and non-SLP.

Could you review it? I think it's more reasonable now.

Thanks.



juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 22:57
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
>>> Both approaches are fine.  I'm not against one or the other.
>
>>> What I didn't understand was why your patch only reuses existing IVs
>>> for max_nscalars_per_iter == 1.  Was it to avoid having to do a
>>> multiplication (well, really a shift left) when moving from one
>>> rgroup to another?  E.g. if one rgroup had;
>
>>>   nscalars_per_iter == 2 && factor == 1
>
>>> and another had:
>
>>>   nscalars_per_iter == 4 && factor == 1
>
>>> then we would need to mulitply by 2 when going from the first rgroup
>>> to the second.
>
>>> If so, avoiding a multiplication seems like a good reason for the choice
>>> you were making in the path.  But we then need to check
>>> max_nscalars_per_iter == 1 for both the source rgroup and the
>>> destination rgroup, not just the destination.  And I think the
>>> condition for “no multiplication needed” should be that:
>
> Oh, I didn't realize such complicated problem. Frankly, I didn't understand 
> well
> rgroup. Sorry about that :).
>
> I just remember last time you said I need to handle multiple-rgroup
> not only for SLP but also non-SLP (which is vec_pack_trunk that I tested).
> Then I asked you when is non-SLP, you said max_nscalars_per_iter == 1.
 
Yeah, max_nscalars_per_iter == 1 is the right way of checking for non-SLP.
 
But I'm never been convinced that SLP vs. non-SLP is a meaningful
distinction for this patch (that is, the parts that don't use
SELECT_VL).
 
SLP vs. non-SLP matters for SELECT_VL.  But the rgroup abstraction
should mean that SLP vs. non-SLP doesn't matter otherwise.
 
Thanks,
Richard

Re: [PATCH V13] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Forget about V13. Plz go directly review V14.
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619478.html 

Thanks.



juzhe.zh...@rivai.ai
 
From: juzhe.zhong
Date: 2023-05-24 22:29
To: gcc-patches
CC: richard.sandiford; rguenther; Ju-Zhe Zhong
Subject: [PATCH V13] VECT: Add decrement IV iteration loop control by variable 
amount support
From: Ju-Zhe Zhong 
 
This patch is supporting decrement IV by following the flow designed by Richard:
 
(1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
call vect_set_loop_controls_directly.
 
(2) vect_set_loop_controls_directly calculates "step" as in your patch.
If rgc has 1 control, this step is the SSA name created for that control.
Otherwise the step is a fresh SSA name, as in your patch.
 
(3) vect_set_loop_controls_directly stores this step somewhere for later
use, probably in LOOP_VINFO.  Let's use "S" to refer to this stored step.
 
(4) After the vect_set_loop_controls_directly call above, and outside
the "if" statement that now contains vect_set_loop_controls_directly,
check whether rgc->controls.length () > 1.  If so, use
vect_adjust_loop_lens_control to set the controls based on S.
 
Then the only caller of vect_adjust_loop_lens_control is
vect_set_loop_condition_partial_vectors.  And the starting
step for vect_adjust_loop_lens_control is always S.
 
This patch has well tested for single-rgroup and multiple-rgroup (SLP) and
passed all testcase in RISC-V port.
 
Also, pass tests for multiple-rgroup (non-SLP) tested on vec_pack_trunk.
 
 
gcc/ChangeLog:
 
* tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Add 
decrement IV support.
(vect_adjust_loop_lens_control): Ditto.
(vect_set_loop_condition_partial_vectors): Ditto.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): New variable.
* tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New macro.
(LOOP_VINFO_DECREMENTING_IV_STEP): New macro.
 
---
gcc/tree-vect-loop-manip.cc | 179 +---
gcc/tree-vect-loop.cc   |  13 +++
gcc/tree-vectorizer.h   |  12 +++
3 files changed, 193 insertions(+), 11 deletions(-)
 
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index ff6159e08d5..3a872668f89 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -468,6 +468,38 @@ vect_set_loop_controls_directly (class loop *loop, 
loop_vec_info loop_vinfo,
   gimple_stmt_iterator incr_gsi;
   bool insert_after;
   standard_iv_increment_position (loop, _gsi, _after);
+  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
+{
+  /* single rgroup:
+ ...
+ _10 = (unsigned long) count_12(D);
+ ...
+ # ivtmp_9 = PHI 
+ _36 = MIN_EXPR ;
+ ...
+ vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
+ ...
+ ivtmp_35 = ivtmp_9 - _36;
+ ...
+ if (ivtmp_35 != 0)
+goto ; [83.33%]
+ else
+goto ; [16.67%]
+  */
+  nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
+  tree step = rgc->controls.length () == 1 ? rgc->controls[0]
+: make_ssa_name (iv_type);
+  /* Create decrement IV.  */
+  create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, _gsi,
+ insert_after, _before_incr, _after_incr);
+  gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
+ index_before_incr,
+ nitems_step));
+  LOOP_VINFO_DECREMENTING_IV_STEP (loop_vinfo) = step;
+  return index_after_incr;
+}
+
+  /* Create increment IV.  */
   create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
 loop, _gsi, insert_after, _before_incr,
 _after_incr);
@@ -683,6 +715,63 @@ vect_set_loop_controls_directly (class loop *loop, 
loop_vec_info loop_vinfo,
   return next_ctrl;
}
+/* Try to use adjust loop lens for multiple-rgroups.
+
+ _36 = MIN_EXPR ;
+
+ First length (MIN (X, VF/N)):
+   loop_len_15 = MIN_EXPR <_36, VF/N>;
+
+ Second length:
+   tmp = _36 - loop_len_15;
+   loop_len_16 = MIN (tmp, VF/N);
+
+ Third length:
+   tmp2 = tmp - loop_len_16;
+   loop_len_17 = MIN (tmp2, VF/N);
+
+ Last length:
+   loop_len_18 = tmp2 - loop_len_17;
+*/
+
+static void
+vect_adjust_loop_lens_control (tree iv_type, gimple_seq *seq,
+rgroup_controls *dest_rgm, tree step)
+{
+  tree ctrl_type = dest_rgm->type;
+  poly_uint64 nitems_per_ctrl
+= TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
+  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
+
+  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
+{
+  tree ctrl = dest_rgm->controls[i];
+  if (i == 0)
+ {
+   /* First iteration: MIN (X, VF/N) capped to the range [0, VF/N].  */
+   gassign *assign
+ = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
+   gimple_seq_add_stmt (seq, assign);
+ }
+  else if (i == dest_rgm->controls.length () - 1)
+ {
+   /* Last iteration: Remain capped to the range [0, VF/N].  */
+   gassign *assign =

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi. Richard. I have sent V13:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619475.html 
It looks more reasonable now.
Could you continue review it again?

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 22:01
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
>>> In other words, why is this different from what
>>>vect_set_loop_controls_directly would do?
> Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
> handling inside "vect_set_loop_controls_directly".
>
> Well. Frankly, I just replicate the handling of ARM SVE:
> unsigned int nmasks = i + 1;
> if (use_masks_p && (nmasks & 1) == 0)
>   {
> rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
> if (!half_rgc->controls.is_empty ()
> && vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
>   continue;
>   }
>
> /* Try to use permutes to define the masks in DEST_RGM using the masks
>in SRC_RGM, given that the former has twice as many masks as the
>latter.  Return true on success, adding any new statements to SEQ.  */
>
> static bool
> vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>rgroup_controls *src_rgm)
> {
>   tree src_masktype = src_rgm->type;
>   tree dest_masktype = dest_rgm->type;
>   machine_mode src_mode = TYPE_MODE (src_masktype);
>   insn_code icode1, icode2;
>   if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
>   && (icode1 = optab_handler (vec_unpacku_hi_optab,
>   src_mode)) != CODE_FOR_nothing
>   && (icode2 = optab_handler (vec_unpacku_lo_optab,
>   src_mode)) != CODE_FOR_nothing)
> {
>   /* Unpacking the source masks gives at least as many mask bits as
>  we need.  We can then VIEW_CONVERT any excess bits away.  */
>   machine_mode dest_mode = insn_data[icode1].operand[0].mode;
>   gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
>   tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
> ? VEC_UNPACK_HI_EXPR
> : VEC_UNPACK_LO_EXPR);
>   gassign *stmt;
>   if (dest_masktype == unpack_masktype)
> stmt = gimple_build_assign (dest, code, src);
>   else
> {
>   tree temp = make_ssa_name (unpack_masktype);
>   stmt = gimple_build_assign (temp, code, src);
>   gimple_seq_add_stmt (seq, stmt);
>   stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
>   build1 (VIEW_CONVERT_EXPR,
>   dest_masktype, temp));
> }
>   gimple_seq_add_stmt (seq, stmt);
> }
>   return true;
> }
>   vec_perm_indices indices[2];
>   if (dest_masktype == src_masktype
>   && interleave_supported_p ([0], src_masktype, 0)
>   && interleave_supported_p ([1], src_masktype, 1))
> {
>   /* The destination requires twice as many mask bits as the source, so
>  we can use interleaving permutes to double up the number of bits.  */
>   tree masks[2];
>   for (unsigned int i = 0; i < 2; ++i)
> masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR,
>   src, src, masks[i & 1]);
>   gimple_seq_add_stmt (seq, stmt);
> }
>   return true;
> }
>   return false;
> }
>
> I know this is just optimization for ARM SVE with sub_rgc (int16)  is half 
> size of rgc (int8).
> But when I just copy the codes from ARM SVE and make it general for all cases 
> (int8 <-> int64).
> They all work well and codegen is good. 
>
> If you don't like this way, would you mind give me some suggestions?
 
It's not a case of disliking one approach or disliking another.
There are two separ

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Oh. I just realize the follow you design is working well for vec_pack_trunk too.
Will send V13 patch soon.

Thanks.



juzhe.zh...@rivai.ai
 
From: 钟居哲
Date: 2023-05-24 22:10
To: richard.sandiford
CC: gcc-patches; rguenther
Subject: Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
>> Both approaches are fine.  I'm not against one or the other.

>> What I didn't understand was why your patch only reuses existing IVs
>> for max_nscalars_per_iter == 1.  Was it to avoid having to do a
>> multiplication (well, really a shift left) when moving from one
>> rgroup to another?  E.g. if one rgroup had;

>>   nscalars_per_iter == 2 && factor == 1

>> and another had:

>>   nscalars_per_iter == 4 && factor == 1

>> then we would need to mulitply by 2 when going from the first rgroup
>> to the second.

>> If so, avoiding a multiplication seems like a good reason for the choice
>> you were making in the path.  But we then need to check
>> max_nscalars_per_iter == 1 for both the source rgroup and the
>> destination rgroup, not just the destination.  And I think the
>> condition for “no multiplication needed” should be that:

Oh, I didn't realize such complicated problem. Frankly, I didn't understand well
rgroup. Sorry about that :).

I just remember last time you said I need to handle multiple-rgroup
not only for SLP but also non-SLP (which is vec_pack_trunk that I tested).
Then I asked you when is non-SLP, you said max_nscalars_per_iter == 1.
Then I use max_nscalars_per_iter == 1 here (I didn't really lean very well from 
this, just add it as you said). 

Actually, I just want to hanlde multip-rgroup for non-SLP here, I am trying to 
avoid  multiplication and I think
scalar multiplication (not cost too much) is fine in modern CPU.

So, what do you suggest that I handle multiple-rgroup for non-SLP.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 22:01
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
>>> In other words, why is this different from what
>>>vect_set_loop_controls_directly would do?
> Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
> handling inside "vect_set_loop_controls_directly".
>
> Well. Frankly, I just replicate the handling of ARM SVE:
> unsigned int nmasks = i + 1;
> if (use_masks_p && (nmasks & 1) == 0)
>   {
> rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
> if (!half_rgc->controls.is_empty ()
> && vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
>   continue;
>   }
>
> /* Try to use permutes to define the masks in DEST_RGM using the masks
>in SRC_RGM, given that the former has twice as many masks as the
>latter.  Return true on success, adding any new statements to SEQ.  */
>
> static bool
> vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>rgroup_controls *src_rgm)
> {
>   tree src_masktype = src_rgm->type;
>   tree dest_masktype = dest_rgm->type;
>   machine_mode src_mode = TYPE_MODE (src_masktype);
>   insn_code icode1, icode2;
>   if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
>   && (icode1 = optab_handler (vec_unpacku_hi_optab,
>   src_mode)) != CODE_FOR_nothing
>   && (icode2 = optab_handler (vec_unpacku_lo_optab,
>   src_mode)) != CODE_FOR_nothing)
> {
>   /* Unpacking the source masks gives at least as many mask bits as
>  we need.  We can then VIEW_CONVERT any excess bits away.  */
>   machine_mode dest_mode = insn_data[icode1].operand[0].mode;
>   gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
>   tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
> ? VEC_UNPACK_HI_EXPR
> : VEC_UNPACK_LO_EXPR);
>   gassign *stmt;
>   if (dest_masktype == unpack_masktype)
> stmt = gimple_build_assign (dest, code, src);
>   else
> {
>   tree temp = make_ssa_name (unpack_masktype);
>   stmt = gimple_build_assign (temp, code, src);
>   gimple_seq_add_stmt

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

>> Actually, I just want to hanlde multip-rgroup for non-SLP here, I am trying 
>> to avoid  multiplication and I think
>> scalar multiplication (not cost too much) is fine in modern CPU.
Sorry for incorrect typo. I didn't try to avoid multiplication and I think 
multiplication is fine.


juzhe.zh...@rivai.ai
 
From: 钟居哲
Date: 2023-05-24 22:10
To: richard.sandiford
CC: gcc-patches; rguenther
Subject: Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
>> Both approaches are fine.  I'm not against one or the other.

>> What I didn't understand was why your patch only reuses existing IVs
>> for max_nscalars_per_iter == 1.  Was it to avoid having to do a
>> multiplication (well, really a shift left) when moving from one
>> rgroup to another?  E.g. if one rgroup had;

>>   nscalars_per_iter == 2 && factor == 1

>> and another had:

>>   nscalars_per_iter == 4 && factor == 1

>> then we would need to mulitply by 2 when going from the first rgroup
>> to the second.

>> If so, avoiding a multiplication seems like a good reason for the choice
>> you were making in the path.  But we then need to check
>> max_nscalars_per_iter == 1 for both the source rgroup and the
>> destination rgroup, not just the destination.  And I think the
>> condition for “no multiplication needed” should be that:

Oh, I didn't realize such complicated problem. Frankly, I didn't understand well
rgroup. Sorry about that :).

I just remember last time you said I need to handle multiple-rgroup
not only for SLP but also non-SLP (which is vec_pack_trunk that I tested).
Then I asked you when is non-SLP, you said max_nscalars_per_iter == 1.
Then I use max_nscalars_per_iter == 1 here (I didn't really lean very well from 
this, just add it as you said). 

Actually, I just want to hanlde multip-rgroup for non-SLP here, I am trying to 
avoid  multiplication and I think
scalar multiplication (not cost too much) is fine in modern CPU.

So, what do you suggest that I handle multiple-rgroup for non-SLP.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 22:01
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
>>> In other words, why is this different from what
>>>vect_set_loop_controls_directly would do?
> Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
> handling inside "vect_set_loop_controls_directly".
>
> Well. Frankly, I just replicate the handling of ARM SVE:
> unsigned int nmasks = i + 1;
> if (use_masks_p && (nmasks & 1) == 0)
>   {
> rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
> if (!half_rgc->controls.is_empty ()
> && vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
>   continue;
>   }
>
> /* Try to use permutes to define the masks in DEST_RGM using the masks
>in SRC_RGM, given that the former has twice as many masks as the
>latter.  Return true on success, adding any new statements to SEQ.  */
>
> static bool
> vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>rgroup_controls *src_rgm)
> {
>   tree src_masktype = src_rgm->type;
>   tree dest_masktype = dest_rgm->type;
>   machine_mode src_mode = TYPE_MODE (src_masktype);
>   insn_code icode1, icode2;
>   if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
>   && (icode1 = optab_handler (vec_unpacku_hi_optab,
>   src_mode)) != CODE_FOR_nothing
>   && (icode2 = optab_handler (vec_unpacku_lo_optab,
>   src_mode)) != CODE_FOR_nothing)
> {
>   /* Unpacking the source masks gives at least as many mask bits as
>  we need.  We can then VIEW_CONVERT any excess bits away.  */
>   machine_mode dest_mode = insn_data[icode1].operand[0].mode;
>   gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
>   tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
> ? VEC_UNPACK_HI_EXPR
> : VEC_UNPACK_LO_EXPR);
>   gassign *stmt;
>   if (dest_masktype == unpack_masktype)
> stmt = gimple_build_assig

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

>> Both approaches are fine.  I'm not against one or the other.

>> What I didn't understand was why your patch only reuses existing IVs
>> for max_nscalars_per_iter == 1.  Was it to avoid having to do a
>> multiplication (well, really a shift left) when moving from one
>> rgroup to another?  E.g. if one rgroup had;

>>   nscalars_per_iter == 2 && factor == 1

>> and another had:

>>   nscalars_per_iter == 4 && factor == 1

>> then we would need to mulitply by 2 when going from the first rgroup
>> to the second.

>> If so, avoiding a multiplication seems like a good reason for the choice
>> you were making in the path.  But we then need to check
>> max_nscalars_per_iter == 1 for both the source rgroup and the
>> destination rgroup, not just the destination.  And I think the
>> condition for “no multiplication needed” should be that:

Oh, I didn't realize such complicated problem. Frankly, I didn't understand well
rgroup. Sorry about that :).

I just remember last time you said I need to handle multiple-rgroup
not only for SLP but also non-SLP (which is vec_pack_trunk that I tested).
Then I asked you when is non-SLP, you said max_nscalars_per_iter == 1.
Then I use max_nscalars_per_iter == 1 here (I didn't really lean very well from 
this, just add it as you said). 

Actually, I just want to hanlde multip-rgroup for non-SLP here, I am trying to 
avoid  multiplication and I think
scalar multiplication (not cost too much) is fine in modern CPU.

So, what do you suggest that I handle multiple-rgroup for non-SLP.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 22:01
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
>>> In other words, why is this different from what
>>>vect_set_loop_controls_directly would do?
> Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
> handling inside "vect_set_loop_controls_directly".
>
> Well. Frankly, I just replicate the handling of ARM SVE:
> unsigned int nmasks = i + 1;
> if (use_masks_p && (nmasks & 1) == 0)
>   {
> rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
> if (!half_rgc->controls.is_empty ()
> && vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
>   continue;
>   }
>
> /* Try to use permutes to define the masks in DEST_RGM using the masks
>in SRC_RGM, given that the former has twice as many masks as the
>latter.  Return true on success, adding any new statements to SEQ.  */
>
> static bool
> vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>rgroup_controls *src_rgm)
> {
>   tree src_masktype = src_rgm->type;
>   tree dest_masktype = dest_rgm->type;
>   machine_mode src_mode = TYPE_MODE (src_masktype);
>   insn_code icode1, icode2;
>   if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
>   && (icode1 = optab_handler (vec_unpacku_hi_optab,
>   src_mode)) != CODE_FOR_nothing
>   && (icode2 = optab_handler (vec_unpacku_lo_optab,
>   src_mode)) != CODE_FOR_nothing)
> {
>   /* Unpacking the source masks gives at least as many mask bits as
>  we need.  We can then VIEW_CONVERT any excess bits away.  */
>   machine_mode dest_mode = insn_data[icode1].operand[0].mode;
>   gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
>   tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
> ? VEC_UNPACK_HI_EXPR
> : VEC_UNPACK_LO_EXPR);
>   gassign *stmt;
>   if (dest_masktype == unpack_masktype)
> stmt = gimple_build_assign (dest, code, src);
>   else
> {
>   tree temp = make_ssa_name (unpack_masktype);
>   stmt = gimple_build_assign (temp, code, src);
>   gimple_seq_add_stmt (seq, stmt);
>   stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
>   build1 (VIEW_CONVERT_EXPR,
>   dest_masktype, temp));
> }
>   gimple_seq_add_stmt (seq, stmt);
> }
>   retu

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

OK. Thanks. I am gonna refine the patch following Richard's idea and test it.
Thanks both Richard and Richi.

juzhe.zh...@rivai.ai

From: Richard Biener
Date: 2023-05-24 20:51
To: Richard Sandiford
CC: 钟居哲; gcc-patches
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
On Wed, 24 May 2023, Richard Sandiford wrote:

> Sorry, I realised later that I had an implicit assumption here:
> if there are multiple rgroups, it's better to have a single IV
> for the smallest rgroup and scale that up to bigger rgroups.
> 
> E.g. if the loop control IV is taken from an N-control rgroup
> and has a step S, an N*M-control rgroup would be based on M*S.
> 
> Of course, it's also OK to create multiple IVs if you prefer.
> It's just a question of which approach gives the best output
> in practice.

One thing to check is whether IVOPTs is ever able to eliminate
one such IV using another.  You can then also check whether
when presented with a single IV it already considers the
others you can create as candidates so you get the optimal
selection in the end.

Richard.

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

>> In other words, why is this different from what
>>vect_set_loop_controls_directly would do?
Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
handling inside "vect_set_loop_controls_directly".

Well. Frankly, I just replicate the handling of ARM SVE:
unsigned int nmasks = i + 1;
if (use_masks_p && (nmasks & 1) == 0)
  {
rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
if (!half_rgc->controls.is_empty ()
&& vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
  continue;
  }

/* Try to use permutes to define the masks in DEST_RGM using the masks
   in SRC_RGM, given that the former has twice as many masks as the
   latter.  Return true on success, adding any new statements to SEQ.  */

static bool
vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
   rgroup_controls *src_rgm)
{
  tree src_masktype = src_rgm->type;
  tree dest_masktype = dest_rgm->type;
  machine_mode src_mode = TYPE_MODE (src_masktype);
  insn_code icode1, icode2;
  if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
  && (icode1 = optab_handler (vec_unpacku_hi_optab,
  src_mode)) != CODE_FOR_nothing
  && (icode2 = optab_handler (vec_unpacku_lo_optab,
  src_mode)) != CODE_FOR_nothing)
{
  /* Unpacking the source masks gives at least as many mask bits as
 we need.  We can then VIEW_CONVERT any excess bits away.  */
  machine_mode dest_mode = insn_data[icode1].operand[0].mode;
  gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
  tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
{
  tree src = src_rgm->controls[i / 2];
  tree dest = dest_rgm->controls[i];
  tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
? VEC_UNPACK_HI_EXPR
: VEC_UNPACK_LO_EXPR);
  gassign *stmt;
  if (dest_masktype == unpack_masktype)
stmt = gimple_build_assign (dest, code, src);
  else
{
  tree temp = make_ssa_name (unpack_masktype);
  stmt = gimple_build_assign (temp, code, src);
  gimple_seq_add_stmt (seq, stmt);
  stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
  build1 (VIEW_CONVERT_EXPR,
  dest_masktype, temp));
}
  gimple_seq_add_stmt (seq, stmt);
}
  return true;
}
  vec_perm_indices indices[2];
  if (dest_masktype == src_masktype
  && interleave_supported_p ([0], src_masktype, 0)
  && interleave_supported_p ([1], src_masktype, 1))
{
  /* The destination requires twice as many mask bits as the source, so
 we can use interleaving permutes to double up the number of bits.  */
  tree masks[2];
  for (unsigned int i = 0; i < 2; ++i)
masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]);
  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
{
  tree src = src_rgm->controls[i / 2];
  tree dest = dest_rgm->controls[i];
  gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR,
  src, src, masks[i & 1]);
  gimple_seq_add_stmt (seq, stmt);
}
  return true;
}
  return false;
}

I know this is just optimization for ARM SVE with sub_rgc (int16)  is half size 
of rgc (int8).
But when I just copy the codes from ARM SVE and make it general for all cases 
(int8 <-> int64).
They all work well and codegen is good. 

If you don't like this way, would you mind give me some suggestions?

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 20:41
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
Sorry, I realised later that I had an implicit assumption here:
if there are multiple rgroups, it's better to have a single IV
for the smallest rgroup and scale that up to bigger rgroups.
 
E.g. if the loop control IV is taken from an N-control rgroup
and has a step S, an N*M-control rgroup would be based on M*S.
 
Of course, it's also OK to create multiple IVs if you prefer.
It's just a question of which approach gives the best output
in practice.
 
Another way of going from an N-control rgroup ("G1") to an N*M-control
rgroup ("G2") would be to reuse all N controls from G1.  E.g. the
first M controls in G2 would come from G1[0], the next M from
G1[1], etc.  That might low

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, Richard.
For step 1. I have write this patch. Could you take a look at it?

Thanks.



juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 19:23
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
Sorry for the slow review.  I needed some time to go through this
patch and surrounding code to understand it, and to understand
why it wasn't structured the way I was expecting.
 
I've got some specific comments below, and then a general comment
about how I think we should structure this.
 
juzhe.zh...@rivai.ai writes:
> From: Ju-Zhe Zhong 
>
> gcc/ChangeLog:
>
> * tree-vect-loop-manip.cc (vect_adjust_loop_lens_control): New 
> function.
> (vect_set_loop_controls_directly): Add decrement IV support.
> (vect_set_loop_condition_partial_vectors): Ditto.
> * tree-vect-loop.cc: Ditto.
> * tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New macro.
>
> ---
>  gcc/tree-vect-loop-manip.cc | 184 +++-
>  gcc/tree-vect-loop.cc   |  10 ++
>  gcc/tree-vectorizer.h   |   8 ++
>  3 files changed, 199 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index ff6159e08d5..94b38d1e0fb 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -385,6 +385,66 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, 
> rgroup_controls *dest_rgm,
>return false;
>  }
>  
> +/* Try to use adjust loop lens for non-SLP multiple-rgroups.
> +
> + _36 = MIN_EXPR ;
> +
> + First length (MIN (X, VF/N)):
> +   loop_len_15 = MIN_EXPR <_36, VF/N>;
> +
> + Second length:
> +   tmp = _36 - loop_len_15;
> +   loop_len_16 = MIN (tmp, VF/N);
> +
> + Third length:
> +   tmp2 = tmp - loop_len_16;
> +   loop_len_17 = MIN (tmp2, VF/N);
> +
> + Last length:
> +   loop_len_18 = tmp2 - loop_len_17;
> +*/
> +
> +static void
> +vect_adjust_loop_lens_control (tree iv_type, gimple_seq *seq,
> +rgroup_controls *dest_rgm,
> +rgroup_controls *src_rgm, tree step)
> +{
> +  tree ctrl_type = dest_rgm->type;
> +  poly_uint64 nitems_per_ctrl
> += TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
> +  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
> +
> +  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> +{
> +  if (!step)
> + step = src_rgm->controls[i / dest_rgm->controls.length ()];
 
Could you explain this index?  It looks like it will always be 0
due to the range of i.
 
Since this is the only use of src_rgm, it might be cleaner to drop
src_rgm and only pass the step.
 
> +  tree ctrl = dest_rgm->controls[i];
> +  if (i == 0)
> + {
> +   /* First iteration: MIN (X, VF/N) capped to the range [0, VF/N].  */
> +   gassign *assign
> + = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
> +   gimple_seq_add_stmt (seq, assign);
> + }
> +  else if (i == dest_rgm->controls.length () - 1)
> + {
> +   /* Last iteration: Remain capped to the range [0, VF/N].  */
> +   gassign *assign = gimple_build_assign (ctrl, MINUS_EXPR, step,
> + dest_rgm->controls[i - 1]);
> +   gimple_seq_add_stmt (seq, assign);
> + }
> +  else
> + {
> +   /* (MIN (remain, VF*I/N)) capped to the range [0, VF/N].  */
> +   step = gimple_build (seq, MINUS_EXPR, iv_type, step,
> +dest_rgm->controls[i - 1]);
> +   gassign *assign
> + = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
> +   gimple_seq_add_stmt (seq, assign);
> + }
> +}
> +}
 
Not your fault, but the structure seems kind-of awkward, since
it's really a MINUS_EXPR for i != 0 followed by a MIN_EXPR for i != last.
But I agree that it probably has to be written as above given that
the final destination is fixed in advance.
 
> +
>  /* Helper for vect_set_loop_condition_partial_vectors.  Generate definitions
> for all the rgroup controls in RGC and return a control that is nonzero
> when the loop needs to iterate.  Add any new preheader statements to
> @@ -468,9 +528,78 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>gimple_stmt_iterator incr_gsi;
>bool insert_after;
>standard_iv_increment_position (loop, _gsi, _after);
> -  create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
> -  loop, _gsi, insert_after, _before_incr,
> -  _after_incr);
> +  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
> +{
> +  nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
> +  tree step = make_ssa_name (iv_type);
> +  /* Create decrement IV.  */
> +  create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, _gsi,
> + insert_after, _before_incr, _after_incr);
> +  tree temp = gimple_build (header_seq, MIN_EXPR, iv_type,
> + index_before_incr, nitems_step);
> +  gimple_seq_add_stmt (header_seq,

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, Richard.  It's quite complicated for me and I am not sure whether I can 
catch up with you.
So I will rather split the work step by step to  implement the decrement IV

For the first step you mentioned:

>> (1) In vect_set_loop_condition_partial_vectors, for the first iteration of:

 >>  FOR_EACH_VEC_ELT (*controls, i, rgc)
 >>if (!rgc->controls.is_empty ())

>> call vect_set_loop_controls_directly.  That is:

>> >> /* See whether zero-based IV would ever generate all-false masks
>>or zero length before wrapping around.  */
>> bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
>> 
/* Set up all controls for this group.  */
>> test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
 >> _seq,
  >>_seq,
 >> loop_cond_gsi, rgc,
 >> niters, niters_skip,
 >> might_wrap_p);

>> needs to be an "if" that (for LOOP_VINFO_USING_DECREMENTING_IV_P)
>> is only executed on the first iteration.

Is it correct like this?

  FOR_EACH_VEC_ELT (*controls, i, rgc)
if (!rgc->controls.is_empty ())
  {
/* First try using permutes.  This adds a single vector
   instruction to the loop for each mask, but needs no extra
   loop invariants or IVs.  */
unsigned int nmasks = i + 1;
if (use_masks_p && (nmasks & 1) == 0)
  {
rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
if (!half_rgc->controls.is_empty ()
&& vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
  continue;
  }

/* See whether zero-based IV would ever generate all-false masks
   or zero length before wrapping around.  */
bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);

/* Set up all controls for this group.  */
test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
 _seq,
 _seq,
 loop_cond_gsi, rgc,
 niters, niters_skip,
 might_wrap_p);

/* Decrement IV only run vect_set_loop_controls_directly once.  */
if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
  break;
  }

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 19:23
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
Sorry for the slow review.  I needed some time to go through this
patch and surrounding code to understand it, and to understand
why it wasn't structured the way I was expecting.
 
I've got some specific comments below, and then a general comment
about how I think we should structure this.
 
juzhe.zh...@rivai.ai writes:
> From: Ju-Zhe Zhong 
>
> gcc/ChangeLog:
>
> * tree-vect-loop-manip.cc (vect_adjust_loop_lens_control): New 
> function.
> (vect_set_loop_controls_directly): Add decrement IV support.
> (vect_set_loop_condition_partial_vectors): Ditto.
> * tree-vect-loop.cc: Ditto.
> * tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New macro.
>
> ---
>  gcc/tree-vect-loop-manip.cc | 184 +++-
>  gcc/tree-vect-loop.cc   |  10 ++
>  gcc/tree-vectorizer.h   |   8 ++
>  3 files changed, 199 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index ff6159e08d5..94b38d1e0fb 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -385,6 +385,66 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, 
> rgroup_controls *dest_rgm,
>return false;
>  }
>  
> +/* Try to use adjust loop lens for non-SLP multiple-rgroups.
> +
> + _36 = MIN_EXPR ;
> +
> + First length (MIN (X, VF/N)):
> +   loop_len_15 = MIN_EXPR <_36, VF/N>;
> +
> + Second length:
> +   tmp = _36 - loop_len_15;
> +   loop_len_16 = MIN (tmp, VF/N);
> +
> + Third length:
> +   tmp2 = tmp - loop_len_16;
> +   loop_len_17 = MIN (tmp2, VF/N);
> +
> + Last length:
> +   loop_len_18 = tmp2 - loop_len_17;
> +*/
> +
> +static void
> +vect_adjust_loop_lens_control (tree iv_type, gimple_seq *seq,
> +rgroup_controls *dest_rgm,
> +rgroup_controls *src_rgm, tree step)
> +{
> +  tree ctrl_type = dest_rgm->type;
> +  poly_uint64 nitems_per_ctrl
> += TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
> +  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
> +
> +  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> +{
> +  if (!step)
> + step = src_rgm->controls[i / dest_rgm->controls.length ()];
 
Could you explain this index?  It looks like it will always be 0
due to the range of i.
 
Since this is the only use of src_rgm, it might be cleaner to drop
src_rgm

Re: Re: [PATCH V2] RISC-V: Add RVV comparison autovectorization

2023-05-23 Thread 钟居哲

Hi, Robin.

>> Don't you want to use your shiny new operand passing style here as
>> with the other expanders?
H, I do this just following ARM code style.
You can see I do pass rtx[] for expand_vcond and pass rtx,rtx,rtx for 
expand_vec_cmp.
Well, I just follow ARM SVE implementation (You can check aarch64-sve.md, we 
are the same)  :)
If don't like it, could give me more information then I change it for you.

>> I don't think we need the same comment in each of these.  Same for
>> /*DEST_MODE*/ and /*MASK_MODE*/ which would be redundant if data_mode
>> were called dest_mode.
Ok

>> Swap lt and gt here for consistency's sake.
Ok.

I have fixed as you suggested.
Would you mind review V3 patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619324.html 

Thanks.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-05-23 22:12
To: juzhe.zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw; Richard 
Sandiford
Subject: Re: [PATCH V2] RISC-V: Add RVV comparison autovectorization
> +(define_expand "vec_cmp"
> +  [(set (match_operand: 0 "register_operand")
> + (match_operator: 1 "comparison_operator"
> +   [(match_operand:VI 2 "register_operand")
> +(match_operand:VI 3 "register_operand")]))]
> +  "TARGET_VECTOR"
> +  {
> +riscv_vector::expand_vec_cmp (operands[0], GET_CODE (operands[1]),
> +   operands[2], operands[3]);
> +DONE;
> +  }
> +)
> +
> +(define_expand "vec_cmpu"
> +  [(set (match_operand: 0 "register_operand")
> + (match_operator: 1 "comparison_operator"
> +   [(match_operand:VI 2 "register_operand")
> +(match_operand:VI 3 "register_operand")]))]
> +  "TARGET_VECTOR"
> +  {
> +riscv_vector::expand_vec_cmp (operands[0], GET_CODE (operands[1]),
> +   operands[2], operands[3]);
> +DONE;
> +  }
> +)
> +
> +(define_expand "vec_cmp"
> +  [(set (match_operand: 0 "register_operand")
> + (match_operator: 1 "comparison_operator"
> +   [(match_operand:VF 2 "register_operand")
> +(match_operand:VF 3 "register_operand")]))]
> +  "TARGET_VECTOR"
> +  {
> +riscv_vector::expand_vec_cmp_float (operands[0], GET_CODE (operands[1]),
> + operands[2], operands[3], false);
> +DONE;
> +  }
> +)
 
Don't you want to use your shiny new operand passing style here as
with the other expanders?
 
> +  /* We have a maximum of 11 operands for RVV instruction patterns according 
> to
> +   * vector.md.  */
> +  insn_expander<11> e (/*OP_NUM*/ op_num, /*HAS_DEST_P*/ true,
> +/*FULLY_UNMASKED_P*/ false,
> +/*USE_REAL_MERGE_P*/ false, /*HAS_AVL_P*/ true,
> +/*VLMAX_P*/ true,
> +/*DEST_MODE*/ data_mode, /*MASK_MODE*/ mask_mode);
> +  e.set_policy (TAIL_ANY);
> +  e.emit_insn ((enum insn_code) icode, ops);
> +}
 
I don't think we need the same comment in each of these.  Same for
/*DEST_MODE*/ and /*MASK_MODE*/ which would be redundant if data_mode
were called dest_mode.
> +/* Expand an RVV comparison.  */
> +
> +void
> +expand_vec_cmp (rtx target, rtx_code code, rtx op0, rtx op1)
> +{
> +  machine_mode mask_mode = GET_MODE (target);
> +  machine_mode data_mode = GET_MODE (op0);
> +  insn_code icode = get_cmp_insn_code (code, data_mode);
> +
> +  if (code == LTGT)
> +{
> +  rtx gt = gen_reg_rtx (mask_mode);
> +  rtx lt = gen_reg_rtx (mask_mode);
> +  expand_vec_cmp (gt, GT, op0, op1);
> +  expand_vec_cmp (lt, LT, op0, op1);
> +  icode = code_for_pred (IOR, mask_mode);
> +  rtx ops[3] = {target, gt, lt};
> +  emit_vlmax_insn (icode, riscv_vector::RVV_BINOP, ops);
> +  return;
> +}
 
Swap lt and gt here for consistency's sake.
 
Regards
Robin
ail-LinkSize:2273655
QQMail-LineLen:76
QQMail-BreakType:1
QQMail-Key:cbdff912c7f03cb40444ad0dccf1f041
QQMail-MD5:6754fd07de754a129fff82b243962497
QQMail-LinkEnd
 
--=_Part_2195_841924464.1657529212753--0eWxlPSJjb2xvcjojMDAwMDAwIj48Zm9udCB5YWhlaT0i
Ij48c3Ryb25nPlRlbDo8L3N0cm9uZz4mbmJzcDs4Ni0yOC02ODM3MzE2NiA2ODM3MzE4OCZuYnNw
OzxiciAvPgo8c3Ryb25nPkZheDo8L3N0cm9uZz4mbmJzcDs4Ni0yOC02ODM3MzE2Ni04MDQmbmJz
cDs8YnIgLz4KPHN0cm9uZz5BZGQ6PC9zdHJvbmc+NzE4LE5vLjEwLDEgTm9ydGgsMiBSaW5nLENo
ZW5nZHUsQ2hpbmEsPC9mb250PjwvZm9udD48L2ZvbnQ+PGJyIG1pY3Jvc29mdD0iIiBzdHlsZT0i
Y29sb3I6IzAwMDAwMCIgeWFoZWk9IiIgLz4KPGZvbnQgbWljcm9zb2Z0PSIiPjxmb250IHN0eWxl
PSJjb2xvcjojMDAwMDAwIj48Zm9udCB5YWhlaT0iIj48c3Ryb25nPlBvc3RhbCBjb2RlOjwvc3Ry
b25nPjYxMDAzMTwvZm9udD48L2ZvbnQ+PC9mb250PjxiciBtaWNyb3NvZnQ9IiIgc3R5bGU9ImNv
bG9yOiMwMDAwMDAiIHlhaGVpPSIiIC8+CiZuYnNwOzxpbWcgYWx0PSIiIHNyYz0iL2VudHNvZnQv
RXRBY3Rpb24uZW50Y3JtP21ldGhvZD10ZSZtYWlsSUQ9ODgwNTgzJmFzcF9jb2Q9JmNfdGFza051
bT0iIGhlaWdodD0wIHdpZHRoPTA+PC9CT0RZPjwvSFRNTD4=
--=_Part_8340_683676631.1684738404743--
 
--=_Part_8339_2046897854.1684738404722
Content-Type: image/jpeg;name="1669700265737.jpg.jpeg"
Content-Transfer-Encoding: base64
Content-ID: <2023052214532474264814...@entsoft.net>
 
/9j/4AAQSkZJRgABAQEBLAEsAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQFxQYGBcU

Re: Re: [PATCH] RISC-V: Implement autovec abs, vneg, vnot.

2023-05-19 Thread 钟居哲

>> What about the rest of the changes? It's not all typos but I tried
>> to unify the mask/policy handling a bit.
Oh, I see.  You rename get_prefer into get_preferred.
This makes perfect sense to me.




juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-05-19 20:07
To: 钟居哲; gcc-patches; kito.cheng; palmer; Michael Collison; Jeff Law
CC: rdapp.gcc
Subject: Re: [PATCH] RISC-V: Implement autovec abs, vneg, vnot.
>>> +  TAIL_UNDEFINED = -1,
>>> +  MASK_UNDEFINED = -1,
> Why you add this ?
> 
>>> +  void add_policy_operands (enum tail_policy vta = TAIL_UNDEFINED,
>>> + enum mask_policy vma = MASK_UNDEFINED)
> No, you should just specify this as TAIL_ANY or MASK_ANY as default value.
 
That's the value I intended for "unspecified" i.e. the caller
didn't specify and then set it to the default.  _ANY can work as
well I guess.
 
> 
>>>const_vlmax_p (machine_mode mode)
>>>{
>>>-  poly_uint64 nuints = GET_MODE_NUNITS (mode);
>>>+  poly_uint64 nunits = GET_MODE_NUNITS (mode);
>>>-  return nuints.is_constant ()
>>>+  return nunits.is_constant ()
>>> /* The vsetivli can only hold register 0~31.  */
>>>-? (IN_RANGE (nuints.to_constant (), 0, 31))
>>>+? (IN_RANGE (nunits.to_constant (), 0, 31))
>>> /* Only allowed in VLS-VLMAX mode.  */
>>> : false;
>>>}
> Meaningless change ?
 
Typo.
 
> 
>>>/* For the instruction that doesn't require TA, we still need a default 
>>> value
>>>  to emit vsetvl. We pick up the default value according to prefer 
>>> policy. */
>>>-  return (bool) (get_prefer_tail_policy () & 0x1
>>>- || (get_prefer_tail_policy () >> 1 & 0x1));
>>>+  return (bool) (get_preferred_tail_policy () & 0x1
>>>+ || (get_preferred_tail_policy () >> 1 & 0x1));
>>>}
>>>/* Get default mask policy.  */
>>>@@ -576,8 +576,8 @@ get_default_ma ()
>>>{
>>>   /* For the instruction that doesn't require MA, we still need a 
>>> default value
>>>  to emit vsetvl. We pick up the default value according to prefer 
>>> policy. */
>>>-  return (bool) (get_prefer_mask_policy () & 0x1
>>>- || (get_prefer_mask_policy () >> 1 & 0x1));
>>>+  return (bool) (get_preferred_mask_policy () & 0x1
>>>+ || (get_preferred_mask_policy () >> 1 & 0x1));
> Why you change it ?
 
Typo/grammar imho.
 
What about the rest of the changes? It's not all typos but I tried
to unify the mask/policy handling a bit. 
 
> You are using comparison helper which I added one in my downstream 
> when I am working on comparison autovec patterns:
> 
> I think you can normalize my code with yours:
 
I wasn't aware that I'm only using one of several helpers, just refactored
what iss upstream.  Yes your code looks reasonable and it surely works
with the patch without much rework. 
 
> I am almost done all comparison autovec patterns, soon will send them after 
> testing.
 
Good, looking forward to it.
 
Regards
Robin

Re: [PATCH] RISC-V: Implement autovec abs, vneg, vnot.

2023-05-19 Thread 钟居哲

>> +  TAIL_UNDEFINED = -1,
>> +  MASK_UNDEFINED = -1,
Why you add this ?

>> +  void add_policy_operands (enum tail_policy vta = TAIL_UNDEFINED,
>> + enum mask_policy vma = MASK_UNDEFINED)
No, you should just specify this as TAIL_ANY or MASK_ANY as default value.

>>const_vlmax_p (machine_mode mode)
>>{
>>-  poly_uint64 nuints = GET_MODE_NUNITS (mode);
>>+  poly_uint64 nunits = GET_MODE_NUNITS (mode);
>>-  return nuints.is_constant ()
>>+  return nunits.is_constant ()
>> /* The vsetivli can only hold register 0~31.  */
>>-? (IN_RANGE (nuints.to_constant (), 0, 31))
>>+? (IN_RANGE (nunits.to_constant (), 0, 31))
>> /* Only allowed in VLS-VLMAX mode.  */
>> : false;
>>}
Meaningless change ?

>>/* For the instruction that doesn't require TA, we still need a default 
>> value
>>  to emit vsetvl. We pick up the default value according to prefer 
>> policy. */
>>-  return (bool) (get_prefer_tail_policy () & 0x1
>>- || (get_prefer_tail_policy () >> 1 & 0x1));
>>+  return (bool) (get_preferred_tail_policy () & 0x1
>>+ || (get_preferred_tail_policy () >> 1 & 0x1));
>>}
>>/* Get default mask policy.  */
>>@@ -576,8 +576,8 @@ get_default_ma ()
>>{
>>   /* For the instruction that doesn't require MA, we still need a 
>> default value
>>  to emit vsetvl. We pick up the default value according to prefer 
>> policy. */
>>-  return (bool) (get_prefer_mask_policy () & 0x1
>>- || (get_prefer_mask_policy () >> 1 & 0x1));
>>+  return (bool) (get_preferred_mask_policy () & 0x1
>>+ || (get_preferred_mask_policy () >> 1 & 0x1));
Why you change it ?

>>   +/* Emit an RVV comparison.  */
>>   +static void
>>   +emit_pred_cmp (unsigned icode, rtx mask, rtx dest, rtx cmp,
>>   +rtx src1, rtx src2,
>>   +rtx len, machine_mode mask_mode)
>>   +{
>>   +  insn_expander<9> e;
>>   +
>>   +  e.set_dest_and_mask (dest, mask, mask_mode);
>>   +
>>   +  e.add_input_operand (cmp, GET_MODE (cmp));
>>   +
>>   +  e.add_source_operand (src1, GET_MODE (src1));
>>   +  e.add_source_operand (src2, GET_MODE (src2));

You are using comparison helper which I added one in my downstream 
when I am working on comparison autovec patterns:

I think you can normalize my code with yours:

/* Emit an RVV comparison.  If one of SRC1 and SRC2 is a scalar operand, its
   data_mode is specified using SCALAR_MODE.  */
static void
emit_pred_comparison (unsigned icode, rtx_code rcode, rtx mask, rtx dest,
  rtx src1, rtx src2, rtx len, machine_mode mask_mode,
  machine_mode scalar_mode = VOIDmode)
{
  insn_expander<9> e;
  e.set_dest_and_mask (mask, dest, mask_mode);
  machine_mode data_mode = GET_MODE (src1);

  gcc_assert (VECTOR_MODE_P (GET_MODE (src1))
|| VECTOR_MODE_P (GET_MODE (src2)));

  if (!insn_operand_matches ((enum insn_code) icode, e.opno () + 1, src1))
src1 = force_reg (data_mode, src1);
  if (!insn_operand_matches ((enum insn_code) icode, e.opno () + 2, src2))
{
  if (VECTOR_MODE_P (GET_MODE (src2)))
  src2 = force_reg (data_mode, src2);
  else
  src2 = force_reg (scalar_mode, src2);
}
  rtx comparison = gen_rtx_fmt_ee (rcode, mask_mode, src1, src2);
  if (!VECTOR_MODE_P (GET_MODE (src2)))
comparison = gen_rtx_fmt_ee (rcode, mask_mode, src1,
 gen_rtx_VEC_DUPLICATE (data_mode, src2));
  e.add_fixed_operand (comparison);

  e.add_fixed_operand (src1);
  if (CONST_INT_P (src2))
e.add_integer_operand (src2);
  else
e.add_fixed_operand (src2);

  e.set_len_and_policy (len, true, false, true);

  e.expand ((enum insn_code) icode, false);
}

static void
emit_len_comparison (unsigned icode, rtx_code rcode, rtx dest, rtx src1,
 rtx src2, rtx len, machine_mode mask_mode,
 machine_mode scalar_mode)
{
  emit_pred_comparison (icode, rcode, NULL_RTX, dest, src1, src2, len,
  mask_mode, scalar_mode);
}

/* Expand an RVV integer comparison using the RVV equivalent of:

 (set TARGET (CODE OP0 OP1)).  */

void
expand_vec_cmp_int (rtx target, rtx_code code, rtx op0, rtx op1)
{
  machine_mode mask_mode = GET_MODE (target);
  machine_mode data_mode = GET_MODE (op0);
  insn_code icode;
  bool scalar_p = false;

  if (CONST_VECTOR_P (op1))
{
  rtx elt;
  if (const_vec_duplicate_p (op1, ))
  op1 = elt;
  scalar_p = true;
}

  switch (code)
{
case LE:
case LEU:
case GT:
case GTU:
  if (scalar_p)
  icode = code_for_pred_cmp_scalar (data_mode);
  else
  icode = code_for_pred_cmp (data_mode);
  break;
case EQ:
case NE:
  if (scalar_p)
  icode = code_for_pred_eqne_scalar (data_mode);
  else
  icode = code_for_pred_cmp (data_mode);
  break;
case LT:
case LTU:
  if (scalar_p)
  icode = code_for_pred_cmp_scalar (data_mode);
  else
  icode = code_for_pred_ltge (data_mode);
  break;
case GE:
case GEU:
  if (scalar_p)
  icode = code_for_pred_ge_scalar (data_mode);
  else
  icode =

Re: Re: [PATCH] RISC-V: Add mode switching target hook to insert rounding mode config for fixed-point instructions

2023-05-17 Thread 钟居哲

Hi, Kito. The intrinsic doc has updated fixed point enum.
This patch (You have LGTM) should be merged after this patch:

https://patchwork.sourceware.org/project/gcc/patch/20230517052521.405836-1-juzhe.zh...@rivai.ai/
 
Can you respond this patch ?

Thanks.


juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-17 18:05
To: juzhe.zhong
CC: gcc-patches; kito.cheng; palmer; palmer; jeffreyalaw; rdapp.gcc
Subject: Re: [PATCH] RISC-V: Add mode switching target hook to insert rounding 
mode config for fixed-point instructions
LGTM, it's really awesome, I know it's kind of blocking due to enum
stuff, so feel free to commit this once it unblock :)
 
On Wed, May 17, 2023 at 5:58 PM  wrote:
>
> From: Juzhe-Zhong 
>
> Hi, this patch support the new coming fixed-point intrinsics:
> https://github.com/riscv-non-isa/rvv-intrinsic-doc/pull/222
>
> Insert fixed-point rounding mode configuration by mode switching target hook.
>
> Mode switching target hook is implemented applying LCM (Lazy code Motion).
>
> So the performance && correctness can be well trusted.
>
> Here is the example:
>
> void f (void * in, void *out, int32_t x, int n, int m)
> {
>   for (int i = 0; i < n; i++) {
> vint32m1_t v = __riscv_vle32_v_i32m1 (in + i, 4);
> vint32m1_t v2 = __riscv_vle32_v_i32m1_tu (v, in + 100 + i, 4);
> vint32m1_t v3 = __riscv_vaadd_vx_i32m1 (v2, 0, VXRM_RDN, 4);
> v3 = __riscv_vaadd_vx_i32m1 (v3, 3, VXRM_RDN, 4);
> __riscv_vse32_v_i32m1 (out + 100 + i, v3, 4);
>   }
>
>   for (int i = 0; i < n; i++) {
> vint32m1_t v = __riscv_vle32_v_i32m1 (in + i + 1000, 4);
> vint32m1_t v2 = __riscv_vle32_v_i32m1_tu (v, in + 100 + i + 1000, 4);
> vint32m1_t v3 = __riscv_vaadd_vx_i32m1 (v2, 0, VXRM_RDN, 4);
> v3 = __riscv_vaadd_vx_i32m1 (v3, 3, VXRM_RDN, 4);
> __riscv_vse32_v_i32m1 (out + 100 + i + 1000, v3, 4);
>   }
> }
>
> ASM:
>
> ...
> csrwi   vxrm,2
> vsetivlizero,4,e32,m1,tu,ma
> ...
> Loop 1
> ...
> Loop 2
>
> mode switching can global recognize both Loop 1 and Loop 2 are using RDN
> rounding mode and hoist such single "csrwi vxrm,2" to dominate both Loop 1
> and Loop 2.
>
> Besides, I have add correctness check sanity tests in this patch too.
>
> Ok for trunk ?
>
> gcc/ChangeLog:
>
> * config/riscv/riscv-opts.h (enum riscv_entity): New enum.
> * config/riscv/riscv.cc (riscv_emit_mode_set): New function.
> (riscv_mode_needed): Ditto.
> (riscv_mode_after): Ditto.
> (riscv_mode_entry): Ditto.
> (riscv_mode_exit): Ditto.
> (riscv_mode_priority): Ditto.
> (TARGET_MODE_EMIT): New target hook.
> (TARGET_MODE_NEEDED): Ditto.
> (TARGET_MODE_AFTER): Ditto.
> (TARGET_MODE_ENTRY): Ditto.
> (TARGET_MODE_EXIT): Ditto.
> (TARGET_MODE_PRIORITY): Ditto.
> * config/riscv/riscv.h (OPTIMIZE_MODE_SWITCHING): Ditto.
> (NUM_MODES_FOR_MODE_SWITCHING): Ditto.
> * config/riscv/riscv.md: Add csrwvxrm.
> * config/riscv/vector.md (rnu,rne,rdn,rod,none): New attribute.
> (vxrmsi): New pattern.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/base/vxrm-10.c: New test.
> * gcc.target/riscv/rvv/base/vxrm-6.c: New test.
> * gcc.target/riscv/rvv/base/vxrm-7.c: New test.
> * gcc.target/riscv/rvv/base/vxrm-8.c: New test.
> * gcc.target/riscv/rvv/base/vxrm-9.c: New test.
>
> ---
>  gcc/config/riscv/riscv-opts.h |   8 ++
>  gcc/config/riscv/riscv.cc | 104 ++
>  gcc/config/riscv/riscv.h  |   6 +-
>  gcc/config/riscv/riscv.md |   3 +-
>  gcc/config/riscv/vector.md|  29 +
>  .../gcc.target/riscv/rvv/base/vxrm-10.c   |  26 +
>  .../gcc.target/riscv/rvv/base/vxrm-6.c|  15 +++
>  .../gcc.target/riscv/rvv/base/vxrm-7.c|  16 +++
>  .../gcc.target/riscv/rvv/base/vxrm-8.c|  18 +++
>  .../gcc.target/riscv/rvv/base/vxrm-9.c|  26 +
>  10 files changed, 249 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/vxrm-10.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/vxrm-6.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/vxrm-7.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/vxrm-8.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/vxrm-9.c
>
> diff --git a/gcc/config/riscv/riscv-opts.h b/gcc/config/riscv/riscv-opts.h
> index 1b2e6de5e1b..2a16402265a 100644
> --- a/gcc/config/riscv/riscv-opts.h
> +++ b/gcc/config/riscv/riscv-opts.h
> @@ -91,6 +91,14 @@ enum riscv_multilib_select_kind {
>select_by_abi,
>  };
>
> +/* ENTITIES in mode switching.  */
> +enum riscv_entity
> +{
> +  RISCV_VXRM = 0,
> +  RISCV_FRM,
> +  MAX_RISCV_ENTITIES
> +};
> +
>  #define MASK_ZICSR(1 << 0)
>  #define MASK_ZIFENCEI (1 << 1)
>
> diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> index

Re: Re: [PATCH V7] VECT: Add decrement IV support in Loop Vectorizer

2023-05-15 Thread 钟居哲

Hi, Richard.

>> Easier to read as:

 >>  _41 = _40 - 16

>> (which might not be valid gimple, but pseudocode is good enough).

OK.


>> The difficulty with this is that the len_load* and len_store*
>>optabs currently say that the behaviour is undefined if the
>>length argument is greater than the length of a vector.
>>So I think using these values of _47 and _44 in the .LEN_STOREs
>>is relying on undefined behaviour.

>>Haven't had time to think about the consequences of that yet,
>>but wanted to send something out sooner rather than later.

Yes, we have tail agnostic (TA) in vsevli which is make tail element 
undefined value. The current optabs behavior matches the RVV specification.
I think maybe we can leave it to be carefully solved in the future. Currently,
I don't see the issue yet so far.

>>It would be better to use known_le here, without checking whether the
>>VF is constant.
Ok

Thank you so much for your patience helping this patch.
I have sent V8 patch with fixes as you suggested:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618638.html 

Can I merge this patch?

I am gonna post the next patch with select_vl included.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-16 03:44
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V7] VECT: Add decrement IV support in Loop Vectorizer
juzhe.zh...@rivai.ai writes:
> From: Juzhe-Zhong 
>
> This patch implement decrement IV for length approach in loop control.
>
> Address comment from kewen that incorporate the implementation inside
> "vect_set_loop_controls_directly" instead of a standalone function.
>
> Address comment from Richard using MIN_EXPR to handle these 3 following
> cases
> 1. single rgroup.
> 2. multiple rgroup for SLP.
> 3. multiple rgroup for non-SLP (tested on vec_pack_trunc).
 
Thanks, this looks pretty reasonable to me FWIW, but some comments below:
 
> Bootstraped && Regression on x86.
>
> Ok for trunk ?
>
> gcc/ChangeLog:
>
> * tree-vect-loop-manip.cc (vect_adjust_loop_lens): New function.
> (vect_set_loop_controls_directly): Add decrement IV support.
> (vect_set_loop_condition_partial_vectors): Ditto.
> * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Add a new 
> variable.
> (vect_get_loop_len): Add decrement IV support.
> * tree-vect-stmts.cc (vectorizable_store): Ditto.
> (vectorizable_load): Ditto.
> * tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New macro.
> (vect_get_loop_len): Add decrement IV support.
>
> ---
>  gcc/tree-vect-loop-manip.cc | 177 +++-
>  gcc/tree-vect-loop.cc   |  38 +++-
>  gcc/tree-vect-stmts.cc  |   9 +-
>  gcc/tree-vectorizer.h   |  13 ++-
>  4 files changed, 224 insertions(+), 13 deletions(-)
>
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index ff6159e08d5..1baac7b1b52 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -385,6 +385,58 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, 
> rgroup_controls *dest_rgm,
>return false;
>  }
>  
> +/* Try to use adjust loop lens for non-SLP multiple-rgroups.
> +
> + _36 = MIN_EXPR ;
> +
> + First length (MIN (X, VF/N)):
> +   loop_len_15 = MIN_EXPR <_36, VF/N>;
> +
> + Second length:
> +   tmp = _36 - loop_len_15;
> +   loop_len_16 = MIN (tmp, VF/N);
> +
> + Third length:
> +   tmp2 = tmp - loop_len_16;
> +   loop_len_17 = MIN (tmp2, VF/N);
> +
> + Forth length:
> +   tmp3 = tmp2 - loop_len_17;
> +   loop_len_18 = MIN (tmp3, VF/N);  */
> +
> +static void
> +vect_adjust_loop_lens (tree iv_type, gimple_seq *seq, rgroup_controls 
> *dest_rgm,
> +rgroup_controls *src_rgm)
> +{
> +  tree ctrl_type = dest_rgm->type;
> +  poly_uint64 nitems_per_ctrl
> += TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
> +
> +  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> +{
> +  tree src = src_rgm->controls[i / dest_rgm->controls.length ()];
> +  tree dest = dest_rgm->controls[i];
> +  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
> +  gassign *stmt;
> +  if (i == 0)
> + {
> +   /* MIN (X, VF*I/N) capped to the range [0, VF/N].  */
> +   stmt = gimple_build_assign (dest, MIN_EXPR, src, length_limit);
> +   gimple_seq_add_stmt (seq, stmt);
> + }
> +  else
> + {
> +   /* (MIN (remain, VF*I/N)) capped to the range [0, VF/N].  */
> +   tree temp = make_ssa_name (iv_type);
> +   stmt = gimple_build_assign (temp, MINUS_EXPR, src,
> +   dest_rgm->controls[i - 1]);
> +   gimple_seq_add_stmt (seq, stmt);
> +   stmt = gimple_build_assign (dest, MIN_EXPR, temp, length_limit);
> +   gimple_seq_add_stmt (seq, stmt);
> + }
> +}
> +}
> +
>  /* Helper for vect_set_loop_condition_partial_vectors.  Generate definitions
> for all the rgroup controls in RGC and return a control that is nonzero
> when the loop needs to

Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point instructions

2023-05-15 Thread 钟居哲

Thanks.
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618614.html 
here is the V2 patch.
I have description about instructions are adding FRM or not.
Would you mind check it again now?



juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-15 22:41
To: 钟居哲
CC: Jeff Law; gcc-patches; kito.cheng; palmer; palmer; rdapp.gcc
Subject: Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
instructions
all sign injection operations (vfsgnjn/ vfsgnj/vfsgnjx and its
friends) didn't involve rounding in the operation, so vfneg.v and
vfabs.v don't need FRM.
 
On Mon, May 15, 2023 at 10:38 PM 钟居哲  wrote:
>
> And what about vfabs ? I guess it also need FRM ?
> vfneg/vfabs/vfsgnj/vfsgnj/vfsgnjx
> vfneg.v vd,vs = vfsgnjn.vv vd,vs,vs
> vfabs.v vd,vs = vfsgnjx.vv vd,vs,vs
>
> That's all questions I have, plz double check for me.
> Thanks.
>
>
> juzhe.zh...@rivai.ai
>
> From: Kito Cheng
> Date: 2023-05-15 22:22
> To: 钟居哲
> CC: Jeff Law; gcc-patches; kito.cheng; palmer; palmer; rdapp.gcc
> Subject: Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
> instructions
> > Oh, do you mean vfsqrt7/vfrec7 doesn't have frm, but vfsqrt/vfneg should 
> > have frm.
> > Is that rigth? If yes, I am gonna send a patch to fix it immediately.
>
> Yes, and I also double checked spike implementation :P
>
> and it seems like you're not committed yet, so let's send V2 :)
>
> On Mon, May 15, 2023 at 10:12 PM 钟居哲  wrote:
> >
> > Oh, do you mean vfsqrt7/vfrec7 doesn't have frm, but vfsqrt/vfneg should 
> > have frm.
> > Is that rigth? If yes, I am gonna send a patch to fix it immediately.
> >
> >
> >
> > juzhe.zh...@rivai.ai
> >
> > From: Kito Cheng
> > Date: 2023-05-15 22:07
> > To: 钟居哲
> > CC: Jeff Law; gcc-patches; kito.cheng; palmer; palmer; rdapp.gcc
> > Subject: Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating 
> > point instructions
> > Oh, Craig says vfrsqrt7.v not have frm but vsqrt.v have frm, and
> > checked spike that match that.
> >
> > On Mon, May 15, 2023 at 9:55 PM 钟居哲  wrote:
> > >
> > > I don't know why we should not add frm vfsqrt.v since I saw topper (LLVM 
> > > maintainer) said we should
> > > not add frm into vsqrt.v. Maybe kito knows the reason ?
> > >
> > > https://github.com/riscv-non-isa/rvv-intrinsic-doc/pull/226
> > >
> > >
> > >
> > >
> > > juzhe.zh...@rivai.ai
> > >
> > > From: Jeff Law
> > > Date: 2023-05-15 21:52
> > > To: juzhe.zhong; gcc-patches
> > > CC: kito.cheng; kito.cheng; palmer; palmer; rdapp.gcc
> > > Subject: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
> > > instructions
> > >
> > >
> > > On 5/15/23 05:49, juzhe.zh...@rivai.ai wrote:
> > > > From: Juzhe-Zhong 
> > > >
> > > > This patch is adding rounding mode operand and FRM_REGNUM dependency
> > > > into floating-point instructions.
> > > >
> > > > The floating-point instructions we added FRM and rounding mode operand:
> > > > 1. vfadd/vfsub
> > > > 2. vfwadd/vfwsub
> > > > 3. vfmul
> > > > 4. vfdiv
> > > > 5. vfwmul
> > > > 6. vfwmacc/vfwnmacc/vfwmsac/vfwnmsac
> > > > 7. vfsqrt7/vfrec7
> > > > 8. floating-point conversions.
> > > > 9. floating-point reductions.
> > > >
> > > > The floating-point instructions we did NOT add FRM and rounding mode 
> > > > operand:
> > > > 1. vfsqrt/vfneg
> > > Assuming vfsqrt is actually an estimator the best place to handle
> > > rounding modes is at the last step(s) after N-R or Goldschmidt
> > > refinement steps.  I haven't paid too much attention to FP yet, but this
> > > is an area I've got fairly extensive experience.
> > >
> > > Sadly RISC-V's estimator is fairly poor and the single instance FMACs
> > > are going to result in an implementation that may not actually be any
> > > better than what glibc can do.
> > >
> > > Jeff
> > >
> >
>

Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point instructions

2023-05-15 Thread 钟居哲

And what about vfabs ? I guess it also need FRM ?
vfneg/vfabs/vfsgnj/vfsgnj/vfsgnjx
vfneg.v vd,vs = vfsgnjn.vv vd,vs,vs
vfabs.v vd,vs = vfsgnjx.vv vd,vs,vs

That's all questions I have, plz double check for me.
Thanks.


juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-15 22:22
To: 钟居哲
CC: Jeff Law; gcc-patches; kito.cheng; palmer; palmer; rdapp.gcc
Subject: Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
instructions
> Oh, do you mean vfsqrt7/vfrec7 doesn't have frm, but vfsqrt/vfneg should have 
> frm.
> Is that rigth? If yes, I am gonna send a patch to fix it immediately.
 
Yes, and I also double checked spike implementation :P
 
and it seems like you're not committed yet, so let's send V2 :)
 
On Mon, May 15, 2023 at 10:12 PM 钟居哲  wrote:
>
> Oh, do you mean vfsqrt7/vfrec7 doesn't have frm, but vfsqrt/vfneg should have 
> frm.
> Is that rigth? If yes, I am gonna send a patch to fix it immediately.
>
>
>
> juzhe.zh...@rivai.ai
>
> From: Kito Cheng
> Date: 2023-05-15 22:07
> To: 钟居哲
> CC: Jeff Law; gcc-patches; kito.cheng; palmer; palmer; rdapp.gcc
> Subject: Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
> instructions
> Oh, Craig says vfrsqrt7.v not have frm but vsqrt.v have frm, and
> checked spike that match that.
>
> On Mon, May 15, 2023 at 9:55 PM 钟居哲  wrote:
> >
> > I don't know why we should not add frm vfsqrt.v since I saw topper (LLVM 
> > maintainer) said we should
> > not add frm into vsqrt.v. Maybe kito knows the reason ?
> >
> > https://github.com/riscv-non-isa/rvv-intrinsic-doc/pull/226
> >
> >
> >
> >
> > juzhe.zh...@rivai.ai
> >
> > From: Jeff Law
> > Date: 2023-05-15 21:52
> > To: juzhe.zhong; gcc-patches
> > CC: kito.cheng; kito.cheng; palmer; palmer; rdapp.gcc
> > Subject: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
> > instructions
> >
> >
> > On 5/15/23 05:49, juzhe.zh...@rivai.ai wrote:
> > > From: Juzhe-Zhong 
> > >
> > > This patch is adding rounding mode operand and FRM_REGNUM dependency
> > > into floating-point instructions.
> > >
> > > The floating-point instructions we added FRM and rounding mode operand:
> > > 1. vfadd/vfsub
> > > 2. vfwadd/vfwsub
> > > 3. vfmul
> > > 4. vfdiv
> > > 5. vfwmul
> > > 6. vfwmacc/vfwnmacc/vfwmsac/vfwnmsac
> > > 7. vfsqrt7/vfrec7
> > > 8. floating-point conversions.
> > > 9. floating-point reductions.
> > >
> > > The floating-point instructions we did NOT add FRM and rounding mode 
> > > operand:
> > > 1. vfsqrt/vfneg
> > Assuming vfsqrt is actually an estimator the best place to handle
> > rounding modes is at the last step(s) after N-R or Goldschmidt
> > refinement steps.  I haven't paid too much attention to FP yet, but this
> > is an area I've got fairly extensive experience.
> >
> > Sadly RISC-V's estimator is fairly poor and the single instance FMACs
> > are going to result in an implementation that may not actually be any
> > better than what glibc can do.
> >
> > Jeff
> >
>

Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point instructions

2023-05-15 Thread 钟居哲

The reason I ask vfsgnjn since according to RVV ISA:
vfneg.v vd,vs = vfsgnjn.vv vd,vs,vs.

It's really confusing here that document has FRM in vfneg but no FRM in vfsgnjn 
?
It's really odd here.


juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-15 22:22
To: 钟居哲
CC: Jeff Law; gcc-patches; kito.cheng; palmer; palmer; rdapp.gcc
Subject: Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
instructions
> Oh, do you mean vfsqrt7/vfrec7 doesn't have frm, but vfsqrt/vfneg should have 
> frm.
> Is that rigth? If yes, I am gonna send a patch to fix it immediately.
 
Yes, and I also double checked spike implementation :P
 
and it seems like you're not committed yet, so let's send V2 :)
 
On Mon, May 15, 2023 at 10:12 PM 钟居哲  wrote:
>
> Oh, do you mean vfsqrt7/vfrec7 doesn't have frm, but vfsqrt/vfneg should have 
> frm.
> Is that rigth? If yes, I am gonna send a patch to fix it immediately.
>
>
>
> juzhe.zh...@rivai.ai
>
> From: Kito Cheng
> Date: 2023-05-15 22:07
> To: 钟居哲
> CC: Jeff Law; gcc-patches; kito.cheng; palmer; palmer; rdapp.gcc
> Subject: Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
> instructions
> Oh, Craig says vfrsqrt7.v not have frm but vsqrt.v have frm, and
> checked spike that match that.
>
> On Mon, May 15, 2023 at 9:55 PM 钟居哲  wrote:
> >
> > I don't know why we should not add frm vfsqrt.v since I saw topper (LLVM 
> > maintainer) said we should
> > not add frm into vsqrt.v. Maybe kito knows the reason ?
> >
> > https://github.com/riscv-non-isa/rvv-intrinsic-doc/pull/226
> >
> >
> >
> >
> > juzhe.zh...@rivai.ai
> >
> > From: Jeff Law
> > Date: 2023-05-15 21:52
> > To: juzhe.zhong; gcc-patches
> > CC: kito.cheng; kito.cheng; palmer; palmer; rdapp.gcc
> > Subject: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
> > instructions
> >
> >
> > On 5/15/23 05:49, juzhe.zh...@rivai.ai wrote:
> > > From: Juzhe-Zhong 
> > >
> > > This patch is adding rounding mode operand and FRM_REGNUM dependency
> > > into floating-point instructions.
> > >
> > > The floating-point instructions we added FRM and rounding mode operand:
> > > 1. vfadd/vfsub
> > > 2. vfwadd/vfwsub
> > > 3. vfmul
> > > 4. vfdiv
> > > 5. vfwmul
> > > 6. vfwmacc/vfwnmacc/vfwmsac/vfwnmsac
> > > 7. vfsqrt7/vfrec7
> > > 8. floating-point conversions.
> > > 9. floating-point reductions.
> > >
> > > The floating-point instructions we did NOT add FRM and rounding mode 
> > > operand:
> > > 1. vfsqrt/vfneg
> > Assuming vfsqrt is actually an estimator the best place to handle
> > rounding modes is at the last step(s) after N-R or Goldschmidt
> > refinement steps.  I haven't paid too much attention to FP yet, but this
> > is an area I've got fairly extensive experience.
> >
> > Sadly RISC-V's estimator is fairly poor and the single instance FMACs
> > are going to result in an implementation that may not actually be any
> > better than what glibc can do.
> >
> > Jeff
> >
>

Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point instructions

2023-05-15 Thread 钟居哲

What about vfsnjn ? Do they have FRM ? I want to double check it since I don't 
trust document.



juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-15 22:22
To: 钟居哲
CC: Jeff Law; gcc-patches; kito.cheng; palmer; palmer; rdapp.gcc
Subject: Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
instructions
> Oh, do you mean vfsqrt7/vfrec7 doesn't have frm, but vfsqrt/vfneg should have 
> frm.
> Is that rigth? If yes, I am gonna send a patch to fix it immediately.
 
Yes, and I also double checked spike implementation :P
 
and it seems like you're not committed yet, so let's send V2 :)
 
On Mon, May 15, 2023 at 10:12 PM 钟居哲  wrote:
>
> Oh, do you mean vfsqrt7/vfrec7 doesn't have frm, but vfsqrt/vfneg should have 
> frm.
> Is that rigth? If yes, I am gonna send a patch to fix it immediately.
>
>
>
> juzhe.zh...@rivai.ai
>
> From: Kito Cheng
> Date: 2023-05-15 22:07
> To: 钟居哲
> CC: Jeff Law; gcc-patches; kito.cheng; palmer; palmer; rdapp.gcc
> Subject: Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
> instructions
> Oh, Craig says vfrsqrt7.v not have frm but vsqrt.v have frm, and
> checked spike that match that.
>
> On Mon, May 15, 2023 at 9:55 PM 钟居哲  wrote:
> >
> > I don't know why we should not add frm vfsqrt.v since I saw topper (LLVM 
> > maintainer) said we should
> > not add frm into vsqrt.v. Maybe kito knows the reason ?
> >
> > https://github.com/riscv-non-isa/rvv-intrinsic-doc/pull/226
> >
> >
> >
> >
> > juzhe.zh...@rivai.ai
> >
> > From: Jeff Law
> > Date: 2023-05-15 21:52
> > To: juzhe.zhong; gcc-patches
> > CC: kito.cheng; kito.cheng; palmer; palmer; rdapp.gcc
> > Subject: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
> > instructions
> >
> >
> > On 5/15/23 05:49, juzhe.zh...@rivai.ai wrote:
> > > From: Juzhe-Zhong 
> > >
> > > This patch is adding rounding mode operand and FRM_REGNUM dependency
> > > into floating-point instructions.
> > >
> > > The floating-point instructions we added FRM and rounding mode operand:
> > > 1. vfadd/vfsub
> > > 2. vfwadd/vfwsub
> > > 3. vfmul
> > > 4. vfdiv
> > > 5. vfwmul
> > > 6. vfwmacc/vfwnmacc/vfwmsac/vfwnmsac
> > > 7. vfsqrt7/vfrec7
> > > 8. floating-point conversions.
> > > 9. floating-point reductions.
> > >
> > > The floating-point instructions we did NOT add FRM and rounding mode 
> > > operand:
> > > 1. vfsqrt/vfneg
> > Assuming vfsqrt is actually an estimator the best place to handle
> > rounding modes is at the last step(s) after N-R or Goldschmidt
> > refinement steps.  I haven't paid too much attention to FP yet, but this
> > is an area I've got fairly extensive experience.
> >
> > Sadly RISC-V's estimator is fairly poor and the single instance FMACs
> > are going to result in an implementation that may not actually be any
> > better than what glibc can do.
> >
> > Jeff
> >
>

Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point instructions

2023-05-15 Thread 钟居哲

Oh, do you mean vfsqrt7/vfrec7 doesn't have frm, but vfsqrt/vfneg should have 
frm.
Is that rigth? If yes, I am gonna send a patch to fix it immediately.



juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-15 22:07
To: 钟居哲
CC: Jeff Law; gcc-patches; kito.cheng; palmer; palmer; rdapp.gcc
Subject: Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
instructions
Oh, Craig says vfrsqrt7.v not have frm but vsqrt.v have frm, and
checked spike that match that.
 
On Mon, May 15, 2023 at 9:55 PM 钟居哲  wrote:
>
> I don't know why we should not add frm vfsqrt.v since I saw topper (LLVM 
> maintainer) said we should
> not add frm into vsqrt.v. Maybe kito knows the reason ?
>
> https://github.com/riscv-non-isa/rvv-intrinsic-doc/pull/226
>
>
>
>
> juzhe.zh...@rivai.ai
>
> From: Jeff Law
> Date: 2023-05-15 21:52
> To: juzhe.zhong; gcc-patches
> CC: kito.cheng; kito.cheng; palmer; palmer; rdapp.gcc
> Subject: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
> instructions
>
>
> On 5/15/23 05:49, juzhe.zh...@rivai.ai wrote:
> > From: Juzhe-Zhong 
> >
> > This patch is adding rounding mode operand and FRM_REGNUM dependency
> > into floating-point instructions.
> >
> > The floating-point instructions we added FRM and rounding mode operand:
> > 1. vfadd/vfsub
> > 2. vfwadd/vfwsub
> > 3. vfmul
> > 4. vfdiv
> > 5. vfwmul
> > 6. vfwmacc/vfwnmacc/vfwmsac/vfwnmsac
> > 7. vfsqrt7/vfrec7
> > 8. floating-point conversions.
> > 9. floating-point reductions.
> >
> > The floating-point instructions we did NOT add FRM and rounding mode 
> > operand:
> > 1. vfsqrt/vfneg
> Assuming vfsqrt is actually an estimator the best place to handle
> rounding modes is at the last step(s) after N-R or Goldschmidt
> refinement steps.  I haven't paid too much attention to FP yet, but this
> is an area I've got fairly extensive experience.
>
> Sadly RISC-V's estimator is fairly poor and the single instance FMACs
> are going to result in an implementation that may not actually be any
> better than what glibc can do.
>
> Jeff
>

Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point instructions

2023-05-15 Thread 钟居哲

So, you mean I also need to add frm into vsqrt? 
If yes, I am now send another patch to add it.



juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-15 22:07
To: 钟居哲
CC: Jeff Law; gcc-patches; kito.cheng; palmer; palmer; rdapp.gcc
Subject: Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
instructions
Oh, Craig says vfrsqrt7.v not have frm but vsqrt.v have frm, and
checked spike that match that.
 
On Mon, May 15, 2023 at 9:55 PM 钟居哲  wrote:
>
> I don't know why we should not add frm vfsqrt.v since I saw topper (LLVM 
> maintainer) said we should
> not add frm into vsqrt.v. Maybe kito knows the reason ?
>
> https://github.com/riscv-non-isa/rvv-intrinsic-doc/pull/226
>
>
>
>
> juzhe.zh...@rivai.ai
>
> From: Jeff Law
> Date: 2023-05-15 21:52
> To: juzhe.zhong; gcc-patches
> CC: kito.cheng; kito.cheng; palmer; palmer; rdapp.gcc
> Subject: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
> instructions
>
>
> On 5/15/23 05:49, juzhe.zh...@rivai.ai wrote:
> > From: Juzhe-Zhong 
> >
> > This patch is adding rounding mode operand and FRM_REGNUM dependency
> > into floating-point instructions.
> >
> > The floating-point instructions we added FRM and rounding mode operand:
> > 1. vfadd/vfsub
> > 2. vfwadd/vfwsub
> > 3. vfmul
> > 4. vfdiv
> > 5. vfwmul
> > 6. vfwmacc/vfwnmacc/vfwmsac/vfwnmsac
> > 7. vfsqrt7/vfrec7
> > 8. floating-point conversions.
> > 9. floating-point reductions.
> >
> > The floating-point instructions we did NOT add FRM and rounding mode 
> > operand:
> > 1. vfsqrt/vfneg
> Assuming vfsqrt is actually an estimator the best place to handle
> rounding modes is at the last step(s) after N-R or Goldschmidt
> refinement steps.  I haven't paid too much attention to FP yet, but this
> is an area I've got fairly extensive experience.
>
> Sadly RISC-V's estimator is fairly poor and the single instance FMACs
> are going to result in an implementation that may not actually be any
> better than what glibc can do.
>
> Jeff
>

Re: Re: [PATCH] RISC-V: Add rounding mode operand for floating point instructions

2023-05-15 Thread 钟居哲

I don't know why we should not add frm vfsqrt.v since I saw topper (LLVM 
maintainer) said we should
not add frm into vsqrt.v. Maybe kito knows the reason ?

https://github.com/riscv-non-isa/rvv-intrinsic-doc/pull/226 




juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2023-05-15 21:52
To: juzhe.zhong; gcc-patches
CC: kito.cheng; kito.cheng; palmer; palmer; rdapp.gcc
Subject: Re: [PATCH] RISC-V: Add rounding mode operand for floating point 
instructions
 
 
On 5/15/23 05:49, juzhe.zh...@rivai.ai wrote:
> From: Juzhe-Zhong 
> 
> This patch is adding rounding mode operand and FRM_REGNUM dependency
> into floating-point instructions.
> 
> The floating-point instructions we added FRM and rounding mode operand:
> 1. vfadd/vfsub
> 2. vfwadd/vfwsub
> 3. vfmul
> 4. vfdiv
> 5. vfwmul
> 6. vfwmacc/vfwnmacc/vfwmsac/vfwnmsac
> 7. vfsqrt7/vfrec7
> 8. floating-point conversions.
> 9. floating-point reductions.
> 
> The floating-point instructions we did NOT add FRM and rounding mode operand:
> 1. vfsqrt/vfneg
Assuming vfsqrt is actually an estimator the best place to handle 
rounding modes is at the last step(s) after N-R or Goldschmidt 
refinement steps.  I haven't paid too much attention to FP yet, but this 
is an area I've got fairly extensive experience.
 
Sadly RISC-V's estimator is fairly poor and the single instance FMACs 
are going to result in an implementation that may not actually be any 
better than what glibc can do.
 
Jeff

Re: [PATCH V5] RISC-V: Using merge approach to optimize repeating sequence in vec_init

2023-05-14 Thread 钟居哲

Ping. Ok for trunk ?



juzhe.zh...@rivai.ai
 
From: juzhe.zhong
Date: 2023-05-13 08:20
To: gcc-patches
CC: kito.cheng; palmer; jeffreyalaw; Juzhe-Zhong
Subject: [PATCH V5] RISC-V: Using merge approach to optimize repeating sequence 
in vec_init
From: Juzhe-Zhong 
 
1. Remove magic number of V4
2. Remove unnecessary gcc_assert
 
Consider this following case:
typedef int64_t vnx32di __attribute__ ((vector_size (256)));
 
 
__attribute__ ((noipa)) void
f_vnx32di (int64_t a, int64_t b, int64_t *out)
{
  vnx32di v
= {a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b, 
a, b, a, b, a, b, a, b};
  *(vnx32di *) out = v;
}
 
Since we dont't have SEW = 128 in vec_duplicate, we can't combine ab into SEW = 
128 element and then
broadcast this big element.
 
This patch is optimize the case as above.
 
-march=rv64gcv_zvl256b --param riscv-autovec-preference=fixed-vlmax
 
Before this patch:
 
..
vslide1down.vx (x31 times)
..
 
After this patch:
li a5,-1431654400
addi a5,a5,-1365
li a3,-1431654400
addi a3,a3,-1366
slli a5,a5,32
add a5,a5,a3
vsetvli a4,zero,e64,m8,ta,ma
vmv.v.x v8,a0
vmv.s.x v0,a5
vmerge.vxm v8,v8,a1,v0
vs8r.v v8,0(a2)
ret
 
gcc/ChangeLog:
 
* config/riscv/riscv-v.cc 
(rvv_builder::can_duplicate_repeating_sequence_p): New function.
(rvv_builder::get_merged_repeating_sequence): Ditto.
(rvv_builder::repeating_sequence_use_merge_profitable_p): Ditto.
(rvv_builder::get_merge_mask_bitfield): Ditto.
(emit_scalar_move_op): Ditto.
(emit_merge_op): Ditto.
(expand_vector_init_merge_repeating_sequence): Ditto.
(expand_vec_init): Add merge approach for reapeating sequence.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-10.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-11.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-7.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-8.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-9.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/repeat_run-11.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/repeat_run-7.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/repeat_run-8.c: New test.
 
---
gcc/config/riscv/riscv-v.cc   | 243 --
.../riscv/rvv/autovec/vls-vlmax/repeat-10.c   |  19 ++
.../riscv/rvv/autovec/vls-vlmax/repeat-11.c   |  25 ++
.../riscv/rvv/autovec/vls-vlmax/repeat-7.c|  25 ++
.../riscv/rvv/autovec/vls-vlmax/repeat-8.c|  15 ++
.../riscv/rvv/autovec/vls-vlmax/repeat-9.c|  16 ++
.../rvv/autovec/vls-vlmax/repeat_run-11.c |  45 
.../rvv/autovec/vls-vlmax/repeat_run-7.c  |  45 
.../rvv/autovec/vls-vlmax/repeat_run-8.c  |  41 +++
9 files changed, 451 insertions(+), 23 deletions(-)
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-10.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-11.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-7.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-8.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat-9.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat_run-11.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat_run-7.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/repeat_run-8.c
 
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index b8dc333f54e..d844c305320 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -72,11 +72,14 @@ public:
   {
 add_input_operand (RVV_VUNDEF (mode), mode);
   }
-  void add_policy_operand (enum tail_policy vta, enum mask_policy vma)
+  void add_policy_operand (enum tail_policy vta)
   {
 rtx tail_policy_rtx = gen_int_mode (vta, Pmode);
-rtx mask_policy_rtx = gen_int_mode (vma, Pmode);
 add_input_operand (tail_policy_rtx, Pmode);
+  }
+  void add_policy_operand (enum mask_policy vma)
+  {
+rtx mask_policy_rtx = gen_int_mode (vma, Pmode);
 add_input_operand (mask_policy_rtx, Pmode);
   }
   void add_avl_type_operand (avl_type type)
@@ -99,25 +102,36 @@ public:
 add_vundef_operand (dest_mode);
   }
-  void set_len_and_policy (rtx len, bool force_vlmax = false)
-{
-  bool vlmax_p = force_vlmax;
-  gcc_assert (has_dest);
+  void set_dest_merge (rtx dest)
+  {
+dest_mode = GET_MODE (dest);
+has_dest = true;
+add_output_operand (dest, dest_mode);
+add_vundef_operand (dest_mode);
+  }
-  if (!len)
- {
-   vlmax_p = true;
-   len = gen_reg_rtx (Pmode);
-   emit_vlmax_vsetvl (dest_mode, len);
- }
+  void set_len_and_policy (rtx len, bool force_vlmax = false, bool ta_p = true,
+bool ma_p = true)
+  {
+bool vlmax_p = force_vlmax;
+gcc_assert (has_dest);
-

Re: Re: [PATCH V4] RISC-V: Using merge approach to optimize repeating sequence in vec_init

2023-05-12 Thread 钟居哲

Address comments.
V5 patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618398.html 

Thanks.


juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-13 00:16
To: juzhe.zhong
CC: gcc-patches; palmer; jeffreyalaw
Subject: Re: [PATCH V4] RISC-V: Using merge approach to optimize repeating 
sequence in vec_init
> +/* Get the mask for merge approach.
> +
> + Consider such following case:
> +   {a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b}
> + To merge "a", the mask should be 1010
> + To merge "b", the mask should be 0101
> +*/
> +rtx
> +rvv_builder::get_merge_mask_bitfield (unsigned int index) const
> +{
> +  uint64_t base_mask = (1ULL << index);
> +  uint64_t mask = 0;
> +  for (unsigned int i = 0; i < (64 / npatterns ()); i++)
 
What the magic 64 means?
...
 
> +static void
> +expand_vector_init_merge_repeating_sequence (rtx target,
> +const rvv_builder )
> +{
> +  machine_mode mask_mode;
> +  gcc_assert (get_mask_mode (builder.mode ()).exists (_mode));
> +
> +  machine_mode dup_mode = builder.mode ();
> +  if (known_gt (GET_MODE_SIZE (dup_mode), BYTES_PER_RISCV_VECTOR))
> +{
> +  poly_uint64 nunits
> +   = exact_div (BYTES_PER_RISCV_VECTOR, builder.inner_units ());
> +  gcc_assert (
> +   get_vector_mode (builder.inner_int_mode (), nunits).exists 
> (_mode));
 
gcc_assert will removed at release mode, so it's not you want I guess?
 
> +}
> +  else
> +{
> +  if (FLOAT_MODE_P (dup_mode))
> +   gcc_assert (get_vector_mode (builder.inner_int_mode (),
> +GET_MODE_NUNITS (dup_mode))
> + .exists (_mode));
 
Same issue
 
> +}
> +
> +  machine_mode dup_mask_mode;
> +  gcc_assert (get_mask_mode (dup_mode).exists (_mask_mode));
 
Same issue

Re: Re: [PATCH V4] RISC-V: Using merge approach to optimize repeating sequence in vec_init

2023-05-12 Thread 钟居哲

>> What the magic 64 means?
uint64_t mask = 0;
64 = sizeof (uint64_t)

>> gcc_assert will removed at release mode, so it's not you want I guess?
You mean I need to remove it?


juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-13 00:16
To: juzhe.zhong
CC: gcc-patches; palmer; jeffreyalaw
Subject: Re: [PATCH V4] RISC-V: Using merge approach to optimize repeating 
sequence in vec_init
> +/* Get the mask for merge approach.
> +
> + Consider such following case:
> +   {a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b}
> + To merge "a", the mask should be 1010
> + To merge "b", the mask should be 0101
> +*/
> +rtx
> +rvv_builder::get_merge_mask_bitfield (unsigned int index) const
> +{
> +  uint64_t base_mask = (1ULL << index);
> +  uint64_t mask = 0;
> +  for (unsigned int i = 0; i < (64 / npatterns ()); i++)
 
What the magic 64 means?
...
 
> +static void
> +expand_vector_init_merge_repeating_sequence (rtx target,
> +const rvv_builder )
> +{
> +  machine_mode mask_mode;
> +  gcc_assert (get_mask_mode (builder.mode ()).exists (_mode));
> +
> +  machine_mode dup_mode = builder.mode ();
> +  if (known_gt (GET_MODE_SIZE (dup_mode), BYTES_PER_RISCV_VECTOR))
> +{
> +  poly_uint64 nunits
> +   = exact_div (BYTES_PER_RISCV_VECTOR, builder.inner_units ());
> +  gcc_assert (
> +   get_vector_mode (builder.inner_int_mode (), nunits).exists 
> (_mode));
 
gcc_assert will removed at release mode, so it's not you want I guess?
 
> +}
> +  else
> +{
> +  if (FLOAT_MODE_P (dup_mode))
> +   gcc_assert (get_vector_mode (builder.inner_int_mode (),
> +GET_MODE_NUNITS (dup_mode))
> + .exists (_mode));
 
Same issue
 
> +}
> +
> +  machine_mode dup_mask_mode;
> +  gcc_assert (get_mask_mode (dup_mode).exists (_mask_mode));
 
Same issue

Re: Re: [PATCH V2] RISC-V: Using merge approach to optimize repeating sequence in vec_init

2023-05-12 Thread 钟居哲

Address comments.
V4 patch: 
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618375.html 
Regresion PASSED.

Thanks.


juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-12 23:19
To: juzhe.zhong
CC: gcc-patches; palmer; rdapp.gcc; jeffreyalaw
Subject: Re: [PATCH V2] RISC-V: Using merge approach to optimize repeating 
sequence in vec_init
two minor comments:
 
> +  void add_ta_policy_operand (enum tail_policy vta)
> +  void add_ma_policy_operand (enum mask_policy vma)
 
You could just named as add_policy_operand since the arugment type is
already sufficient to distinguish.
 
> @@ -84,40 +92,52 @@ public:
>  add_input_operand (gen_int_mode (type, Pmode), Pmode);
>}
>
> -  void set_dest_and_mask (rtx mask, rtx dest, machine_mode mask_mode)
> +  void set_dest_and_mask (rtx mask, rtx dest, machine_mode mask_mode,
> + bool scalar_move = false, bool merge_op = false)
>{
>  dest_mode = GET_MODE (dest);
>  has_dest = true;
>
>  add_output_operand (dest, dest_mode);
>
> -if (mask)
> -  add_input_operand (mask, GET_MODE (mask));
> -else
> -  add_all_one_mask_operand (mask_mode);
> +if (!merge_op)
> +  {
> +   if (mask)
> + add_input_operand (mask, GET_MODE (mask));
> +   else
> + {
> +   if (scalar_move)
> + add_scalar_move_mask_operand (mask_mode);
> +   else
> + add_all_one_mask_operand (mask_mode);
> + }
> +  }
 
I would like to have set_dest_and_mask_merge_op to reduce the
complexity of set_dest_and_mask.
Accroding the code I got is:
- mask and merge_op are mutual exclusion.
- scalar_move will become meanless if merge_op is true.
 
So I think they mixed those togeter is not good idea.

Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-11 Thread 钟居哲

Hi, Richards.

Could you take a look at this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618241.html 

Thanks


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-11 20:42
To: juzhe.zhong\@rivai.ai
CC: gcc-patches; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
"juzhe.zh...@rivai.ai"  writes:
> Thanks. I have read rgroup descriptions again.
> Still I am not fully understand it clearly, bear with me :)
>
> I don't known how to differentiate Case 2 and Case 3.
>
> Case 2 is multiple rgroup for SLP.
> Case 3 is multiple rgroup for non-SLP (VEC_PACK_TRUNC)
>
> Is it correct:
> case 2: rgc->max_nscalarper_iter != 1
 
Yes.
 
> Case 3 : rgc->max_nscalarper_iter == 1 but rgc->factor != 1?
 
For case 3 it's:
 
rgc->max_nscalars_per_iter == 1 && rgc != _VINFO_LENS (loop_vinfo)[0]
 
rgc->factor is controlled by the target and just says what units
IFN_LOAD_LEN works in.  E.g. if we're loading 16-byte elements,
but the underlying instruction measures bytes, the factor would be 2.
 
Thanks,
Richard

Re: Re: [PATCH] RISC-V: Add v_uimm_operand

2023-05-11 Thread 钟居哲

LGTM



juzhe.zh...@rivai.ai
 
From: Palmer Dabbelt
Date: 2023-05-12 06:31
To: juzhe.zhong
CC: gcc-patches; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Add v_uimm_operand
On Thu, 11 May 2023 15:00:48 PDT (-0700), juzhe.zh...@rivai.ai wrote:
>>>  ;; V has 32-bit unsigned immediates.  This happens to be the same 
>>> constraint asIt should be 5-bit unsigned immediates>> ;  the csr_operand, 
>>> but it's not CSR related.
>>> (define_predicate "v_uimm_operand"
>>>   (match_operand 0 "csr_operand"))
> To make name consistent, it should be "vector_", so I suggest it to be 
> "vector_scalar_shift_operand".
 
Makes sense, I sent a v2.

[PATCH] RISC-V: Add v_uimm_operand

2023-05-11 Thread 钟居哲

>>  ;; V has 32-bit unsigned immediates.  This happens to be the same 
>> constraint asIt should be 5-bit unsigned immediates>> ;  the csr_operand, 
>> but it's not CSR related.
>> (define_predicate "v_uimm_operand"
>>   (match_operand 0 "csr_operand"))
To make name consistent, it should be "vector_", so I suggest it to be 
"vector_scalar_shift_operand".

Thanks.


juzhe.zh...@rivai.ai

Re: Re: [PATCH] riscv: Split off shift patterns for autovectorization.

2023-05-10 Thread 钟居哲

>> I don't think VEL is _wrong_ here, as it's an integer type that's big
>> enough to hold the shift amount, but we might get some odd generated
>> code for the QI and HI flavors as we frequently don't handle the shorter
>> types well.

This implementation has been proved works well in both my downsteam GCC and 
"rvv-next".
 
>> "csr_operand" does seem wrong, though, as that just accepts constants.
>> Maybe "arith_operand" is the way to go?  I haven't looked at the
>> V immediates though.

"arith_operand" is not correct which is SMALL_OPERND - 12bit operand.
For shift V immediates should be 0 ~ 31 which perfectly match csr_operand.



juzhe.zh...@rivai.ai
 
From: Palmer Dabbelt
Date: 2023-05-11 03:19
To: rdapp.gcc
CC: gcc-patches; juzhe.zhong; Kito Cheng; collison; jeffreyalaw; rdapp.gcc
Subject: Re: [PATCH] riscv: Split off shift patterns for autovectorization.
On Wed, 10 May 2023 08:24:50 PDT (-0700), rdapp@gmail.com wrote:
> Hi,
>
> this patch splits off the shift patterns of the binop patterns.
> This is necessary as the scalar shifts require a Pmode operand
> as shift count.  To this end, a new iterator any_int_binop_no_shift
> is introduced.  At a later point when the binops are split up
> further in commutative and non-commutative patterns (which both
> do not include the shift patterns) we might not need this anymore.
>
> Bootstrapped and regtested.
>
> Regards
>  Robin
>
> --
>
> gcc/ChangeLog:
>
> * config/riscv/autovec.md (3): Add scalar shift
> pattern.
> (v3): Add vector shift pattern.
> * config/riscv/vector-iterators.md: New iterator.
> ---
>  gcc/config/riscv/autovec.md  | 40 +++-
>  gcc/config/riscv/vector-iterators.md |  4 +++
>  2 files changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
> index 8347e42bb9c..2da4fc67d51 100644
> --- a/gcc/config/riscv/autovec.md
> +++ b/gcc/config/riscv/autovec.md
> @@ -65,7 +65,7 @@ (define_expand "movmisalign"
>
>  (define_expand "3"
>[(set (match_operand:VI 0 "register_operand")
> -(any_int_binop:VI
> +(any_int_binop_no_shift:VI
>   (match_operand:VI 1 "")
>   (match_operand:VI 2 "")))]
>"TARGET_VECTOR"
> @@ -91,3 +91,41 @@ (define_expand "3"
>NULL_RTX, mode);
>DONE;
>  })
> +
> +;; =
> +;; == Binary integer shifts by scalar.
> +;; =
> +
> +(define_expand "3"
> +  [(set (match_operand:VI 0 "register_operand")
> +(any_shift:VI
> + (match_operand:VI 1 "register_operand")
> + (match_operand: 2 "csr_operand")))]
 
I don't think VEL is _wrong_ here, as it's an integer type that's big
enough to hold the shift amount, but we might get some odd generated
code for the QI and HI flavors as we frequently don't handle the shorter
types well.
 
"csr_operand" does seem wrong, though, as that just accepts constants.
Maybe "arith_operand" is the way to go?  I haven't looked at the
V immediates though.
 
> +  "TARGET_VECTOR"
> +{
> +  if (!CONST_SCALAR_INT_P (operands[2]))
> +  operands[2] = gen_lowpart (Pmode, operands[2]);
> +  riscv_vector::emit_len_binop (code_for_pred_scalar
> + (, mode),
> + operands[0], operands[1], operands[2],
> + NULL_RTX, mode, Pmode);
> +  DONE;
> +})
> +
> +;; =
> +;; == Binary integer shifts by vector.
> +;; =
> +
> +(define_expand "v3"
> +  [(set (match_operand:VI 0 "register_operand")
> +(any_shift:VI
> + (match_operand:VI 1 "register_operand")
> + (match_operand:VI 2 "vector_shift_operand")))]
> +  "TARGET_VECTOR"
> +{
> +  riscv_vector::emit_len_binop (code_for_pred
> + (, mode),
> + operands[0], operands[1], operands[2],
> + NULL_RTX, mode);
> +  DONE;
> +})
> diff --git a/gcc/config/riscv/vector-iterators.md 
> b/gcc/config/riscv/vector-iterators.md
> index 42848627c8c..fdb0bfbe3b1 100644
> --- a/gcc/config/riscv/vector-iterators.md
> +++ b/gcc/config/riscv/vector-iterators.md
> @@ -1429,6 +1429,10 @@ (define_code_iterator any_commutative_binop [plus and 
> ior xor
>
>  (define_code_iterator any_non_commutative_binop [minus div udiv mod umod])
>
> +(define_code_iterator any_int_binop_no_shift
> + [plus minus and ior xor smax umax smin umin mult div udiv mod umod
> +])
> +
>  (define_code_iterator any_immediate_binop [plus minus and ior xor])
>
>  (define_code_iterator any_sat_int_binop [ss_plus ss_minus us_plus us_minus])
> --
> 2.40.0
 
It'd be great to have test cases for the patterns we're adding, at least
for some of the stickier ones.

Re: Re: [PATCH] riscv: Add vectorized binops and insn_expander helpers.

2023-05-10 Thread 钟居哲

>> This I added in order to match the scalar variants like
 
 >>  [(set (match_operand:VI_QHS 0 "register_operand"  "=vd,vd, vr, vr")
>> (if_then_else:VI_QHS
>>   (unspec:
>> [(match_operand: 1 "vector_mask_operand" "vm,vm,Wc1,Wc1")
 >> (match_operand 5 "vector_length_operand""rK,rK, rK, rK")
>>  (match_operand 6 "const_int_operand"" i, i,  i,  i")
>>  (match_operand 7 "const_int_operand"" i, i,  i,  i")
 >> (match_operand 8 "const_int_operand"" i, i,  i,  i")
 >> (reg:SI VL_REGNUM)
 >> (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
 >>  (any_commutative_binop:VI_QHS
 >>(vec_duplicate:VI_QHS
 >>  (match_operand: 4 "reg_or_0_operand"  "rJ,rJ, rJ, rJ"))
 
>> Any other way to get there?

No, you don't need to care about that. 
Intrinsic patterns are well designed, you just use "GET_MODE_INNER" which can
well handle that.

 >> Hmm I see, the VOIDmode being abused as default might be confusing here.
 >> Would an additional parameter like "bool set_op2_mode" make it clearer?
 >> Another option is to separate this into another function altogether like
 >> emit_len_binop_scalar or so.

No, you just use op2mode which you pass through.




juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-05-11 02:02
To: 钟居哲; gcc-patches; kito.cheng; Michael Collison; palmer; Jeff Law
Subject: Re: [PATCH] riscv: Add vectorized binops and insn_expander helpers.
> +  machine_mode op2mode = Pmode;
> +  if (inner == E_QImode || inner == E_HImode || inner == E_SImode)
> + op2mode = inner;
 
This I added in order to match the scalar variants like
 
  [(set (match_operand:VI_QHS 0 "register_operand"  "=vd,vd, vr, vr")
(if_then_else:VI_QHS
  (unspec:
[(match_operand: 1 "vector_mask_operand" "vm,vm,Wc1,Wc1")
 (match_operand 5 "vector_length_operand""rK,rK, rK, rK")
 (match_operand 6 "const_int_operand"" i, i,  i,  i")
 (match_operand 7 "const_int_operand"" i, i,  i,  i")
 (match_operand 8 "const_int_operand"" i, i,  i,  i")
 (reg:SI VL_REGNUM)
 (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
  (any_commutative_binop:VI_QHS
(vec_duplicate:VI_QHS
  (match_operand: 4 "reg_or_0_operand"  "rJ,rJ, rJ, rJ"))
 
Any other way to get there?
 
> + e.add_input_operand (src2, op2mode == VOIDmode ? GET_MODE (src2) : op2mode);
> Very confusing here.
 
Hmm I see, the VOIDmode being abused as default might be confusing here.
Would an additional parameter like "bool set_op2_mode" make it clearer?
Another option is to separate this into another function altogether like
emit_len_binop_scalar or so.
 
> +  
> change it into 
 
Done and removed the rest.
 
Thanks.

Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-10 Thread 钟居哲

I am sorry that I am still confused about that.

Is this what you want ?

  bool use_minus_p = TREE_CODE (step) == INTEGER_CST && ((TYPE_UNSIGNED 
(TREE_TYPE (step)) && tree_int_cst_lt (step1, step))
 || (!TYPE_UNSIGNED (TREE_TYPE (step)) && 
!tree_expr_nonnegative_warnv_p (step, ) && may_negate_without_overflow_p 
(step)));

  /* For easier readability of the created code, produce MINUS_EXPRs
 when suitable.  */
  if (TREE_CODE (step) == INTEGER_CST)
{
  if (TYPE_UNSIGNED (TREE_TYPE (step)))
{
  step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
  if (tree_int_cst_lt (step1, step))
{
  incr_op = MINUS_EXPR; /* Remove it.  */
  step = step1;
}
}
  else
{
  bool ovf;

  if (!tree_expr_nonnegative_warnv_p (step, )
  && may_negate_without_overflow_p (step))
{
  incr_op = MINUS_EXPR; /* Remove it.  */
  step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
}
}
}
  if (POINTER_TYPE_P (TREE_TYPE (base)))
{
  if (TREE_CODE (base) == ADDR_EXPR)
mark_addressable (TREE_OPERAND (base, 0));
  step = convert_to_ptrofftype (step);
  if (incr_op == MINUS_EXPR) /* Change it into if (use_minus_p)  */
step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
  incr_op = POINTER_PLUS_EXPR; /* Remove it.  */
}
  /* Gimplify the step if necessary.  We put the computations in front of the
 loop (i.e. the step should be loop invariant).  */
  step = force_gimple_operand (step, , true, NULL_TREE);
  if (stmts)
gsi_insert_seq_on_edge_immediate (pe, stmts);

  if (POINTER_TYPE_P (TREE_TYPE (base)))
stmt = gimple_build_assign (va, POINTER_PLUS_EXPR, vb, step);
  else if (use_minus_p)
stmt = gimple_build_assign (va, MINUS_EXPR, vb, step);
  else
stmt = gimple_build_assign (va, incr_op, vb, step);
...

Since I have no idea to make stmts flips between PLUS_EXPR and MINUS_EXPR.

Thanks.

juzhe.zh...@rivai.ai

From: Richard Sandiford
Date: 2023-05-11 05:28
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> Thanks Richard.
> I am planning to seperate a patch with only creat_iv stuff only.
>
> Are you suggesting that I remove "tree_code incr_op = code;"
> Use the argument directly ?
>
> I saw the codes here:
>
>   /* For easier readability of the created code, produce MINUS_EXPRs
>  when suitable.  */
>   if (TREE_CODE (step) == INTEGER_CST)
> {
>   if (TYPE_UNSIGNED (TREE_TYPE (step)))
> {
>   step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   if (tree_int_cst_lt (step1, step))
> {
>   incr_op = MINUS_EXPR;
>   step = step1;
> }
> }
>   else
> {
>   bool ovf;
>
>   if (!tree_expr_nonnegative_warnv_p (step, )
>   && may_negate_without_overflow_p (step))
> {
>   incr_op = MINUS_EXPR;
>   step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
> }
> }
> }
>   if (POINTER_TYPE_P (TREE_TYPE (base)))
> {
>   if (TREE_CODE (base) == ADDR_EXPR)
> mark_addressable (TREE_OPERAND (base, 0));
>   step = convert_to_ptrofftype (step);
>   if (incr_op == MINUS_EXPR)
> step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   incr_op = POINTER_PLUS_EXPR;
> }
>   /* Gimplify the step if necessary.  We put the computations in front of the
>  loop (i.e. the step should be loop invariant).  */
>   step = force_gimple_operand (step, , true, NULL_TREE);
>   if (stmts)
> gsi_insert_seq_on_edge_immediate (pe, stmts);
>
>   stmt = gimple_build_assign (va, incr_op, vb, step);
> ...
>
> It seems that it has complicated conditions here to change value of variable 
> "incr_op".
> That's why I define a temporary variable "tree_code incr_op = code;" here and
> let the following codes change the value of "incr_op".
>
> Could you give me some hints of dealing with this piece of code to get rid of 
> "tree_code incr_op = code;" ?

Yeah, but like I said in the review, those later:
  incr_op = MINUS_EXPR;
stmts need to be updated to something that flips between PLUS_EXPR
and MINUS_EXPR (with updates to the comments).  Just leaving them
as-is is incorrect (in cases where the caller passed MINUS_EXPR
rather than PLUS_EXPR).

The POINTER_PLUS_EXPR handling is fine due to the conditional
negate beforehand.

Thanks,
Richard

Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-10 Thread 钟居哲

Thanks Richard.
I am planning to seperate a patch with only creat_iv stuff only.

Are you suggesting that I remove "tree_code incr_op = code;"
Use the argument directly ?

I saw the codes here:

  /* For easier readability of the created code, produce MINUS_EXPRs
 when suitable.  */
  if (TREE_CODE (step) == INTEGER_CST)
{
  if (TYPE_UNSIGNED (TREE_TYPE (step)))
{
  step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
  if (tree_int_cst_lt (step1, step))
{
  incr_op = MINUS_EXPR;
  step = step1;
}
}
  else
{
  bool ovf;

  if (!tree_expr_nonnegative_warnv_p (step, )
  && may_negate_without_overflow_p (step))
{
  incr_op = MINUS_EXPR;
  step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
}
}
}
  if (POINTER_TYPE_P (TREE_TYPE (base)))
{
  if (TREE_CODE (base) == ADDR_EXPR)
mark_addressable (TREE_OPERAND (base, 0));
  step = convert_to_ptrofftype (step);
  if (incr_op == MINUS_EXPR)
step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
  incr_op = POINTER_PLUS_EXPR;
}
  /* Gimplify the step if necessary.  We put the computations in front of the
 loop (i.e. the step should be loop invariant).  */
  step = force_gimple_operand (step, , true, NULL_TREE);
  if (stmts)
gsi_insert_seq_on_edge_immediate (pe, stmts);

  stmt = gimple_build_assign (va, incr_op, vb, step);
...

It seems that it has complicated conditions here to change value of variable 
"incr_op".
That's why I define a temporary variable "tree_code incr_op = code;" here and
let the following codes change the value of "incr_op".

Could you give me some hints of dealing with this piece of code to get rid of 
"tree_code incr_op = code;" ?

Thanks.

juzhe.zh...@rivai.ai

From: Richard Sandiford
Date: 2023-05-11 00:45
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
In addition to Jeff's comments:

juzhe.zh...@rivai.ai writes:
> [...]
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index cc4a93a8763..99cf0cdbdca 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of active elements in vector will be updated 
> value.
> +operand 1 is the total elements need to be updated value.
> +operand 2 is the vectorization factor.
> +The value of operand 0 is target dependent and flexible in each iteration.
> +The operation of this pattern can be:
> +
> +@smallexample
> +Case 1:
> +operand0 = MIN (operand1, operand2);
> +operand2 can be const_poly_int or poly_int related to vector mode size.
> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE 
> SIZE) so
> +that we can reduce a use of general purpose register.
> +
> +In this case, only the last iteration of the loop is partial iteration.
> +@end smallexample
> +
> +@smallexample
> +Case 2:
> +if (operand1 <= operand2)
> +  operand0 = operand1;
> +else if (operand1 < 2 * operand2)
> +  operand0 = IN_RANGE (ceil (operand1 / 2), operand2);

GCC's IN_RANGE is a predicate, so it would be best to avoid that here.
Why isn't it simply ceil (operand1 / 2), which must be <= operand2?

> +else
> +  operand0 = operand2;
> +
> +This case will evenly distribute work over the last 2 iterations of a 
> stripmine loop.
> +@end smallexample
> +
> +The output of this pattern is not only used as IV of loop control counter, 
> but also
> +is used as the IV of address calculation with multiply/shift operation. This 
> allow
> +us dynamic adjust the number of elements is processed in each iteration of 
> the loop.
> +
>  @cindex @code{check_raw_ptrs@var{m}} instruction pattern
>  @item @samp{check_raw_ptrs@var{m}}
>  Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
> [...]
> diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc
> index 909b705d00d..5abca64379e 100644
> --- a/gcc/tree-ssa-loop-manip.cc
> +++ b/gcc/tree-ssa-loop-manip.cc
> @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3.  If not see
> so that we can free them all at once.  */
>  static bitmap_obstack loop_renamer_obstack;
>  
> -/* Creates an induction variable with value BASE + STEP * iteration in LOOP.
> +/* Creates an induction variable with value BASE (+/-) STEP * iteration in 
> LOOP.
> +   If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration.
> +   If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration.
> It is expected that neither BASE nor STEP are shared with other 
> expressions
> (unless the sharing rules allow this).  Use VAR as a base var_decl for it
> (if NULL, a new temporary will be created).  The increment will occur at
> @@ -57,8 +59,8 @@ static bitmap_obstack

Re: [PATCH] riscv: Add vectorized binops and insn_expander helpers.

2023-05-10 Thread 钟居哲

Thanks Robin.

A couple comments here:
+  machine_mode op2mode = Pmode;
+  if (inner == E_QImode || inner == E_HImode || inner == E_SImode)
+ op2mode = inner;

Remove it.

+  
change it into 

+ e.add_input_operand (src2, op2mode == VOIDmode ? GET_MODE (src2) : op2mode);
Very confusing here.

+(define_code_attr BINOP_TO_UPPERCASE [
+(plus "PLUS")
+(minus "MINUS")
+(and "AND")
+(ior "IOR")
+(xor "XOR")
+(ashift "ASHIFT")
+(ashiftrt "ASHIFTRT")
+(lshiftrt "LSHIFTRT")
+(smax "SMAX")
+(umax "UMAX")
+(smin "SMIN")
+(umin "UMIN")
+(mult "MULT")
+(div "DIV")
+(udiv "UDIV")
+(mod "MOD")
+(umod "UMOD")
+])

Remove it.

Thanks.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-05-10 23:24
To: gcc-patches; juzhe.zh...@rivai.ai; Kito Cheng; Michael Collison; palmer; 
jeffreyalaw
CC: rdapp.gcc
Subject: [PATCH] riscv: Add vectorized binops and insn_expander helpers.
Hi,
 
this patch adds basic binary integer operations support.  It is based
on Michael Collison's work and makes use of the existing helpers in
riscv-c.cc.  It introduces emit_nonvlmax_binop which, in turn, uses
emit_pred_binop.  Setting the destination as well as the mask and the
length is factored out into separate functions.
 
There are several things still missing, most notably the scalar variants
(.vx) as well as multiplication variants and more.
 
Bootstrapped and regtested.
 
Regards
Robin
 
--
 
gcc/ChangeLog:
 
* config/riscv/autovec.md (3): Add integer binops.
* config/riscv/riscv-protos.h (emit_nonvlmax_binop): Declare.
* config/riscv/riscv-v.cc (emit_pred_op): New function.
(set_expander_dest_and_mask): New function.
(emit_pred_binop): New function.
(emit_nonvlmax_binop): New function.
* config/riscv/vector-iterators.md: Add new code attribute.
---
gcc/config/riscv/autovec.md  | 33 ++
gcc/config/riscv/riscv-protos.h  |  2 +
gcc/config/riscv/riscv-v.cc  | 98 ++--
gcc/config/riscv/vector-iterators.md | 22 +++
4 files changed, 136 insertions(+), 19 deletions(-)
 
diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index f1c5ff5951b..15f8d007e07 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -58,3 +58,36 @@ (define_expand "movmisalign"
 DONE;
   }
)
+
+;; =
+;; == Binary integer operations
+;; =
+
+(define_expand "3"
+  [(set (match_operand:VI 0 "register_operand")
+(any_int_binop:VI
+ (match_operand:VI 1 "")
+ (match_operand:VI 2 "")))]
+  "TARGET_VECTOR"
+{
+  if (!register_operand (operands[2], mode))
+{
+  rtx cst;
+  gcc_assert (const_vec_duplicate_p(operands[2], ));
+  machine_mode inner = mode;
+  machine_mode op2mode = Pmode;
+  if (inner == E_QImode || inner == E_HImode || inner == E_SImode)
+ op2mode = inner;
+
+  riscv_vector::emit_nonvlmax_binop (code_for_pred_scalar
+ (, mode),
+ operands[0], operands[1], cst,
+ NULL_RTX, mode, op2mode);
+}
+  else
+riscv_vector::emit_nonvlmax_binop (code_for_pred
+(, mode),
+operands[0], operands[1], operands[2],
+NULL_RTX, mode);
+  DONE;
+})
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index c0293a306f9..75cdb90b9c9 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -169,6 +169,8 @@ void emit_hard_vlmax_vsetvl (machine_mode, rtx);
void emit_vlmax_op (unsigned, rtx, rtx, machine_mode);
void emit_vlmax_op (unsigned, rtx, rtx, rtx, machine_mode);
void emit_nonvlmax_op (unsigned, rtx, rtx, rtx, machine_mode);
+void emit_nonvlmax_binop (unsigned, rtx, rtx, rtx, rtx, machine_mode,
+   machine_mode = VOIDmode);
enum vlmul_type get_vlmul (machine_mode);
unsigned int get_ratio (machine_mode);
unsigned int get_nf (machine_mode);
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 7ca49ca67c1..3c43dfc5eea 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -53,7 +53,7 @@ namespace riscv_vector {
template  class insn_expander
{
public:
-  insn_expander () : m_opno (0) {}
+  insn_expander () : m_opno (0), has_dest(false) {}
   void add_output_operand (rtx x, machine_mode mode)
   {
 create_output_operand (_ops[m_opno++], x, mode);
@@ -84,6 +84,44 @@ public:
 add_input_operand (gen_int_mode (type, Pmode), Pmode);
   }
+  void set_dest_and_mask (rtx mask, rtx dest, machine_mode mask_mode)
+  {
+dest_mode = GET_MODE (dest);
+has_dest = true;
+
+add_output_operand (dest, dest_mode);
+
+if (mask)
+  add_input_operand (mask, GET_MODE (mask));
+else
+  add_all_one_mask_operand (mask_mode);
+
+add_vundef_operand (dest_mode);
+  }
+
+  void set_len_and_policy (rtx len, bool vlmax_p)
+{
+  gcc_assert (has_dest);
+  gcc_assert (len || vlmax_p);
+
+  if

Re: Re: [PATCH V2] RISC-V: Fix incorrect implementation of TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT

2023-05-09 Thread 钟居哲

No, I don't think so. Some testcases the reason I added -fno-vect-cost-model 
here is
because we don't have enough patterns to enable some auto-vectorizations.
I add   -fno-vect-cost-model to force enable auto-vectorizations for such cases 
for testing.



juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-09 22:36
To: juzhe.zhong
CC: gcc-patches@gcc.gnu.org; pal...@dabbelt.com; jeffreya...@gmail.com; 
rdapp@gmail.com
Subject: Re: [PATCH V2] RISC-V: Fix incorrect implementation of 
TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT
One more question from me: should we just add  -fno-vect-cost-model to
AUTOVEC_TEST_OPTS?
 
On Tue, May 9, 2023 at 10:29 PM Kito Cheng  wrote:
>
> Oh, checked default_builtin_support_vector_misalignment and I realized
> we can just remove riscv_support_vector_misalignment at all...
>
>
> On Tue, May 9, 2023 at 10:18 PM juzhe.zhong  wrote:
> >
> > riscv_support_vector_misalignment update makes some of the testcase check 
> > fail. I have checked the those fails， they are reasonable. So I include 
> > test case adapt in this patch.
> >  Replied Message 
> > FromKito Cheng
> > Date05/09/2023 21:54
> > tojuzhe.zh...@rivai.ai
> > ccgcc-patc...@gcc.gnu.org,
> > pal...@dabbelt.com,
> > jeffreya...@gmail.com,
> > rdapp@gmail.com
> > SubjectRe: [PATCH V2] RISC-V: Fix incorrect implementation of 
> > TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT
> > I am ok with both changes but I tried to build some test cases, and it
> > seems the changes are caused by options update, not caused by the
> > riscv_support_vector_misalignment update? so I would like to see the
> > testcase should split out into a separated patch.
> >
> > > +/* Return true if the vector misalignment factor is supported by the
> > > +   target.  */
> > >  bool
> > >  riscv_support_vector_misalignment (machine_mode mode,
> > >const_tree type ATTRIBUTE_UNUSED,
> > >int misalignment,
> > >bool is_packed ATTRIBUTE_UNUSED)
> > >  {
> > > -  if (TARGET_VECTOR)
> > > -{
> > > -  if (STRICT_ALIGNMENT)
> > > -   {
> > > - /* Return if movmisalign pattern is not supported for this 
> > > mode.  */
> > > - if (optab_handler (movmisalign_optab, mode) == CODE_FOR_nothing)
> > > -   return false;
> > > -
> > > - /* Misalignment factor is unknown at compile time.  */
> > > - if (misalignment == -1)
> > > -   return false;
> > > -   }
> > > -  return true;
> > > -}
> > > +  /* TODO: For RVV scalable vector auto-vectorization, we should allow
> > > + movmisalign pattern to handle misalign data movement to 
> > > unblock
> > > + possible auto-vectorization.
> > >
> > > + RVV VLS auto-vectorization or SIMD auto-vectorization can be 
> > > supported here
> > > + in the future.  */
> > >return default_builtin_support_vector_misalignment (mode, type, 
> > > misalignment,
> > >   is_packed);
> > >  }
> >
> > Should we have some corresponding change on autovec.md like this?
> >
> > diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
> > index f1c5ff5951bf..c2873201d82e 100644
> > --- a/gcc/config/riscv/autovec.md
> > +++ b/gcc/config/riscv/autovec.md
> > @@ -51,7 +51,7 @@
> > (define_expand "movmisalign"
> >  [(set (match_operand:V 0 "nonimmediate_operand")
> >   (match_operand:V 1 "general_operand"))]
> > -  "TARGET_VECTOR"
> > +  "TARGET_VECTOR && !STRICT_ALIGNMENT"
> >  {
> >/* Equivalent to a normal move for our purpooses.  */
> >emit_move_insn (operands[0], operands[1]);

Re: Re: [PATCH V2] RISC-V: Fix incorrect implementation of TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT

2023-05-09 Thread 钟居哲

Yes We can remove it but I still keep it here and add comment for TODO.
Since we may want to support it for VLS modes, like ARM SVE, they have Advanced 
SIMD modes (128bit VLS mode):
/* Return true if the vector misalignment factor is supported by the
   target.  */
static bool
aarch64_builtin_support_vector_misalignment (machine_mode mode,
   const_tree type, int misalignment,
   bool is_packed)
{
  if (TARGET_SIMD && STRICT_ALIGNMENT)
{
  /* Return if movmisalign pattern is not supported for this mode.  */
  if (optab_handler (movmisalign_optab, mode) == CODE_FOR_nothing)
return false;

  /* Misalignment factor is unknown at compile time.  */
  if (misalignment == -1)
  return false;
}
  return default_builtin_support_vector_misalignment (mode, type, misalignment,
  is_packed);
}

This is ARM implementation, TAGET_SIMD is for Advance SIMD.

juzhe.zh...@rivai.ai

From: Kito Cheng
Date: 2023-05-09 22:29
To: juzhe.zhong
CC: gcc-patches@gcc.gnu.org; pal...@dabbelt.com; jeffreya...@gmail.com; 
rdapp@gmail.com
Subject: Re: [PATCH V2] RISC-V: Fix incorrect implementation of 
TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT
Oh, checked default_builtin_support_vector_misalignment and I realized
we can just remove riscv_support_vector_misalignment at all...

On Tue, May 9, 2023 at 10:18 PM juzhe.zhong  wrote:
>
> riscv_support_vector_misalignment update makes some of the testcase check 
> fail. I have checked the those fails， they are reasonable. So I include test 
> case adapt in this patch.
>  Replied Message 
> FromKito Cheng
> Date05/09/2023 21:54
> tojuzhe.zh...@rivai.ai
> ccgcc-patc...@gcc.gnu.org,
> pal...@dabbelt.com,
> jeffreya...@gmail.com,
> rdapp@gmail.com
> SubjectRe: [PATCH V2] RISC-V: Fix incorrect implementation of 
> TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT
> I am ok with both changes but I tried to build some test cases, and it
> seems the changes are caused by options update, not caused by the
> riscv_support_vector_misalignment update? so I would like to see the
> testcase should split out into a separated patch.
>
> > +/* Return true if the vector misalignment factor is supported by the
> > +   target.  */
> >  bool
> >  riscv_support_vector_misalignment (machine_mode mode,
> >const_tree type ATTRIBUTE_UNUSED,
> >int misalignment,
> >bool is_packed ATTRIBUTE_UNUSED)
> >  {
> > -  if (TARGET_VECTOR)
> > -{
> > -  if (STRICT_ALIGNMENT)
> > -   {
> > - /* Return if movmisalign pattern is not supported for this mode.  
> > */
> > - if (optab_handler (movmisalign_optab, mode) == CODE_FOR_nothing)
> > -   return false;
> > -
> > - /* Misalignment factor is unknown at compile time.  */
> > - if (misalignment == -1)
> > -   return false;
> > -   }
> > -  return true;
> > -}
> > +  /* TODO: For RVV scalable vector auto-vectorization, we should allow
> > + movmisalign pattern to handle misalign data movement to unblock
> > + possible auto-vectorization.
> >
> > + RVV VLS auto-vectorization or SIMD auto-vectorization can be 
> > supported here
> > + in the future.  */
> >return default_builtin_support_vector_misalignment (mode, type, 
> > misalignment,
> >   is_packed);
> >  }
>
> Should we have some corresponding change on autovec.md like this?
>
> diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
> index f1c5ff5951bf..c2873201d82e 100644
> --- a/gcc/config/riscv/autovec.md
> +++ b/gcc/config/riscv/autovec.md
> @@ -51,7 +51,7 @@
> (define_expand "movmisalign"
>  [(set (match_operand:V 0 "nonimmediate_operand")
>   (match_operand:V 1 "general_operand"))]
> -  "TARGET_VECTOR"
> +  "TARGET_VECTOR && !STRICT_ALIGNMENT"
>  {
>/* Equivalent to a normal move for our purpooses.  */
>emit_move_insn (operands[0], operands[1]);

Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-09 Thread 钟居哲

;
+   _6 = _5 * 2;
+   _7 = f_15(D) + _6;
+   .LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0);
+   vectp_f.8_56 = vectp_f.8_51 + 16;
+   .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0);
+   _8 = (long unsigned int) i_23;
+   _9 = _8 * 4;
+   _10 = d_18(D) + _9;
+   _61 = _75 / 2;
+   .LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0);
+   vectp_d.10_63 = vectp_d.10_59 + 16;
+   _64 = _72 / 2;
+   .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0);
+   i_20 = i_23 + 1;
+   vectp_f.8_52 = vectp_f.8_56 + 16;
+   vectp_d.10_60 = vectp_d.10_63 + 16;
+   ivtmp_74 = ivtmp_73 - _75;
+   ivtmp_71 = ivtmp_70 - _72;
+   if (ivtmp_74 != 0)
+ goto ; [83.33%]
+   else
+ goto ; [16.67%]
+
+   Note: We DO NOT use .SELECT_VL in SLP auto-vectorization for multiple
+   rgroups. Instead, we use MIN_EXPR to guarantee we always use VF as the
+   iteration amount for mutiple rgroups.+   The analysis of the flow of 
multiple rgroups:
+   _72 = MIN_EXPR ;
+   _75 = MIN_EXPR ;
+   ...
+   .LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0);
+   vectp_f.8_56 = vectp_f.8_51 + 16;
+   .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0);
+   ...
+   _61 = _75 / 2;
+   .LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0);
+   vectp_d.10_63 = vectp_d.10_59 + 16;
+   _64 = _72 / 2;
+   .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0);Here, If use 
SELECT_VL instead of MIN_EXPR. Since we define the outcome of SELECT_VL can be 
any number in non-final iteration.It seems not easy to adjust address pointer 
IV (vectp_f.8_56 = vectp_f.8_51 + 16;) and the next length (_61 = _75 / 2;).
For case 3: +  3. Multiple rgroups for non-SLP auto-vectorization.+
+ # ivtmp_26 = PHI 
+ # ivtmp.35_10 = PHI 
+ # ivtmp.36_2 = PHI 
+ _28 = MIN_EXPR ;
+ loop_len_15 = MIN_EXPR <_28, POLY_INT_CST [4, 4]>;
+ loop_len_16 = _28 - loop_len_15;
+ _29 = (void *) ivtmp.35_10;
+ _7 =   [(int *)_29];
+ vect__1.25_17 = .LEN_LOAD (_7, 128B, loop_len_15, 0);
+ _33 = _29 + POLY_INT_CST [16, 16];
+ _34 =   [(int *)_33];
+ vect__1.26_19 = .LEN_LOAD (_34, 128B, loop_len_16, 0);
+ vect__2.27_20 = VEC_PACK_TRUNC_EXPR ;
+ _30 = (void *) ivtmp.36_2;
+ _31 =   [(short int *)_30];
+ .LEN_STORE (_31, 128B, _28, vect__2.27_20, 0);
+ ivtmp_27 = ivtmp_26 - _28;
+ ivtmp.35_11 = ivtmp.35_10 + POLY_INT_CST [32, 32];
+ ivtmp.36_8 = ivtmp.36_2 + POLY_INT_CST [16, 16];
+ if (ivtmp_27 != 0)
+   goto ; [83.33%]
+ else
+   goto ; [16.67%]
+
+ The total length: _28 = MIN_EXPR ;
+
+ The length of first half vector:
+   loop_len_15 = MIN_EXPR <_28, POLY_INT_CST [4, 4]>;
+
+ The length of second half vector:
+   loop_len_15 = MIN_EXPR <_28, POLY_INT_CST [4, 4]>;
+   loop_len_16 = _28 - loop_len_15;
+
+ 1). _28 always <= POLY_INT_CST [8, 8].
+ 2). When _28 <= POLY_INT_CST [4, 4], second half vector is not processed.
+ 3). When _28 > POLY_INT_CST [4, 4], second half vector is processed.We 
known in Case 3, we should deal with 2 vectors: vect__2.27_20 = 
VEC_PACK_TRUNC_EXPR ;First we use  _28 = MIN_EXPR 
; to generate the number of elements to be 
processedfor these 2 vector.Second, we use "loop_len_15 = MIN_EXPR <_28, 
POLY_INT_CST [4, 4]>;"  "loop_len_15" is the  number elements to be processed 
for first vector.Then, "loop_len_16 = _28 - loop_len_15; "loop_len_16" is the  
number elements to be processed for first vector.I think "loop_len_15 = 
MIN_EXPR <_28, POLY_INT_CST [4, 4]>;" is very similiar the unpacklo in ARM 
SVE."loop_len_16 = _28 - loop_len_15; "loop_len_16" is very similiar the 
unpackhi in ARM SVE.
>> It's up to you.  If you don't think select_vl is worth it then it would
>>obviously make the vectoriser changes a bit simpler.
>>But making the vectoriser simpler isn't IMO the goal here.  SELECT_VL
>>seems like a perfectly reasonable construct to add to target-independent
>>code.  We just need to work out some of the details.

Ok, I also prefer keeping select_vl.

>>FWIW, I share Kewen's concern about duplicating too much logic between
>>masks, current lengths, and SELECT_VL lengths.  But I haven't looked at
>>the patch yet and so I don't know how easy it would be to avoid that.

I understand the concern, the current implementation are in the isolated 
function "vect_set_loop_controls_by_select_vl",
it's easier to review the implementation.
Maybe we can first make the whole implementation codes in 
"vect_set_loop_controls_by_select_vl" to be stable after review,
then we can try to incorporate these codes of 
"vect_set_loop_controls_by_select_vl" into "ve

Re: Re: [PATCH] RISC-V: Optimize vsetvli of LCM INSERTED edge for user vsetvli [PR 109743]

2023-05-08 Thread 钟居哲

Ok. Address comment and V2 patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/617821.html 

Thanks.


juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-08 17:53
To: juzhe.zh...@rivai.ai
CC: gcc-patches
Subject: Re: [PATCH] RISC-V: Optimize vsetvli of LCM INSERTED edge for user 
vsetvli [PR 109743]
I am wondering if it is possible to do this on
local_eliminate_vsetvl_insn? I feel this is sort of local elimination,
so putting them together would be better than handling that in many
different places.
 
On Mon, May 8, 2023 at 9:35 AM juzhe.zh...@rivai.ai
 wrote:
>
> Gentle ping this patch.
>
> Is this Ok for trunk? Thanks.
>
>
> juzhe.zh...@rivai.ai
>
> From: juzhe.zhong
> Date: 2023-05-06 19:14
> To: gcc-patches
> CC: kito.cheng; Juzhe-Zhong
> Subject: [PATCH] RISC-V: Optimize vsetvli of LCM INSERTED edge for user 
> vsetvli [PR 109743]
> From: Juzhe-Zhong 
>
> This patch is fixing: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109743.
>
> This issue happens is because we are currently very conservative in 
> optimization of user vsetvli.
>
> Consider this following case:
>
> bb 1:
>   vsetvli a5,a4... (demand AVL = a4).
> bb 2:
>   RVV insn use a5 (demand AVL = a5).
>
> LCM will hoist vsetvl of bb 2 into bb 1.
> We don't do AVL propagation for this situation since it's complicated that
> we should analyze the code sequence between vsetvli in bb 1 and RVV insn in 
> bb 2.
> They are not necessary the consecutive blocks.
>
> This patch is doing the optimizations after LCM, we will check and eliminate 
> the vsetvli
> in LCM inserted edge if such vsetvli is redundant. Such approach is much 
> simplier and safe.
>
> code:
> void
> foo2 (int32_t *a, int32_t *b, int n)
> {
>   if (n <= 0)
>   return;
>   int i = n;
>   size_t vl = __riscv_vsetvl_e32m1 (i);
>
>   for (; i >= 0; i--)
>   {
> vint32m1_t v = __riscv_vle32_v_i32m1 (a, vl);
> __riscv_vse32_v_i32m1 (b, v, vl);
>
> if (i >= vl)
>   continue;
>
> if (i == 0)
>   return;
>
> vl = __riscv_vsetvl_e32m1 (i);
>   }
> }
>
> Before this patch:
> foo2:
> .LFB2:
> .cfi_startproc
> ble a2,zero,.L1
> mv  a4,a2
> li  a3,-1
> vsetvli a5,a2,e32,m1,ta,mu
> vsetvli zero,a5,e32,m1,ta,ma  <- can be eliminated.
> .L5:
> vle32.v v1,0(a0)
> vse32.v v1,0(a1)
> bgeua4,a5,.L3
> .L10:
> beq a2,zero,.L1
> vsetvli a5,a4,e32,m1,ta,mu
> addia4,a4,-1
> vsetvli zero,a5,e32,m1,ta,ma  <- can be eliminated.
> vle32.v v1,0(a0)
> vse32.v v1,0(a1)
> addiw   a2,a2,-1
> bltua4,a5,.L10
> .L3:
> addiw   a2,a2,-1
> addia4,a4,-1
> bne a2,a3,.L5
> .L1:
> ret
>
> After this patch:
> f:
> ble a2,zero,.L1
> mv  a4,a2
> li  a3,-1
> vsetvli a5,a2,e32,m1,ta,ma
> .L5:
> vle32.v v1,0(a0)
> vse32.v v1,0(a1)
> bgeua4,a5,.L3
> .L10:
> beq a2,zero,.L1
> vsetvli a5,a4,e32,m1,ta,ma
> addia4,a4,-1
> vle32.v v1,0(a0)
> vse32.v v1,0(a1)
> addiw   a2,a2,-1
> bltua4,a5,.L10
> .L3:
> addiw   a2,a2,-1
> addia4,a4,-1
> bne a2,a3,.L5
> .L1:
> ret
>
> PR target/109743
>
> gcc/ChangeLog:
>
> * config/riscv/riscv-vsetvl.cc (pass_vsetvl::commit_vsetvls): Add 
> optimization for LCM inserted edge.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/vsetvl/pr109743-1.c: New test.
> * gcc.target/riscv/rvv/vsetvl/pr109743-2.c: New test.
> * gcc.target/riscv/rvv/vsetvl/pr109743-3.c: New test.
> * gcc.target/riscv/rvv/vsetvl/pr109743-4.c: New test.
>
> ---
> gcc/config/riscv/riscv-vsetvl.cc  | 42 +++
> .../gcc.target/riscv/rvv/vsetvl/pr109743-1.c  | 26 
> .../gcc.target/riscv/rvv/vsetvl/pr109743-2.c  | 27 
> .../gcc.target/riscv/rvv/vsetvl/pr109743-3.c  | 28 +
> .../gcc.target/riscv/rvv/vsetvl/pr109743-4.c  | 28 +
> 5 files changed, 151 insertions(+)
> create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr109743-1.c
> create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr109743-2.c
> create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr109743-3.c
> create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr109743-4.c
>
> diff --git a/gcc/config/riscv/riscv-vsetvl.cc 
> b/gcc/config/riscv/riscv-vsetvl.cc
> index f55907a410e..fcee7fdf323 100644
> --- a/gcc/config/riscv/riscv-vsetvl.cc
> +++ b/gcc/config/riscv/riscv-vsetvl.cc
> @@ -3834,6 +3834,48 @@ pass_vsetvl::commit_vsetvls (void)
>   const vector_insn_info *require
> = m_vector_manager->vector_exprs[i];
>   gcc_assert (require->valid_or_dirty_p ());
> +
> +   /* Here we optimize the VSETVL is hoisted by LCM:
> +
> + Before LCM:
> +bb 1:
> +  vsetvli a5,a2,e32,m1,ta,mu
> +bb 2:
> +

Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-07 Thread 钟居哲

>> It's been pretty standard to stick with just PLUS_EXPR for this stuff
>> and instead negate the constant to produce the same effect as
>> MINUS_EXPR.  Is there a reason we're not continuing that practice?
>> Sorry if you've answered this already -- if you have, you can just point
>> me at the prior discussion and I'll read it.

Richard (Sandiford) has answered this question:
https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616745.html 
And as Richard (Biener) said, it will make IVOPTs failed if we have variable 
IVs.
However, we already have implemented variable IVs in downstream RVV GCC
and works fine and I don't see any bad codegen so far. So, I think it may not
be a serious issue for RVV.

Besides, this implementation is not my idea which is just following the guide 
coming
from RVV ISA spec.  And also inspired by the implementation coming from LLVM.
Reference:
1). RVV ISA: 
https://github.com/riscv/riscv-v-spec/blob/master/example/vvaddint32.s 
2). LLVM length stuff implementation (Should note that "get_vector_length" 
pattern in LLVM is
 totally doing the same thing as "select_vl" pattern in GCC here:
 https://reviews.llvm.org/D99750 

vvaddint32:
vsetvli t0, a0, e32, ta, ma  # Set vector length based on 32-bit vectors
vle32.v v0, (a1) # Get first vector
  sub a0, a0, t0 # Decrement number done
  slli t0, t0, 2 # Multiply number done by 4 bytes
  add a1, a1, t0 # Bump pointer
vle32.v v1, (a2) # Get second vector
  add a2, a2, t0 # Bump pointer
vadd.vv v2, v0, v1   # Sum vectors
vse32.v v2, (a3) # Store result
  add a3, a3, t0 # Bump pointer
  bnez a0, vvaddint32# Loop back
  ret# Finished

Notice "sub a0, a0, t0", the "t0" is the variable coming from "vsetvli t0, a0, 
e32, ta, ma"
which is generated by "get_vector_length" in LLVM, or similiar it is also 
generatd by "select_vl" in GCC too.

Other comments from Jeff will be addressed in the next patch (V5), I will wait 
for
Richards (both Sandiford && Biener that are the experts in Loop Vectorizer) 
comments.
Then send V5 patch which is including all comments from Jeff && Richards 
(Sandiford && Biener).

Thanks.


juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2023-05-07 23:19
To: juzhe.zhong; gcc-patches
CC: richard.sandiford; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
 
 
On 5/4/23 07:25, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
> 
> This patch is fixing V3 patch:
> https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/
> 
> Fix issues according to Richard Sandiford && Richard Biener.
> 
> 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford.
> 2. Support multiple-rgroup for non-SLP auto-vectorization.
> 
> For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the 
> total length:
> 
>   _36 = MIN_EXPR ;
> 
>   First length (MIN (X, VF/N)):
> loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;
> 
>   Second length (X - MIN (X, 1 * VF/N)):
> loop_len_16 = _36 - loop_len_15;
> 
>   Third length (X - MIN (X, 2 * VF/N)):
> _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
> loop_len_17 = _36 - _38;
> 
>   Forth length (X - MIN (X, 3 * VF/N)):
> _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
> loop_len_18 = _36 - _39;
> 
> The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length 
> since using SELECT_VL
> to adapt induction IV consumes more instructions than just using MIN_EXPR. 
> Also, during testing,
> I found it's hard to adjust length correctly according to SELECT_VL.
> 
> So, this patch we only use SELECT_VL for single-rgroup with single length 
> control.
> 
> 3. Fix document of select_vl for Richard Biener (remove mode N).
> 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard 
> Biener.
> 5. Keep loop_vinfo as first parameter for "vect_get_loop_len".
> 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated 
> at the caller site.
> 
> More comments from Richard Biener:
>>> So it's not actually saturating.  The saturating operation is done by 
>>> .WHILE_LEN?
> I define the outcome of SELECT_VL (n, vf)  (WHILE_LEN) = IN_RANGE (0, min (n, 
> vf)) will make
> the loop control counter never underflow zero.
> 
>>> I see.  I wonder if it makes sense to leave .WHILE_LEN aside for a start,
>>> the above scheme should also work for single rgroups, no?
>>> As said, it _looks_ like you can progress without .WHILE_LEN and using
>>> .WHILE_LEN is a pure optimization?
> Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow
> target adjust any length = INRANGE (0, min (n, vf)) each iteration.
> 
> Let me known if I missed something for the V3 patch.
So at a high level this is pretty good.  I think there's some

Re: Re: [PATCH] riscv: Allow vector constants in riscv_const_insns.

2023-05-06 Thread 钟居哲

OK, you can go ahead commit patch.
I am gonna send another patch to fix this.

Besides, I saw you have commit some redundant incorrect codes, I will clean 
them up in another patch.



juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2023-05-07 04:11
To: juzhe.zh...@rivai.ai; Robin Dapp; gcc-patches; kito.cheng; Kito.cheng; 
palmer; collison
Subject: Re: [PATCH] riscv: Allow vector constants in riscv_const_insns.
 
 
On 5/3/23 23:07, juzhe.zh...@rivai.ai wrote:
> This ideal of this patch looks good to me.
> But I think this patch should be able to handle more cases (not only -16 
> ~ 15) in case of CONST_VECTOR initialization.
> 
> Case 1 (Other constant value that is not -16 ~ 15):
> void vmv_m##VAL (TYPE dst[], int n) \
> { \
>  for (int i = 0; i < n; i++) \
>dst[i] = 100; \
>}
> 
> I guess for const_vector:100 is not optimal currently so far, I think 
> you may try (and add testcases).
> Such code can be:
> 
> Codegen 1:Codegen 2:
> li a5,100  vlse.v v24, (a5), zero ;; a5 
> address memory has the value of 100.
> vmv.v.x v1, a5
> 
> I am not sure codegen 1 or codegen 2, which one is better. I think you 
> can decide it.
> But my idea is that I think this patch should not only handle he 
> constant value of -16 ~ 15, but also other constant value should be 
> handled and tested in this patch.
> 
> Case 2 (Constant value *within 32bit* for INT64 in *RV32* system):
> 
> This is a special case:
> 
> void vmv_i64 (TYPE dst[], int n)
> {
>  for (int i = 0; i < n; i++)
>dst[i] = *0x*;
>   }
> 
> In this case, the Codegen should be similiar with Case 1 since each 
> scalar register can hold the whole constant value.
> 
> 
> Case 3 (Constant value over* 32bit* for INT64 in *RV32* system):
> 
> This is a special case:
> 
> void vmv_i64 (TYPE dst[], int n)
> {
>  for (int i = 0; i < n; i++)
>dst[i] = *0xA*;
>   }
> 
> In this case, since each scalar register can only hold 32bit value that 
> is not the whole constant value (*0xA)*
> I think in this case, we can only use vlse.v...
> 
> Would you refine this patch more? Thanks.
I think we can add those as distinct patches.  The [-16..15] change is 
simple, stands on its own and I don't see any strong reason to make it 
wait for handling additional cases.
 
Remember, there are multiple engineers working in this space now.  So 
things which are clearly correct should move forward quickly so that we 
don't end up duplicating work.
 
Handling the additional cases can be handled as a distinct patch on its 
own.
 
Jeff

Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-05 Thread 钟居哲

Hi, Richards. I would like to give more information about this patch so that it 
will make this patch easier for you to review.

Currently, I saw we have 3 situations that we need to handle in case of loop 
control IV in auto-vectorization:
1. Single rgroup loop control (ncopies == 1 && vec_num == 1 so loop_len.length 
() == 1 or rgc->lengh () == 1)
2. Multiple rgroup for SLP.
3. Multiple rgroup for non-SLP which is Richard Sandiford point out previously 
(For example, VEC_PACK_TRUNC).

To talk about this patch, let me talk about RVV LLVM implementation first which 
inspire me to send this patch:
https://reviews.llvm.org/D99750 

According to LLVM implementation, they are adding a middle-end IR called 
"get_vector_length" which has totally
same functionality as "select_vl" in this patch (I call it "while_len" 
previously, now I rename it as "select_vl" following Richard suggestion).

The LLVM implementation is only let "get_vector_length" calculate the number of 
elements in single rgroup loop.
For multi rgroup, let's take a look at it:
https://godbolt.org/z/3GP78efTY 

void
foo1 (short *__restrict f, int *__restrict d, int n)
{
  for (int i = 0; i < n; ++i)
{
  f[i * 2 + 0] = 1;
  f[i * 2 + 1] = 2;
  d[i] = 3;
}
} 

RISC-V Clang:
foo1:   # @foo1
# %bb.0:
bleza2, .LBB0_8
# %bb.1:
li  a3, 16
bgeua2, a3, .LBB0_3
# %bb.2:
li  a6, 0
j   .LBB0_6
.LBB0_3:
andia6, a2, -16
lui a3, 32
addiw   a3, a3, 1
vsetivlizero, 8, e32, m2, ta, ma
vmv.v.x v8, a3
vmv.v.i v10, 3
mv  a4, a6
mv  a5, a1
mv  a3, a0
.LBB0_4:# =>This Inner Loop Header: Depth=1
addia7, a5, 32
addit0, a3, 32
vsetivlizero, 16, e16, m2, ta, ma
vse16.v v8, (a3)
vse16.v v8, (t0)
vsetivlizero, 8, e32, m2, ta, ma
vse32.v v10, (a5)
vse32.v v10, (a7)
addia3, a3, 64
addia4, a4, -16
addia5, a5, 64
bneza4, .LBB0_4
# %bb.5:
beq a6, a2, .LBB0_8
.LBB0_6:
sllia3, a6, 2
add a0, a0, a3
addia0, a0, 2
add a1, a1, a3
sub a2, a2, a6
li  a3, 1
li  a4, 2
li  a5, 3
.LBB0_7:# =>This Inner Loop Header: Depth=1
sh  a3, -2(a0)
sh  a4, 0(a0)
sw  a5, 0(a1)
addia0, a0, 4
addia2, a2, -1
addia1, a1, 4
bneza2, .LBB0_7
.LBB0_8:
ret

ARM GCC:
foo1:
cmp w2, 0
ble .L1
addvl   x4, x0, #1
mov x3, 0
cntbx7
cntbx6, all, mul #2
sbfiz   x2, x2, 1, 32
ptrue   p0.b, all
mov x5, x2
adrpx8, .LC0
uqdech  x5
add x8, x8, :lo12:.LC0
whilelo p1.h, xzr, x5
ld1rw   z1.s, p0/z, [x8]
mov z0.s, #3
whilelo p0.h, xzr, x2
.L3:
st1hz1.h, p0, [x0, x3, lsl 1]
st1hz1.h, p1, [x4, x3, lsl 1]
st1wz0.s, p1, [x1, #1, mul vl]
add x3, x3, x7
whilelo p1.h, x3, x5
st1wz0.s, p0, [x1]
add x1, x1, x6
whilelo p0.h, x3, x2
b.any   .L3
.L1:
ret

It's very obvious that ARM GCC has much better codegen since RVV LLVM just use 
SIMD style to handle multi-rgroup SLP auto-vectorization.

Well, I am totally aggree that we should add length stuff in auto-vectorization 
not only for single rgroup but also multiple rgroup.
However, when I am trying to implement multiple rgroup length for both SLP and 
non-SLP and testing, turns out it's hard to use select_vl
since "select_vl" pattern allows non-VF flexible length (length <= min 
(remain,VF)) in any iteration, it's consuming much more operations for
adjust loop controls IV and data reference address point IV than just using 
"MIN_EXPR".

So for Case 2 && Case 3, I just use MIN_EXPR directly instead of SELECT_VL 
after my serveral internal testing.

Now base on these situations, we only have "select_vl" for single-rgroup, but 
multiple-rgroup (both SLP and non-SLP), we just
use MIN_EXPR.

Is it more appropriate that we should remove "select_vl" and just use MIN_EXPR 
force VF elements in each non-final iteration in single rgroup?

Like the codegen according to RVV ISA example (show as RVV LLVM):
https://repo.hca.bsc.es/epic/z/oynhzP 

ASM:
vec_add:# @vec_add
bleza3, .LBB0_3
li  a4, 0
.LBB0_2:# %vector.body
sub a5, a3, a4
vsetvli a6, a5, e64, m1, ta, mu  ==> change it into a6 = min (a5, VF) 
&& vsetvli zero, a6, e64, m1, ta, mu
sllia7, a4, 3
add a5, a1, a7
vle64.v v8, (a5)
add a5,

Re: Re: [GCC14 QUEUE PATCH] RISC-V: Optimize fault only first load

2023-04-23 Thread 钟居哲

Hi, Jeff.
I have fixed patches as you suggested:
https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616515.html 
https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616518.html 
https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616516.html 

Can you merge these patches?

juzhe.zh...@rivai.ai

From: Jeff Law
Date: 2023-04-22 11:18
To: juzhe.zhong; gcc-patches
CC: kito.cheng; palmer
Subject: Re: [GCC14 QUEUE PATCH] RISC-V: Optimize fault only first load

On 3/29/23 19:28, juzhe.zh...@rivai.ai wrote:
> From: Juzhe-Zhong 
> 
> gcc/ChangeLog:
> 
>  * config/riscv/riscv-vsetvl.cc (pass_vsetvl::cleanup_insns): Adapt 
> PASS.
This doesn't provide any useful information as far as I can tell. 
Perhaps something like:
Erase AVL from instructions with the fault first load property.

OK with a better ChangeLog entry.

Related.  As a separate patch, can you add a function comment to 
cleanup_insns?  It doesn't have one and it should.

Thanks,
jeff

Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization

2023-04-19 Thread 钟居哲

Hi, Richards.
Since GCC 14 is open and this patch has been boostraped && tested on X86.
Is this patch supporting variable IV OK for the trunk ?

Thanks


juzhe.zh...@rivai.ai
 
From: juzhe.zhong
Date: 2023-04-07 09:47
To: gcc-patches
CC: richard.sandiford; rguenther; jeffreyalaw; Juzhe-Zhong
Subject: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for 
auto-vectorization
From: Juzhe-Zhong 
 
This patch is to add WHILE_LEN pattern.
It's inspired by RVV ISA simple "vvaddint32.s" example:
https://github.com/riscv/riscv-v-spec/blob/master/example/vvaddint32.s
 
More details are in "vect_set_loop_controls_by_while_len" implementation
and comments.
 
Consider such following case:
#define N 16
int src[N];
int dest[N];
 
void
foo (int n)
{
  for (int i = 0; i < n; i++)
dest[i] = src[i];
}
 
-march=rv64gcv -O3 --param riscv-autovec-preference=scalable 
-fno-vect-cost-model -fno-tree-loop-distribute-patterns:
 
foo:
ble a0,zero,.L1
lui a4,%hi(.LANCHOR0)
addia4,a4,%lo(.LANCHOR0)
addia3,a4,64
csrra2,vlenb
.L3:
vsetvli a5,a0,e32,m1,ta,ma
vle32.v v1,0(a4)
sub a0,a0,a5
vse32.v v1,0(a3)
add a4,a4,a2
add a3,a3,a2
bne a0,zero,.L3
.L1:
ret
 
gcc/ChangeLog:
 
* doc/md.texi: Add WHILE_LEN support.
* internal-fn.cc (while_len_direct): Ditto.
(expand_while_len_optab_fn): Ditto.
(direct_while_len_optab_supported_p): Ditto.
* internal-fn.def (WHILE_LEN): Ditto.
* optabs.def (OPTAB_D): Ditto.
* tree-ssa-loop-manip.cc (create_iv): Ditto.
* tree-ssa-loop-manip.h (create_iv): Ditto.
* tree-vect-loop-manip.cc (vect_set_loop_controls_by_while_len): Ditto.
(vect_set_loop_condition_partial_vectors): Ditto.
* tree-vect-loop.cc (vect_get_loop_len): Ditto.
* tree-vect-stmts.cc (vectorizable_store): Ditto.
(vectorizable_load): Ditto.
* tree-vectorizer.h (vect_get_loop_len): Ditto.
 
---
gcc/doc/md.texi |  14 +++
gcc/internal-fn.cc  |  29 ++
gcc/internal-fn.def |   1 +
gcc/optabs.def  |   1 +
gcc/tree-ssa-loop-manip.cc  |   4 +-
gcc/tree-ssa-loop-manip.h   |   2 +-
gcc/tree-vect-loop-manip.cc | 186 ++--
gcc/tree-vect-loop.cc   |  35 +--
gcc/tree-vect-stmts.cc  |   9 +-
gcc/tree-vectorizer.h   |   4 +-
10 files changed, 264 insertions(+), 21 deletions(-)
 
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 8e3113599fd..72178ab014c 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4965,6 +4965,20 @@ for (i = 1; i < operand3; i++)
   operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
@end smallexample
+@cindex @code{while_len@var{m}@var{n}} instruction pattern
+@item @code{while_len@var{m}@var{n}}
+Set operand 0 to the number of active elements in vector will be updated value.
+operand 1 is the total elements need to be updated value.
+operand 2 is the vectorization factor.
+The operation is equivalent to:
+
+@smallexample
+operand0 = MIN (operand1, operand2);
+operand2 can be const_poly_int or poly_int related to vector mode size.
+Some target like RISC-V has a standalone instruction to get MIN (n, MODE SIZE) 
so
+that we can reduce a use of general purpose register.
+@end smallexample
+
@cindex @code{check_raw_ptrs@var{m}} instruction pattern
@item @samp{check_raw_ptrs@var{m}}
Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 6e81dc05e0e..5f44def90d3 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -127,6 +127,7 @@ init_internal_fns ()
#define cond_binary_direct { 1, 1, true }
#define cond_ternary_direct { 1, 1, true }
#define while_direct { 0, 2, false }
+#define while_len_direct { 0, 0, false }
#define fold_extract_direct { 2, 2, false }
#define fold_left_direct { 1, 1, false }
#define mask_fold_left_direct { 1, 1, false }
@@ -3702,6 +3703,33 @@ expand_while_optab_fn (internal_fn, gcall *stmt, 
convert_optab optab)
 emit_move_insn (lhs_rtx, ops[0].value);
}
+/* Expand WHILE_LEN call STMT using optab OPTAB.  */
+static void
+expand_while_len_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+{
+  expand_operand ops[3];
+  tree rhs_type[2];
+
+  tree lhs = gimple_call_lhs (stmt);
+  tree lhs_type = TREE_TYPE (lhs);
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand ([0], lhs_rtx, TYPE_MODE (lhs_type));
+
+  for (unsigned int i = 0; i < gimple_call_num_args (stmt); ++i)
+{
+  tree rhs = gimple_call_arg (stmt, i);
+  rhs_type[i] = TREE_TYPE (rhs);
+  rtx rhs_rtx = expand_normal (rhs);
+  create_input_operand ([i + 1], rhs_rtx, TYPE_MODE (rhs_type[i]));
+}
+
+  insn_code icode = direct_optab_handler (optab, TYPE_MODE (rhs_type[0]));
+
+  expand_insn (icode, 3, ops);
+  if (!rtx_equal_p (lhs_rtx,

Re: [PATCH 0/3] RISC-V: Basic enable RVV auto-vectorizaiton

2023-04-19 Thread 钟居哲

Sorry for sending messy patches.
Ignore those messy patches and these following patches are the real patches:
https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616222.html 
https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616225.html 
https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616223.html 
https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616224.html 

Thanks.


juzhe.zh...@rivai.ai
 
From: juzhe.zhong
Date: 2023-04-20 00:36
To: gcc-patches
CC: kito.cheng; palmer; jeffreyalaw; Ju-Zhe Zhong
Subject: [PATCH 0/3] RISC-V: Basic enable RVV auto-vectorizaiton
From: Ju-Zhe Zhong 
 
PATCH 1: Add compile option for RVV auto-vectorization.
PATCH 2: Enable basic RVV auto-vectorization.
PATCH 3: Add sanity testcases.
 
*** BLURB HERE ***
 
Ju-Zhe Zhong (3):
  RISC-V: Add auto-vectorization compile option for RVV
  RISC-V: Enable basic auto-vectorization for RVV
  RISC-V: Add sanity testcases for RVV auto-vectorization
 
gcc/config/riscv/autovec.md   |  49 
gcc/config/riscv/riscv-opts.h |  15 +++
gcc/config/riscv/riscv-protos.h   |   1 +
gcc/config/riscv/riscv-v.cc   |  53 +
gcc/config/riscv/riscv.cc |  24 +++-
gcc/config/riscv/riscv.opt|  37 ++
gcc/config/riscv/vector.md|   4 +-
.../rvv/autovec/partial/single_rgroup-1.c |   8 ++
.../rvv/autovec/partial/single_rgroup-1.h | 106 ++
.../rvv/autovec/partial/single_rgroup_run-1.c |  19 
.../gcc.target/riscv/rvv/autovec/template-1.h |  68 +++
.../gcc.target/riscv/rvv/autovec/v-1.c|   4 +
.../gcc.target/riscv/rvv/autovec/v-2.c|   6 +
.../gcc.target/riscv/rvv/autovec/zve32f-1.c   |   4 +
.../gcc.target/riscv/rvv/autovec/zve32f-2.c   |   5 +
.../gcc.target/riscv/rvv/autovec/zve32f-3.c   |   6 +
.../riscv/rvv/autovec/zve32f_zvl128b-1.c  |   4 +
.../riscv/rvv/autovec/zve32f_zvl128b-2.c  |   6 +
.../gcc.target/riscv/rvv/autovec/zve32x-1.c   |   4 +
.../gcc.target/riscv/rvv/autovec/zve32x-2.c   |   6 +
.../gcc.target/riscv/rvv/autovec/zve32x-3.c   |   6 +
.../riscv/rvv/autovec/zve32x_zvl128b-1.c  |   5 +
.../riscv/rvv/autovec/zve32x_zvl128b-2.c  |   6 +
.../gcc.target/riscv/rvv/autovec/zve64d-1.c   |   4 +
.../gcc.target/riscv/rvv/autovec/zve64d-2.c   |   4 +
.../gcc.target/riscv/rvv/autovec/zve64d-3.c   |   6 +
.../riscv/rvv/autovec/zve64d_zvl128b-1.c  |   4 +
.../riscv/rvv/autovec/zve64d_zvl128b-2.c  |   6 +
.../gcc.target/riscv/rvv/autovec/zve64f-1.c   |   4 +
.../gcc.target/riscv/rvv/autovec/zve64f-2.c   |   4 +
.../gcc.target/riscv/rvv/autovec/zve64f-3.c   |   6 +
.../riscv/rvv/autovec/zve64f_zvl128b-1.c  |   4 +
.../riscv/rvv/autovec/zve64f_zvl128b-2.c  |   6 +
.../gcc.target/riscv/rvv/autovec/zve64x-1.c   |   4 +
.../gcc.target/riscv/rvv/autovec/zve64x-2.c   |   4 +
.../gcc.target/riscv/rvv/autovec/zve64x-3.c   |   6 +
.../riscv/rvv/autovec/zve64x_zvl128b-1.c  |   4 +
.../riscv/rvv/autovec/zve64x_zvl128b-2.c  |   6 +
gcc/testsuite/gcc.target/riscv/rvv/rvv.exp|  16 +++
39 files changed, 532 insertions(+), 2 deletions(-)
create mode 100644 gcc/config/riscv/autovec.md
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/single_rgroup-1.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/single_rgroup-1.h
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/single_rgroup_run-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/template-1.h
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/v-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/v-2.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve32f-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve32f-2.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve32f-3.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve32f_zvl128b-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve32f_zvl128b-2.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve32x-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve32x-2.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve32x-3.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve32x_zvl128b-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve32x_zvl128b-2.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve64d-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve64d-2.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve64d-3.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve64d_zvl128b-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve64d_zvl128b-2.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve64f-1.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/zve64f-2.c
create mode 100644

Re: [wwwdocs] gcc-13: Add release note for RISC-V

2023-04-19 Thread 钟居哲

LGTM。



juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-04-19 21:53
To: gcc-patches; kito.cheng; palmer; juzhe.zhong; jeffreyalaw
CC: Kito Cheng
Subject: [wwwdocs] gcc-13: Add release note for RISC-V
---
htdocs/gcc-13/changes.html | 31 ++-
1 file changed, 30 insertions(+), 1 deletion(-)
 
diff --git a/htdocs/gcc-13/changes.html b/htdocs/gcc-13/changes.html
index f6941534..5427f805 100644
--- a/htdocs/gcc-13/changes.html
+++ b/htdocs/gcc-13/changes.html
@@ -636,9 +636,32 @@ a work-in-progress.
RISC-V

-New ISA extension support for zawrs.
+Supports vector intrinsics as specified in https://github.com/riscv-non-isa/rvv-intrinsic-doc/tree/v0.11.x;>
+ version 0.11 of the RISC-V vector intrinsic specification,
+ thanks Ju-Zhe Zhong from https://rivai-ic.com.cn/;>RiVAI
+ for contributing most of implementation.
+
 Support for the following vendor extensions has been added:
   
+ Zawrs
+ Zicbom
+ Zicboz
+ Zicbop
+ Zfh
+ Zfhmin
+ Zmmul
+ Zdinx
+ Zfinx
+ Zhinx
+ Zhinxmin
+ Zksh
+ Zksed
+ Zknd
+ Zkne
+ Zbkb
+ Zbkc
+ Zbkx
 XTheadBa
 XTheadBb
 XTheadBs
@@ -657,8 +680,14 @@ a work-in-progress.
   option (GCC identifiers in parentheses).
   
 T-Head's XuanTie C906 (thead-c906).
+Ventana's VT1 (ventana-vt1).
   
 
+Improves the multi-lib selection mechanism for the bare-metal toolchain
+ (riscv*-elf*). GCC will now automatically select the best-fit multi-lib
+ candidate instead of requiring all possible reuse rules to be listed at
+ build time.
+


-- 
2.39.2

Re: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization

2023-04-13 Thread 钟居哲

Thanks Kewen.

Current flow in this patch like you said:

len = WHILE_LEN (n,vf);
...
v = len_load (addr,len);
..
addr = addr + vf (in byte align);

This patch is just keep adding address with a vector factor (adjust as byte 
align).
For example, if your vector length = 512bit. Then this patch is just updating 
address as
addr = addr + 64;

However, today after I read RVV ISA more deeply, it should be more appropriate 
that
the address should updated as : addr = addr + (len * 4) if len is element 
number of INT32.
the len is the result by WHILE_LEN which calculate the len. 

I assume for IBM target, it's better to just update address directly adding the 
whole register bytesize 
in address IV. Since I think the second way (address = addr + (len * 4)) is too 
RVV specific, and won't be suitable for IBM. Is that right?
If it is true, I will keep this patch flow (won't change to  address = addr + 
(len * 4)) to see what else I need to do for IBM.
I would rather do that in RISC-V backend port.

>> I tried
>>to compile the above source files on Power, the former can adopt doloop
>>optimization but the latter fails to. 
You mean GCC can not do hardward loop optimization when IV loop control is 
variable ? 

juzhe.zh...@rivai.ai

From: Kewen.Lin
Date: 2023-04-13 15:29
To: 钟居哲
CC: gcc-patches; Jeff Law; rdapp; richard.sandiford; rguenther
Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for 
auto-vectorization
Hi Juzhe,

on 2023/4/12 21:22, 钟居哲 wrote:
> Thanks Kewen. 
> 
> It seems that this proposal WHILE_LEN can help s390 when using --param 
> vect-partial-vector-usage=2 compile option.
> 

Yeah, IMHO, the previous sequence vs. the proposed sequence are like:

int
foo (int *__restrict a, int *__restrict b, int n)
{
  if (n <= 0)
return 0;

  int iv = 0;
  int len = MIN (n, 16);
  int sum = 0;
  do
{
  sum += a[len] + b[len];
  iv += 16;
  int n1 = MIN (n, iv);   // line A
  int n2 = n - n1;
  len = MIN (n2, 16);
}
  while (n > iv);

  return sum;
}

vs.

int
foo (int *__restrict a, int *__restrict b, int n)
{
  if (n <= 0)
return 0;

  int len;
  int sum = 0;
  do
{
  len = MIN (n, 16);
  sum += a[len] + b[len];
  n -= len;
}
  while (n > 0);

  return sum;
}

it at least saves one MIN (at line A) and one length preparation in the
last iteration (it's useless since loop ends).  But I think the concern
that this proposed IV isn't recognized as simple iv may stay.  I tried
to compile the above source files on Power, the former can adopt doloop
optimization but the latter fails to.

> Would you mind apply this patch && support WHILE_LEN in s390 backend and test 
> it to see the overal benefits for s390
> as well as the correctness of this sequence ? 

Sure, if all of you think this approach and this revision is good enough to go 
forward for this kind of evaluation,
I'm happy to give it a shot, but only for rs6000. ;-)  I noticed that there are 
some discussions on withdrawing this
WHILE_LEN by using MIN_EXPR instead, I'll stay tuned.

btw, now we only adopt vector with length on the epilogues rather than the main 
vectorized loops, because of the
non-trivial extra costs for length preparation than just using the normal 
vector load/store (all lanes), so we don't
care about the performance with --param vect-partial-vector-usage=2 much.  Even 
if this new proposal can optimize
the length preparation for --param vect-partial-vector-usage=2, the extra costs 
for length preparation is still
unavoidable (MIN, shifting, one more GPR used), we would still stay with 
default --param vect-partial-vector-usage=1
(which can't benefit from this new proposal).

BR,
Kewen

Re: Re: [PATCH] machine_mode type size: Extend enum size from 8-bit to 16-bit

2023-04-12 Thread 钟居哲

Yeah, like kito said.
Turns out the tuple type model in ARM SVE is the optimal solution for RVV.
And we like ARM SVE style implmentation.

And now we see swapping rtx_code and mode in rtx_def can make rtx_def overal 
not exceed 64 bit.
But it seems that there is still problem in tree_type_common and 
tree_decl_common, is that right?

After several trys (remove all redundant TI/TF vector modes and FP16 vector 
mode), now there are 252 modes
in RISC-V port. Basically, I can keep supporting new RVV intrinsisc features 
recently.
However, we can't support more in the future, for example, FP16 vector, BF16 
vector, matrix modes, VLS modes,...etc.

From RVV side, I think extending 1 more bit of machine mode should be enough 
for RVV (overal 512 modes).
Is it possible make it happen in tree_type_common and tree_decl_common, 
Richards?

Thank you so much for all comments.


juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-04-12 17:31
To: Richard Biener
CC: juzhe.zh...@rivai.ai; richard.sandiford; jeffreyalaw; gcc-patches; palmer; 
jakub
Subject: Re: Re: [PATCH] machine_mode type size: Extend enum size from 8-bit to 
16-bit
> > The concept of fractional LMUL is the same as the concept of AArch64's
> > partial SVE vectors,
> > so they can only access the lowest part, like SVE's partial vector.
> >
> > We want to spill/restore the exact size of those modes (1/2, 1/4,
> > 1/8), so adding dedicated modes for those partial vector modes should
> > be unavoidable IMO.
> >
> > And even if we use sub-vector, we still need to define those partial
> > vector types.
>
> Could you use integer modes for the fractional vectors?
 
You mean using the scalar integer mode like using (subreg:SI
(reg:VNx4SI) 0) to represent
LMUL=1/4?
(Assume VNx4SI is mode for M1)
 
If so I think it might not be able to model that right - it seems like
we are using 32-bits
but actually we are using poly_int16(1, 1) * 32 bits.
 
> For computation you can always appropriately limit the LEN?
 
RVV provide zvl*b extension like zvlb (e.g.zvl128b or zvl256b)
to guarantee the vector length is at least larger than N bits, but it's
just guarantee the minimal length like SVE guarantee the minimal
vector length is 128 bits

Re: Re: [PATCH] RISC-V: Fix incorrect condition of EEW = 64 mode

2023-04-12 Thread 钟居哲

Yeah. But this patch is not appropriate now since it is conflict with the 
upstream GCC.
I am gonna re-check the current upstream GCC and the queue patch for GCC 14.
If there are some conflicts, I will resend them.

Thanks


juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2023-04-13 07:00
To: juzhe.zhong; gcc-patches
CC: kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Fix incorrect condition of EEW = 64 mode
 
 
On 4/6/23 19:11, juzhe.zh...@rivai.ai wrote:
> From: Juzhe-Zhong 
> 
> This patch should be merged before this patch:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-March/614935.html
> 
> According to RVV ISA, the EEW = 64 is enable only when -march=*zve64*
> Current condition is incorrect, since -march=*zve32*_zvl64b will enable EEW = 
> 64 which
> is incorrect.
> 
> gcc/ChangeLog:
> 
>  * config/riscv/riscv-vector-switch.def (ENTRY): Change to 
> TARGET_VECTOR_ELEN_64.
Just to be clear, this was for gcc-14, right?  I don't see these modes 
in the current trunk.
 
jeff

Re: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization

2023-04-12 Thread 钟居哲

>> It's not so much that we need to do that.  But normally it's only worth
>> adding internal functions if they do something that is too complicated
>> to express in simple gimple arithmetic.  The UQDEC case I mentioned:

>>z = MAX (x, y) - y

>> fell into the "simple arithmetic" category for me.  We could have added
>> an ifn for unsigned saturating decrement, but it didn't seem complicated
>> enough to merit its own ifn.

Ah, I known your concern. I should admit that WHILE_LEN is a simple arithmetic 
operation
which is just taking result from

min (remain,vf).

The possible solution is to just use MIN_EXPR (remain,vf).
Then, add speciall handling in umin_optab pattern to recognize "vf" in the 
backend.
Finally generate vsetvl in RISC-V backend.

The "vf" should be recognized as the operand of umin should be 
const_int/const_poly_int operand.
Otherwise, just generate umin scalar instruction..

However, there is a case that I can't recognize umin should generate vsetvl or 
umin. Is this following case:
void foo (int32_t a)
{
  return min (a, 4);
}

In this case I should generate:
li a1,4
umin a1,a0,a1

instead of generating vsetvl

However, in this case:

void foo (int32_t *a...)
for (int i = 0; i < n; i++)
  a[i] = b[i] + c[i];

with -mriscv-vector-bits=128 (which means each vector can handle 4 INT32)
Then the VF will be 4 too. If we also MIN_EXPR instead WHILE_LEN:

...
len = MIN_EXPR (n,4)
v = len_load (len)

...

In this case, MIN_EXPR should emit vsetvl.

It's hard for me to tell the difference between these 2 cases...

CC RISC-V port backend maintainer: Kito.

juzhe.zh...@rivai.ai

From: Richard Sandiford
Date: 2023-04-12 20:24
To: juzhe.zhong\@rivai.ai
CC: rguenther; gcc-patches; jeffreyalaw; rdapp; linkw
Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for 
auto-vectorization
"juzhe.zh...@rivai.ai"  writes:
>>> I think that already works for them (could be misremembering).
>>> However, IIUC, they have no special instruction to calculate the
>>> length (unlike for RVV), and so it's open-coded using vect_get_len.
>
> Yeah, the current flow using min, sub, and then min in vect_get_len
> is working for IBM. But I wonder whether switching the current flow of
> length-loop-control into the WHILE_LEN pattern that this patch can improve
> their performance.
>
>>> (1) How easy would it be to express WHILE_LEN in normal gimple?
>>> I haven't thought about this at all, so the answer might be
>>> "very hard".  But it reminds me a little of UQDEC on AArch64,
>>> which we open-code using MAX_EXPR and MINUS_EXPR (see
>  >>vect_set_loop_controls_directly).
>
>   >>   I'm not saying WHILE_LEN is the same operation, just that it seems
>   >>   like it might be open-codeable in a similar way.
>
>  >>Even if we can open-code it, we'd still need some way for the
>   >>   target to select the "RVV way" from the "s390/PowerPC way".
>
> WHILE_LEN in doc I define is
> operand0 = MIN (operand1, operand2)operand1 is the residual number of scalar 
> elements need to be updated.operand2 is vectorization factor (vf) for single 
> rgroup. if multiple rgroup operan2 = vf * nitems_per_ctrl.You mean 
> such pattern is not well expressed so we need to replace it with normaltree 
> code (MIN OR MAX). And let RISC-V backend to optimize them into vsetvl 
> ?Sorry, maybe I am not on the same page.

It's not so much that we need to do that.  But normally it's only worth
adding internal functions if they do something that is too complicated
to express in simple gimple arithmetic.  The UQDEC case I mentioned:

   z = MAX (x, y) - y

fell into the "simple arithmetic" category for me.  We could have added
an ifn for unsigned saturating decrement, but it didn't seem complicated
enough to merit its own ifn.

>>> (2) What effect does using a variable IV step (the result of
>>> the WHILE_LEN) have on ivopts?  I remember experimenting with
>>> something similar once (can't remember the context) and not
>>> having a constant step prevented ivopts from making good
>>> addresing-mode choices.
>
> Thank you so much for pointing out this. Currently, varialble IV step and 
> decreasing n down to 0 
> works fine for RISC-V downstream GCC and we didn't find issues related 
> addressing-mode choosing.

OK, that's good.  Sounds like it isn't a problem then.

> I think I must missed something, would you mind giving me some hints so that 
> I can study on ivopts
> to find out which case may generate inferior codegens for varialble IV step?

I think AArch64 was sensitive to this because (a) the vectoriser creates
separate IVs for each base address and (b) for SVE, we instead want
invariant base addresses that are indexed by the loop control IV.
Like Richard says, if the loop control IV isn't a SCEV, ivopts isn't
able to use it and so (b) fails.

Thanks,
Richard

Re: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization

2023-04-12 Thread 钟居哲

Thanks Kewen. 

It seems that this proposal WHILE_LEN can help s390 when using --param 
vect-partial-vector-usage=2 compile option.

Would you mind apply this patch && support WHILE_LEN in s390 backend and test 
it to see the overal benefits for s390
as well as the correctness of this sequence ? 
If it may create some correctness issue for s390 or rs6000 (I saw 
len_load/len_store in rs6000 too), I can fix this patch for you.

I hope both RVV and IBM targets can gain benefits from this patch.

Thanks.

juzhe.zh...@rivai.ai

From: Kewen.Lin
Date: 2023-04-12 20:56
To: juzhe.zh...@rivai.ai; richard.sandiford; rguenther
CC: gcc-patches; jeffreyalaw; rdapp
Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for 
auto-vectorization
Hi!

on 2023/4/12 19:37, juzhe.zh...@rivai.ai wrote:
> Thank you. Richard. >> I think that already works for them (could be 
> misremembering). >> However, IIUC, they have no special instruction to 
> calculate the >> length (unlike for RVV), and so it's open-coded using 
> vect_get_len. 
> 
> Thank you. Richard.
> 
> 
>>> I think that already works for them (could be misremembering).
>>> However, IIUC, they have no special instruction to calculate the
>>> length (unlike for RVV), and so it's open-coded using vect_get_len.
> 

Yeah, Richard is right, we don't have some special hardware instruction
for efficient length calculation.

> Yeah, the current flow using min, sub, and then min in vect_get_len
> is working for IBM. But I wonder whether switching the current flow of
> length-loop-control into the WHILE_LEN pattern that this patch can improve
> their performance.

Based on some cons for the vector load/store with length in bytes on Power
(like we need one extra GPR holding the length, the length needs to be the
most significant 8 bits requiring an extra shifting etc.), we use normal
vector load/store in main loop and only use vector load/store with length
for the epilogue.  For the epilogue, the remaining length is known less
than the whole vector length, so the related sequence can be optimized.
I just had a check on s390 code, which also enables it only for the
epilogue.  From this perspective, this WHILE_LEN proposal may not give us
more.  But for the case of vect-partial-vector-usage=2 (fully adopting
vector with length on the main loop), I think the proposed sequence looks
better to me.

BR,
Kewen

Re: Re: [PATCH] RISC-V: Fix ICE of visiting non-existing block in CFG.

2022-12-28 Thread 钟居哲

Yeah, I agree with you that it makes the pass looks confusing that if we are 
mixing FOR_EACH_BB and for (const bb_info *bb...
But Jeff feels happy if I use FOR_EACH_BB so I send a patch to change the 
iterator form if it doesn't care about the order.
In this patch, it's ok for both FOR_EACH_BB and for (const bb_info *bb... So I 
change it as Jeff suggested.

However, in other places of this pass, for example 
compute_global_backward_infos function, I want to iterate blocks in reverse 
order and I must use 
"for (const bb_info *bb : crtl->ssa->reverse_bbs ())" which can allow me to do 
the information backward propagation throughly
so that I can do the aggressive and fancy optimization.

Base on these situations, it will be mixing FOR_EACH_BB and for (const bb_info 
*bb...  in this pass which may make the pass
a little bit confusing.

juzhe.zh...@rivai.ai

From: Richard Sandiford
Date: 2022-12-28 19:47
To: Jeff Law via Gcc-patches
CC: juzhe.zhong; Jeff Law; kito.cheng\@gmail.com; palmer\@dabbelt.com
Subject: Re: [PATCH] RISC-V: Fix ICE of visiting non-existing block in CFG.
Jeff Law via Gcc-patches  writes:
> On 12/27/22 16:11, juzhe.zhong wrote:
>> You mean only change to this form you suggested in this patch？ Since in 
>> all other places of this PASS，I use RTL_SSA framework to iterate 
>> instructions and blocks. I use RTL_SSA framework to iterate blocks here 
>> to make codes look more consistent even though they are same here.
> The FOR_EACH_BB is used far more widely than the C++ style found in 
> RTL-SSA so I'd slightly prefer that style.

I can see where you're coming from, but what the patch does is preferred
for RTL-SSA passes.  There is some additional information in
rtl_ssa::bb_info compared to the underlying basic_block, and even if
this particular loop doesn't use that information, IMO it would be
better to avoid mixing styles within a pass.

Also, the list that the patch iterates over is in reverse postorder,
whereas FOR_EACH_BB doesn't guarantee a particular order.  Again,
that might not be important here, but it seems better to stick to the
“native” RTL-SSA approach.

Thanks,
Richard

Re: Re: [PATCH] RISC-V: Fix ICE of visiting non-existing block in CFG.

2022-12-27 Thread 钟居哲

Hi, I fixed that form like you said:
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/609217.html 

juzhe.zh...@rivai.ai

From: Jeff Law
Date: 2022-12-28 09:11
To: 钟居哲
CC: gcc-patches; kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Fix ICE of visiting non-existing block in CFG.

On 12/27/22 17:24, 钟居哲 wrote:
> OK, I will change that after I finished my current work.
Sounds good.  Thanks.

Jeff

Re: Re: [PATCH] RISC-V: Fix ICE of visiting non-existing block in CFG.

2022-12-27 Thread 钟居哲

OK, I will change that after I finished my current work.

juzhe.zh...@rivai.ai

From: Jeff Law
Date: 2022-12-28 08:06
To: juzhe.zhong
CC: gcc-patches@gcc.gnu.org; kito.ch...@gmail.com; pal...@dabbelt.com
Subject: Re: [PATCH] RISC-V: Fix ICE of visiting non-existing block in CFG.

On 12/27/22 16:11, juzhe.zhong wrote:
> You mean only change to this form you suggested in this patch？ Since in 
> all other places of this PASS，I use RTL_SSA framework to iterate 
> instructions and blocks. I use RTL_SSA framework to iterate blocks here 
> to make codes look more consistent even though they are same here.
The FOR_EACH_BB is used far more widely than the C++ style found in 
RTL-SSA so I'd slightly prefer that style.

jeff

Re: Re: [PATCH] RISC-V: Add testcases for VSETVL PASS

2022-12-26 Thread 钟居哲

This is another issue and I have no idea. I think palmer or kito may have know 
how to solve it. 

It seems  this patch 
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/609045.html 
fixed previous issue which is more important. I think it's time to merge it.




juzhe.zh...@rivai.ai
 
From: Andreas Schwab
Date: 2022-12-26 17:20
To: juzhe.zhong
CC: gcc-patches; kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Add testcases for VSETVL PASS
FAIL: gcc.target/riscv/rvv/vsetvl/dump-1.c   -O0  (test for excess errors)
Excess errors:
/usr/include/gnu/stubs.h:8:11: fatal error: gnu/stubs-ilp32.h: No such file or 
directory
compilation terminated.
 
-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Re: Re: [PATCH] RISC-V: Fix ICE for avl_info deprecated copy and pp_print error.

2022-12-24 Thread 钟居哲

I just want to make sure. You mean the bootstrap can pass now with this patch?
If yes, plz merge this patch. Thank you so much.

juzhe.zh...@rivai.ai

From: Andreas Schwab
Date: 2022-12-25 00:58
To: juzhe.zhong
CC: gcc-patches
Subject: Re: [PATCH] RISC-V: Fix ICE for avl_info deprecated copy and pp_print 
error.
On Dez 23 2022, juzhe.zh...@rivai.ai wrote:

> * config/riscv/riscv-vsetvl.cc (change_insn): Remove pp_print.
> (avl_info::avl_info): Add copy function.
> (vector_insn_info::dump): Remove pp_print.
> * config/riscv/riscv-vsetvl.h: Add copy function.

Survived bootstrap so far.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Re: Re: [PATCH] RISC-V: Support VSETVL PASS for RVV support

2022-12-23 Thread 钟居哲

Thank you. Would you mind testing this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/609045.html 
to see whether the issue is fixed ?
Thanks



juzhe.zh...@rivai.ai
 
From: Andreas Schwab
Date: 2022-12-23 22:54
To: 钟居哲
CC: gcc-patches; kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Support VSETVL PASS for RVV support
On Dez 23 2022, 钟居哲 wrote:
 
> Would you mind telling me how you reproduce these errors ?
 
make bootstrap
 
-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Re: Re: [PATCH] RISC-V: Support VSETVL PASS for RVV support

2022-12-23 Thread 钟居哲

Hi, Andreas. Thank you for reporting this.
Even though I didn't reproduce this error, I have an idea to fix it:
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/609045.html 
Would you mind testing this patch for me before merging it?
Thanks.


juzhe.zh...@rivai.ai
 
From: Andreas Schwab
Date: 2022-12-23 18:53
To: juzhe.zhong
CC: gcc-patches; kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Support VSETVL PASS for RVV support
How has this been tested?
 
In file included from ../../gcc/config/riscv/riscv-vsetvl.cc:89:
../../gcc/config/riscv/riscv-vsetvl.h: In member function 
'riscv_vector::avl_info riscv_vector::vl_vtype_info::get_avl_info() const':
../../gcc/config/riscv/riscv-vsetvl.h:175:43: error: implicitly-declared 
'constexpr riscv_vector::avl_info::avl_info(const riscv_vector::avl_info&)' is 
deprecated [-Werror=deprecated-copy]
  175 |   avl_info get_avl_info () const { return m_avl; }
  |   ^
../../gcc/config/riscv/riscv-vsetvl.h:131:13: note: because 
'riscv_vector::avl_info' has user-provided 'riscv_vector::avl_info& 
riscv_vector::avl_info::operator=(const riscv_vector::avl_info&)'
  131 |   avl_info = (const avl_info &);
  | ^~~~
../../gcc/config/riscv/riscv-vsetvl.cc: In function 'bool 
change_insn(rtl_ssa::function_info*, rtl_ssa::insn_change, rtl_ssa::insn_info*, 
rtx)':
../../gcc/config/riscv/riscv-vsetvl.cc:823:27: error: unquoted whitespace 
character '\x0a' in format [-Werror=format-diag]
  823 |   pp_printf (, "\n");
  |   ^~~~
../../gcc/config/riscv/riscv-vsetvl.cc:847:27: error: unquoted whitespace 
character '\x0a' in format [-Werror=format-diag]
  847 |   pp_printf (, "\n");
  |   ^~~~
../../gcc/config/riscv/riscv-vsetvl.cc: In constructor 
'riscv_vector::vl_vtype_info::vl_vtype_info(riscv_vector::avl_info, uint8_t, 
riscv_vector::vlmul_type, uint8_t, bool, bool)':
../../gcc/config/riscv/riscv-vsetvl.cc:905:5: error: implicitly-declared 
'constexpr riscv_vector::avl_info::avl_info(const riscv_vector::avl_info&)' is 
deprecated [-Werror=deprecated-copy]
  905 |   : m_avl (avl_in), m_sew (sew_in), m_vlmul (vlmul_in), m_ratio 
(ratio_in),
  | ^~
../../gcc/config/riscv/riscv-vsetvl.cc:859:1: note: because 
'riscv_vector::avl_info' has user-provided 'riscv_vector::avl_info& 
riscv_vector::avl_info::operator=(const riscv_vector::avl_info&)'
  859 | avl_info::operator= (const avl_info )
  | ^~~~
../../gcc/config/riscv/riscv-vsetvl.cc: In member function 'void 
riscv_vector::vector_insn_info::dump(FILE*) const':
../../gcc/config/riscv/riscv-vsetvl.cc:1366:27: error: unquoted whitespace 
character '\x0a' in format [-Werror=format-diag]
1366 |   pp_printf (, "\n");
  |   ^~~~
cc1plus: all warnings being treated as errors
make[3]: *** [../../gcc/config/riscv/t-riscv:59: riscv-vsetvl.o] Error 1
 
-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Re: Re: [PATCH] RISC-V: Support VSETVL PASS for RVV support

2022-12-23 Thread 钟居哲

Would you mind telling me how you reproduce these errors ?
I failed to reproduce this. Thanks



juzhe.zh...@rivai.ai
 
From: Andreas Schwab
Date: 2022-12-23 18:53
To: juzhe.zhong
CC: gcc-patches; kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Support VSETVL PASS for RVV support
How has this been tested?
 
In file included from ../../gcc/config/riscv/riscv-vsetvl.cc:89:
../../gcc/config/riscv/riscv-vsetvl.h: In member function 
'riscv_vector::avl_info riscv_vector::vl_vtype_info::get_avl_info() const':
../../gcc/config/riscv/riscv-vsetvl.h:175:43: error: implicitly-declared 
'constexpr riscv_vector::avl_info::avl_info(const riscv_vector::avl_info&)' is 
deprecated [-Werror=deprecated-copy]
  175 |   avl_info get_avl_info () const { return m_avl; }
  |   ^
../../gcc/config/riscv/riscv-vsetvl.h:131:13: note: because 
'riscv_vector::avl_info' has user-provided 'riscv_vector::avl_info& 
riscv_vector::avl_info::operator=(const riscv_vector::avl_info&)'
  131 |   avl_info = (const avl_info &);
  | ^~~~
../../gcc/config/riscv/riscv-vsetvl.cc: In function 'bool 
change_insn(rtl_ssa::function_info*, rtl_ssa::insn_change, rtl_ssa::insn_info*, 
rtx)':
../../gcc/config/riscv/riscv-vsetvl.cc:823:27: error: unquoted whitespace 
character '\x0a' in format [-Werror=format-diag]
  823 |   pp_printf (, "\n");
  |   ^~~~
../../gcc/config/riscv/riscv-vsetvl.cc:847:27: error: unquoted whitespace 
character '\x0a' in format [-Werror=format-diag]
  847 |   pp_printf (, "\n");
  |   ^~~~
../../gcc/config/riscv/riscv-vsetvl.cc: In constructor 
'riscv_vector::vl_vtype_info::vl_vtype_info(riscv_vector::avl_info, uint8_t, 
riscv_vector::vlmul_type, uint8_t, bool, bool)':
../../gcc/config/riscv/riscv-vsetvl.cc:905:5: error: implicitly-declared 
'constexpr riscv_vector::avl_info::avl_info(const riscv_vector::avl_info&)' is 
deprecated [-Werror=deprecated-copy]
  905 |   : m_avl (avl_in), m_sew (sew_in), m_vlmul (vlmul_in), m_ratio 
(ratio_in),
  | ^~
../../gcc/config/riscv/riscv-vsetvl.cc:859:1: note: because 
'riscv_vector::avl_info' has user-provided 'riscv_vector::avl_info& 
riscv_vector::avl_info::operator=(const riscv_vector::avl_info&)'
  859 | avl_info::operator= (const avl_info )
  | ^~~~
../../gcc/config/riscv/riscv-vsetvl.cc: In member function 'void 
riscv_vector::vector_insn_info::dump(FILE*) const':
../../gcc/config/riscv/riscv-vsetvl.cc:1366:27: error: unquoted whitespace 
character '\x0a' in format [-Werror=format-diag]
1366 |   pp_printf (, "\n");
  |   ^~~~
cc1plus: all warnings being treated as errors
make[3]: *** [../../gcc/config/riscv/t-riscv:59: riscv-vsetvl.o] Error 1
 
-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Re: [PATCH] RISC-V: Support vle.v/vse.v intrinsics

2022-12-22 Thread 钟居哲

This patch is minimum intrinsics support for VSETVL PASS to support AVL model.
The corresponding unit-test for vle.v/vse.v should be added after I support AVL 
model 
and well tested VSETVL PASS patch.

juzhe.zh...@rivai.ai

From: juzhe.zhong
Date: 2022-12-23 08:52
To: gcc-patches
CC: kito.cheng; palmer; Ju-Zhe Zhong
Subject: [PATCH] RISC-V: Support vle.v/vse.v intrinsics
From: Ju-Zhe Zhong 

gcc/ChangeLog:

* config/riscv/riscv-protos.h (get_avl_type_rtx): New function.
* config/riscv/riscv-v.cc (get_avl_type_rtx): Ditto.
* config/riscv/riscv-vector-builtins-bases.cc (class loadstore): New 
class.
(BASE): Ditto.
* config/riscv/riscv-vector-builtins-bases.h: Ditto.  
* config/riscv/riscv-vector-builtins-functions.def (vle): Ditto.
(vse): Ditto.
* config/riscv/riscv-vector-builtins-shapes.cc (build_one): Ditto.
(struct loadstore_def): Ditto.
(SHAPE): Ditto.
* config/riscv/riscv-vector-builtins-shapes.h: Ditto.
* config/riscv/riscv-vector-builtins-types.def (DEF_RVV_U_OPS): New 
macro.
(DEF_RVV_F_OPS): Ditto.
(vuint8mf8_t): Add corresponding mask type.
(vuint8mf4_t): Ditto.
(vuint8mf2_t): Ditto.
(vuint8m1_t): Ditto.
(vuint8m2_t): Ditto.
(vuint8m4_t): Ditto.
(vuint8m8_t): Ditto.
(vuint16mf4_t): Ditto.
(vuint16mf2_t): Ditto.
(vuint16m1_t): Ditto.
(vuint16m2_t): Ditto.
(vuint16m4_t): Ditto.
(vuint16m8_t): Ditto.
(vuint32mf2_t): Ditto.
(vuint32m1_t): Ditto.
(vuint32m2_t): Ditto.
(vuint32m4_t): Ditto.
(vuint32m8_t): Ditto.
(vuint64m1_t): Ditto.
(vuint64m2_t): Ditto.
(vuint64m4_t): Ditto.
(vuint64m8_t): Ditto.
(vfloat32mf2_t): Ditto.
(vfloat32m1_t): Ditto.
(vfloat32m2_t): Ditto.
(vfloat32m4_t): Ditto.
(vfloat32m8_t): Ditto.
(vfloat64m1_t): Ditto.
(vfloat64m2_t): Ditto.
(vfloat64m4_t): Ditto.
(vfloat64m8_t): Ditto.
* config/riscv/riscv-vector-builtins.cc (DEF_RVV_TYPE): Adjust for new 
macro.
(DEF_RVV_I_OPS): Ditto.
(DEF_RVV_U_OPS): New macro.
(DEF_RVV_F_OPS): New macro.
(use_real_mask_p): New function.
(use_real_merge_p): Ditto.
(get_tail_policy_for_pred): Ditto.
(get_mask_policy_for_pred): Ditto.
(function_builder::apply_predication): Ditto.
(function_builder::append_base_name): Ditto.
(function_builder::append_sew): Ditto.
(function_expander::add_vundef_operand): Ditto.
(function_expander::add_mem_operand): Ditto.
(function_expander::use_contiguous_load_insn): Ditto.
(function_expander::use_contiguous_store_insn): Ditto.
* config/riscv/riscv-vector-builtins.def (DEF_RVV_TYPE): Adjust for 
adding mask type.
(vbool64_t): Ditto.
(vbool32_t): Ditto.
(vbool16_t): Ditto.
(vbool8_t): Ditto.
(vbool4_t): Ditto.
(vbool2_t): Ditto.
(vbool1_t): Ditto.
(vint8mf8_t): Ditto.
(vint8mf4_t): Ditto.
(vint8mf2_t): Ditto.
(vint8m1_t): Ditto.
(vint8m2_t): Ditto.
(vint8m4_t): Ditto.
(vint8m8_t): Ditto.
(vint16mf4_t): Ditto.
(vint16mf2_t): Ditto.
(vint16m1_t): Ditto.
(vint16m2_t): Ditto.
(vint16m4_t): Ditto.
(vint16m8_t): Ditto.
(vint32mf2_t): Ditto.
(vint32m1_t): Ditto.
(vint32m2_t): Ditto.
(vint32m4_t): Ditto.
(vint32m8_t): Ditto.
(vint64m1_t): Ditto.
(vint64m2_t): Ditto.
(vint64m4_t): Ditto.
(vint64m8_t): Ditto.
(vfloat32mf2_t): Ditto.
(vfloat32m1_t): Ditto.
(vfloat32m2_t): Ditto.
(vfloat32m4_t): Ditto.
(vfloat32m8_t): Ditto.
(vfloat64m1_t): Ditto.
(vfloat64m4_t): Ditto.
* config/riscv/riscv-vector-builtins.h 
(function_expander::add_output_operand): New function.
(function_expander::add_all_one_mask_operand): Ditto.
(function_expander::add_fixed_operand): Ditto.
(function_expander::vector_mode): Ditto.
(function_base::apply_vl_p): Ditto.
(function_base::can_be_overloaded_p): Ditto.
* config/riscv/riscv-vsetvl.cc (get_vl): Remove restrict of supporting 
AVL is not VLMAX.
* config/riscv/t-riscv: Add include file.

---
gcc/config/riscv/riscv-protos.h   |   1 +
gcc/config/riscv/riscv-v.cc   |  10 +-
.../riscv/riscv-vector-builtins-bases.cc  |  49 +++-
.../riscv/riscv-vector-builtins-bases.h   |   2 +
.../riscv/riscv-vector-builtins-functions.def |   3 +
.../riscv/riscv-vector-builtins-shapes.cc |  38 ++-
.../riscv/riscv-vector-builtins-shapes.h  |   1 +
.../riscv/riscv-vector-builtins-types.def |  49 +++-
gcc/config/riscv/riscv-vector-builtins.cc | 236

Re: Re: [PATCH] RISC-V: Fix incorrect annotation

2022-12-20 Thread 钟居哲

Thanks. I received an email from sourceware:
"You should now have write access to the source control repository for your 
project."
It seems that I can merge codes? However, I still don't know how to merge codes.

juzhe.zh...@rivai.ai

From: Jeff Law
Date: 2022-12-21 00:02
To: juzhe.zhong
CC: gcc-patches@gcc.gnu.org; kito.ch...@gmail.com; pal...@dabbelt.com
Subject: Re: [PATCH] RISC-V: Fix incorrect annotation

On 12/19/22 17:38, juzhe.zhong wrote:
> Would you mind merging it for me？ I can‘t merge code.
Do you mean you do not have write access to the repository?  If so, that 
can be easily fixed.

https://sourceware.org/cgi-bin/pdw/ps_form.cgi

List me as your sponsor.

jeff

Re: Re: [PATCH] RISC-V: Support VSETVL PASS for RVV support

2022-12-19 Thread 钟居哲

>> ISTM that if you want to run before sched2, then
>> you'd need to introduce dependencies between the vsetvl instrutions and
>> the vector instructions that utilize those settings?

Yes, I want to run before sched2 so that we could have the chance to do the 
instruction scheduling before sched2. I already introduce dependencies in
vector instructions so that it won't produce any issues.

>> Formatting note.  For a multi-line conditional, go ahead and use an open
>> paren and the usual indention style.

>>return (INSN_CODE (rinsn) == CODE_FOR_vsetvldi
>>|| INSN_CODE (rinsn) == CODE_FOR_vsetvlsi);

>>  There's other examples in the new file.

>> s/shoule/should/
>> s/propagete/propagate/
>> s/optimzal/optimal/
>> s/PASSes/passes/
>> s/intrinsiscs/intrinsics/
>> s/instrinsics/intrinsics/
>> s/acrocss/across/

Address commnents.

>> It'd probably be better to move this into rtl.cc with a prototype in
>> rtl.h rather than have duplicate definitions in gcse.c and the RISC-V
>> backend.  I'm not even entirely sure why we really need it here.
Maybe we do that when GCC14 is open?

>> These need function comments.  What isn't clear to me is why we don't
>> just call validate_change?  Is it just so we get the dump info?
Yes, since it's called more than once and I want to dump details in dump file.
Such dump infos are important for debugging.


juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2022-12-19 23:44
To: juzhe.zhong; gcc-patches
CC: kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Support VSETVL PASS for RVV support
I believe Kito already approved.  There's nothing here that is critical, 
just minor cleanups and I'm fine with them being cleaned up as a 
follow-up patch given Kito has already approved this patch.
 
On 12/14/22 00:13, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
> 
> This patch is to support VSETVL PASS for RVV support.
> 1.The optimization and performance is guaranteed LCM (Lazy code motion).
> 2.Base on RTL_SSA framework to gain better optimization chances.
> 3.Also we do VL/VTYPE, demand information backward propagation across
>blocks by RTL_SSA reverse order in CFG.
> 4.It has been well and fully tested by about 200+ testcases for VLMAX
>AVL situation (Only for VLMAX since we don't have an intrinsics to
>test non-VLMAX).
> 5.Will support AVL model in the next patch
> 
> gcc/ChangeLog:
> 
>  * config.gcc: Add riscv-vsetvl.o.
>  * config/riscv/riscv-passes.def (INSERT_PASS_BEFORE): Add VSETVL 
> PASS location.
>  * config/riscv/riscv-protos.h (make_pass_vsetvl): New function.
>  (enum avl_type): New enum.
>  (get_ta): New function.
>  (get_ma): Ditto.
>  (get_avl_type): Ditto.
>  (calculate_ratio): Ditto.
>  (enum tail_policy): New enum.
>  (enum mask_policy): Ditto.
>  * config/riscv/riscv-v.cc (calculate_ratio): New function.
>  (emit_pred_op): change the VLMAX mov codgen.
>  (get_ta): New function.
>  (get_ma): Ditto.
>  (enum tail_policy): Change enum.
>  (get_prefer_tail_policy): New function.
>  (enum mask_policy): Change enum.
>  (get_prefer_mask_policy): New function.
>  * config/riscv/t-riscv: Add riscv-vsetvl.o
>  * config/riscv/vector.md (): Adjust attribute and pattern for VSETVL 
> PASS.
>  (@vlmax_avl): Ditto.
>  (@vsetvl_no_side_effects): Delete.
>  (vsetvl_vtype_change_only): New MD pattern.
>  (@vsetvl_discard_result): Ditto.
>  * config/riscv/riscv-vsetvl.cc: New file.
>  * config/riscv/riscv-vsetvl.h: New file.
So a high level note.  Once you've inserted your vsetvl instrutions, you 
can't have further code motion, correct?  So doesn't this potentially 
have a poor interaction with something like speculative code motion as 
performed by sched?   ISTM that if you want to run before sched2, then 
you'd need to introduce dependencies between the vsetvl instrutions and 
the vector instructions that utilize those settings?
 
I can envision wanting to schedule the vsetvl instructions so that they 
bubble up slightly from their insertion points to avoid stalls or allow 
the vector units to start executing earlier.  Is that what's driving the 
the current pass placement?  If not would it make more sense to use the 
late prologue/epilogue hooks that Richard Sandiford posted recently (I'm 
not sure they're committed yet).
 
 
 
 
 
 
> +
> +static bool
> +loop_basic_block_p (const basic_block cfg_bb)
> +{
> +  return JUMP_P (BB_END (cfg_bb)) && any_condjump_p (BB_END (cfg_bb));
> +}
The name seems poor here -- AFAICT this has nothing to do with loops. 
It's just a test that the end of a block is a conditional jump.  I'm 
pretty sure we could extract BB_END (cfg_bb) and use an existing routine 
instead of writing our own.  I'd suggest peeking at jump.cc to see if 
there's something already suitable.
 
  +
> +/* Return true if it is vsetvldi or

Re: Re: [PATCH] RISC-V: Fix RVV mask mode size

2022-12-16 Thread 钟居哲

>> Most likely than not you end up loading a larger quantity with the high
>> bits zero'd.  Interesting that we're using a packed model.  I'd been
>> told it was fairly expensive to implement in hardware relative to teh
>> cost of implementing the sparse model.

>> I'm a bit confused by this.  GCC can support single bit bools, though
>> ports often extend them to 8 bits or more for computational efficiency
>> purposes.  At least that's the case in general.  Is there something
>> particularly special about masks & bools that's causing problems?
I am not sure I am on the same page with you. I don't understand what is the
sparse model you said. The only thing I do in this patch is that we change the 
BYTESIZE VNx1BI for example
as the BYTESIZE of VNx1BI (Original I adjust all mask modes same size as 
VNx8QImode like LLVM). 
And I print the GET_MODE_SIZE (VNx1BI) the value is the same as VNx1QImode so I 
assume because GCC model 1-bool same as 1-QI???
Actually I not sure but I am sure after this patch, VNx1BI is adjusted smaller 
size.

Adjusting mask modes as smaller size always beneficial, since we can use vlm && 
vsm in register spilling, it can reduce the memory consuming and
load store hardware bandwidth.

Unlike LLVM, LLVM make each fractional vector and mask vector same size as LMUL 
=1 so they use vl1r/vs1r to do the register spilling which is not
optimal.


juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2022-12-17 09:53
To: 钟居哲; gcc-patches
CC: kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Fix RVV mask mode size
 
 
On 12/16/22 18:44, 钟居哲 wrote:
> Yes, VNx4DF only has 4 bit in mask mode in case of load and store.
> For example vlm or vsm we will load store 8-bit ??? (I am not sure 
> hardward can load store 4bit,but I am sure it definetly not load store 
> the whole register size)
Most likely than not you end up loading a larger quantity with the high 
bits zero'd.  Interesting that we're using a packed model.  I'd been 
told it was fairly expensive to implement in hardware relative to teh 
cost of implementing the sparse model.
 
> So ideally it should be model more accurate. However, since GCC assumes 
> that 1 BOOL is 1-byte, the only thing I do is to model mask mode as 
> smallest as possible.
> Maybe in the future, I can support 1BOOL for 1-bit?? I am not sure since 
> it will need to change GCC framework.
I'm a bit confused by this.  GCC can support single bit bools, though 
ports often extend them to 8 bits or more for computational efficiency 
purposes.  At least that's the case in general.  Is there something 
particularly special about masks & bools that's causing problems?
 
Jeff

Re: Re: [PATCH] RISC-V: Fix RVV machine mode attribute configuration

2022-12-16 Thread 钟居哲

Actually, I don't check and test HF carefully since I disable them.
Kito ask me to disable all HF modes since zvfhmin is no ratified and GCC
doesn't allow any un-ratified ISA. You can see vector-iterator.md that all
RVV modes supported including QI HI SI DI SF DF excluding HF and BF.



juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2022-12-17 09:48
To: juzhe.zhong; gcc-patches
CC: kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Fix RVV machine mode attribute configuration
 
 
On 12/14/22 00:01, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
> 
> The attribute configuration of each machine mode are support in the previous 
> patch.
> I noticed some of them are not correct during VSETVL PASS testsing.
> Correct them in the single patch now.
> 
> gcc/ChangeLog:
> 
>  * config/riscv/riscv-vector-switch.def (ENTRY): Correct attributes.
> 
 
 
 
> @@ -121,7 +121,7 @@ ENTRY (VNx2HI, true, LMUL_1, 16, LMUL_F2, 32)
>   ENTRY (VNx1HI, true, LMUL_F2, 32, LMUL_F4, 64)
>   
>   /* TODO:Disable all FP16 vector, enable them when 'zvfh' is supported.  */
> -ENTRY (VNx32HF, false, LMUL_8, 2, LMUL_RESERVED, 0)
> +ENTRY (VNx32HF, false, LMUL_RESERVED, 0, LMUL_8, 2)
Is there any value in making VNx32HF dependent on TARGET_MIN_VLEN > 32 
like we're doing for VNx32HI?   In the past I've found it useful to have 
HI, HF, BF behave identically as much as possible.
 
You call.  The patch is OK either way.
 
jeff

Re: Re: [PATCH] RISC-V: Fix RVV mask mode size

2022-12-16 Thread 钟居哲

Yes, VNx4DF only has 4 bit in mask mode in case of load and store.
For example vlm or vsm we will load store 8-bit ??? (I am not sure hardward can 
load store 4bit,but I am sure it definetly not load store the whole register 
size)
So ideally it should be model more accurate. However, since GCC assumes that 1 
BOOL is 1-byte, the only thing I do is to model mask mode as smallest as 
possible.
Maybe in the future, I can support 1BOOL for 1-bit?? I am not sure since it 
will need to change GCC framework.

juzhe.zh...@rivai.ai

From: Jeff Law
Date: 2022-12-17 04:22
To: juzhe.zhong; gcc-patches
CC: kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Fix RVV mask mode size

On 12/13/22 23:48, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
> 
> This patch is to fix RVV mask modes size. Since mask mode size are adjust
> as a whole RVV register size LMUL = 1 which not only make each mask type for
> example vbool32_t tied to vint8m1_t but also increase memory consuming.
> 
> I notice this issue during development of VSETVL PASS. Since it is not part of
> VSETVL support, I seperate it into a single fix patch now.
> 
> gcc/ChangeLog:
> 
>  * config/riscv/riscv-modes.def (ADJUST_BYTESIZE): Reduce RVV mask 
> mode size.
>  * config/riscv/riscv.cc (riscv_v_adjust_bytesize): New function.
>  (riscv_modes_tieable_p): Don't tie mask modes which will create 
> issue.
>  * config/riscv/riscv.h (riscv_v_adjust_bytesize): New function.
So I haven't really studied the masking model for RVV (yet).  But 
there's two models that I'm generally aware of.

One model has a bit per element in the vector we're operating on.  So a 
V4DF will have 4 bits in the mask.  I generally call this the dense or 
packed model.

The other model has a bit for every element for the maximal number of 
elements that can ever appear in a vector.  So if we support an element 
length of 8bits and a 1kbit vector, then the sparse model would have 128 
bits regardless of the size of the object being operated on.  So we'd 
still have 128 bits for V4DF, but the vast majority would be don't cares.

ISTM that you're trying to set the mode size to the smallest possible 
which would seem to argue that you want the dense/packed mask model. 
Does that actually match what the hardware does?  If not, then don't we 
need to convert back and forth?

Or maybe I'm missing something here?!?

Jeff

Re: Re: [PATCH] RISC-V: Add testcases for VSETVL PASS

2022-12-16 Thread 钟居哲

Register allocation (RA) doesn't affect the assembler checks since I relax the 
registers in assmebler checks,
all assmebler checks have their own goal. For example:

The code like this:
+void foo2 (void * restrict in, void * restrict out, int n)
+{
+  for (int i = 0; i < n; i++)
+{
+  vuint16mf4_t v = *(vuint16mf4_t*)(in + i);
+  *(vuint16mf4_t*)(out + i) = v;
+}
+}
Assembler check:
scan-assembler-times 
{vsetvli\s+(?:ra|[sgtf]p|t[0-6]|s[0-9]|s10|s11|a[0-7]),\s*zero,\s*e16,\s*mf4,\s*t[au],\s*m[au]\s+\.L[0-9]\:\s+vle16\.v\s+(?:v[0-9]|v[1-2][0-9]|v3[0-1]),0\s*\((?:ra|[sgtf]p|t[0-6]|s[0-9]|s10|s11|a[0-7])\I
 don't care about which vector register is using since I relax register in 
assembler : (?:v[0-9]|v[1-2][0-9]|v3[0-1]), this means any vector register 
v0-v31But also I relax scalar register : 
(?:ra|[sgtf]p|t[0-6]|s[0-9]|s10|s11|a[0-7]), so could be any x0 - x31 of 
them.The only strict check is that make sure the vsetvl is hoist outside the 
loop meaning the location of vsetvl is outside of the Lable 
L[0-9]:vsetvli\s+(?:ra|[sgtf]p|t[0-6]|s[0-9]|s10|s11|a[0-7]),\s*zero,\s*e16,\s*mf4,\s*t[au],\s*m[au]\s+\.L[0-9]You
 can see the last assembler is \s+\.L[0-9] to make sure VSETVL PASS 
successfully do the optimization that hoist the vsetvl instruction outside the 
loopI try to use check-function-body but it fails since it can not recognize 
the Lable which is most important for such cases.


juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2022-12-17 04:07
To: juzhe.zhong; gcc-patches
CC: kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Add testcases for VSETVL PASS
 
 
On 12/14/22 01:09, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
> 
> gcc/testsuite/ChangeLog:
> 
>  * gcc.target/riscv/rvv/rvv.exp: Adjust to enable tests for VSETVL 
> PASS.
>  * gcc.target/riscv/rvv/vsetvl/dump-1.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-1.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-10.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-11.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-12.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-13.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-14.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-15.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-16.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-17.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-18.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-19.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-2.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-3.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-4.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-5.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-6.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-7.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-8.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-9.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_vtype-1.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_vtype-2.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_vtype-3.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_vtype-4.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_vtype-5.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_vtype-6.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_vtype-7.c: New test.
>  * gcc.target/riscv/rvv/vsetvl/vlmax_single_vtype-8.c: New test.
So it looks like the assembler strings you're searching for are highly 
specific (across all 5 testsuite patches).  How sensitive do we expect 
these tests to be to things like register allocation giving us different 
registers and such?  I'd hate to be in a position where we're constantly 
needing to update these tests because the output is changing in 
unimportant ways.
 
Jeff

Re: Re: [PATCH] RISC-V: Remove unit-stride store from ta attribute

2022-12-16 Thread 钟居哲

Yes, the vector stores doesn't care about policy no matter mask or tail.
Removing it can allow VSETVL PASS have more optimization chances
since VSETVL PASS has backward demands fusion.

For example:
vadd tama
vse.v
VSETVL PASS will choose to set tama for vse.v

vadd tumu
vse.v
VSETVL PASS will choose to set tumu for vse.v



juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2022-12-17 04:01
To: juzhe.zhong; gcc-patches
CC: kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Remove unit-stride store from ta attribute
 
 
On 12/14/22 04:36, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
> 
> Since store instructions doesn't care about tail policy, we remove
> vste from "ta" attribute. Hence, we could have more fusion chances
> and better optimization.
> 
> gcc/ChangeLog:
> 
>  * config/riscv/vector.md: Remove vste.
Just to confirm that I understand the basic model.  Vector stores only 
update active elements, thus they don't care about tail policy, right?
 
Assuming that's the case, then this is OK.
 
jeff

Re: Re: [PATCH] RISC-V: Add attributes for VSETVL PASS

2022-11-28 Thread 钟居哲

Thanks. 

I think we still can continue RVV feature reviewing process in github branch
that we have talked about. Such patches that have been reviewed I will still 
send
them to GCC mail list and not to merge right now, we can wait until stage1 is 
open.

Is it a good idea ? I don't want to make RVV support in GCC stop here since 
LLVM already has
all RVV support  and GCC is far behind LLVM for a long time in case of RVV.

juzhe.zh...@rivai.ai

From: Palmer Dabbelt
Date: 2022-11-29 02:02
To: jeffreyalaw
CC: juzhe.zhong; gcc-patches; Kito Cheng
Subject: Re: [PATCH] RISC-V: Add attributes for VSETVL PASS
On Mon, 28 Nov 2022 08:44:16 PST (-0800), jeffreya...@gmail.com wrote:
>
> On 11/28/22 07:14, juzhe.zh...@rivai.ai wrote:
>> From: Ju-Zhe Zhong 
>>
>> gcc/ChangeLog:
>>
>>  * config/riscv/riscv-protos.h (enum vlmul_type): New enum.
>>  (get_vlmul): New function.
>>  (get_ratio): Ditto.
>>  * config/riscv/riscv-v.cc (struct mode_vtype_group): New struct.
>>  (ENTRY): Adapt for attributes.
>>  (enum vlmul_type): New enum.
>>  (get_vlmul): New function.
>>  (get_ratio): New function.
>>  * config/riscv/riscv-vector-switch.def (ENTRY): Adapt for 
>> attributes.
>>  * config/riscv/riscv.cc (ENTRY): Ditto.
>>  * config/riscv/vector.md (false,true): Add attributes.
>
> I'm tempted to push this into the next stage1 given its arrival after
> stage1 close, but if the wider RISC-V maintainers want to see it move
> forward, I don't object strongly.

I'm also on the fence here: the RISC-V V implementation is a huge 
feature so it's a bit awkward to land it this late in the release, but 
on the flip side it's a very important feature.  It's complicated enough 
that whatever our first release is will probably be a mess, so I'd 
prefer to just get that pain out of the way sooner rather than later.  
There's no V hardware availiable now and nothing concretely announced so 
any users are probably going to be pretty advanced, but having at least 
the basics of V in there will allow us to kick the tires on the rest of 
the stack a lot more easily.

There's obviously risk to taking something this late in the process.  We 
don't have anything else that triggers the vectorizer, so I think it 
should be seperable enough that risk is manageable.

Not sure if Kito wants to chim in, though.

> I'm curious about the model you're using.  Is it going to be something
> similar to mode switching?  That's the first mental model that comes to
> mind.  Essentially we determine the VL needed for every chunk of code,
> then we do an LCM like algorithm to find the optimal placement points
> for VL sets to minimize the number of VL sets across all the paths
> through the CFG.  Never in a million years would I have expected we'd be
> considering reusing that code.
>
>
> Jeff

Re: Re: [PATCH] RISC-V: Add duplicate vector support.

2022-11-28 Thread 钟居哲

OK.



juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2022-11-29 00:49
To: juzhe.zhong; gcc-patches
CC: kito.cheng
Subject: Re: [PATCH] RISC-V: Add duplicate vector support.
 
On 11/25/22 09:06, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
>
> gcc/ChangeLog:
>
>  * config/riscv/constraints.md (Wdm): New constraint.
>  * config/riscv/predicates.md (direct_broadcast_operand): New 
> predicate.
>  * config/riscv/riscv-protos.h (RVV_VLMAX): New macro.
>  (emit_pred_op): Refine function.
>  * config/riscv/riscv-selftests.cc (run_const_vector_selftests): New 
> function.
>  (run_broadcast_selftests): Ditto.
>  (BROADCAST_TEST): New tests.
>  (riscv_run_selftests): More tests.
>  * config/riscv/riscv-v.cc (emit_pred_move): Refine function.
>  (emit_vlmax_vsetvl): Ditto.
>  (emit_pred_op): Ditto.
>  (expand_const_vector): New function.
>  (legitimize_move): Add constant vector support.
>  * config/riscv/riscv.cc (riscv_print_operand): New asm print rule 
> for const vector.
>  * config/riscv/riscv.h (X0_REGNUM): New macro.
>  * config/riscv/vector-iterators.md: New attribute.
>  * config/riscv/vector.md (vec_duplicate): New pattern.
>  (@pred_broadcast): New pattern.
>
> gcc/testsuite/ChangeLog:
>
>  * gcc.target/riscv/rvv/base/dup-1.c: New test.
>  * gcc.target/riscv/rvv/base/dup-2.c: New test.
 
I think this should wait for the next stage1 cycle.
 
jeff

Re: Re: [PATCH] RISC-V: Remove tail && mask policy operand for vmclr, vmset, vmld, vmst

2022-11-28 Thread 钟居哲

Yes, it's a cleanup.



juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2022-11-29 00:48
To: juzhe.zhong; gcc-patches
CC: kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Remove tail && mask policy operand for vmclr, 
vmset, vmld, vmst
 
On 11/28/22 07:21, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
>
> Since mask instruction doesn't need policy, so remove it to make it look 
> reasonable.
> gcc/ChangeLog:
>
>  * config/riscv/vector.md: Remove TA && MA operands.
 
Does this fix a known bug or is it just a cleanup?   I think the latter, 
but I want to be sure.
 
 
 
Jeff

Re: Re: [PATCH] RISC-V: Add attributes for VSETVL PASS

2022-11-28 Thread 钟居哲

>> I'm tempted to push this into the next stage1 given its arrival after
>> stage1 close, but if the wider RISC-V maintainers want to see it move
>> forward, I don't object strongly.

Ok, let's save these patches and merge them when GCC14 stage1 is open.
Would you mind telling me when will stage 1 be open?

>> I'm curious about the model you're using.  Is it going to be something
>> similar to mode switching?  That's the first mental model that comes to
>> mind.  Essentially we determine the VL needed for every chunk of code,
>> then we do an LCM like algorithm to find the optimal placement points
>> for VL sets to minimize the number of VL sets across all the paths
>> through the CFG.  Never in a million years would I have expected we'd be
>> considering reusing that code.

Yes, I implemented VSETVL PASS with LCM algorithm and RTL_SSA framework.
Actually, me && kito have spent a month on VSETVL PASS and we have 
made a progress. We have tested it with a lot of testcases, turns out our 
implementation
of VSETVL PASS in GCC has much better codegen than the VSETVL implemented
in LLVM side in many different situations because of LCM. I am working on 
cleaning up the codes
and hopefully you will see it soon in the next patch.

Thanks

juzhe.zh...@rivai.ai

From: Jeff Law
Date: 2022-11-29 00:44
To: juzhe.zhong; gcc-patches
CC: kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Add attributes for VSETVL PASS

On 11/28/22 07:14, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
>
> gcc/ChangeLog:
>
>  * config/riscv/riscv-protos.h (enum vlmul_type): New enum.
>  (get_vlmul): New function.
>  (get_ratio): Ditto.
>  * config/riscv/riscv-v.cc (struct mode_vtype_group): New struct.
>  (ENTRY): Adapt for attributes.
>  (enum vlmul_type): New enum.
>  (get_vlmul): New function.
>  (get_ratio): New function.
>  * config/riscv/riscv-vector-switch.def (ENTRY): Adapt for attributes.
>  * config/riscv/riscv.cc (ENTRY): Ditto.
>  * config/riscv/vector.md (false,true): Add attributes.

I'm tempted to push this into the next stage1 given its arrival after 
stage1 close, but if the wider RISC-V maintainers want to see it move 
forward, I don't object strongly.

I'm curious about the model you're using.  Is it going to be something 
similar to mode switching?  That's the first mental model that comes to 
mind.  Essentially we determine the VL needed for every chunk of code, 
then we do an LCM like algorithm to find the optimal placement points 
for VL sets to minimize the number of VL sets across all the paths 
through the CFG.  Never in a million years would I have expected we'd be 
considering reusing that code.

Jeff

Re: Re: [PATCH] RISC-V: Fix RVV testcases.

2022-10-31 Thread 钟居哲

These cases actually doesn't care about -mabi, they just need 'v' in -march.
Can you tell me how to fix these testcases for "fails on targets without 
ilp32d" ?
These failures are bogus failures since if you specify -mabi=ilp32d when you 
are using GNU toolchain which is build up with "--arch=ilp32" let say.
It will fail. Report there is no "ilp32d". So I fix these testcase by replacing 
"ilp32d" into "ilp32".
Thank you.



juzhe.zh...@rivai.ai
 
From: Palmer Dabbelt
Date: 2022-11-01 06:30
To: gcc-patches
CC: juzhe.zhong; gcc-patches; schwab; Kito Cheng
Subject: Re: [PATCH] RISC-V: Fix RVV testcases.
On Mon, 31 Oct 2022 15:00:49 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
>
> On 10/30/22 19:40, juzhe.zh...@rivai.ai wrote:
>> From: Ju-Zhe Zhong 
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/riscv/rvv/base/abi-2.c: Change ilp32d to ilp32.
>>  * gcc.target/riscv/rvv/base/abi-3.c: Ditto.
>>  * gcc.target/riscv/rvv/base/abi-4.c: Ditto.
>>  * gcc.target/riscv/rvv/base/abi-5.c: Ditto.
>>  * gcc.target/riscv/rvv/base/abi-6.c: Ditto.
>>  * gcc.target/riscv/rvv/base/abi-7.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-1.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-10.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-11.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-12.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-13.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-2.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-3.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-4.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-5.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-6.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-7.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-8.c: Ditto.
>>  * gcc.target/riscv/rvv/base/mov-9.c: Ditto.
>>  * gcc.target/riscv/rvv/base/pragma-1.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-1.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-2.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-3.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-4.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-5.c: Ditto.
>>  * gcc.target/riscv/rvv/base/user-6.c: Ditto.
>>  * gcc.target/riscv/rvv/base/vsetvl-1.c: Ditto.
>
> I'm pretty new to the RISC-V world, but don't some of the cases
> (particularly the abi-* tests) verify that the ABI specification does
> not override the arch specification WRT availability of types?
 
I think that depends on what the ABI specification says here, as it 
could really go many ways.  Most of the RISC-V targets just use -mabi to 
control how arguments end up passed in functions, not the availability 
of types.  I can't find the ABI spec for these, though, so I'm not 
entirely sure how they're supposed to work...
 
That said, I'm not sure why we need any of these -mabi changes?  Just 
from spot checking some of the examples it doesn't look like there 
should be any functional difference between ilp32 and ilp32d here: 
-march is always specified so ilp32d looks valid.  If this is just to 
fix the "fails on targets without ilp32d" [1], then IMO it's not really 
a fix: we're essentially just changing that to "fails on targets without 
ilp32", we either need some sort of automatic march/mabi setting or a 
dependency on the availiable multilibs.  Some of these can probably 
avoid linking, but we'll have execution tests at some point.
 
1: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604644.html

Re: Re: [PATCH] RISC-V: Fix RVV testcases.

2022-10-31 Thread 钟居哲

These testcases are not depend on the ABI specification.
I pick up the minimum ABI setting so that it won't fail.
The naming of abi-* tests may be confusing, I can change the naming in the next 
time.


juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2022-11-01 06:00
To: juzhe.zhong; gcc-patches
CC: schwab; kito.cheng
Subject: Re: [PATCH] RISC-V: Fix RVV testcases.
 
On 10/30/22 19:40, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
>
> gcc/testsuite/ChangeLog:
>
>  * gcc.target/riscv/rvv/base/abi-2.c: Change ilp32d to ilp32.
>  * gcc.target/riscv/rvv/base/abi-3.c: Ditto.
>  * gcc.target/riscv/rvv/base/abi-4.c: Ditto.
>  * gcc.target/riscv/rvv/base/abi-5.c: Ditto.
>  * gcc.target/riscv/rvv/base/abi-6.c: Ditto.
>  * gcc.target/riscv/rvv/base/abi-7.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-1.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-10.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-11.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-12.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-13.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-2.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-3.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-4.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-5.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-6.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-7.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-8.c: Ditto.
>  * gcc.target/riscv/rvv/base/mov-9.c: Ditto.
>  * gcc.target/riscv/rvv/base/pragma-1.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-1.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-2.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-3.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-4.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-5.c: Ditto.
>  * gcc.target/riscv/rvv/base/user-6.c: Ditto.
>  * gcc.target/riscv/rvv/base/vsetvl-1.c: Ditto.
 
I'm pretty new to the RISC-V world, but don't some of the cases 
(particularly the abi-* tests) verify that the ABI specification does 
not override the arch specification WRT availability of types?
 
 
Jeff

Re: Re: [PATCH][RFT] Vectorization of first-order recurrences

2022-10-10 Thread 钟居哲

RVV also doesn't have a two-input permutation instructions (unlike ARM SVE has 
tbl instructions) and 
RVV needs about 4 instructions to handle this permutation, it still improve 
performance a lot.
I think backend should handle this. Because this first-order recurrence loop 
vectorizer always generates the
special permuation index = [vl-1, vl, vl+1,. ..]  (This index sequence 
pattern is just following LLVM). 
If the backend doesn't want this permuation happens, just recognize this index 
pattern and disable it.

juzhe.zh...@rivai.ai

From: Andrew Stubbs
Date: 2022-10-10 21:57
To: Richard Biener; gcc-patches@gcc.gnu.org
CC: richard.sandif...@arm.com; juzhe.zh...@rivai.ai
Subject: Re: [PATCH][RFT] Vectorization of first-order recurrences
On 10/10/2022 12:03, Richard Biener wrote:
> The following picks up the prototype by Ju-Zhe Zhong for vectorizing
> first order recurrences.  That solves two TSVC missed optimization PRs.
> 
> There's a new scalar cycle def kind, vect_first_order_recurrence
> and it's handling of the backedge value vectorization is complicated
> by the fact that the vectorized value isn't the PHI but instead
> a (series of) permute(s) shifting in the recurring value from the
> previous iteration.  I've implemented this by creating both the
> single vectorized PHI and the series of permutes when vectorizing
> the scalar PHI but leave the backedge values in both unassigned.
> The backedge values are (for the testcases) computed by a load
> which is also the place after which the permutes are inserted.
> That placement also restricts the cases we can handle (without
> resorting to code motion).
> 
> I added both costing and SLP handling though SLP handling is
> restricted to the case where a single vectorized PHI is enough.
> 
> Missing is epilogue handling - while prologue peeling would
> be handled transparently by adjusting iv_phi_p the epilogue
> case doesn't work with just inserting a scalar LC PHI since
> that a) keeps the scalar load live and b) that loads is the
> wrong one, it has to be the last, much like when we'd vectorize
> the LC PHI as live operation.  Unfortunately LIVE
> compute/analysis happens too early before we decide on
> peeling.  When using fully masked loop vectorization the
> vect-recurr-6.c works as expected though.
> 
> I have tested this on x86_64 for now, but since epilogue
> handling is missing there's probably no practical cases.
> My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c
> just fine but I didn't feel like running SPEC within SDE nor
> is the WHILE_ULT patch complete enough.  Builds of SPEC 2k7
> with fully masked loops succeed (minus three cases of
> PR107096, caused by my WHILE_ULT prototype).
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> 
> Testing with SVE, GCN or RVV appreciated, ideas how to cleanly
> handle epilogues welcome.

The testcases all produce correct code on GCN and pass the execution tests.

The code isn't terribly optimal because we don't have a two-input 
permutation instruction, so we permute each half separately and 
vec_merge the results. In this case the first vector is always a no-op 
permutation so that's wasted cycles. We'd really want a vector rotate 
and write-lane (or the other way around). I think the special-case 
permutations can be recognised and coded into the backend, but I don't 
know if we can easily tell that the first vector is just a bunch of 
duplicates, when it's not constant.

Andrew

Re: [PATCH] RISC-V: move struct vector_type_info from .h to .cc.

2022-10-10 Thread 钟居哲

Ignore this patch plz. It's not finished.
The correct && full patch is this:
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603148.html 
Thanks.



juzhe.zh...@rivai.ai
 
From: juzhe.zhong
Date: 2022-10-10 21:49
To: gcc-patches
CC: kito.cheng; Ju-Zhe Zhong
Subject: [PATCH] RISC-V: move struct vector_type_info from *.h to *.cc.
From: Ju-Zhe Zhong 
 
gcc/ChangeLog:
 
* config/riscv/riscv-vector-builtins.cc (struct vector_type_info): Move 
from riscv-vector-builtins.h.
* config/riscv/riscv-vector-builtins.h (struct vector_type_info): Move 
to riscv-vector-builtins.cc.
 
---
gcc/config/riscv/riscv-vector-builtins.cc | 16 
gcc/config/riscv/riscv-vector-builtins.h  | 16 
2 files changed, 16 insertions(+), 16 deletions(-)
 
diff --git a/gcc/config/riscv/riscv-vector-builtins.cc 
b/gcc/config/riscv/riscv-vector-builtins.cc
index 0096e32f5e4..d7b567a7ba1 100644
--- a/gcc/config/riscv/riscv-vector-builtins.cc
+++ b/gcc/config/riscv/riscv-vector-builtins.cc
@@ -50,6 +50,22 @@ using namespace riscv_vector;
namespace riscv_vector {
+/* Static information about each vector type.  */
+struct vector_type_info
+{
+  /* The name of the type as declared by riscv_vector.h
+ which is recommend to use. For example: 'vint32m1_t'.  */
+  const char *name;
+
+  /* ABI name of vector type. The type is always available
+ under this name, even when riscv_vector.h isn't included.
+ For example:  '__rvv_int32m1_t'.  */
+  const char *abi_name;
+
+  /* The C++ mangling of ABI_NAME.  */
+  const char *mangled_name;
+};
+
/* Information about each RVV type.  */
static CONSTEXPR const vector_type_info vector_types[] = {
#define DEF_RVV_TYPE(USER_NAME, NCHARS, ABI_NAME, ARGS...)\
diff --git a/gcc/config/riscv/riscv-vector-builtins.h 
b/gcc/config/riscv/riscv-vector-builtins.h
index 6ca0b073964..524fb0b02c2 100644
--- a/gcc/config/riscv/riscv-vector-builtins.h
+++ b/gcc/config/riscv/riscv-vector-builtins.h
@@ -26,22 +26,6 @@ namespace riscv_vector {
/* This is for segment instructions.  */
const unsigned int MAX_TUPLE_SIZE = 8;
-/* Static information about each vector type.  */
-struct vector_type_info
-{
-  /* The name of the type as declared by riscv_vector.h
- which is recommend to use. For example: 'vint32m1_t'.  */
-  const char *user_name;
-
-  /* ABI name of vector type. The type is always available
- under this name, even when riscv_vector.h isn't included.
- For example:  '__rvv_int32m1_t'.  */
-  const char *abi_name;
-
-  /* The C++ mangling of ABI_NAME.  */
-  const char *mangled_name;
-};
-
/* Enumerates the RVV types, together called
"vector types" for brevity.  */
enum vector_type_index
-- 
2.36.1

Re: [committed] RISC-V: Add riscv_vector.h wrapper in testsuite to prevent pull in stdint.h from C library

2022-10-10 Thread 钟居哲

LGTM.

juzhe.zh...@rivai.ai

From: Kito Cheng
Date: 2022-10-10 21:14
To: gcc-patches; kito.cheng; christoph.muellner; juzhe.zhong
CC: Kito Cheng
Subject: [committed] RISC-V: Add riscv_vector.h wrapper in testsuite to prevent 
pull in stdint.h from C library
For RISC-V linux/glibc toolchain will got header file not found when including
stdint.h if multilib is not enabled, it because some header file will
try to include gnu/stubs-.h from the system, however it only
generated when multilib enabled.

In order to prevent that, we introduce a wrapper for riscv_vector.h,
include stdint-gcc.h rather than the default stdint.h.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/riscv_vector.h: New.

Reported-by: Christoph Müllner 
Tested-by: Christoph Müllner 
Reviewed-by: Ju-Zhe Zhong 
---
.../gcc.target/riscv/rvv/base/riscv_vector.h  | 11 +++
1 file changed, 11 insertions(+)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/riscv_vector.h

diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/riscv_vector.h 
b/gcc/testsuite/gcc.target/riscv/rvv/base/riscv_vector.h
new file mode 100644
index 000..fbb4858fc86
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/riscv_vector.h
@@ -0,0 +1,11 @@
+/* Wrapper of riscv_vector.h, prevent riscv_vector.h including stdint.h from
+   C library, that might cause problem on testing RV32 related testcase when
+   we disable multilib.  */
+#ifndef _RISCV_VECTOR_WRAP_H
+
+#define _GCC_WRAP_STDINT_H
+#include "stdint-gcc.h"
+#include_next 
+#define _RISCV_VECTOR_WRAP_H
+
+#endif
-- 
2.37.2

Re: [committed] RISC-V: Adjust testcase for rvv/base/user-1.c

2022-10-10 Thread 钟居哲

LGTM.

juzhe.zh...@rivai.ai

From: Kito Cheng
Date: 2022-10-10 21:14
To: gcc-patches; kito.cheng; christoph.muellner; juzhe.zhong
CC: Kito Cheng
Subject: [committed] RISC-V: Adjust testcase for rvv/base/user-1.c
The -march option check isn't precise enough, -march=rv*v* also mach any
zve extensions.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/user-1.c: Add dg-options and drop
dg-skip-if.

Reported-by: Christoph Müllner 
Tested-by: Christoph Müllner 
Reviewed-by: Ju-Zhe Zhong 
---
gcc/testsuite/gcc.target/riscv/rvv/base/user-1.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/user-1.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/user-1.c
index fa1f0f3d4d2..00fb73f220f 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/user-1.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/user-1.c
@@ -1,5 +1,5 @@
/* { dg-do compile } */
-/* { dg-skip-if "test rvv intrinsic" { *-*-* } { "*" } { "-march=rv*v*" } } */
+/* { dg-options "-O3 -march=rv32gcv -mabi=ilp32d" } */
#include "riscv_vector.h"
-- 
2.37.2

Re: Re: [PATCH] Add first-order recurrence autovectorization

2022-10-07 Thread 钟居哲

Sorry for late reply. I just got back from vacation (a week).
I was planning to finish this patch after vacation. It seems that you almost 
finished.
That's great! Thank you so much.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2022-10-07 20:24
To: juzhe.zhong
CC: gcc-patches; richard.sandiford
Subject: Re: [PATCH] Add first-order recurrence autovectorization
On Thu, Oct 6, 2022 at 3:07 PM Richard Biener
 wrote:
>
> On Thu, Oct 6, 2022 at 2:13 PM Richard Biener
>  wrote:
> >
> > On Fri, Sep 30, 2022 at 10:00 AM  wrote:
> > >
> > > From: Ju-Zhe Zhong 
> > >
> > > Hi, After fixing previous ICE.
> > > I add full implementation (insert permutation to get correct result.)
> > >
> > > The gimple IR is correct now I think:
> > >   # t_21 = PHI <_4(6), t_12(9)>
> > >   # i_22 = PHI 
> > >   # vectp_a.6_26 = PHI 
> > >   # vect_vec_recur_.9_9 = PHI 
> > >   # vectp_b.11_7 = PHI 
> > >   # curr_cnt_36 = PHI 
> > >   # loop_len_20 = PHI 
> > >   _38 = .WHILE_LEN (loop_len_20, 32, POLY_INT_CST [4, 4]);
> > >   while_len_37 = _38;
> > >   _1 = (long unsigned int) i_22;
> > >   _2 = _1 * 4;
> > >   _3 = a_14(D) + _2;
> > >   vect__4.8_19 = .LEN_LOAD (vectp_a.6_26, 32B, loop_len_20, 0);
> > >   _4 = *_3;
> > >   _5 = b_15(D) + _2;
> > >   vect_vec_recur_.9_9 = VEC_PERM_EXPR  > > { POLY_INT_CST [3, 4], POLY_INT_CST [4, 4], POLY_INT_CST [5, 4], ... }>;
> > >
> > > But I encounter another ICE:
> > > 0x169e0e7 process_bb
> > > ../../../riscv-gcc/gcc/tree-ssa-sccvn.cc:7498
> > > 0x16a09af do_rpo_vn(function*, edge_def*, bitmap_head*, bool, bool, 
> > > vn_lookup_kind)
> > > ../../../riscv-gcc/gcc/tree-ssa-sccvn.cc:8109
> > > 0x16a0fe7 do_rpo_vn(function*, edge_def*, bitmap_head*)
> > > ../../../riscv-gcc/gcc/tree-ssa-sccvn.cc:8205
> > > 0x179b7db execute
> > > ../../../riscv-gcc/gcc/tree-vectorizer.cc:1365
> > >
> > > Could you help me with this? After fixing this ICE, I think the loop 
> > > vectorizer
> > > can run correctly. Maybe you can test is in X86 or ARM after fixing this 
> > > ICE.
> >
> > Sorry for the late reply, the issue is that we have
> >
> > vect_vec_recur_.7_7 = VEC_PERM_EXPR  > { 7, 8, 9, 10, 11, 12, 13, 14 }>;
> >
> > thus
> >
> > +  for (unsigned i = 0; i < ncopies; ++i)
> > +   {
> > + gphi *phi = as_a (STMT_VINFO_VEC_STMTS 
> > (def_stmt_info)[i]);
> > + tree latch = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
> > + tree recur = gimple_phi_result (phi);
> > + gassign *assign
> > +   = gimple_build_assign (recur, VEC_PERM_EXPR, recur, latch, 
> > perm);
> > + gimple_assign_set_lhs (assign, recur);
> >
> > needs to create a new SSA name for each LHS.  You shouldn't create code in
> > vect_get_vec_defs_for_operand either.
> >
> > Let me mangle the patch a bit.
> >
> > The attached is what I came up with, the permutes need to be generated when
> > the backedge PHI values are filled in.  Missing are ncopies > 1 handling, 
> > we'd
> > need to think of how the initial value and the permutes would work here, 
> > missing
> > is SLP support but more importantly handling in the epilogue (so on x86 
> > requires
> > constant loop bound)
> > I've added a testcase that triggers on x86_64.
>
> Actually I broke it, the following is more correct.
 
So let me finish the patch.  I have everything besides the epilogue
handling done,
I'll get to that somewhen next week.
 
Richard.
 
> Richard.
>
> > Richard.

Re: Re: [Unfinished PATCH] Add first-order recurrence autovectorization

2022-09-29 Thread 钟居哲

Yeah, frankly, I already noticed this situation.
If we can manually rewrite some codes, GCC can solve data dependency in scalar 
passes 
by introducing repeating statement (It will remove PHI nodes) before loop 
vectorizer.
Which approach is winner, GCC or LLVM ? This is not point that I care about.
My goal is to fix cases that GCC failed to vectorize and make GCC loop 
vectorizer more powerful and can vectorize more cases.
Besides, In many situations user doesn't want to rewrite the codes and also we 
can't leave data dependency to scalar pass to handle it.

The same example I presented you, users could write codes in different styles 
will get different vectorization codegen (after applying my patch).
However, LLVM can not achieve that, no matter how you write the codes they 
always uses general first-order recurrence loop vectorizer. 
And I think this is the advantage GCC overcome LLVM after my patch is finished 
and merge into GCC upstream.
Which approach is better? Leave it to user choose it.

If you watched my presentation in GNU cauldron 2022. I have showed the 
comparison between RVV LLVM and RVV GCC.
After compiling and testing many benchmarks, I noticed LLVM can always 
vectorize more cases than GCC.
However, in case of cases that both GCC and LLVM can vectorize, some cases GCC 
wins, some cases GCC and LLVM are the same or LLVM wins,
but overal GCC can win more in most of cases.
I have analyzed most of them, because GCC is missing some general loop 
vectorizer that is what I want to do (translating LLVM loop vectorizer into 
GCC).

So, let's me first finish this patch and test it in the downstream RVV GCC. I 
can only test it in my downstream RVV GCC.
Because the RISC-V backend in upstream GCC is far from ready to support 
autovectorization even though my about 10 pathes of RVV support are merged into 
GCC upstream.
Then I post the finished version of this loop vectorizer to you, can you help 
me test it in ARM platform ? Thanks.

juzhe.zh...@rivai.ai

From: Richard Sandiford
Date: 2022-09-30 00:53
To: juzhe.zhong
CC: gcc-patches
Subject: Re: [Unfinished PATCH] Add first-order recurrence autovectorization
Thanks for posting the patch.

juzhe.zh...@rivai.ai writes:
> From: Ju-Zhe Zhong 
>
> gcc/ChangeLog:
>
> * tree-vect-loop.cc (vect_phi_first_order_recurrence_p): New function.
> (vect_analyze_scalar_cycles_1): Classify first-order recurrence phi.
> (vect_analyze_loop_operations): Add first-order recurrence 
> autovectorization support.
> (vectorizable_dep_phi): New function.
> (vect_use_first_order_phi_result_p): New function.
> (vect_transform_loop): Add first-order recurrence autovectorization 
> support.
> * tree-vect-stmts.cc (vect_transform_stmt): Ditto.
> (vect_is_simple_use): Ditto.
> * tree-vectorizer.h (enum vect_def_type): New enum.
> (enum stmt_vec_info_type): Ditto.
> (vectorizable_dep_phi): New function.
>
> Hi, since Richard said I can post unfinished for help, I post it.
> This patch is for fix 
> issue:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409.
> LLVM can vectorize this case using first-order recurrence loop-vectorizer.
> This patch is inspired by first-order recurrence autovectorization support in 
> LLVM:
> https://reviews.llvm.org/D16197
> There is a link that I can show you several cases that GCC fails vectorization
> because no support of firs-order recurrence vectorization: 
> https://godbolt.org/z/nzf1Wrd6T
>
> Let's consider a simple case that I simplify:
> void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * 
> __restrict__ c, int n)
> {
>   int32_t t = *c;
>   for (int i = 0; i < n; ++i)
> {
>   b[i] = a[i] - t;
>   t = a[i];
> }
> }

One thing that I wondered about the LLVM implementation is:
does reusing the loaded value really pay for itself?  E.g. for
the un-predictive-commoned version:

void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * __restr\
ict__ c, int n)
{
  b[0] = a[0] - *c;
  for (int i = 1; i < n; ++i)
b[i] = a[i] - a[i - 1];
}

GCC generates:

L4:
ldr q0, [x6, x2]
ldr q1, [x0, x2]
sub v0.4s, v0.4s, v1.4s
str q0, [x5, x2]
add x2, x2, 16
cmp x2, x4
bne .L4

whereas LLVM (with -fno-unroll-loops) generates:

.LBB0_4:// %vector.body
mov v1.16b, v0.16b
subsx15, x15, #4
ldr q0, [x13], #16
ext v1.16b, v1.16b, v0.16b, #12
sub v1.4s, v0.4s, v1.4s
str q1, [x14], #16
b.ne.LBB0_4

Introducing the loop-carried dependency (via the ext) limits the
throughput of the loop to the latency of a permutation.

But I guess which approach is better depends on the amount of work
that is repeated by GCC's approach.  For a single load it's probably
better to repeat the work, but for something more complicated the

Re: RE: [PATCH]middle-end Add optimized float addsub without needing VEC_PERM_EXPR.

2022-09-23 Thread 钟居哲

So far I didn't see the case that V2DF <-> V4SF in RISC-V. 



juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2022-09-23 20:54
To: Tamar Christina
CC: Richard Sandiford; Tamar Christina via Gcc-patches; nd; juzhe.zhong
Subject: RE: [PATCH]middle-end Add optimized float addsub without needing 
VEC_PERM_EXPR.
On Fri, 23 Sep 2022, Tamar Christina wrote:
 
> Hi,
> 
> Attached is the respun version of the patch,
> 
> > >>
> > >> Wouldn't a target need to re-check if lanes are NaN or denormal if
> > >> after a SFmode lane operation a DFmode lane operation follows?  IIRC
> > >> that is what usually makes punning "integer" vectors as FP vectors 
> > >> costly.
> 
> I don't believe this is a problem, due to NANs not being a single value and
> according to the standard the sign bit doesn't change the meaning of a NAN.
> 
> That's why specifically for negates generally no check is performed and it's
> Assumed that if a value is a NaN going in, it's a NaN coming out, and this
> Optimization doesn't change that.  Also under fast-math we don't guarantee
> a stable representation for NaN (or zeros, etc) afaik.
> 
> So if that is still a concern I could add && !HONORS_NAN () to the 
> constraints.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> * match.pd: Add fneg/fadd rule.
> 
> gcc/testsuite/ChangeLog:
> 
> * gcc.target/aarch64/simd/addsub_1.c: New test.
> * gcc.target/aarch64/sve/addsub_1.c: New test.
> 
> --- inline version of patch ---
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 
> 1bb936fc4010f98f24bb97671350e8432c55b347..2617d56091dfbd41ae49f980ee0af3757f5ec1cf
>  100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -7916,6 +7916,59 @@ and,
>(simplify (reduc (op @0 VECTOR_CST@1))
>  (op (reduc:type @0) (reduc:type @1
>  
> +/* Simplify vector floating point operations of alternating sub/add pairs
> +   into using an fneg of a wider element type followed by a normal add.
> +   under IEEE 754 the fneg of the wider type will negate every even entry
> +   and when doing an add we get a sub of the even and add of every odd
> +   elements.  */
> +(simplify
> + (vec_perm (plus:c @0 @1) (minus @0 @1) VECTOR_CST@2)
> + (if (!VECTOR_INTEGER_TYPE_P (type) && !BYTES_BIG_ENDIAN)
 
shouldn't this be FLOAT_WORDS_BIG_ENDIAN instead?
 
I'm still concerned what
 
(neg:V2DF (subreg:V2DF (reg:V4SF) 0))
 
means for architectures like RISC-V.  Can one "reformat" FP values
in vector registers so that two floats overlap a double
(and then back)?
 
I suppose you rely on target_can_change_mode_class to tell you that.
 
 
> +  (with
> +   {
> + /* Build a vector of integers from the tree mask.  */
> + vec_perm_builder builder;
> + if (!tree_to_vec_perm_builder (, @2))
> +   return NULL_TREE;
> +
> + /* Create a vec_perm_indices for the integer vector.  */
> + poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (type);
> + vec_perm_indices sel (builder, 2, nelts);
> +   }
> +   (if (sel.series_p (0, 2, 0, 2))
> +(with
> + {
> +   machine_mode vec_mode = TYPE_MODE (type);
> +   auto elem_mode = GET_MODE_INNER (vec_mode);
> +   auto nunits = exact_div (GET_MODE_NUNITS (vec_mode), 2);
> +   tree stype;
> +   switch (elem_mode)
> + {
> + case E_HFmode:
> +stype = float_type_node;
> +break;
> + case E_SFmode:
> +stype = double_type_node;
> +break;
> + default:
> +return NULL_TREE;
> + }
 
Can't you use GET_MODE_WIDER_MODE and double-check the
mode-size doubles?  I mean you obviously miss DFmode -> TFmode.
 
> +   tree ntype = build_vector_type (stype, nunits);
> +   if (!ntype)
 
You want to check that the above results in a vector mode.
 
> + return NULL_TREE;
> +
> +   /* The format has to have a simple sign bit.  */
> +   const struct real_format *fmt = FLOAT_MODE_FORMAT (vec_mode);
> +   if (fmt == NULL)
> + return NULL_TREE;
> + }
> + (if (fmt->signbit_rw == GET_MODE_UNIT_BITSIZE (vec_mode) - 1
 
shouldn't this be a check on the component mode?  I think you'd
want to check that the bigger format signbit_rw is equal to
the smaller format mode size plus its signbit_rw or so?
 
> +   && fmt->signbit_rw == fmt->signbit_ro
> +   && targetm.can_change_mode_class (TYPE_MODE (ntype), TYPE_MODE (type), 
> ALL_REGS)
> +   && (optimize_vectors_before_lowering_p ()
> +   || target_supports_op_p (ntype, NEGATE_EXPR, optab_vector)))
> +  (plus (view_convert:type (negate (view_convert:ntype @1))) @0)))
> +
>  (simplify
>   (vec_perm @0 @1 VECTOR_CST@2)
>   (with
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/addsub_1.c 
> b/gcc/testsuite/gcc.target/aarch64/simd/addsub_1.c
> new file mode 100644
> index 
> ..1fb91a34c421bbd2894faa0dbbf1b47ad43310c4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/addsub_1.c
> @@ -0,0 +1,56 @@
> +/* { dg-do compile } */
> +/* {

Re: Re: [PATCH] RISC-V: Add runtime invariant support

2022-08-20 Thread 钟居哲

OK. Thank you. I am gonna try it again and fix this in RISC-V port.



juzhe.zh...@rivai.ai
 
From: Andrew Pinski
Date: 2022-08-21 08:18
To: 钟居哲
CC: Andreas Schwab; gcc-patches; kito.cheng; andrew; rguenther
Subject: Re: Re: [PATCH] RISC-V: Add runtime invariant support
On Sat, Aug 20, 2022 at 5:06 PM 钟居哲  wrote:
>
> Hi, it seems that this warning still report if I revert my patch. Am I right? 
> Feel free to correct me. Maybe I need to try it again?
 
The warning will not be still there. The reason is NUM_POLY_INT_COEFFS
defaults to 1 which means vf.is_constant (_vf) will always
return true and will always set const_vf.
I don't know why the warning does not happen on aarch64-linux-gnu (the
other target where NUM_POLY_INT_COEFFS is set to 2) though; it just
might be slightly different IR which causes the warning mechanism not
to warn.
 
Thanks,
Andrew Pinski
 
 
>
> 
> juzhe.zh...@rivai.ai
>
>
> From: Andrew Pinski
> Date: 2022-08-21 07:53
> To: Andreas Schwab
> CC: juzhe.zhong; gcc-patches; kito.cheng; andrew; Richard Guenther
> Subject: Re: [PATCH] RISC-V: Add runtime invariant support
> On Sat, Aug 20, 2022 at 3:34 PM Andreas Schwab  wrote:
> >
> > This breaks bootstrap:
> >
> > ../../gcc/tree-vect-loop-manip.cc: In function 'void 
> > vect_gen_vector_loop_niters(loop_vec_info, tree, tree_node**, tree_node**, 
> > bool)':
> > ../../gcc/tree-vect-loop-manip.cc:1981:26: error: 'const_vf' may be used 
> > uninitialized [-Werror=maybe-uninitialized]
> >  1981 |   unsigned HOST_WIDE_INT const_vf;
> >   |  ^~~~
> > cc1plus: all warnings being treated as errors
> > make[3]: *** [Makefile:1146: tree-vect-loop-manip.o] Error 1
> > make[2]: *** [Makefile:4977: all-stage2-gcc] Error 2
> > make[1]: *** [Makefile:30363: stage2-bubble] Error 2
> > make: *** [Makefile:1065: all] Error 2
>
>
> This looks like a real uninitialized variable issue.
> I even can't tell if the paths that lead to using const_vf will be
> always set so how we expect GCC to do the same.
> The code that uses const_vf was added with r11-5820-cdcbef3c3310,
> CCing the author there.
>
> Thanks,
> Andrew
>
> >
> > --
> > Andreas Schwab, sch...@linux-m68k.org
> > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> > "And now for something completely different."
>

< 1 2 3 4 5 >

301 - 400 of 407 matches

Mail list logo