Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-11 Thread 钟居哲
Hi, Richards.

Could you take a look at this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618241.html 

Thanks


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-11 20:42
To: juzhe.zhong\@rivai.ai
CC: gcc-patches; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
"juzhe.zh...@rivai.ai"  writes:
> Thanks. I have read rgroup descriptions again.
> Still I am not fully understand it clearly, bear with me :)
>
> I don't known how to differentiate Case 2 and Case 3.
>
> Case 2 is multiple rgroup for SLP.
> Case 3 is multiple rgroup for non-SLP (VEC_PACK_TRUNC)
>
> Is it correct:
> case 2: rgc->max_nscalarper_iter != 1
 
Yes.
 
> Case 3 : rgc->max_nscalarper_iter == 1 but rgc->factor != 1?
 
For case 3 it's:
 
rgc->max_nscalars_per_iter == 1 && rgc != _VINFO_LENS (loop_vinfo)[0]
 
rgc->factor is controlled by the target and just says what units
IFN_LOAD_LEN works in.  E.g. if we're loading 16-byte elements,
but the underlying instruction measures bytes, the factor would be 2.
 
Thanks,
Richard
 


Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-11 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Thanks. I have read rgroup descriptions again.
> Still I am not fully understand it clearly, bear with me :)
>
> I don't known how to differentiate Case 2 and Case 3.
>
> Case 2 is multiple rgroup for SLP.
> Case 3 is multiple rgroup for non-SLP (VEC_PACK_TRUNC)
>
> Is it correct:
> case 2: rgc->max_nscalarper_iter != 1

Yes.

> Case 3 : rgc->max_nscalarper_iter == 1 but rgc->factor != 1?

For case 3 it's:

rgc->max_nscalars_per_iter == 1 && rgc != _VINFO_LENS (loop_vinfo)[0]

rgc->factor is controlled by the target and just says what units
IFN_LOAD_LEN works in.  E.g. if we're loading 16-byte elements,
but the underlying instruction measures bytes, the factor would be 2.

Thanks,
Richard


Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-11 Thread juzhe.zh...@rivai.ai
Thanks. I have read rgroup descriptions again.
Still I am not fully understand it clearly, bear with me :)

I don't known how to differentiate Case 2 and Case 3.

Case 2 is multiple rgroup for SLP.
Case 3 is multiple rgroup for non-SLP (VEC_PACK_TRUNC)

Is it correct:
case 2: rgc->max_nscalarper_iter != 1
Case 3 : rgc->max_nscalarper_iter == 1 but rgc->factor != 1?

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-11 19:29
To: juzhe.zhong\@rivai.ai
CC: gcc-patches; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
"juzhe.zh...@rivai.ai"  writes:
> Oh, I see. But I saw there is a variable using_partial_vectors_p
> in the loop data structure.
>
> Can I add a variable call using_select_vl_p ?
 
Yeah.  Please also add a wrapper macro like
LOOP_VINFO_USING_PARTIAL_VECTORS_P.  (I'm not really a fan of the
wrappers, but it's better to be consistent.)
 
> Since it may increase the size of data structure, I am not sure whether it is 
> appropriate.
 
The structure is only temporary, and very few of them exist at
a given time.  Besides, there's already a layout hole on LP64 hosts
around those booleans (between slp_unrolling_factor and scalar_loop).
So the new boolean shouldn't grow the size of the structure.
 
We can convert the booleans to bitfields if size ever becomes a problem.
 
Thanks,
Richard
 


Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-11 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Oh, I see. But I saw there is a variable using_partial_vectors_p
> in the loop data structure.
>
> Can I add a variable call using_select_vl_p ?

Yeah.  Please also add a wrapper macro like
LOOP_VINFO_USING_PARTIAL_VECTORS_P.  (I'm not really a fan of the
wrappers, but it's better to be consistent.)

> Since it may increase the size of data structure, I am not sure whether it is 
> appropriate.

The structure is only temporary, and very few of them exist at
a given time.  Besides, there's already a layout hole on LP64 hosts
around those booleans (between slp_unrolling_factor and scalar_loop).
So the new boolean shouldn't grow the size of the structure.

We can convert the booleans to bitfields if size ever becomes a problem.

Thanks,
Richard


Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-11 Thread juzhe.zh...@rivai.ai
Oh, I see. But I saw there is a variable using_partial_vectors_p
in the loop data structure.

Can I add a variable call using_select_vl_p ?
Since it may increase the size of data structure, I am not sure whether it is 
appropriate.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-11 19:04
To: juzhe.zhong\@rivai.ai
CC: gcc-patches; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
"juzhe.zh...@rivai.ai"  writes:
> Hi, Richard.  Since create_iv has been approved and soon will be commited 
> after
> we bootstrap && regression.
>
> Now, I plan to send patch for "decrement IV".
>
> After reading your comments, I have several questions:
>
> 1. 
>>if (use_bias_adjusted_len)
>>  return rgl->bias_adjusted_ctrl;
>> +  else if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
>> +OPTIMIZE_FOR_SPEED))
>> +{
>> +  tree loop_len = rgl->controls[index];
>> +  poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type);
>> +  poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype);
>> +  if (maybe_ne (nunits1, nunits2))
>> + {
>> +   /* A loop len for data type X can be reused for data type Y
>> +  if X has N times more elements than Y and if Y's elements
>> +  are N times bigger than X's.  */
>> +   gcc_assert (multiple_p (nunits1, nunits2));
>> +   unsigned int factor = exact_div (nunits1, nunits2).to_constant ();
>> +   gimple_seq seq = NULL;
>> +   loop_len = gimple_build (, RDIV_EXPR, iv_type, loop_len,
>> +build_int_cst (iv_type, factor));
>> +   if (seq)
>> + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
>> + }
>> +  return loop_len;
>> +}
>>else
>>  return rgl->controls[index];
>>  }
>
>>  ...here.  That is, the key isn't whether SELECT_VL is available,
>>  but instead whether we've decided to use it for this loop (unless
>>  I'm missing something).
>
> Let's me clarify it again:
>
> I do this here is for Case 2 SLP:
>
> Generate for len : _61 = _75 / 2;
> I think it is similar with ARM SVE using VIEW_CONVER_EXPR to view_convert the 
> mask.
>
> You said we should not let SELECT_VL is available or not to decide it here.
> Could you teach me how to handle this code here? Should I add a target hook 
> like:
> TARGET_SLP_LOOP_LEN_RDIV_BY_FACTOR_P ?
 
No.  What I mean is: for each vectorised loop, we should make a decision,
in one place only, whether to use SELECT_VL-based control flow or
arithmetic-based control flow for that particular loop.  That decision
depends partly on direct_internal_fn_supported_p (a necessary but not
sufficient condition), partly on whether the loop contains SLP nodes, etc.
We should then record that decision in the loop_vec_info so that it is
available to whichever code needs it.
 
This is similar to LOOP_VINFO_USING_PARTIAL_VECTORS_P etc.
 
Thanks,
Richard
 


Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-11 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Hi, Richard.  Since create_iv has been approved and soon will be commited 
> after
> we bootstrap && regression.
>
> Now, I plan to send patch for "decrement IV".
>
> After reading your comments, I have several questions:
>
> 1. 
>>if (use_bias_adjusted_len)
>>  return rgl->bias_adjusted_ctrl;
>> +  else if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
>> +OPTIMIZE_FOR_SPEED))
>> +{
>> +  tree loop_len = rgl->controls[index];
>> +  poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type);
>> +  poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype);
>> +  if (maybe_ne (nunits1, nunits2))
>> + {
>> +   /* A loop len for data type X can be reused for data type Y
>> +  if X has N times more elements than Y and if Y's elements
>> +  are N times bigger than X's.  */
>> +   gcc_assert (multiple_p (nunits1, nunits2));
>> +   unsigned int factor = exact_div (nunits1, nunits2).to_constant ();
>> +   gimple_seq seq = NULL;
>> +   loop_len = gimple_build (, RDIV_EXPR, iv_type, loop_len,
>> +build_int_cst (iv_type, factor));
>> +   if (seq)
>> + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
>> + }
>> +  return loop_len;
>> +}
>>else
>>  return rgl->controls[index];
>>  }
>
>>  ...here.  That is, the key isn't whether SELECT_VL is available,
>>  but instead whether we've decided to use it for this loop (unless
>>  I'm missing something).
>
> Let's me clarify it again:
>
> I do this here is for Case 2 SLP:
>
> Generate for len : _61 = _75 / 2;
> I think it is similar with ARM SVE using VIEW_CONVER_EXPR to view_convert the 
> mask.
>
> You said we should not let SELECT_VL is available or not to decide it here.
> Could you teach me how to handle this code here? Should I add a target hook 
> like:
> TARGET_SLP_LOOP_LEN_RDIV_BY_FACTOR_P ?

No.  What I mean is: for each vectorised loop, we should make a decision,
in one place only, whether to use SELECT_VL-based control flow or
arithmetic-based control flow for that particular loop.  That decision
depends partly on direct_internal_fn_supported_p (a necessary but not
sufficient condition), partly on whether the loop contains SLP nodes, etc.
We should then record that decision in the loop_vec_info so that it is
available to whichever code needs it.

This is similar to LOOP_VINFO_USING_PARTIAL_VECTORS_P etc.

Thanks,
Richard


Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-11 Thread juzhe.zh...@rivai.ai
Hi, Richard.  Since create_iv has been approved and soon will be commited after
we bootstrap && regression.

Now, I plan to send patch for "decrement IV".

After reading your comments, I have several questions:

1. 
>if (use_bias_adjusted_len)
>  return rgl->bias_adjusted_ctrl;
> +  else if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
> +OPTIMIZE_FOR_SPEED))
> +{
> +  tree loop_len = rgl->controls[index];
> +  poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type);
> +  poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype);
> +  if (maybe_ne (nunits1, nunits2))
> + {
> +   /* A loop len for data type X can be reused for data type Y
> +  if X has N times more elements than Y and if Y's elements
> +  are N times bigger than X's.  */
> +   gcc_assert (multiple_p (nunits1, nunits2));
> +   unsigned int factor = exact_div (nunits1, nunits2).to_constant ();
> +   gimple_seq seq = NULL;
> +   loop_len = gimple_build (, RDIV_EXPR, iv_type, loop_len,
> +build_int_cst (iv_type, factor));
> +   if (seq)
> + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> + }
> +  return loop_len;
> +}
>else
>  return rgl->controls[index];
>  }

>  ...here.  That is, the key isn't whether SELECT_VL is available,
>  but instead whether we've decided to use it for this loop (unless
>  I'm missing something).

Let's me clarify it again:

I do this here is for Case 2 SLP:

Generate for len : _61 = _75 / 2;
I think it is similar with ARM SVE using VIEW_CONVER_EXPR to view_convert the 
mask.

You said we should not let SELECT_VL is available or not to decide it here.
Could you teach me how to handle this code here? Should I add a target hook 
like:
TARGET_SLP_LOOP_LEN_RDIV_BY_FACTOR_P ?

 2. 
>  _offsets);
> +   else if (loop_lens && loop_lens->length () == 1
> +&& direct_internal_fn_supported_p (
> +  IFN_SELECT_VL, LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo),
> +  OPTIMIZE_FOR_SPEED)
> +&& memory_access_type != VMAT_INVARIANT)
> + dataref_ptr
> +   = get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type,
> + simd_lane_access_p ? loop : NULL,
> + offset, , gsi,
> + simd_lane_access_p, loop_lens,
> + dr_info, memory_access_type);

> Here too I think the check should be based on a cached record
> of whether we're using SELECT_VL for this particular loop,

Since only Case 1 is using SELECT_VL, Case 2 && 3 is using MIN_EXPR.
I have no idea how to know whether it is using SELECT_VL.
Could you teach me about this?


>> What happens for loop_lens->length () != 1?
As you can see, When we are using SELECT_VL (case 1), the length will be 
possible non-vf in non-final iteration.
So here we need to adjust data reference pointer IV by the outcome of SELECT_VL.

The gimple IR is like this:
# vectp_B.8_16 = PHI 
...
_28 = .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]);ivtmp_15 = _28 * 
4;..vect__1.10_18 = .LEN_LOAD (vectp_B.8_16, 128B, _28, 0);..
vectp_B.8_17 = vectp_B.8_16 + ivtmp_15;
This is the case loop_lens->length () == 1.
When it is loop_lens->length () != 1, it means it is Case 2 or Case 3. We 
always force MIN_EXPR using VF in non-final iteration.So the data reference IV 
is added by constant value (poly or non-poly).  Maybe the codes here is ugly 
with using loop_lens->length () == 1?Could you give me the suggestions for this?
I am gonna fix this patch by following your suggestions.Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-11 00:45
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
In addition to Jeff's comments:
 
juzhe.zh...@rivai.ai writes:
> [...]
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index cc4a93a8763..99cf0cdbdca 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of active elements in vector will be updated 
> value.
> +operand 1 is the total elements need to be updated value.
> +operand 2 is the vectorization factor.
> +The value of operand 0 is target dependent and flexible in each iteration.
> +The operation of this pattern can be:
> +
> +@smallexample
> +Case 1:
> +operand0 = MIN (operand1, operand2);
> +operand2 can be const_poly_int or poly_int related to vector mode size.
> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE 
> SIZE) so
> +that we can reduce a use of general purpose register.
> +
> +In 

Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-10 Thread juzhe.zh...@rivai.ai
Thank you so much.
Can you take a look at this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618110.html 

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-11 12:50
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> I am sorry that I am still confused about that.
>
> Is this what you want ?
>
>   bool use_minus_p = TREE_CODE (step) == INTEGER_CST && ((TYPE_UNSIGNED 
> (TREE_TYPE (step)) && tree_int_cst_lt (step1, step))
>  || (!TYPE_UNSIGNED (TREE_TYPE (step)) && 
> !tree_expr_nonnegative_warnv_p (step, ) && may_negate_without_overflow_p 
> (step)));
>
>   /* For easier readability of the created code, produce MINUS_EXPRs
>  when suitable.  */
>   if (TREE_CODE (step) == INTEGER_CST)
> {
>   if (TYPE_UNSIGNED (TREE_TYPE (step)))
> {
>   step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   if (tree_int_cst_lt (step1, step))
> {
>   incr_op = MINUS_EXPR; /* Remove it.  */
>   step = step1;
> }
> }
>   else
> {
>   bool ovf;
>
>   if (!tree_expr_nonnegative_warnv_p (step, )
>   && may_negate_without_overflow_p (step))
> {
>   incr_op = MINUS_EXPR; /* Remove it.  */
>   step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
> }
> }
> }
>   if (POINTER_TYPE_P (TREE_TYPE (base)))
> {
>   if (TREE_CODE (base) == ADDR_EXPR)
> mark_addressable (TREE_OPERAND (base, 0));
>   step = convert_to_ptrofftype (step);
>   if (incr_op == MINUS_EXPR) /* Change it into if (use_minus_p)  */
> step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   incr_op = POINTER_PLUS_EXPR; /* Remove it.  */
> }
>   /* Gimplify the step if necessary.  We put the computations in front of the
>  loop (i.e. the step should be loop invariant).  */
>   step = force_gimple_operand (step, , true, NULL_TREE);
>   if (stmts)
> gsi_insert_seq_on_edge_immediate (pe, stmts);
>   
>   if (POINTER_TYPE_P (TREE_TYPE (base)))
> stmt = gimple_build_assign (va, POINTER_PLUS_EXPR, vb, step);
>   else if (use_minus_p)
> stmt = gimple_build_assign (va, MINUS_EXPR, vb, step);
>   else
> stmt = gimple_build_assign (va, incr_op, vb, step);
> ...
>
> Since I have no idea to make stmts flips between PLUS_EXPR and MINUS_EXPR.
 
No, I meant:
 
- Rename the "code" argument to "incr_op".
 
- Remove "tree_code incr_op = code;".
 
- Replace both instances of:
 
 incr_op = MINUS_EXPR;
 
  with:
 
 incr_op = (incr_op == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR);
 
The point is that the current code (rightly) assumes that incr_op
always starts out as PLUS_EXPR, i.e. that STEP starts out applying
positively.  Making STEP apply in the opposite direction is then as
simple as changing incr_op to MINUS_EXPR.  But the new interface
allows STEP to start out applying positively or negatively, and so
this code needs to cope with both cases.
 
Thanks,
Richard
 


Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-10 Thread Richard Sandiford via Gcc-patches
钟居哲  writes:
> I am sorry that I am still confused about that.
>
> Is this what you want ?
>
>   bool use_minus_p = TREE_CODE (step) == INTEGER_CST && ((TYPE_UNSIGNED 
> (TREE_TYPE (step)) && tree_int_cst_lt (step1, step))
>  || (!TYPE_UNSIGNED (TREE_TYPE (step)) && 
> !tree_expr_nonnegative_warnv_p (step, ) && may_negate_without_overflow_p 
> (step)));
>
>   /* For easier readability of the created code, produce MINUS_EXPRs
>  when suitable.  */
>   if (TREE_CODE (step) == INTEGER_CST)
> {
>   if (TYPE_UNSIGNED (TREE_TYPE (step)))
> {
>   step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   if (tree_int_cst_lt (step1, step))
> {
>   incr_op = MINUS_EXPR; /* Remove it.  */
>   step = step1;
> }
> }
>   else
> {
>   bool ovf;
>
>   if (!tree_expr_nonnegative_warnv_p (step, )
>   && may_negate_without_overflow_p (step))
> {
>   incr_op = MINUS_EXPR; /* Remove it.  */
>   step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
> }
> }
> }
>   if (POINTER_TYPE_P (TREE_TYPE (base)))
> {
>   if (TREE_CODE (base) == ADDR_EXPR)
> mark_addressable (TREE_OPERAND (base, 0));
>   step = convert_to_ptrofftype (step);
>   if (incr_op == MINUS_EXPR) /* Change it into if (use_minus_p)  */
> step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   incr_op = POINTER_PLUS_EXPR; /* Remove it.  */
> }
>   /* Gimplify the step if necessary.  We put the computations in front of the
>  loop (i.e. the step should be loop invariant).  */
>   step = force_gimple_operand (step, , true, NULL_TREE);
>   if (stmts)
> gsi_insert_seq_on_edge_immediate (pe, stmts);
>   
>   if (POINTER_TYPE_P (TREE_TYPE (base)))
> stmt = gimple_build_assign (va, POINTER_PLUS_EXPR, vb, step);
>   else if (use_minus_p)
> stmt = gimple_build_assign (va, MINUS_EXPR, vb, step);
>   else
> stmt = gimple_build_assign (va, incr_op, vb, step);
> ...
>
> Since I have no idea to make stmts flips between PLUS_EXPR and MINUS_EXPR.

No, I meant:

- Rename the "code" argument to "incr_op".

- Remove "tree_code incr_op = code;".

- Replace both instances of:

 incr_op = MINUS_EXPR;

  with:

 incr_op = (incr_op == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR);

The point is that the current code (rightly) assumes that incr_op
always starts out as PLUS_EXPR, i.e. that STEP starts out applying
positively.  Making STEP apply in the opposite direction is then as
simple as changing incr_op to MINUS_EXPR.  But the new interface
allows STEP to start out applying positively or negatively, and so
this code needs to cope with both cases.

Thanks,
Richard


Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-10 Thread 钟居哲
I am sorry that I am still confused about that.

Is this what you want ?

  bool use_minus_p = TREE_CODE (step) == INTEGER_CST && ((TYPE_UNSIGNED 
(TREE_TYPE (step)) && tree_int_cst_lt (step1, step))
 || (!TYPE_UNSIGNED (TREE_TYPE (step)) && 
!tree_expr_nonnegative_warnv_p (step, ) && may_negate_without_overflow_p 
(step)));

  /* For easier readability of the created code, produce MINUS_EXPRs
 when suitable.  */
  if (TREE_CODE (step) == INTEGER_CST)
{
  if (TYPE_UNSIGNED (TREE_TYPE (step)))
{
  step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
  if (tree_int_cst_lt (step1, step))
{
  incr_op = MINUS_EXPR; /* Remove it.  */
  step = step1;
}
}
  else
{
  bool ovf;

  if (!tree_expr_nonnegative_warnv_p (step, )
  && may_negate_without_overflow_p (step))
{
  incr_op = MINUS_EXPR; /* Remove it.  */
  step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
}
}
}
  if (POINTER_TYPE_P (TREE_TYPE (base)))
{
  if (TREE_CODE (base) == ADDR_EXPR)
mark_addressable (TREE_OPERAND (base, 0));
  step = convert_to_ptrofftype (step);
  if (incr_op == MINUS_EXPR) /* Change it into if (use_minus_p)  */
step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
  incr_op = POINTER_PLUS_EXPR; /* Remove it.  */
}
  /* Gimplify the step if necessary.  We put the computations in front of the
 loop (i.e. the step should be loop invariant).  */
  step = force_gimple_operand (step, , true, NULL_TREE);
  if (stmts)
gsi_insert_seq_on_edge_immediate (pe, stmts);
  
  if (POINTER_TYPE_P (TREE_TYPE (base)))
stmt = gimple_build_assign (va, POINTER_PLUS_EXPR, vb, step);
  else if (use_minus_p)
stmt = gimple_build_assign (va, MINUS_EXPR, vb, step);
  else
stmt = gimple_build_assign (va, incr_op, vb, step);
...

Since I have no idea to make stmts flips between PLUS_EXPR and MINUS_EXPR.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-11 05:28
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> Thanks Richard.
> I am planning to seperate a patch with only creat_iv stuff only.
>
> Are you suggesting that I remove "tree_code incr_op = code;"
> Use the argument directly ?
>
> I saw the codes here:
>
>   /* For easier readability of the created code, produce MINUS_EXPRs
>  when suitable.  */
>   if (TREE_CODE (step) == INTEGER_CST)
> {
>   if (TYPE_UNSIGNED (TREE_TYPE (step)))
> {
>   step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   if (tree_int_cst_lt (step1, step))
> {
>   incr_op = MINUS_EXPR;
>   step = step1;
> }
> }
>   else
> {
>   bool ovf;
>
>   if (!tree_expr_nonnegative_warnv_p (step, )
>   && may_negate_without_overflow_p (step))
> {
>   incr_op = MINUS_EXPR;
>   step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
> }
> }
> }
>   if (POINTER_TYPE_P (TREE_TYPE (base)))
> {
>   if (TREE_CODE (base) == ADDR_EXPR)
> mark_addressable (TREE_OPERAND (base, 0));
>   step = convert_to_ptrofftype (step);
>   if (incr_op == MINUS_EXPR)
> step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   incr_op = POINTER_PLUS_EXPR;
> }
>   /* Gimplify the step if necessary.  We put the computations in front of the
>  loop (i.e. the step should be loop invariant).  */
>   step = force_gimple_operand (step, , true, NULL_TREE);
>   if (stmts)
> gsi_insert_seq_on_edge_immediate (pe, stmts);
>
>   stmt = gimple_build_assign (va, incr_op, vb, step);
> ...
>
> It seems that it has complicated conditions here to change value of variable 
> "incr_op".
> That's why I define a temporary variable "tree_code incr_op = code;" here and
> let the following codes change the value of "incr_op".
>
> Could you give me some hints of dealing with this piece of code to get rid of 
> "tree_code incr_op = code;" ?
 
Yeah, but like I said in the review, those later:
  incr_op = MINUS_EXPR;
stmts need to be updated to something that flips between PLUS_EXPR
and MINUS_EXPR (with updates to the comments).  Just leaving them
as-is is incorrect (in cases where the caller passed MINUS_EXPR
rather than PLUS_EXPR).
 
The POINTER_PLUS_EXPR handling is fine due to the conditional
negate beforehand.
 
Thanks,
Richard
 


Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-10 Thread Richard Sandiford via Gcc-patches
钟居哲  writes:
> Thanks Richard.
> I am planning to seperate a patch with only creat_iv stuff only.
>
> Are you suggesting that I remove "tree_code incr_op = code;"
> Use the argument directly ?
>
> I saw the codes here:
>
>   /* For easier readability of the created code, produce MINUS_EXPRs
>  when suitable.  */
>   if (TREE_CODE (step) == INTEGER_CST)
> {
>   if (TYPE_UNSIGNED (TREE_TYPE (step)))
> {
>   step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   if (tree_int_cst_lt (step1, step))
> {
>   incr_op = MINUS_EXPR;
>   step = step1;
> }
> }
>   else
> {
>   bool ovf;
>
>   if (!tree_expr_nonnegative_warnv_p (step, )
>   && may_negate_without_overflow_p (step))
> {
>   incr_op = MINUS_EXPR;
>   step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
> }
> }
> }
>   if (POINTER_TYPE_P (TREE_TYPE (base)))
> {
>   if (TREE_CODE (base) == ADDR_EXPR)
> mark_addressable (TREE_OPERAND (base, 0));
>   step = convert_to_ptrofftype (step);
>   if (incr_op == MINUS_EXPR)
> step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
>   incr_op = POINTER_PLUS_EXPR;
> }
>   /* Gimplify the step if necessary.  We put the computations in front of the
>  loop (i.e. the step should be loop invariant).  */
>   step = force_gimple_operand (step, , true, NULL_TREE);
>   if (stmts)
> gsi_insert_seq_on_edge_immediate (pe, stmts);
>
>   stmt = gimple_build_assign (va, incr_op, vb, step);
> ...
>
> It seems that it has complicated conditions here to change value of variable 
> "incr_op".
> That's why I define a temporary variable "tree_code incr_op = code;" here and
> let the following codes change the value of "incr_op".
>
> Could you give me some hints of dealing with this piece of code to get rid of 
> "tree_code incr_op = code;" ?

Yeah, but like I said in the review, those later:
 
  incr_op = MINUS_EXPR;
 
stmts need to be updated to something that flips between PLUS_EXPR
and MINUS_EXPR (with updates to the comments).  Just leaving them
as-is is incorrect (in cases where the caller passed MINUS_EXPR
rather than PLUS_EXPR).

The POINTER_PLUS_EXPR handling is fine due to the conditional
negate beforehand.

Thanks,
Richard


Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-10 Thread 钟居哲
Thanks Richard.
I am planning to seperate a patch with only creat_iv stuff only.

Are you suggesting that I remove "tree_code incr_op = code;"
Use the argument directly ?

I saw the codes here:

  /* For easier readability of the created code, produce MINUS_EXPRs
 when suitable.  */
  if (TREE_CODE (step) == INTEGER_CST)
{
  if (TYPE_UNSIGNED (TREE_TYPE (step)))
{
  step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
  if (tree_int_cst_lt (step1, step))
{
  incr_op = MINUS_EXPR;
  step = step1;
}
}
  else
{
  bool ovf;

  if (!tree_expr_nonnegative_warnv_p (step, )
  && may_negate_without_overflow_p (step))
{
  incr_op = MINUS_EXPR;
  step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
}
}
}
  if (POINTER_TYPE_P (TREE_TYPE (base)))
{
  if (TREE_CODE (base) == ADDR_EXPR)
mark_addressable (TREE_OPERAND (base, 0));
  step = convert_to_ptrofftype (step);
  if (incr_op == MINUS_EXPR)
step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step);
  incr_op = POINTER_PLUS_EXPR;
}
  /* Gimplify the step if necessary.  We put the computations in front of the
 loop (i.e. the step should be loop invariant).  */
  step = force_gimple_operand (step, , true, NULL_TREE);
  if (stmts)
gsi_insert_seq_on_edge_immediate (pe, stmts);

  stmt = gimple_build_assign (va, incr_op, vb, step);
...

It seems that it has complicated conditions here to change value of variable 
"incr_op".
That's why I define a temporary variable "tree_code incr_op = code;" here and
let the following codes change the value of "incr_op".

Could you give me some hints of dealing with this piece of code to get rid of 
"tree_code incr_op = code;" ?

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-11 00:45
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
In addition to Jeff's comments:
 
juzhe.zh...@rivai.ai writes:
> [...]
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index cc4a93a8763..99cf0cdbdca 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of active elements in vector will be updated 
> value.
> +operand 1 is the total elements need to be updated value.
> +operand 2 is the vectorization factor.
> +The value of operand 0 is target dependent and flexible in each iteration.
> +The operation of this pattern can be:
> +
> +@smallexample
> +Case 1:
> +operand0 = MIN (operand1, operand2);
> +operand2 can be const_poly_int or poly_int related to vector mode size.
> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE 
> SIZE) so
> +that we can reduce a use of general purpose register.
> +
> +In this case, only the last iteration of the loop is partial iteration.
> +@end smallexample
> +
> +@smallexample
> +Case 2:
> +if (operand1 <= operand2)
> +  operand0 = operand1;
> +else if (operand1 < 2 * operand2)
> +  operand0 = IN_RANGE (ceil (operand1 / 2), operand2);
 
GCC's IN_RANGE is a predicate, so it would be best to avoid that here.
Why isn't it simply ceil (operand1 / 2), which must be <= operand2?
 
> +else
> +  operand0 = operand2;
> +
> +This case will evenly distribute work over the last 2 iterations of a 
> stripmine loop.
> +@end smallexample
> +
> +The output of this pattern is not only used as IV of loop control counter, 
> but also
> +is used as the IV of address calculation with multiply/shift operation. This 
> allow
> +us dynamic adjust the number of elements is processed in each iteration of 
> the loop.
> +
>  @cindex @code{check_raw_ptrs@var{m}} instruction pattern
>  @item @samp{check_raw_ptrs@var{m}}
>  Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
> [...]
> diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc
> index 909b705d00d..5abca64379e 100644
> --- a/gcc/tree-ssa-loop-manip.cc
> +++ b/gcc/tree-ssa-loop-manip.cc
> @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3.  If not see
> so that we can free them all at once.  */
>  static bitmap_obstack loop_renamer_obstack;
>  
> -/* Creates an induction variable with value BASE + STEP * iteration in LOOP.
> +/* Creates an induction variable with value BASE (+/-) STEP * iteration in 
> LOOP.
> +   If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration.
> +   If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration.
>  

Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-10 Thread Richard Sandiford via Gcc-patches
In addition to Jeff's comments:

juzhe.zh...@rivai.ai writes:
> [...]
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index cc4a93a8763..99cf0cdbdca 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of active elements in vector will be updated 
> value.
> +operand 1 is the total elements need to be updated value.
> +operand 2 is the vectorization factor.
> +The value of operand 0 is target dependent and flexible in each iteration.
> +The operation of this pattern can be:
> +
> +@smallexample
> +Case 1:
> +operand0 = MIN (operand1, operand2);
> +operand2 can be const_poly_int or poly_int related to vector mode size.
> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE 
> SIZE) so
> +that we can reduce a use of general purpose register.
> +
> +In this case, only the last iteration of the loop is partial iteration.
> +@end smallexample
> +
> +@smallexample
> +Case 2:
> +if (operand1 <= operand2)
> +  operand0 = operand1;
> +else if (operand1 < 2 * operand2)
> +  operand0 = IN_RANGE (ceil (operand1 / 2), operand2);

GCC's IN_RANGE is a predicate, so it would be best to avoid that here.
Why isn't it simply ceil (operand1 / 2), which must be <= operand2?

> +else
> +  operand0 = operand2;
> +
> +This case will evenly distribute work over the last 2 iterations of a 
> stripmine loop.
> +@end smallexample
> +
> +The output of this pattern is not only used as IV of loop control counter, 
> but also
> +is used as the IV of address calculation with multiply/shift operation. This 
> allow
> +us dynamic adjust the number of elements is processed in each iteration of 
> the loop.
> +
>  @cindex @code{check_raw_ptrs@var{m}} instruction pattern
>  @item @samp{check_raw_ptrs@var{m}}
>  Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
> [...]
> diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc
> index 909b705d00d..5abca64379e 100644
> --- a/gcc/tree-ssa-loop-manip.cc
> +++ b/gcc/tree-ssa-loop-manip.cc
> @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3.  If not see
> so that we can free them all at once.  */
>  static bitmap_obstack loop_renamer_obstack;
>  
> -/* Creates an induction variable with value BASE + STEP * iteration in LOOP.
> +/* Creates an induction variable with value BASE (+/-) STEP * iteration in 
> LOOP.
> +   If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration.
> +   If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration.
> It is expected that neither BASE nor STEP are shared with other 
> expressions
> (unless the sharing rules allow this).  Use VAR as a base var_decl for it
> (if NULL, a new temporary will be created).  The increment will occur at
> @@ -57,8 +59,8 @@ static bitmap_obstack loop_renamer_obstack;
> VAR_AFTER (unless they are NULL).  */
>  
>  void
> -create_iv (tree base, tree step, tree var, class loop *loop,
> -gimple_stmt_iterator *incr_pos, bool after,
> +create_iv (tree base, tree_code code, tree step, tree var,
> +class loop *loop, gimple_stmt_iterator *incr_pos, bool after,
>  tree *var_before, tree *var_after)
>  {
>gassign *stmt;
> @@ -66,7 +68,9 @@ create_iv (tree base, tree step, tree var, class loop *loop,
>tree initial, step1;
>gimple_seq stmts;
>tree vb, va;
> -  enum tree_code incr_op = PLUS_EXPR;
> +  /* The code can only be PLUS_EXPR or MINUS_EXPR.  */
> +  gcc_assert (code == PLUS_EXPR || code == MINUS_EXPR);
> +  tree_code incr_op = code;

As Richard said, we should be able to get rid of incr_op, probably
by calling the parameter incr_op.

I think the later:

  incr_op = MINUS_EXPR;

stmts need to be updated to something that flips between PLUS_EXPR
and MINUS_EXPR (with updates to the comments).

It would probably make sense to split the create_iv part out as a
separate prepatch.

>edge pe = loop_preheader_edge (loop);
>  
>if (var != NULL_TREE)
> @@ -1365,7 +1369,7 @@ tree_transform_and_unroll_loop (class loop *loop, 
> unsigned factor,
>tree ctr_before, ctr_after;
>gimple_stmt_iterator bsi = gsi_last_nondebug_bb (new_exit->src);
>exit_if = as_a  (gsi_stmt (bsi));
> -  create_iv (exit_base, exit_step, NULL_TREE, loop,
> +  create_iv (exit_base, PLUS_EXPR, exit_step, NULL_TREE, loop,
>, false, _before, _after);
>gimple_cond_set_code (exit_if, exit_cmp);
>gimple_cond_set_lhs (exit_if, ctr_after);
> @@ -1580,8 +1584,8 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, 
> bool bump_in_latch)
>  gsi = gsi_last_bb (loop->latch);
>else
>  gsi = gsi_last_nondebug_bb (loop->header);
> -  create_iv (build_int_cst_type (type, 0), 

Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-09 Thread 钟居哲
More details for Case 2:
+   _72 = MIN_EXPR ;
+   _75 = MIN_EXPR ;
+   ...
+   .LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0);
+   vectp_f.8_56 = vectp_f.8_51 + 16;
+   .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0);
+   ...
+   _61 = _75 / 2;
+   .LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0);
+   vectp_d.10_63 = vectp_d.10_59 + 16;
+   _64 = _72 / 2;
+   .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0);
You may be confused that _61 = _75 / 2; and _64 = _72 / 2;
Well, this is similiar VIEW_CONVERT_EXPR of mask in ARM SVE.

Like ARM SVE:
tree
vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
unsigned int nvectors, tree vectype, unsigned int index)
{
  rgroup_controls *rgm = &(*masks)[nvectors - 1];
  tree mask_type = rgm->type;

  /* Populate the rgroup's mask array, if this is the first time we've
 used it.  */
  if (rgm->controls.is_empty ())
{
  rgm->controls.safe_grow_cleared (nvectors, true);
  for (unsigned int i = 0; i < nvectors; ++i)
{
  tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
  /* Provide a dummy definition until the real one is available.  */
  SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
  rgm->controls[i] = mask;
}
}

  tree mask = rgm->controls[index];
  if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type),
TYPE_VECTOR_SUBPARTS (vectype)))
{
  /* A loop mask for data type X can be reused for data type Y
 if X has N times more elements than Y and if Y's elements
 are N times bigger than X's.  In this case each sequence
 of N elements in the loop mask will be all-zero or all-one.
 We can then view-convert the mask so that each sequence of
 N elements is replaced by a single element.  */
  gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
  TYPE_VECTOR_SUBPARTS (vectype)));
  gimple_seq seq = NULL;
  mask_type = truth_type_for (vectype);
  mask = gimple_build (, VIEW_CONVERT_EXPR, mask_type, mask);
  if (seq)
gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
}
  return mask;
}


I am doing similiar thing for RVV:
+vect_get_loop_len (loop_vec_info loop_vinfo, gimple_stmt_iterator *gsi,
+  vec_loop_lens *lens, unsigned int nvectors, tree vectype,
+  unsigned int index)
 {
   rgroup_controls *rgl = &(*lens)[nvectors - 1];
   bool use_bias_adjusted_len =
 LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) != 0;
+  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
 
   /* Populate the rgroup's len array, if this is the first time we've
  used it.  */
@@ -10400,6 +10403,27 @@ vect_get_loop_len (loop_vec_info loop_vinfo, 
vec_loop_lens *lens,
 
   if (use_bias_adjusted_len)
 return rgl->bias_adjusted_ctrl;
+  else if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
+  OPTIMIZE_FOR_SPEED))
+{
+  tree loop_len = rgl->controls[index];
+  poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type);
+  poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype);
+  if (maybe_ne (nunits1, nunits2))
+   {
+ /* A loop len for data type X can be reused for data type Y
+if X has N times more elements than Y and if Y's elements
+are N times bigger than X's.  */
+ gcc_assert (multiple_p (nunits1, nunits2));
+ unsigned int factor = exact_div (nunits1, nunits2).to_constant ();
+ gimple_seq seq = NULL;
+ loop_len = gimple_build (, RDIV_EXPR, iv_type, loop_len,
+  build_int_cst (iv_type, factor));
+ if (seq)
+   gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
+   }
+  return loop_len;
+}
   else
 return rgl->controls[index];
 }Thansk.


juzhe.zh...@rivai.ai
 
From: juzhe.zh...@rivai.ai
Date: 2023-05-09 21:27
To: richard.sandiford
CC: gcc-patches; rguenther
Subject: Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
Thanks, Richard.

>> Could you go into more details about this?  I imagined that for case 3,
>> there would be a single SELECT_VL that decides how many scalar iterations
>> will be handled by the current vector iteration, then we would "expand"
>> the result (using MIN_EXPRs) to the multi-control cases.

For case 2 , here is the example:
+   2. Multiple rgroup, the Gimple IR should be:
+
+   # i_23 = PHI 
+   # vectp_f.8_51 = PHI 
+   # vectp_d.10_59 = PHI 
+   # ivtmp_70 = PHI 
+   # ivtmp_73 = PHI 
+   _72 = MIN_EXPR ;
+   _75 = MIN_EXPR ;
+   _1 = i_23 * 2;
+   _2 = (long unsigned int) _1;
+   _3 = _2 * 2;
+   _4 = f_15(D) + _3;
+   _5 = _2 + 1

Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-09 Thread juzhe.zh...@rivai.ai
ode.  We just need to work out some of the details.

Ok, I also prefer keeping select_vl.

>>FWIW, I share Kewen's concern about duplicating too much logic between
>>masks, current lengths, and SELECT_VL lengths.  But I haven't looked at
>>the patch yet and so I don't know how easy it would be to avoid that.

I understand the concern, the current implementation are in the isolated 
function "vect_set_loop_controls_by_select_vl",
it's easier to review the implementation.
Maybe we can first make the whole implementation codes in 
"vect_set_loop_controls_by_select_vl" to be stable after review,
then we can try to incorporate these codes of 
"vect_set_loop_controls_by_select_vl" into "vect_set_loop_controls_directly".


Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-09 20:59
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> Hi, Richards. I would like to give more information about this patch so that 
> it will make this patch easier for you to review.
>
> Currently, I saw we have 3 situations that we need to handle in case of loop 
> control IV in auto-vectorization:
> 1. Single rgroup loop control (ncopies == 1 && vec_num == 1 so 
> loop_len.length () == 1 or rgc->lengh () == 1)
> 2. Multiple rgroup for SLP.
> 3. Multiple rgroup for non-SLP which is Richard Sandiford point out 
> previously (For example, VEC_PACK_TRUNC).
>
> To talk about this patch, let me talk about RVV LLVM implementation first 
> which inspire me to send this patch:
> https://reviews.llvm.org/D99750 
>
> According to LLVM implementation, they are adding a middle-end IR called 
> "get_vector_length" which has totally
> same functionality as "select_vl" in this patch (I call it "while_len" 
> previously, now I rename it as "select_vl" following Richard suggestion).
>
> The LLVM implementation is only let "get_vector_length" calculate the number 
> of elements in single rgroup loop.
> For multi rgroup, let's take a look at it:
> https://godbolt.org/z/3GP78efTY 
>
> void
> foo1 (short *__restrict f, int *__restrict d, int n)
> {
>   for (int i = 0; i < n; ++i)
> {
>   f[i * 2 + 0] = 1;
>   f[i * 2 + 1] = 2;
>   d[i] = 3;
> }
> } 
>
> RISC-V Clang:
> foo1:   # @foo1
> # %bb.0:
> bleza2, .LBB0_8
> # %bb.1:
> li  a3, 16
> bgeua2, a3, .LBB0_3
> # %bb.2:
> li  a6, 0
> j   .LBB0_6
> .LBB0_3:
> andia6, a2, -16
> lui a3, 32
> addiw   a3, a3, 1
> vsetivlizero, 8, e32, m2, ta, ma
> vmv.v.x v8, a3
> vmv.v.i v10, 3
> mv  a4, a6
> mv  a5, a1
> mv  a3, a0
> .LBB0_4:# =>This Inner Loop Header: Depth=1
> addia7, a5, 32
> addit0, a3, 32
> vsetivlizero, 16, e16, m2, ta, ma
> vse16.v v8, (a3)
> vse16.v v8, (t0)
> vsetivlizero, 8, e32, m2, ta, ma
> vse32.v v10, (a5)
> vse32.v v10, (a7)
> addia3, a3, 64
> addia4, a4, -16
> addia5, a5, 64
> bneza4, .LBB0_4
> # %bb.5:
> beq a6, a2, .LBB0_8
> .LBB0_6:
> sllia3, a6, 2
> add a0, a0, a3
> addia0, a0, 2
> add a1, a1, a3
> sub a2, a2, a6
> li  a3, 1
> li  a4, 2
> li  a5, 3
> .LBB0_7:# =>This Inner Loop Header: Depth=1
> sh  a3, -2(a0)
> sh  a4, 0(a0)
> sw  a5, 0(a1)
> addia0, a0, 4
> addia2, a2, -1
> addia1, a1, 4
> bneza2, .LBB0_7
> .LBB0_8:
> ret
>
> ARM GCC:
> foo1:
> cmp w2, 0
> ble .L1
> addvl   x4, x0, #1
> mov x3, 0
> cntbx7
> cntbx6, all, mul #2
> sbfiz   x2, x2, 1, 32
> ptrue   p0.b, all
> mov x5, x2
> adrpx8, .LC0
> uqdech  x5
> add x8, x8, :lo12:.LC0
> whilelo p1.h, xzr, x5
> ld1rw   z1.s, p0/z, [x8]
> mov z0.s, #3
> whilelo p0.h, xzr, x2
> .L3:
> st1hz1.h, p0, [x0, x3, lsl 1]
> st1hz1.h, p1, [x4, x3, lsl 1]
> st1wz0.s, p1, [x1, #1, mul vl]
> add x3, x3, x7
> whilelo p1.h, x3, x5
> st1wz0.s, p0, 

Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-09 Thread Richard Sandiford via Gcc-patches

Could you go into more details about this?  I imagined that for case 3,
there would be a single SELECT_VL that decides how many scalar iterations
will be handled by the current vector iteration, then we would "expand"
the result (using MIN_EXPRs) to the multi-control cases.

In a sense that replicates what the SVE code above is doing.  But for SVE,
it's possible to "optimise" the unpacking of a WHILELO result due to the
lack of implementation-defined behaviour.  So conceptually we have a
single WHILELO that is unpacked one layer to give two masks.  But in
reality we optimise that two WHILELOs.  No such optimisation is possible
with SELECT_VL, and maybe that produces poor-quality code.

That might be what you mean (haven't had chance to look at the patch
itself yet, but hope to tomorrow).

For case 2 (max_nscalars_per_iter > 1), I think it would make conceptual
sense to pass max_nscalars_per_iter to SELECT_VL or (alternatively)
multiply the result of the SELECT_VL by max_nscalars_per_iter.
But it's only worth doing one of those two things if it produces
reasonable code for RVV.

> Now base on these situations, we only have "select_vl" for single-rgroup, but 
> multiple-rgroup (both SLP and non-SLP), we just
> use MIN_EXPR.
>
> Is it more appropriate that we should remove "select_vl" and just use 
> MIN_EXPR force VF elements in each non-final iteration in single rgroup?

It's up to you.  If you don't think select_vl is worth it then it would
obviously make the vectoriser changes a bit simpler.

But making the vectoriser simpler isn't IMO the goal here.  SELECT_VL
seems like a perfectly reasonable construct to add to target-independent
code.  We just need to work out some of the details.

FWIW, I share Kewen's concern about duplicating too much logic between
masks, current lengths, and SELECT_VL lengths.  But I haven't looked at
the patch yet and so I don't know how easy it would be to avoid that.

Thanks,
Richard

> Like the codegen according to RVV ISA example (show as RVV LLVM):
> https://repo.hca.bsc.es/epic/z/oynhzP 
>
> ASM:
> vec_add:# @vec_add
> bleza3, .LBB0_3
> li  a4, 0
> .LBB0_2:# %vector.body
> sub a5, a3, a4
> vsetvli a6, a5, e64, m1, ta, mu  ==> change it into a6 = min (a5, VF) 
> && vsetvli zero, a6, e64, m1, ta, mu
> sllia7, a4, 3
> add a5, a1, a7
> vle64.v v8, (a5)
> add a5, a2, a7
> vle64.v v9, (a5)
> vfadd.vvv8, v8, v9
> add a7, a7, a0
> add a4, a4, a6
> vse64.v v8, (a7)
> bne a4, a3, .LBB0_2
> .LBB0_3:# %for.cond.cleanup
> ret
>
> So if we remove "select_vl" just use MIN_EXPR to force VF elements in 
> non-final iteration, we will end up with having like this:
>
> vec_add:# @vec_add
> bleza3, .LBB0_3
> csrrVF in bytes, vlenb
> .LBB0_2:# %vector.body
> sub a5, a3, VF in elements
> a6 = min (a5, VF in elements) 
> vsetvli zero, a6, e64, m1, ta, mu
>     add     a5, a1, VF in bytes
>     vle64.v v8, (a5)
>     add a5, a2, VF in bytes
> vle64.v v9, (a5)
> vfadd.vvv8, v8, v9
> add a7, a7, a0
> vse64.v v8, (a7)
> bne a4, a3, .LBB0_2
> .LBB0_3:# %for.cond.cleanup
> ret
>
> Both codegen are working good for RVV.
> Just only the second one can not have the RVV special optimization (even 
> distributed workload in last 2 iterations).
>
> Expecting any suggestions from you.
> Thanks.
>
>
> juzhe.zh...@rivai.ai
>  
> From: juzhe.zhong
> Date: 2023-05-04 21:25
> To: gcc-patches
> CC: richard.sandiford; rguenther; Ju-Zhe Zhong
> Subject: [PATCH V4] VECT: Add decrement IV iteration loop control by variable 
> amount support
> From: Ju-Zhe Zhong 
>  
> This patch is fixing V3 patch:
> https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/
>  
> Fix issues according to Richard Sandiford && Richard Biener.
>  
> 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford.
> 2. Support multiple-rgroup for non-SLP auto-vectorization.
>  
>For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the 
> total length:
>  
>  _36 = MIN_EXPR ;
>  
>  First length (MIN (X, VF/N)):
>loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;
>  
>  Second length (X - MIN (X, 1 * VF/N)):
>loop_len_16 = _36 - loop_len

Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-09 Thread juzhe.zh...@rivai.ai
Hi,Richards. Would you mind reviewing this patch?

Thanks.


juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2023-05-07 23:19
To: juzhe.zhong; gcc-patches
CC: richard.sandiford; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
 
 
On 5/4/23 07:25, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
> 
> This patch is fixing V3 patch:
> https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/
> 
> Fix issues according to Richard Sandiford && Richard Biener.
> 
> 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford.
> 2. Support multiple-rgroup for non-SLP auto-vectorization.
> 
> For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the 
> total length:
> 
>   _36 = MIN_EXPR ;
> 
>   First length (MIN (X, VF/N)):
> loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;
> 
>   Second length (X - MIN (X, 1 * VF/N)):
> loop_len_16 = _36 - loop_len_15;
> 
>   Third length (X - MIN (X, 2 * VF/N)):
> _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
> loop_len_17 = _36 - _38;
> 
>   Forth length (X - MIN (X, 3 * VF/N)):
> _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
> loop_len_18 = _36 - _39;
> 
> The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length 
> since using SELECT_VL
> to adapt induction IV consumes more instructions than just using MIN_EXPR. 
> Also, during testing,
> I found it's hard to adjust length correctly according to SELECT_VL.
> 
> So, this patch we only use SELECT_VL for single-rgroup with single length 
> control.
> 
> 3. Fix document of select_vl for Richard Biener (remove mode N).
> 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard 
> Biener.
> 5. Keep loop_vinfo as first parameter for "vect_get_loop_len".
> 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated 
> at the caller site.
> 
> More comments from Richard Biener:
>>> So it's not actually saturating.  The saturating operation is done by 
>>> .WHILE_LEN?
> I define the outcome of SELECT_VL (n, vf)  (WHILE_LEN) = IN_RANGE (0, min (n, 
> vf)) will make
> the loop control counter never underflow zero.
> 
>>> I see.  I wonder if it makes sense to leave .WHILE_LEN aside for a start,
>>> the above scheme should also work for single rgroups, no?
>>> As said, it _looks_ like you can progress without .WHILE_LEN and using
>>> .WHILE_LEN is a pure optimization?
> Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow
> target adjust any length = INRANGE (0, min (n, vf)) each iteration.
> 
> Let me known if I missed something for the V3 patch.
So at a high level this is pretty good.  I think there's some 
improvements we should make in the documentation and comments, but I'm 
comfortable with most of the implementation details.
 
 
 
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index cc4a93a8763..99cf0cdbdca 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
> operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>   @end smallexample
>   
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of active elements in vector will be updated 
> value.
This reads rather poorly.  Is this still accurate?
 
Set operand 0 to the number of active elements in a vector to be updated 
in a loop iteration based on the total number of elements to be updated, 
the vectorization factor and vector properties of the target.
 
 
> +operand 1 is the total elements need to be updated value.
operand 1 is the total elements in the vector to be updated.
 
 
> +
> +The output of this pattern is not only used as IV of loop control counter, 
> but also
> +is used as the IV of address calculation with multiply/shift operation. This 
> allow
> +us dynamic adjust the number of elements is processed in each iteration of 
> the loop.
This allows dynamic adjustment of the number of elements processed each 
loop iteration. -- is that still accurate and does it read better?
 
 
> @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3.  If not see
>  so that we can free them all at once.  */
>   static bitmap_obstack loop_renamer_obstack;
>   
> -/* Creates an induction variable with value BASE + STEP * iteration in LOOP.
> +/* Creates an induction variable with value BASE (+/-) STEP * iteration in 
> LOOP.
> +   If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration.
> +   If CODE is MINUS_EX

Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-08 Thread juzhe.zh...@rivai.ai
Hi, Kewen.

I have tried to implement "decrement IV" feature and incorporate into 
"vect_set_loop_controls_directly".
Since the implementation is quite different from 
vect_set_loop_controls_directly, it will make  vect_set_loop_controls_directly
very complicated sometimes it makes me very hard to debug when I am testing. 

I am not sure but I can try again.

This patch isolated those implementation into a single function 
"vect_set_loop_controls_by_select_vl" which makes Richards
easier to review codes. Well, I think I can try again to incorporate those 
codes into "vect_set_loop_controls_directly" when they 
finish the review process of  "vect_set_loop_controls_by_select_vl".

Thanks.


juzhe.zh...@rivai.ai
 
From: Kewen.Lin
Date: 2023-05-08 15:55
To: juzhe.zh...@rivai.ai
CC: gcc-patches; rguenther; richard.sandiford
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
Hi Juzhe,
 
> Hi, Kewen.
> 
>>> Sorry for chiming in, I had some concern here.
>>> We already have some handlings for partial vector in length in 
>>> vect_set_loop_controls_directly
>>>(actually it deals with both mask and length), the proposed 
>>>vect_set_loop_controls_by_select_vl
>>>for select_vl looks like a variant for partial vector in length (comparing 
>>>to the normal MIN),
>>>and also adopts decrement IV.  IMHO, it seems better to modify/extend the 
>>>current handling in
>>>vect_set_loop_controls_directly for length, or factor out the handlings for 
>>>length there and
>>>extend the factored one.  Otherwise, it means we have two sets of handlings 
>>>for partial vector
>>>in lengths, it looks bad to maintain.  As the previous discussion, adopting 
>>>decrement IV is an
>>>enhancement for loop control, it's good for both cases w/ or w/o select_vl.  
>>>If the above
>>>understanding is correct, the possible steps seem to be:
>>>  - factor out the handling for length (* optional)
>>>  - modify it with decrement IV
>>>  - extend it with select_vl.
>>>In future if some RVV vendor wants to degenerate select_vl to min, it can 
>>>just adopt the same
>>>handlings with min by not defining select_vl optab.
> 
> You mean like this:
> doing this inside vect_set_loop_controls_directly ?
> if (use_while_len_p)
>   return vect_set_loop_controls_by_while_len(...)
 
No, I meant either factoring out those handlings for partial vector in length 
in function
vect_set_loop_controls_directly to one separated function like: 
vect_set_loop_controls_directly_length
and rename the existing one to vect_set_loop_controls_directly_mask, or keep 
the existing
vect_set_loop_controls_directly for both mask and length but modify/extend the 
part for length.
If there is no much code to share between mask and length, the former may be 
better.
 
BR,
Kewen
 


Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-08 Thread Kewen.Lin via Gcc-patches
Hi Juzhe,

> Hi, Kewen.
> 
>>> Sorry for chiming in, I had some concern here.
>>> We already have some handlings for partial vector in length in 
>>> vect_set_loop_controls_directly
>>>(actually it deals with both mask and length), the proposed 
>>>vect_set_loop_controls_by_select_vl
>>>for select_vl looks like a variant for partial vector in length (comparing 
>>>to the normal MIN),
>>>and also adopts decrement IV.  IMHO, it seems better to modify/extend the 
>>>current handling in
>>>vect_set_loop_controls_directly for length, or factor out the handlings for 
>>>length there and
>>>extend the factored one.  Otherwise, it means we have two sets of handlings 
>>>for partial vector
>>>in lengths, it looks bad to maintain.  As the previous discussion, adopting 
>>>decrement IV is an
>>>enhancement for loop control, it's good for both cases w/ or w/o select_vl.  
>>>If the above
>>>understanding is correct, the possible steps seem to be:
>>>  - factor out the handling for length (* optional)
>>>  - modify it with decrement IV
>>>  - extend it with select_vl.
>>>In future if some RVV vendor wants to degenerate select_vl to min, it can 
>>>just adopt the same
>>>handlings with min by not defining select_vl optab.
> 
> You mean like this:
> doing this inside vect_set_loop_controls_directly ?
>         if (use_while_len_p)
>           return vect_set_loop_controls_by_while_len(...)

No, I meant either factoring out those handlings for partial vector in length 
in function
vect_set_loop_controls_directly to one separated function like: 
vect_set_loop_controls_directly_length
and rename the existing one to vect_set_loop_controls_directly_mask, or keep 
the existing
vect_set_loop_controls_directly for both mask and length but modify/extend the 
part for length.
If there is no much code to share between mask and length, the former may be 
better.

BR,
Kewen


Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-08 Thread juzhe.zh...@rivai.ai
Hi, Kewen.>> Sorry for chiming in, I had some concern here.
>> We already have some handlings for partial vector in length in 
>> vect_set_loop_controls_directly
>>(actually it deals with both mask and length), the proposed 
>>vect_set_loop_controls_by_select_vl
>>for select_vl looks like a variant for partial vector in length (comparing to 
>>the normal MIN),
>>and also adopts decrement IV.  IMHO, it seems better to modify/extend the 
>>current handling in
>>vect_set_loop_controls_directly for length, or factor out the handlings for 
>>length there and
>>extend the factored one.  Otherwise, it means we have two sets of handlings 
>>for partial vector
>>in lengths, it looks bad to maintain.  As the previous discussion, adopting 
>>decrement IV is an
>>enhancement for loop control, it's good for both cases w/ or w/o select_vl.  
>>If the above
>>understanding is correct, the possible steps seem to be:
>>  - factor out the handling for length (* optional)
>>  - modify it with decrement IV
>>  - extend it with select_vl.
>>In future if some RVV vendor wants to degenerate select_vl to min, it can 
>>just adopt the same
>>handlings with min by not defining select_vl optab.
You mean like this:
doing this inside vect_set_loop_controls_directly ?
if (use_while_len_p)
  return vect_set_loop_controls_by_while_len (...)

Thanks.


juzhe.zh...@rivai.ai
 
From: Kewen.Lin
Date: 2023-05-08 13:35
To: juzhe.zhong; rguenther; richard.sandiford
CC: gcc-patches
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
Hi,
 
on 2023/5/4 21:25, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
> 
> This patch is fixing V3 patch:
> https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/
> 
> Fix issues according to Richard Sandiford && Richard Biener.
> 
> 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford.
> 2. Support multiple-rgroup for non-SLP auto-vectorization.
> 
>For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the 
> total length:
> 
>  _36 = MIN_EXPR ;
> 
>  First length (MIN (X, VF/N)):
>loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;
> 
>  Second length (X - MIN (X, 1 * VF/N)):
>loop_len_16 = _36 - loop_len_15;
> 
>  Third length (X - MIN (X, 2 * VF/N)):
>_38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
>loop_len_17 = _36 - _38;
> 
>  Forth length (X - MIN (X, 3 * VF/N)):
>_39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
>loop_len_18 = _36 - _39;
> 
> The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length 
> since using SELECT_VL
> to adapt induction IV consumes more instructions than just using MIN_EXPR. 
> Also, during testing,
> I found it's hard to adjust length correctly according to SELECT_VL.
> 
> So, this patch we only use SELECT_VL for single-rgroup with single length 
> control.
> 
> 3. Fix document of select_vl for Richard Biener (remove mode N).
> 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard 
> Biener.
> 5. Keep loop_vinfo as first parameter for "vect_get_loop_len".
> 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated 
> at the caller site.
> 
> More comments from Richard Biener:
>>> So it's not actually saturating.  The saturating operation is done by 
>>> .WHILE_LEN?
> I define the outcome of SELECT_VL (n, vf)  (WHILE_LEN) = IN_RANGE (0, min (n, 
> vf)) will make 
> the loop control counter never underflow zero.
> 
>>> I see.  I wonder if it makes sense to leave .WHILE_LEN aside for a start,
>>> the above scheme should also work for single rgroups, no?
>>> As said, it _looks_ like you can progress without .WHILE_LEN and using
>>> .WHILE_LEN is a pure optimization?
> Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow
> target adjust any length = INRANGE (0, min (n, vf)) each iteration.
> 
> Let me known if I missed something for the V3 patch.
> Thanks.
> 
> ---
>  gcc/cfgloopmanip.cc|   2 +-
>  gcc/doc/md.texi|  34 +++
>  gcc/gimple-loop-interchange.cc |   2 +-
>  gcc/internal-fn.def|   1 +
>  gcc/optabs.def |   1 +
>  gcc/tree-ssa-loop-ivcanon.cc   |   2 +-
>  gcc/tree-ssa-loop-ivopts.cc|   2 +-
>  gcc/tree-ssa-loop-manip.cc |  18 +-
>  gcc/tree-ssa-loop-manip.h  |   4 +-
>  gcc/tree-vect-data-refs.cc |   8 +-
>  gcc/tree-vect-loop-manip.cc| 374 -
>  gcc/t

Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-07 Thread Kewen.Lin via Gcc-patches
Hi,

on 2023/5/4 21:25, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
> 
> This patch is fixing V3 patch:
> https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/
> 
> Fix issues according to Richard Sandiford && Richard Biener.
> 
> 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford.
> 2. Support multiple-rgroup for non-SLP auto-vectorization.
> 
>For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the 
> total length:
> 
>  _36 = MIN_EXPR ;
> 
>  First length (MIN (X, VF/N)):
>loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;
> 
>  Second length (X - MIN (X, 1 * VF/N)):
>loop_len_16 = _36 - loop_len_15;
> 
>  Third length (X - MIN (X, 2 * VF/N)):
>_38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
>loop_len_17 = _36 - _38;
> 
>  Forth length (X - MIN (X, 3 * VF/N)):
>_39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
>loop_len_18 = _36 - _39;
> 
> The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length 
> since using SELECT_VL
> to adapt induction IV consumes more instructions than just using MIN_EXPR. 
> Also, during testing,
> I found it's hard to adjust length correctly according to SELECT_VL.
> 
> So, this patch we only use SELECT_VL for single-rgroup with single length 
> control.
> 
> 3. Fix document of select_vl for Richard Biener (remove mode N).
> 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard 
> Biener.
> 5. Keep loop_vinfo as first parameter for "vect_get_loop_len".
> 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated 
> at the caller site.
> 
> More comments from Richard Biener:
>>> So it's not actually saturating.  The saturating operation is done by 
>>> .WHILE_LEN?
> I define the outcome of SELECT_VL (n, vf)  (WHILE_LEN) = IN_RANGE (0, min (n, 
> vf)) will make 
> the loop control counter never underflow zero.
> 
>>> I see.  I wonder if it makes sense to leave .WHILE_LEN aside for a start,
>>> the above scheme should also work for single rgroups, no?
>>> As said, it _looks_ like you can progress without .WHILE_LEN and using
>>> .WHILE_LEN is a pure optimization?
> Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow
> target adjust any length = INRANGE (0, min (n, vf)) each iteration.
> 
> Let me known if I missed something for the V3 patch.
> Thanks.
> 
> ---
>  gcc/cfgloopmanip.cc|   2 +-
>  gcc/doc/md.texi|  34 +++
>  gcc/gimple-loop-interchange.cc |   2 +-
>  gcc/internal-fn.def|   1 +
>  gcc/optabs.def |   1 +
>  gcc/tree-ssa-loop-ivcanon.cc   |   2 +-
>  gcc/tree-ssa-loop-ivopts.cc|   2 +-
>  gcc/tree-ssa-loop-manip.cc |  18 +-
>  gcc/tree-ssa-loop-manip.h  |   4 +-
>  gcc/tree-vect-data-refs.cc |   8 +-
>  gcc/tree-vect-loop-manip.cc| 374 -
>  gcc/tree-vect-loop.cc  |  32 ++-
>  gcc/tree-vect-stmts.cc |  89 +++-
>  gcc/tree-vectorizer.h  |   4 +-
>  14 files changed, 535 insertions(+), 38 deletions(-)
> 
> diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
> index 0e3ad8ed742..6e09dcbb0b1 100644
> --- a/gcc/cfgloopmanip.cc
> +++ b/gcc/cfgloopmanip.cc
> @@ -826,7 +826,7 @@ create_empty_loop_on_edge (edge entry_edge,
>  }
>  
>gsi = gsi_last_bb (loop_header);
> -  create_iv (initial_value, stride, iv, loop, , false,
> +  create_iv (initial_value, PLUS_EXPR, stride, iv, loop, , false,
>iv_before, iv_after);
>  
>/* Insert loop exit condition.  */
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index cc4a93a8763..99cf0cdbdca 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of active elements in vector will be updated 
> value.
> +operand 1 is the total elements need to be updated value.
> +operand 2 is the vectorization factor.
> +The value of operand 0 is target dependent and flexible in each iteration.
> +The operation of this pattern can be:
> +
> +@smallexample
> +Case 1:
> +operand0 = MIN (operand1, operand2);
> +operand2 can be const_poly_int or poly_int related to vector mode size.
> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE 
> SIZE) so
> +that we can reduce a use of general purpose register.
> +
> +In this case, only the last iteration of the loop is partial iteration.
> +@end smallexample
> +
> +@smallexample
> +Case 2:
> +if (operand1 <= operand2)
> +  operand0 = operand1;
> +else if (operand1 < 2 * operand2)
> +  operand0 = IN_RANGE (ceil (operand1 / 2), operand2);
> +else
> +  operand0 = operand2;
> +
> +This case will evenly distribute work over the last 2 iterations of a 
> 

Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-07 Thread 钟居哲
>> It's been pretty standard to stick with just PLUS_EXPR for this stuff
>> and instead negate the constant to produce the same effect as
>> MINUS_EXPR.  Is there a reason we're not continuing that practice?
>> Sorry if you've answered this already -- if you have, you can just point
>> me at the prior discussion and I'll read it.

Richard (Sandiford) has answered this question:
https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616745.html 
And as Richard (Biener) said, it will make IVOPTs failed if we have variable 
IVs.
However, we already have implemented variable IVs in downstream RVV GCC
and works fine and I don't see any bad codegen so far. So, I think it may not
be a serious issue for RVV.

Besides, this implementation is not my idea which is just following the guide 
coming
from RVV ISA spec.  And also inspired by the implementation coming from LLVM.
Reference:
1). RVV ISA: 
https://github.com/riscv/riscv-v-spec/blob/master/example/vvaddint32.s 
2). LLVM length stuff implementation (Should note that "get_vector_length" 
pattern in LLVM is
 totally doing the same thing as "select_vl" pattern in GCC here:
 https://reviews.llvm.org/D99750 

vvaddint32:
vsetvli t0, a0, e32, ta, ma  # Set vector length based on 32-bit vectors
vle32.v v0, (a1) # Get first vector
  sub a0, a0, t0 # Decrement number done
  slli t0, t0, 2 # Multiply number done by 4 bytes
  add a1, a1, t0 # Bump pointer
vle32.v v1, (a2) # Get second vector
  add a2, a2, t0 # Bump pointer
vadd.vv v2, v0, v1   # Sum vectors
vse32.v v2, (a3) # Store result
  add a3, a3, t0 # Bump pointer
  bnez a0, vvaddint32# Loop back
  ret# Finished

Notice "sub a0, a0, t0", the "t0" is the variable coming from "vsetvli t0, a0, 
e32, ta, ma"
which is generated by "get_vector_length" in LLVM, or similiar it is also 
generatd by "select_vl" in GCC too.

Other comments from Jeff will be addressed in the next patch (V5), I will wait 
for
Richards (both Sandiford && Biener that are the experts in Loop Vectorizer) 
comments.
Then send V5 patch which is including all comments from Jeff && Richards 
(Sandiford && Biener).

Thanks.


juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2023-05-07 23:19
To: juzhe.zhong; gcc-patches
CC: richard.sandiford; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by 
variable amount support
 
 
On 5/4/23 07:25, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
> 
> This patch is fixing V3 patch:
> https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/
> 
> Fix issues according to Richard Sandiford && Richard Biener.
> 
> 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford.
> 2. Support multiple-rgroup for non-SLP auto-vectorization.
> 
> For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the 
> total length:
> 
>   _36 = MIN_EXPR ;
> 
>   First length (MIN (X, VF/N)):
> loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;
> 
>   Second length (X - MIN (X, 1 * VF/N)):
> loop_len_16 = _36 - loop_len_15;
> 
>   Third length (X - MIN (X, 2 * VF/N)):
> _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
> loop_len_17 = _36 - _38;
> 
>   Forth length (X - MIN (X, 3 * VF/N)):
> _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
> loop_len_18 = _36 - _39;
> 
> The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length 
> since using SELECT_VL
> to adapt induction IV consumes more instructions than just using MIN_EXPR. 
> Also, during testing,
> I found it's hard to adjust length correctly according to SELECT_VL.
> 
> So, this patch we only use SELECT_VL for single-rgroup with single length 
> control.
> 
> 3. Fix document of select_vl for Richard Biener (remove mode N).
> 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard 
> Biener.
> 5. Keep loop_vinfo as first parameter for "vect_get_loop_len".
> 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated 
> at the caller site.
> 
> More comments from Richard Biener:
>>> So it's not actually saturating.  The saturating operation is done by 
>>> .WHILE_LEN?
> I define the outcome of SELECT_VL (n, vf)  (WHILE_LEN) = IN_RANGE (0, min (n, 
> vf)) will make
> the loop control counter never underflow zero.
> 
>>> I see.  I wonder if it makes sense to leave .WHILE_LEN aside for a start,
>>> the above scheme should also work for single rgroups, no?
>>> As said, it _looks_ li

Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-07 Thread Jeff Law via Gcc-patches




On 5/4/23 07:25, juzhe.zh...@rivai.ai wrote:

From: Ju-Zhe Zhong 

This patch is fixing V3 patch:
https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/

Fix issues according to Richard Sandiford && Richard Biener.

1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford.
2. Support multiple-rgroup for non-SLP auto-vectorization.

For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the total 
length:

  _36 = MIN_EXPR ;

  First length (MIN (X, VF/N)):
loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;

  Second length (X - MIN (X, 1 * VF/N)):
loop_len_16 = _36 - loop_len_15;

  Third length (X - MIN (X, 2 * VF/N)):
_38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
loop_len_17 = _36 - _38;

  Forth length (X - MIN (X, 3 * VF/N)):
_39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
loop_len_18 = _36 - _39;

The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length 
since using SELECT_VL
to adapt induction IV consumes more instructions than just using MIN_EXPR. 
Also, during testing,
I found it's hard to adjust length correctly according to SELECT_VL.

So, this patch we only use SELECT_VL for single-rgroup with single length 
control.

3. Fix document of select_vl for Richard Biener (remove mode N).
4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard 
Biener.
5. Keep loop_vinfo as first parameter for "vect_get_loop_len".
6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated 
at the caller site.

More comments from Richard Biener:

So it's not actually saturating.  The saturating operation is done by 
.WHILE_LEN?

I define the outcome of SELECT_VL (n, vf)  (WHILE_LEN) = IN_RANGE (0, min (n, 
vf)) will make
the loop control counter never underflow zero.


I see.  I wonder if it makes sense to leave .WHILE_LEN aside for a start,
the above scheme should also work for single rgroups, no?
As said, it _looks_ like you can progress without .WHILE_LEN and using
.WHILE_LEN is a pure optimization?

Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow
target adjust any length = INRANGE (0, min (n, vf)) each iteration.

Let me known if I missed something for the V3 patch.
So at a high level this is pretty good.  I think there's some 
improvements we should make in the documentation and comments, but I'm 
comfortable with most of the implementation details.






diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index cc4a93a8763..99cf0cdbdca 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
  @end smallexample
  
+@cindex @code{select_vl@var{m}} instruction pattern

+@item @code{select_vl@var{m}}
+Set operand 0 to the number of active elements in vector will be updated value.

This reads rather poorly.  Is this still accurate?

Set operand 0 to the number of active elements in a vector to be updated 
in a loop iteration based on the total number of elements to be updated, 
the vectorization factor and vector properties of the target.




+operand 1 is the total elements need to be updated value.

operand 1 is the total elements in the vector to be updated.



+
+The output of this pattern is not only used as IV of loop control counter, but 
also
+is used as the IV of address calculation with multiply/shift operation. This 
allow
+us dynamic adjust the number of elements is processed in each iteration of the 
loop.
This allows dynamic adjustment of the number of elements processed each 
loop iteration. -- is that still accurate and does it read better?




@@ -47,7 +47,9 @@ along with GCC; see the file COPYING3.  If not see
 so that we can free them all at once.  */
  static bitmap_obstack loop_renamer_obstack;
  
-/* Creates an induction variable with value BASE + STEP * iteration in LOOP.

+/* Creates an induction variable with value BASE (+/-) STEP * iteration in 
LOOP.
+   If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration.
+   If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration.
 It is expected that neither BASE nor STEP are shared with other expressions
 (unless the sharing rules allow this).  Use VAR as a base var_decl for it
 (if NULL, a new temporary will be created).  The increment will occur at
It's been pretty standard to stick with just PLUS_EXPR for this stuff 
and instead negate the constant to produce the same effect as 
MINUS_EXPR.  Is there a reason we're not continuing that practice? 
Sorry if you've answered this already -- if you have, you can just point 
me at the prior discussion and I'll read it.





diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 44bd5f2c805..d63ded5d4f0 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -385,6 +385,48 @@ 

Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-05 Thread 钟居哲
; vsetvli zero, a6, e64, m1, ta, mu
sllia7, a4, 3
add a5, a1, a7
vle64.v v8, (a5)
add a5, a2, a7
vle64.v v9, (a5)
vfadd.vvv8, v8, v9
add a7, a7, a0
add a4, a4, a6
vse64.v v8, (a7)
bne a4, a3, .LBB0_2
.LBB0_3:# %for.cond.cleanup
ret

So if we remove "select_vl" just use MIN_EXPR to force VF elements in non-final 
iteration, we will end up with having like this:

vec_add:# @vec_add
bleza3, .LBB0_3
csrrVF in bytes, vlenb
.LBB0_2:# %vector.body
sub a5, a3, VF in elements
a6 = min (a5, VF in elements) 
vsetvli zero, a6, e64, m1, ta, mu
add a5, a1, VF in bytes
vle64.v v8, (a5)
add a5, a2, VF in bytes
vle64.v v9, (a5)
vfadd.vvv8, v8, v9
add a7, a7, a0
vse64.v v8, (a7)
bne a4, a3, .LBB0_2
.LBB0_3:# %for.cond.cleanup
ret

Both codegen are working good for RVV.
Just only the second one can not have the RVV special optimization (even 
distributed workload in last 2 iterations).

Expecting any suggestions from you.
Thanks.


juzhe.zh...@rivai.ai
 
From: juzhe.zhong
Date: 2023-05-04 21:25
To: gcc-patches
CC: richard.sandiford; rguenther; Ju-Zhe Zhong
Subject: [PATCH V4] VECT: Add decrement IV iteration loop control by variable 
amount support
From: Ju-Zhe Zhong 
 
This patch is fixing V3 patch:
https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/
 
Fix issues according to Richard Sandiford && Richard Biener.
 
1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford.
2. Support multiple-rgroup for non-SLP auto-vectorization.
 
   For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the total 
length:
 
 _36 = MIN_EXPR ;
 
 First length (MIN (X, VF/N)):
   loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;
 
 Second length (X - MIN (X, 1 * VF/N)):
   loop_len_16 = _36 - loop_len_15;
 
 Third length (X - MIN (X, 2 * VF/N)):
   _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
   loop_len_17 = _36 - _38;
 
 Forth length (X - MIN (X, 3 * VF/N)):
   _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
   loop_len_18 = _36 - _39;
 
The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length 
since using SELECT_VL
to adapt induction IV consumes more instructions than just using MIN_EXPR. 
Also, during testing,
I found it's hard to adjust length correctly according to SELECT_VL.
 
So, this patch we only use SELECT_VL for single-rgroup with single length 
control.
 
3. Fix document of select_vl for Richard Biener (remove mode N).
4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard 
Biener.
5. Keep loop_vinfo as first parameter for "vect_get_loop_len".
6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated 
at the caller site.
 
More comments from Richard Biener:
>> So it's not actually saturating.  The saturating operation is done by 
>> .WHILE_LEN?
I define the outcome of SELECT_VL (n, vf)  (WHILE_LEN) = IN_RANGE (0, min (n, 
vf)) will make 
the loop control counter never underflow zero.
 
>> I see.  I wonder if it makes sense to leave .WHILE_LEN aside for a start,
>> the above scheme should also work for single rgroups, no?
>> As said, it _looks_ like you can progress without .WHILE_LEN and using
>> .WHILE_LEN is a pure optimization?
Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow
target adjust any length = INRANGE (0, min (n, vf)) each iteration.
 
Let me known if I missed something for the V3 patch.
Thanks.
 
---
gcc/cfgloopmanip.cc|   2 +-
gcc/doc/md.texi|  34 +++
gcc/gimple-loop-interchange.cc |   2 +-
gcc/internal-fn.def|   1 +
gcc/optabs.def |   1 +
gcc/tree-ssa-loop-ivcanon.cc   |   2 +-
gcc/tree-ssa-loop-ivopts.cc|   2 +-
gcc/tree-ssa-loop-manip.cc |  18 +-
gcc/tree-ssa-loop-manip.h  |   4 +-
gcc/tree-vect-data-refs.cc |   8 +-
gcc/tree-vect-loop-manip.cc| 374 -
gcc/tree-vect-loop.cc  |  32 ++-
gcc/tree-vect-stmts.cc |  89 +++-
gcc/tree-vectorizer.h  |   4 +-
14 files changed, 535 insertions(+), 38 deletions(-)
 
diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
index 0e3ad8ed742..6e09dcbb0b1 100644
--- a/gcc/cfgloopmanip.cc
+++ b/gcc/cfgloopmanip.cc
@@ -826,7 +826,7 @@ create_empty_loop_on_edge (edge entry_edge,
 }
   gsi = gsi_last_bb (loop_header);
-  create_iv (initial_value, stride, iv, loop, , false,
+  create_iv (initial_value, PLUS_EXPR, stride, iv, loop, , false,
 iv_before, iv_after);
   /* Insert loop ex

[PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-04 Thread juzhe . zhong
From: Ju-Zhe Zhong 

This patch is fixing V3 patch:
https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/

Fix issues according to Richard Sandiford && Richard Biener.

1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford.
2. Support multiple-rgroup for non-SLP auto-vectorization.

   For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the total 
length:

 _36 = MIN_EXPR ;

 First length (MIN (X, VF/N)):
   loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;

 Second length (X - MIN (X, 1 * VF/N)):
   loop_len_16 = _36 - loop_len_15;

 Third length (X - MIN (X, 2 * VF/N)):
   _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
   loop_len_17 = _36 - _38;

 Forth length (X - MIN (X, 3 * VF/N)):
   _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
   loop_len_18 = _36 - _39;

The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length 
since using SELECT_VL
to adapt induction IV consumes more instructions than just using MIN_EXPR. 
Also, during testing,
I found it's hard to adjust length correctly according to SELECT_VL.

So, this patch we only use SELECT_VL for single-rgroup with single length 
control.

3. Fix document of select_vl for Richard Biener (remove mode N).
4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard 
Biener.
5. Keep loop_vinfo as first parameter for "vect_get_loop_len".
6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated 
at the caller site.

More comments from Richard Biener:
>> So it's not actually saturating.  The saturating operation is done by 
>> .WHILE_LEN?
I define the outcome of SELECT_VL (n, vf)  (WHILE_LEN) = IN_RANGE (0, min (n, 
vf)) will make 
the loop control counter never underflow zero.

>> I see.  I wonder if it makes sense to leave .WHILE_LEN aside for a start,
>> the above scheme should also work for single rgroups, no?
>> As said, it _looks_ like you can progress without .WHILE_LEN and using
>> .WHILE_LEN is a pure optimization?
Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow
target adjust any length = INRANGE (0, min (n, vf)) each iteration.

Let me known if I missed something for the V3 patch.
Thanks.

---
 gcc/cfgloopmanip.cc|   2 +-
 gcc/doc/md.texi|  34 +++
 gcc/gimple-loop-interchange.cc |   2 +-
 gcc/internal-fn.def|   1 +
 gcc/optabs.def |   1 +
 gcc/tree-ssa-loop-ivcanon.cc   |   2 +-
 gcc/tree-ssa-loop-ivopts.cc|   2 +-
 gcc/tree-ssa-loop-manip.cc |  18 +-
 gcc/tree-ssa-loop-manip.h  |   4 +-
 gcc/tree-vect-data-refs.cc |   8 +-
 gcc/tree-vect-loop-manip.cc| 374 -
 gcc/tree-vect-loop.cc  |  32 ++-
 gcc/tree-vect-stmts.cc |  89 +++-
 gcc/tree-vectorizer.h  |   4 +-
 14 files changed, 535 insertions(+), 38 deletions(-)

diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
index 0e3ad8ed742..6e09dcbb0b1 100644
--- a/gcc/cfgloopmanip.cc
+++ b/gcc/cfgloopmanip.cc
@@ -826,7 +826,7 @@ create_empty_loop_on_edge (edge entry_edge,
 }
 
   gsi = gsi_last_bb (loop_header);
-  create_iv (initial_value, stride, iv, loop, , false,
+  create_iv (initial_value, PLUS_EXPR, stride, iv, loop, , false,
 iv_before, iv_after);
 
   /* Insert loop exit condition.  */
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index cc4a93a8763..99cf0cdbdca 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
   operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
 @end smallexample
 
+@cindex @code{select_vl@var{m}} instruction pattern
+@item @code{select_vl@var{m}}
+Set operand 0 to the number of active elements in vector will be updated value.
+operand 1 is the total elements need to be updated value.
+operand 2 is the vectorization factor.
+The value of operand 0 is target dependent and flexible in each iteration.
+The operation of this pattern can be:
+
+@smallexample
+Case 1:
+operand0 = MIN (operand1, operand2);
+operand2 can be const_poly_int or poly_int related to vector mode size.
+Some target like RISC-V has a standalone instruction to get MIN (n, MODE SIZE) 
so
+that we can reduce a use of general purpose register.
+
+In this case, only the last iteration of the loop is partial iteration.
+@end smallexample
+
+@smallexample
+Case 2:
+if (operand1 <= operand2)
+  operand0 = operand1;
+else if (operand1 < 2 * operand2)
+  operand0 = IN_RANGE (ceil (operand1 / 2), operand2);
+else
+  operand0 = operand2;
+
+This case will evenly distribute work over the last 2 iterations of a 
stripmine loop.
+@end smallexample
+
+The output of this pattern is not only used as IV of loop control counter, but 
also
+is used as the IV of address calculation with multiply/shift operation. This 
allow
+us dynamic adjust the number of elements is processed in each iteration of the 
loop.
+
 @cindex