Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Hi, Richards. Could you take a look at this patch: https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618241.html Thanks juzhe.zh...@rivai.ai From: Richard Sandiford Date: 2023-05-11 20:42 To: juzhe.zhong\@rivai.ai CC: gcc-patches; rguenther Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support "juzhe.zh...@rivai.ai" writes: > Thanks. I have read rgroup descriptions again. > Still I am not fully understand it clearly, bear with me :) > > I don't known how to differentiate Case 2 and Case 3. > > Case 2 is multiple rgroup for SLP. > Case 3 is multiple rgroup for non-SLP (VEC_PACK_TRUNC) > > Is it correct: > case 2: rgc->max_nscalarper_iter != 1 Yes. > Case 3 : rgc->max_nscalarper_iter == 1 but rgc->factor != 1? For case 3 it's: rgc->max_nscalars_per_iter == 1 && rgc != _VINFO_LENS (loop_vinfo)[0] rgc->factor is controlled by the target and just says what units IFN_LOAD_LEN works in. E.g. if we're loading 16-byte elements, but the underlying instruction measures bytes, the factor would be 2. Thanks, Richard
Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
"juzhe.zh...@rivai.ai" writes: > Thanks. I have read rgroup descriptions again. > Still I am not fully understand it clearly, bear with me :) > > I don't known how to differentiate Case 2 and Case 3. > > Case 2 is multiple rgroup for SLP. > Case 3 is multiple rgroup for non-SLP (VEC_PACK_TRUNC) > > Is it correct: > case 2: rgc->max_nscalarper_iter != 1 Yes. > Case 3 : rgc->max_nscalarper_iter == 1 but rgc->factor != 1? For case 3 it's: rgc->max_nscalars_per_iter == 1 && rgc != _VINFO_LENS (loop_vinfo)[0] rgc->factor is controlled by the target and just says what units IFN_LOAD_LEN works in. E.g. if we're loading 16-byte elements, but the underlying instruction measures bytes, the factor would be 2. Thanks, Richard
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Thanks. I have read rgroup descriptions again. Still I am not fully understand it clearly, bear with me :) I don't known how to differentiate Case 2 and Case 3. Case 2 is multiple rgroup for SLP. Case 3 is multiple rgroup for non-SLP (VEC_PACK_TRUNC) Is it correct: case 2: rgc->max_nscalarper_iter != 1 Case 3 : rgc->max_nscalarper_iter == 1 but rgc->factor != 1? Thanks. juzhe.zh...@rivai.ai From: Richard Sandiford Date: 2023-05-11 19:29 To: juzhe.zhong\@rivai.ai CC: gcc-patches; rguenther Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support "juzhe.zh...@rivai.ai" writes: > Oh, I see. But I saw there is a variable using_partial_vectors_p > in the loop data structure. > > Can I add a variable call using_select_vl_p ? Yeah. Please also add a wrapper macro like LOOP_VINFO_USING_PARTIAL_VECTORS_P. (I'm not really a fan of the wrappers, but it's better to be consistent.) > Since it may increase the size of data structure, I am not sure whether it is > appropriate. The structure is only temporary, and very few of them exist at a given time. Besides, there's already a layout hole on LP64 hosts around those booleans (between slp_unrolling_factor and scalar_loop). So the new boolean shouldn't grow the size of the structure. We can convert the booleans to bitfields if size ever becomes a problem. Thanks, Richard
Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
"juzhe.zh...@rivai.ai" writes: > Oh, I see. But I saw there is a variable using_partial_vectors_p > in the loop data structure. > > Can I add a variable call using_select_vl_p ? Yeah. Please also add a wrapper macro like LOOP_VINFO_USING_PARTIAL_VECTORS_P. (I'm not really a fan of the wrappers, but it's better to be consistent.) > Since it may increase the size of data structure, I am not sure whether it is > appropriate. The structure is only temporary, and very few of them exist at a given time. Besides, there's already a layout hole on LP64 hosts around those booleans (between slp_unrolling_factor and scalar_loop). So the new boolean shouldn't grow the size of the structure. We can convert the booleans to bitfields if size ever becomes a problem. Thanks, Richard
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Oh, I see. But I saw there is a variable using_partial_vectors_p in the loop data structure. Can I add a variable call using_select_vl_p ? Since it may increase the size of data structure, I am not sure whether it is appropriate. Thanks. juzhe.zh...@rivai.ai From: Richard Sandiford Date: 2023-05-11 19:04 To: juzhe.zhong\@rivai.ai CC: gcc-patches; rguenther Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support "juzhe.zh...@rivai.ai" writes: > Hi, Richard. Since create_iv has been approved and soon will be commited > after > we bootstrap && regression. > > Now, I plan to send patch for "decrement IV". > > After reading your comments, I have several questions: > > 1. >>if (use_bias_adjusted_len) >> return rgl->bias_adjusted_ctrl; >> + else if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type, >> +OPTIMIZE_FOR_SPEED)) >> +{ >> + tree loop_len = rgl->controls[index]; >> + poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type); >> + poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype); >> + if (maybe_ne (nunits1, nunits2)) >> + { >> + /* A loop len for data type X can be reused for data type Y >> + if X has N times more elements than Y and if Y's elements >> + are N times bigger than X's. */ >> + gcc_assert (multiple_p (nunits1, nunits2)); >> + unsigned int factor = exact_div (nunits1, nunits2).to_constant (); >> + gimple_seq seq = NULL; >> + loop_len = gimple_build (, RDIV_EXPR, iv_type, loop_len, >> +build_int_cst (iv_type, factor)); >> + if (seq) >> + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT); >> + } >> + return loop_len; >> +} >>else >> return rgl->controls[index]; >> } > >> ...here. That is, the key isn't whether SELECT_VL is available, >> but instead whether we've decided to use it for this loop (unless >> I'm missing something). > > Let's me clarify it again: > > I do this here is for Case 2 SLP: > > Generate for len : _61 = _75 / 2; > I think it is similar with ARM SVE using VIEW_CONVER_EXPR to view_convert the > mask. > > You said we should not let SELECT_VL is available or not to decide it here. > Could you teach me how to handle this code here? Should I add a target hook > like: > TARGET_SLP_LOOP_LEN_RDIV_BY_FACTOR_P ? No. What I mean is: for each vectorised loop, we should make a decision, in one place only, whether to use SELECT_VL-based control flow or arithmetic-based control flow for that particular loop. That decision depends partly on direct_internal_fn_supported_p (a necessary but not sufficient condition), partly on whether the loop contains SLP nodes, etc. We should then record that decision in the loop_vec_info so that it is available to whichever code needs it. This is similar to LOOP_VINFO_USING_PARTIAL_VECTORS_P etc. Thanks, Richard
Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
"juzhe.zh...@rivai.ai" writes: > Hi, Richard. Since create_iv has been approved and soon will be commited > after > we bootstrap && regression. > > Now, I plan to send patch for "decrement IV". > > After reading your comments, I have several questions: > > 1. >>if (use_bias_adjusted_len) >> return rgl->bias_adjusted_ctrl; >> + else if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type, >> +OPTIMIZE_FOR_SPEED)) >> +{ >> + tree loop_len = rgl->controls[index]; >> + poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type); >> + poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype); >> + if (maybe_ne (nunits1, nunits2)) >> + { >> + /* A loop len for data type X can be reused for data type Y >> + if X has N times more elements than Y and if Y's elements >> + are N times bigger than X's. */ >> + gcc_assert (multiple_p (nunits1, nunits2)); >> + unsigned int factor = exact_div (nunits1, nunits2).to_constant (); >> + gimple_seq seq = NULL; >> + loop_len = gimple_build (, RDIV_EXPR, iv_type, loop_len, >> +build_int_cst (iv_type, factor)); >> + if (seq) >> + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT); >> + } >> + return loop_len; >> +} >>else >> return rgl->controls[index]; >> } > >> ...here. That is, the key isn't whether SELECT_VL is available, >> but instead whether we've decided to use it for this loop (unless >> I'm missing something). > > Let's me clarify it again: > > I do this here is for Case 2 SLP: > > Generate for len : _61 = _75 / 2; > I think it is similar with ARM SVE using VIEW_CONVER_EXPR to view_convert the > mask. > > You said we should not let SELECT_VL is available or not to decide it here. > Could you teach me how to handle this code here? Should I add a target hook > like: > TARGET_SLP_LOOP_LEN_RDIV_BY_FACTOR_P ? No. What I mean is: for each vectorised loop, we should make a decision, in one place only, whether to use SELECT_VL-based control flow or arithmetic-based control flow for that particular loop. That decision depends partly on direct_internal_fn_supported_p (a necessary but not sufficient condition), partly on whether the loop contains SLP nodes, etc. We should then record that decision in the loop_vec_info so that it is available to whichever code needs it. This is similar to LOOP_VINFO_USING_PARTIAL_VECTORS_P etc. Thanks, Richard
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Hi, Richard. Since create_iv has been approved and soon will be commited after we bootstrap && regression. Now, I plan to send patch for "decrement IV". After reading your comments, I have several questions: 1. >if (use_bias_adjusted_len) > return rgl->bias_adjusted_ctrl; > + else if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type, > +OPTIMIZE_FOR_SPEED)) > +{ > + tree loop_len = rgl->controls[index]; > + poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type); > + poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype); > + if (maybe_ne (nunits1, nunits2)) > + { > + /* A loop len for data type X can be reused for data type Y > + if X has N times more elements than Y and if Y's elements > + are N times bigger than X's. */ > + gcc_assert (multiple_p (nunits1, nunits2)); > + unsigned int factor = exact_div (nunits1, nunits2).to_constant (); > + gimple_seq seq = NULL; > + loop_len = gimple_build (, RDIV_EXPR, iv_type, loop_len, > +build_int_cst (iv_type, factor)); > + if (seq) > + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT); > + } > + return loop_len; > +} >else > return rgl->controls[index]; > } > ...here. That is, the key isn't whether SELECT_VL is available, > but instead whether we've decided to use it for this loop (unless > I'm missing something). Let's me clarify it again: I do this here is for Case 2 SLP: Generate for len : _61 = _75 / 2; I think it is similar with ARM SVE using VIEW_CONVER_EXPR to view_convert the mask. You said we should not let SELECT_VL is available or not to decide it here. Could you teach me how to handle this code here? Should I add a target hook like: TARGET_SLP_LOOP_LEN_RDIV_BY_FACTOR_P ? 2. > _offsets); > + else if (loop_lens && loop_lens->length () == 1 > +&& direct_internal_fn_supported_p ( > + IFN_SELECT_VL, LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo), > + OPTIMIZE_FOR_SPEED) > +&& memory_access_type != VMAT_INVARIANT) > + dataref_ptr > + = get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type, > + simd_lane_access_p ? loop : NULL, > + offset, , gsi, > + simd_lane_access_p, loop_lens, > + dr_info, memory_access_type); > Here too I think the check should be based on a cached record > of whether we're using SELECT_VL for this particular loop, Since only Case 1 is using SELECT_VL, Case 2 && 3 is using MIN_EXPR. I have no idea how to know whether it is using SELECT_VL. Could you teach me about this? >> What happens for loop_lens->length () != 1? As you can see, When we are using SELECT_VL (case 1), the length will be possible non-vf in non-final iteration. So here we need to adjust data reference pointer IV by the outcome of SELECT_VL. The gimple IR is like this: # vectp_B.8_16 = PHI ... _28 = .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]);ivtmp_15 = _28 * 4;..vect__1.10_18 = .LEN_LOAD (vectp_B.8_16, 128B, _28, 0);.. vectp_B.8_17 = vectp_B.8_16 + ivtmp_15; This is the case loop_lens->length () == 1. When it is loop_lens->length () != 1, it means it is Case 2 or Case 3. We always force MIN_EXPR using VF in non-final iteration.So the data reference IV is added by constant value (poly or non-poly). Maybe the codes here is ugly with using loop_lens->length () == 1?Could you give me the suggestions for this? I am gonna fix this patch by following your suggestions.Thanks. juzhe.zh...@rivai.ai From: Richard Sandiford Date: 2023-05-11 00:45 To: juzhe.zhong CC: gcc-patches; rguenther Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support In addition to Jeff's comments: juzhe.zh...@rivai.ai writes: > [...] > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > index cc4a93a8763..99cf0cdbdca 100644 > --- a/gcc/doc/md.texi > +++ b/gcc/doc/md.texi > @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++) >operand0[i] = operand0[i - 1] && (operand1 + i < operand2); > @end smallexample > > +@cindex @code{select_vl@var{m}} instruction pattern > +@item @code{select_vl@var{m}} > +Set operand 0 to the number of active elements in vector will be updated > value. > +operand 1 is the total elements need to be updated value. > +operand 2 is the vectorization factor. > +The value of operand 0 is target dependent and flexible in each iteration. > +The operation of this pattern can be: > + > +@smallexample > +Case 1: > +operand0 = MIN (operand1, operand2); > +operand2 can be const_poly_int or poly_int related to vector mode size. > +Some target like RISC-V has a standalone instruction to get MIN (n, MODE > SIZE) so > +that we can reduce a use of general purpose register. > + > +In
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Thank you so much. Can you take a look at this patch: https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618110.html Thanks. juzhe.zh...@rivai.ai From: Richard Sandiford Date: 2023-05-11 12:50 To: 钟居哲 CC: gcc-patches; rguenther Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support 钟居哲 writes: > I am sorry that I am still confused about that. > > Is this what you want ? > > bool use_minus_p = TREE_CODE (step) == INTEGER_CST && ((TYPE_UNSIGNED > (TREE_TYPE (step)) && tree_int_cst_lt (step1, step)) > || (!TYPE_UNSIGNED (TREE_TYPE (step)) && > !tree_expr_nonnegative_warnv_p (step, ) && may_negate_without_overflow_p > (step))); > > /* For easier readability of the created code, produce MINUS_EXPRs > when suitable. */ > if (TREE_CODE (step) == INTEGER_CST) > { > if (TYPE_UNSIGNED (TREE_TYPE (step))) > { > step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > if (tree_int_cst_lt (step1, step)) > { > incr_op = MINUS_EXPR; /* Remove it. */ > step = step1; > } > } > else > { > bool ovf; > > if (!tree_expr_nonnegative_warnv_p (step, ) > && may_negate_without_overflow_p (step)) > { > incr_op = MINUS_EXPR; /* Remove it. */ > step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > } > } > } > if (POINTER_TYPE_P (TREE_TYPE (base))) > { > if (TREE_CODE (base) == ADDR_EXPR) > mark_addressable (TREE_OPERAND (base, 0)); > step = convert_to_ptrofftype (step); > if (incr_op == MINUS_EXPR) /* Change it into if (use_minus_p) */ > step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > incr_op = POINTER_PLUS_EXPR; /* Remove it. */ > } > /* Gimplify the step if necessary. We put the computations in front of the > loop (i.e. the step should be loop invariant). */ > step = force_gimple_operand (step, , true, NULL_TREE); > if (stmts) > gsi_insert_seq_on_edge_immediate (pe, stmts); > > if (POINTER_TYPE_P (TREE_TYPE (base))) > stmt = gimple_build_assign (va, POINTER_PLUS_EXPR, vb, step); > else if (use_minus_p) > stmt = gimple_build_assign (va, MINUS_EXPR, vb, step); > else > stmt = gimple_build_assign (va, incr_op, vb, step); > ... > > Since I have no idea to make stmts flips between PLUS_EXPR and MINUS_EXPR. No, I meant: - Rename the "code" argument to "incr_op". - Remove "tree_code incr_op = code;". - Replace both instances of: incr_op = MINUS_EXPR; with: incr_op = (incr_op == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR); The point is that the current code (rightly) assumes that incr_op always starts out as PLUS_EXPR, i.e. that STEP starts out applying positively. Making STEP apply in the opposite direction is then as simple as changing incr_op to MINUS_EXPR. But the new interface allows STEP to start out applying positively or negatively, and so this code needs to cope with both cases. Thanks, Richard
Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
钟居哲 writes: > I am sorry that I am still confused about that. > > Is this what you want ? > > bool use_minus_p = TREE_CODE (step) == INTEGER_CST && ((TYPE_UNSIGNED > (TREE_TYPE (step)) && tree_int_cst_lt (step1, step)) > || (!TYPE_UNSIGNED (TREE_TYPE (step)) && > !tree_expr_nonnegative_warnv_p (step, ) && may_negate_without_overflow_p > (step))); > > /* For easier readability of the created code, produce MINUS_EXPRs > when suitable. */ > if (TREE_CODE (step) == INTEGER_CST) > { > if (TYPE_UNSIGNED (TREE_TYPE (step))) > { > step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > if (tree_int_cst_lt (step1, step)) > { > incr_op = MINUS_EXPR; /* Remove it. */ > step = step1; > } > } > else > { > bool ovf; > > if (!tree_expr_nonnegative_warnv_p (step, ) > && may_negate_without_overflow_p (step)) > { > incr_op = MINUS_EXPR; /* Remove it. */ > step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > } > } > } > if (POINTER_TYPE_P (TREE_TYPE (base))) > { > if (TREE_CODE (base) == ADDR_EXPR) > mark_addressable (TREE_OPERAND (base, 0)); > step = convert_to_ptrofftype (step); > if (incr_op == MINUS_EXPR) /* Change it into if (use_minus_p) */ > step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > incr_op = POINTER_PLUS_EXPR; /* Remove it. */ > } > /* Gimplify the step if necessary. We put the computations in front of the > loop (i.e. the step should be loop invariant). */ > step = force_gimple_operand (step, , true, NULL_TREE); > if (stmts) > gsi_insert_seq_on_edge_immediate (pe, stmts); > > if (POINTER_TYPE_P (TREE_TYPE (base))) > stmt = gimple_build_assign (va, POINTER_PLUS_EXPR, vb, step); > else if (use_minus_p) > stmt = gimple_build_assign (va, MINUS_EXPR, vb, step); > else > stmt = gimple_build_assign (va, incr_op, vb, step); > ... > > Since I have no idea to make stmts flips between PLUS_EXPR and MINUS_EXPR. No, I meant: - Rename the "code" argument to "incr_op". - Remove "tree_code incr_op = code;". - Replace both instances of: incr_op = MINUS_EXPR; with: incr_op = (incr_op == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR); The point is that the current code (rightly) assumes that incr_op always starts out as PLUS_EXPR, i.e. that STEP starts out applying positively. Making STEP apply in the opposite direction is then as simple as changing incr_op to MINUS_EXPR. But the new interface allows STEP to start out applying positively or negatively, and so this code needs to cope with both cases. Thanks, Richard
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
I am sorry that I am still confused about that. Is this what you want ? bool use_minus_p = TREE_CODE (step) == INTEGER_CST && ((TYPE_UNSIGNED (TREE_TYPE (step)) && tree_int_cst_lt (step1, step)) || (!TYPE_UNSIGNED (TREE_TYPE (step)) && !tree_expr_nonnegative_warnv_p (step, ) && may_negate_without_overflow_p (step))); /* For easier readability of the created code, produce MINUS_EXPRs when suitable. */ if (TREE_CODE (step) == INTEGER_CST) { if (TYPE_UNSIGNED (TREE_TYPE (step))) { step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); if (tree_int_cst_lt (step1, step)) { incr_op = MINUS_EXPR; /* Remove it. */ step = step1; } } else { bool ovf; if (!tree_expr_nonnegative_warnv_p (step, ) && may_negate_without_overflow_p (step)) { incr_op = MINUS_EXPR; /* Remove it. */ step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); } } } if (POINTER_TYPE_P (TREE_TYPE (base))) { if (TREE_CODE (base) == ADDR_EXPR) mark_addressable (TREE_OPERAND (base, 0)); step = convert_to_ptrofftype (step); if (incr_op == MINUS_EXPR) /* Change it into if (use_minus_p) */ step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); incr_op = POINTER_PLUS_EXPR; /* Remove it. */ } /* Gimplify the step if necessary. We put the computations in front of the loop (i.e. the step should be loop invariant). */ step = force_gimple_operand (step, , true, NULL_TREE); if (stmts) gsi_insert_seq_on_edge_immediate (pe, stmts); if (POINTER_TYPE_P (TREE_TYPE (base))) stmt = gimple_build_assign (va, POINTER_PLUS_EXPR, vb, step); else if (use_minus_p) stmt = gimple_build_assign (va, MINUS_EXPR, vb, step); else stmt = gimple_build_assign (va, incr_op, vb, step); ... Since I have no idea to make stmts flips between PLUS_EXPR and MINUS_EXPR. Thanks. juzhe.zh...@rivai.ai From: Richard Sandiford Date: 2023-05-11 05:28 To: 钟居哲 CC: gcc-patches; rguenther Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support 钟居哲 writes: > Thanks Richard. > I am planning to seperate a patch with only creat_iv stuff only. > > Are you suggesting that I remove "tree_code incr_op = code;" > Use the argument directly ? > > I saw the codes here: > > /* For easier readability of the created code, produce MINUS_EXPRs > when suitable. */ > if (TREE_CODE (step) == INTEGER_CST) > { > if (TYPE_UNSIGNED (TREE_TYPE (step))) > { > step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > if (tree_int_cst_lt (step1, step)) > { > incr_op = MINUS_EXPR; > step = step1; > } > } > else > { > bool ovf; > > if (!tree_expr_nonnegative_warnv_p (step, ) > && may_negate_without_overflow_p (step)) > { > incr_op = MINUS_EXPR; > step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > } > } > } > if (POINTER_TYPE_P (TREE_TYPE (base))) > { > if (TREE_CODE (base) == ADDR_EXPR) > mark_addressable (TREE_OPERAND (base, 0)); > step = convert_to_ptrofftype (step); > if (incr_op == MINUS_EXPR) > step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > incr_op = POINTER_PLUS_EXPR; > } > /* Gimplify the step if necessary. We put the computations in front of the > loop (i.e. the step should be loop invariant). */ > step = force_gimple_operand (step, , true, NULL_TREE); > if (stmts) > gsi_insert_seq_on_edge_immediate (pe, stmts); > > stmt = gimple_build_assign (va, incr_op, vb, step); > ... > > It seems that it has complicated conditions here to change value of variable > "incr_op". > That's why I define a temporary variable "tree_code incr_op = code;" here and > let the following codes change the value of "incr_op". > > Could you give me some hints of dealing with this piece of code to get rid of > "tree_code incr_op = code;" ? Yeah, but like I said in the review, those later: incr_op = MINUS_EXPR; stmts need to be updated to something that flips between PLUS_EXPR and MINUS_EXPR (with updates to the comments). Just leaving them as-is is incorrect (in cases where the caller passed MINUS_EXPR rather than PLUS_EXPR). The POINTER_PLUS_EXPR handling is fine due to the conditional negate beforehand. Thanks, Richard
Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
钟居哲 writes: > Thanks Richard. > I am planning to seperate a patch with only creat_iv stuff only. > > Are you suggesting that I remove "tree_code incr_op = code;" > Use the argument directly ? > > I saw the codes here: > > /* For easier readability of the created code, produce MINUS_EXPRs > when suitable. */ > if (TREE_CODE (step) == INTEGER_CST) > { > if (TYPE_UNSIGNED (TREE_TYPE (step))) > { > step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > if (tree_int_cst_lt (step1, step)) > { > incr_op = MINUS_EXPR; > step = step1; > } > } > else > { > bool ovf; > > if (!tree_expr_nonnegative_warnv_p (step, ) > && may_negate_without_overflow_p (step)) > { > incr_op = MINUS_EXPR; > step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > } > } > } > if (POINTER_TYPE_P (TREE_TYPE (base))) > { > if (TREE_CODE (base) == ADDR_EXPR) > mark_addressable (TREE_OPERAND (base, 0)); > step = convert_to_ptrofftype (step); > if (incr_op == MINUS_EXPR) > step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); > incr_op = POINTER_PLUS_EXPR; > } > /* Gimplify the step if necessary. We put the computations in front of the > loop (i.e. the step should be loop invariant). */ > step = force_gimple_operand (step, , true, NULL_TREE); > if (stmts) > gsi_insert_seq_on_edge_immediate (pe, stmts); > > stmt = gimple_build_assign (va, incr_op, vb, step); > ... > > It seems that it has complicated conditions here to change value of variable > "incr_op". > That's why I define a temporary variable "tree_code incr_op = code;" here and > let the following codes change the value of "incr_op". > > Could you give me some hints of dealing with this piece of code to get rid of > "tree_code incr_op = code;" ? Yeah, but like I said in the review, those later: incr_op = MINUS_EXPR; stmts need to be updated to something that flips between PLUS_EXPR and MINUS_EXPR (with updates to the comments). Just leaving them as-is is incorrect (in cases where the caller passed MINUS_EXPR rather than PLUS_EXPR). The POINTER_PLUS_EXPR handling is fine due to the conditional negate beforehand. Thanks, Richard
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Thanks Richard. I am planning to seperate a patch with only creat_iv stuff only. Are you suggesting that I remove "tree_code incr_op = code;" Use the argument directly ? I saw the codes here: /* For easier readability of the created code, produce MINUS_EXPRs when suitable. */ if (TREE_CODE (step) == INTEGER_CST) { if (TYPE_UNSIGNED (TREE_TYPE (step))) { step1 = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); if (tree_int_cst_lt (step1, step)) { incr_op = MINUS_EXPR; step = step1; } } else { bool ovf; if (!tree_expr_nonnegative_warnv_p (step, ) && may_negate_without_overflow_p (step)) { incr_op = MINUS_EXPR; step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); } } } if (POINTER_TYPE_P (TREE_TYPE (base))) { if (TREE_CODE (base) == ADDR_EXPR) mark_addressable (TREE_OPERAND (base, 0)); step = convert_to_ptrofftype (step); if (incr_op == MINUS_EXPR) step = fold_build1 (NEGATE_EXPR, TREE_TYPE (step), step); incr_op = POINTER_PLUS_EXPR; } /* Gimplify the step if necessary. We put the computations in front of the loop (i.e. the step should be loop invariant). */ step = force_gimple_operand (step, , true, NULL_TREE); if (stmts) gsi_insert_seq_on_edge_immediate (pe, stmts); stmt = gimple_build_assign (va, incr_op, vb, step); ... It seems that it has complicated conditions here to change value of variable "incr_op". That's why I define a temporary variable "tree_code incr_op = code;" here and let the following codes change the value of "incr_op". Could you give me some hints of dealing with this piece of code to get rid of "tree_code incr_op = code;" ? Thanks. juzhe.zh...@rivai.ai From: Richard Sandiford Date: 2023-05-11 00:45 To: juzhe.zhong CC: gcc-patches; rguenther Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support In addition to Jeff's comments: juzhe.zh...@rivai.ai writes: > [...] > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > index cc4a93a8763..99cf0cdbdca 100644 > --- a/gcc/doc/md.texi > +++ b/gcc/doc/md.texi > @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++) >operand0[i] = operand0[i - 1] && (operand1 + i < operand2); > @end smallexample > > +@cindex @code{select_vl@var{m}} instruction pattern > +@item @code{select_vl@var{m}} > +Set operand 0 to the number of active elements in vector will be updated > value. > +operand 1 is the total elements need to be updated value. > +operand 2 is the vectorization factor. > +The value of operand 0 is target dependent and flexible in each iteration. > +The operation of this pattern can be: > + > +@smallexample > +Case 1: > +operand0 = MIN (operand1, operand2); > +operand2 can be const_poly_int or poly_int related to vector mode size. > +Some target like RISC-V has a standalone instruction to get MIN (n, MODE > SIZE) so > +that we can reduce a use of general purpose register. > + > +In this case, only the last iteration of the loop is partial iteration. > +@end smallexample > + > +@smallexample > +Case 2: > +if (operand1 <= operand2) > + operand0 = operand1; > +else if (operand1 < 2 * operand2) > + operand0 = IN_RANGE (ceil (operand1 / 2), operand2); GCC's IN_RANGE is a predicate, so it would be best to avoid that here. Why isn't it simply ceil (operand1 / 2), which must be <= operand2? > +else > + operand0 = operand2; > + > +This case will evenly distribute work over the last 2 iterations of a > stripmine loop. > +@end smallexample > + > +The output of this pattern is not only used as IV of loop control counter, > but also > +is used as the IV of address calculation with multiply/shift operation. This > allow > +us dynamic adjust the number of elements is processed in each iteration of > the loop. > + > @cindex @code{check_raw_ptrs@var{m}} instruction pattern > @item @samp{check_raw_ptrs@var{m}} > Check whether, given two pointers @var{a} and @var{b} and a length @var{len}, > [...] > diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc > index 909b705d00d..5abca64379e 100644 > --- a/gcc/tree-ssa-loop-manip.cc > +++ b/gcc/tree-ssa-loop-manip.cc > @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3. If not see > so that we can free them all at once. */ > static bitmap_obstack loop_renamer_obstack; > > -/* Creates an induction variable with value BASE + STEP * iteration in LOOP. > +/* Creates an induction variable with value BASE (+/-) STEP * iteration in > LOOP. > + If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration. > + If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration. >
Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
In addition to Jeff's comments: juzhe.zh...@rivai.ai writes: > [...] > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > index cc4a93a8763..99cf0cdbdca 100644 > --- a/gcc/doc/md.texi > +++ b/gcc/doc/md.texi > @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++) >operand0[i] = operand0[i - 1] && (operand1 + i < operand2); > @end smallexample > > +@cindex @code{select_vl@var{m}} instruction pattern > +@item @code{select_vl@var{m}} > +Set operand 0 to the number of active elements in vector will be updated > value. > +operand 1 is the total elements need to be updated value. > +operand 2 is the vectorization factor. > +The value of operand 0 is target dependent and flexible in each iteration. > +The operation of this pattern can be: > + > +@smallexample > +Case 1: > +operand0 = MIN (operand1, operand2); > +operand2 can be const_poly_int or poly_int related to vector mode size. > +Some target like RISC-V has a standalone instruction to get MIN (n, MODE > SIZE) so > +that we can reduce a use of general purpose register. > + > +In this case, only the last iteration of the loop is partial iteration. > +@end smallexample > + > +@smallexample > +Case 2: > +if (operand1 <= operand2) > + operand0 = operand1; > +else if (operand1 < 2 * operand2) > + operand0 = IN_RANGE (ceil (operand1 / 2), operand2); GCC's IN_RANGE is a predicate, so it would be best to avoid that here. Why isn't it simply ceil (operand1 / 2), which must be <= operand2? > +else > + operand0 = operand2; > + > +This case will evenly distribute work over the last 2 iterations of a > stripmine loop. > +@end smallexample > + > +The output of this pattern is not only used as IV of loop control counter, > but also > +is used as the IV of address calculation with multiply/shift operation. This > allow > +us dynamic adjust the number of elements is processed in each iteration of > the loop. > + > @cindex @code{check_raw_ptrs@var{m}} instruction pattern > @item @samp{check_raw_ptrs@var{m}} > Check whether, given two pointers @var{a} and @var{b} and a length @var{len}, > [...] > diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc > index 909b705d00d..5abca64379e 100644 > --- a/gcc/tree-ssa-loop-manip.cc > +++ b/gcc/tree-ssa-loop-manip.cc > @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3. If not see > so that we can free them all at once. */ > static bitmap_obstack loop_renamer_obstack; > > -/* Creates an induction variable with value BASE + STEP * iteration in LOOP. > +/* Creates an induction variable with value BASE (+/-) STEP * iteration in > LOOP. > + If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration. > + If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration. > It is expected that neither BASE nor STEP are shared with other > expressions > (unless the sharing rules allow this). Use VAR as a base var_decl for it > (if NULL, a new temporary will be created). The increment will occur at > @@ -57,8 +59,8 @@ static bitmap_obstack loop_renamer_obstack; > VAR_AFTER (unless they are NULL). */ > > void > -create_iv (tree base, tree step, tree var, class loop *loop, > -gimple_stmt_iterator *incr_pos, bool after, > +create_iv (tree base, tree_code code, tree step, tree var, > +class loop *loop, gimple_stmt_iterator *incr_pos, bool after, > tree *var_before, tree *var_after) > { >gassign *stmt; > @@ -66,7 +68,9 @@ create_iv (tree base, tree step, tree var, class loop *loop, >tree initial, step1; >gimple_seq stmts; >tree vb, va; > - enum tree_code incr_op = PLUS_EXPR; > + /* The code can only be PLUS_EXPR or MINUS_EXPR. */ > + gcc_assert (code == PLUS_EXPR || code == MINUS_EXPR); > + tree_code incr_op = code; As Richard said, we should be able to get rid of incr_op, probably by calling the parameter incr_op. I think the later: incr_op = MINUS_EXPR; stmts need to be updated to something that flips between PLUS_EXPR and MINUS_EXPR (with updates to the comments). It would probably make sense to split the create_iv part out as a separate prepatch. >edge pe = loop_preheader_edge (loop); > >if (var != NULL_TREE) > @@ -1365,7 +1369,7 @@ tree_transform_and_unroll_loop (class loop *loop, > unsigned factor, >tree ctr_before, ctr_after; >gimple_stmt_iterator bsi = gsi_last_nondebug_bb (new_exit->src); >exit_if = as_a (gsi_stmt (bsi)); > - create_iv (exit_base, exit_step, NULL_TREE, loop, > + create_iv (exit_base, PLUS_EXPR, exit_step, NULL_TREE, loop, >, false, _before, _after); >gimple_cond_set_code (exit_if, exit_cmp); >gimple_cond_set_lhs (exit_if, ctr_after); > @@ -1580,8 +1584,8 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, > bool bump_in_latch) > gsi = gsi_last_bb (loop->latch); >else > gsi = gsi_last_nondebug_bb (loop->header); > - create_iv (build_int_cst_type (type, 0),
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
More details for Case 2: + _72 = MIN_EXPR ; + _75 = MIN_EXPR ; + ... + .LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0); + vectp_f.8_56 = vectp_f.8_51 + 16; + .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0); + ... + _61 = _75 / 2; + .LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0); + vectp_d.10_63 = vectp_d.10_59 + 16; + _64 = _72 / 2; + .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0); You may be confused that _61 = _75 / 2; and _64 = _72 / 2; Well, this is similiar VIEW_CONVERT_EXPR of mask in ARM SVE. Like ARM SVE: tree vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks, unsigned int nvectors, tree vectype, unsigned int index) { rgroup_controls *rgm = &(*masks)[nvectors - 1]; tree mask_type = rgm->type; /* Populate the rgroup's mask array, if this is the first time we've used it. */ if (rgm->controls.is_empty ()) { rgm->controls.safe_grow_cleared (nvectors, true); for (unsigned int i = 0; i < nvectors; ++i) { tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask"); /* Provide a dummy definition until the real one is available. */ SSA_NAME_DEF_STMT (mask) = gimple_build_nop (); rgm->controls[i] = mask; } } tree mask = rgm->controls[index]; if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type), TYPE_VECTOR_SUBPARTS (vectype))) { /* A loop mask for data type X can be reused for data type Y if X has N times more elements than Y and if Y's elements are N times bigger than X's. In this case each sequence of N elements in the loop mask will be all-zero or all-one. We can then view-convert the mask so that each sequence of N elements is replaced by a single element. */ gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type), TYPE_VECTOR_SUBPARTS (vectype))); gimple_seq seq = NULL; mask_type = truth_type_for (vectype); mask = gimple_build (, VIEW_CONVERT_EXPR, mask_type, mask); if (seq) gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT); } return mask; } I am doing similiar thing for RVV: +vect_get_loop_len (loop_vec_info loop_vinfo, gimple_stmt_iterator *gsi, + vec_loop_lens *lens, unsigned int nvectors, tree vectype, + unsigned int index) { rgroup_controls *rgl = &(*lens)[nvectors - 1]; bool use_bias_adjusted_len = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) != 0; + tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); /* Populate the rgroup's len array, if this is the first time we've used it. */ @@ -10400,6 +10403,27 @@ vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens, if (use_bias_adjusted_len) return rgl->bias_adjusted_ctrl; + else if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type, + OPTIMIZE_FOR_SPEED)) +{ + tree loop_len = rgl->controls[index]; + poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type); + poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype); + if (maybe_ne (nunits1, nunits2)) + { + /* A loop len for data type X can be reused for data type Y +if X has N times more elements than Y and if Y's elements +are N times bigger than X's. */ + gcc_assert (multiple_p (nunits1, nunits2)); + unsigned int factor = exact_div (nunits1, nunits2).to_constant (); + gimple_seq seq = NULL; + loop_len = gimple_build (, RDIV_EXPR, iv_type, loop_len, + build_int_cst (iv_type, factor)); + if (seq) + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT); + } + return loop_len; +} else return rgl->controls[index]; }Thansk. juzhe.zh...@rivai.ai From: juzhe.zh...@rivai.ai Date: 2023-05-09 21:27 To: richard.sandiford CC: gcc-patches; rguenther Subject: Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support Thanks, Richard. >> Could you go into more details about this? I imagined that for case 3, >> there would be a single SELECT_VL that decides how many scalar iterations >> will be handled by the current vector iteration, then we would "expand" >> the result (using MIN_EXPRs) to the multi-control cases. For case 2 , here is the example: + 2. Multiple rgroup, the Gimple IR should be: + + # i_23 = PHI + # vectp_f.8_51 = PHI + # vectp_d.10_59 = PHI + # ivtmp_70 = PHI + # ivtmp_73 = PHI + _72 = MIN_EXPR ; + _75 = MIN_EXPR ; + _1 = i_23 * 2; + _2 = (long unsigned int) _1; + _3 = _2 * 2; + _4 = f_15(D) + _3; + _5 = _2 + 1
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
ode. We just need to work out some of the details. Ok, I also prefer keeping select_vl. >>FWIW, I share Kewen's concern about duplicating too much logic between >>masks, current lengths, and SELECT_VL lengths. But I haven't looked at >>the patch yet and so I don't know how easy it would be to avoid that. I understand the concern, the current implementation are in the isolated function "vect_set_loop_controls_by_select_vl", it's easier to review the implementation. Maybe we can first make the whole implementation codes in "vect_set_loop_controls_by_select_vl" to be stable after review, then we can try to incorporate these codes of "vect_set_loop_controls_by_select_vl" into "vect_set_loop_controls_directly". Thanks. juzhe.zh...@rivai.ai From: Richard Sandiford Date: 2023-05-09 20:59 To: 钟居哲 CC: gcc-patches; rguenther Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support 钟居哲 writes: > Hi, Richards. I would like to give more information about this patch so that > it will make this patch easier for you to review. > > Currently, I saw we have 3 situations that we need to handle in case of loop > control IV in auto-vectorization: > 1. Single rgroup loop control (ncopies == 1 && vec_num == 1 so > loop_len.length () == 1 or rgc->lengh () == 1) > 2. Multiple rgroup for SLP. > 3. Multiple rgroup for non-SLP which is Richard Sandiford point out > previously (For example, VEC_PACK_TRUNC). > > To talk about this patch, let me talk about RVV LLVM implementation first > which inspire me to send this patch: > https://reviews.llvm.org/D99750 > > According to LLVM implementation, they are adding a middle-end IR called > "get_vector_length" which has totally > same functionality as "select_vl" in this patch (I call it "while_len" > previously, now I rename it as "select_vl" following Richard suggestion). > > The LLVM implementation is only let "get_vector_length" calculate the number > of elements in single rgroup loop. > For multi rgroup, let's take a look at it: > https://godbolt.org/z/3GP78efTY > > void > foo1 (short *__restrict f, int *__restrict d, int n) > { > for (int i = 0; i < n; ++i) > { > f[i * 2 + 0] = 1; > f[i * 2 + 1] = 2; > d[i] = 3; > } > } > > RISC-V Clang: > foo1: # @foo1 > # %bb.0: > bleza2, .LBB0_8 > # %bb.1: > li a3, 16 > bgeua2, a3, .LBB0_3 > # %bb.2: > li a6, 0 > j .LBB0_6 > .LBB0_3: > andia6, a2, -16 > lui a3, 32 > addiw a3, a3, 1 > vsetivlizero, 8, e32, m2, ta, ma > vmv.v.x v8, a3 > vmv.v.i v10, 3 > mv a4, a6 > mv a5, a1 > mv a3, a0 > .LBB0_4:# =>This Inner Loop Header: Depth=1 > addia7, a5, 32 > addit0, a3, 32 > vsetivlizero, 16, e16, m2, ta, ma > vse16.v v8, (a3) > vse16.v v8, (t0) > vsetivlizero, 8, e32, m2, ta, ma > vse32.v v10, (a5) > vse32.v v10, (a7) > addia3, a3, 64 > addia4, a4, -16 > addia5, a5, 64 > bneza4, .LBB0_4 > # %bb.5: > beq a6, a2, .LBB0_8 > .LBB0_6: > sllia3, a6, 2 > add a0, a0, a3 > addia0, a0, 2 > add a1, a1, a3 > sub a2, a2, a6 > li a3, 1 > li a4, 2 > li a5, 3 > .LBB0_7:# =>This Inner Loop Header: Depth=1 > sh a3, -2(a0) > sh a4, 0(a0) > sw a5, 0(a1) > addia0, a0, 4 > addia2, a2, -1 > addia1, a1, 4 > bneza2, .LBB0_7 > .LBB0_8: > ret > > ARM GCC: > foo1: > cmp w2, 0 > ble .L1 > addvl x4, x0, #1 > mov x3, 0 > cntbx7 > cntbx6, all, mul #2 > sbfiz x2, x2, 1, 32 > ptrue p0.b, all > mov x5, x2 > adrpx8, .LC0 > uqdech x5 > add x8, x8, :lo12:.LC0 > whilelo p1.h, xzr, x5 > ld1rw z1.s, p0/z, [x8] > mov z0.s, #3 > whilelo p0.h, xzr, x2 > .L3: > st1hz1.h, p0, [x0, x3, lsl 1] > st1hz1.h, p1, [x4, x3, lsl 1] > st1wz0.s, p1, [x1, #1, mul vl] > add x3, x3, x7 > whilelo p1.h, x3, x5 > st1wz0.s, p0,
Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Could you go into more details about this? I imagined that for case 3, there would be a single SELECT_VL that decides how many scalar iterations will be handled by the current vector iteration, then we would "expand" the result (using MIN_EXPRs) to the multi-control cases. In a sense that replicates what the SVE code above is doing. But for SVE, it's possible to "optimise" the unpacking of a WHILELO result due to the lack of implementation-defined behaviour. So conceptually we have a single WHILELO that is unpacked one layer to give two masks. But in reality we optimise that two WHILELOs. No such optimisation is possible with SELECT_VL, and maybe that produces poor-quality code. That might be what you mean (haven't had chance to look at the patch itself yet, but hope to tomorrow). For case 2 (max_nscalars_per_iter > 1), I think it would make conceptual sense to pass max_nscalars_per_iter to SELECT_VL or (alternatively) multiply the result of the SELECT_VL by max_nscalars_per_iter. But it's only worth doing one of those two things if it produces reasonable code for RVV. > Now base on these situations, we only have "select_vl" for single-rgroup, but > multiple-rgroup (both SLP and non-SLP), we just > use MIN_EXPR. > > Is it more appropriate that we should remove "select_vl" and just use > MIN_EXPR force VF elements in each non-final iteration in single rgroup? It's up to you. If you don't think select_vl is worth it then it would obviously make the vectoriser changes a bit simpler. But making the vectoriser simpler isn't IMO the goal here. SELECT_VL seems like a perfectly reasonable construct to add to target-independent code. We just need to work out some of the details. FWIW, I share Kewen's concern about duplicating too much logic between masks, current lengths, and SELECT_VL lengths. But I haven't looked at the patch yet and so I don't know how easy it would be to avoid that. Thanks, Richard > Like the codegen according to RVV ISA example (show as RVV LLVM): > https://repo.hca.bsc.es/epic/z/oynhzP > > ASM: > vec_add:# @vec_add > bleza3, .LBB0_3 > li a4, 0 > .LBB0_2:# %vector.body > sub a5, a3, a4 > vsetvli a6, a5, e64, m1, ta, mu ==> change it into a6 = min (a5, VF) > && vsetvli zero, a6, e64, m1, ta, mu > sllia7, a4, 3 > add a5, a1, a7 > vle64.v v8, (a5) > add a5, a2, a7 > vle64.v v9, (a5) > vfadd.vvv8, v8, v9 > add a7, a7, a0 > add a4, a4, a6 > vse64.v v8, (a7) > bne a4, a3, .LBB0_2 > .LBB0_3:# %for.cond.cleanup > ret > > So if we remove "select_vl" just use MIN_EXPR to force VF elements in > non-final iteration, we will end up with having like this: > > vec_add:# @vec_add > bleza3, .LBB0_3 > csrrVF in bytes, vlenb > .LBB0_2:# %vector.body > sub a5, a3, VF in elements > a6 = min (a5, VF in elements) > vsetvli zero, a6, e64, m1, ta, mu > add a5, a1, VF in bytes > vle64.v v8, (a5) > add a5, a2, VF in bytes > vle64.v v9, (a5) > vfadd.vvv8, v8, v9 > add a7, a7, a0 > vse64.v v8, (a7) > bne a4, a3, .LBB0_2 > .LBB0_3:# %for.cond.cleanup > ret > > Both codegen are working good for RVV. > Just only the second one can not have the RVV special optimization (even > distributed workload in last 2 iterations). > > Expecting any suggestions from you. > Thanks. > > > juzhe.zh...@rivai.ai > > From: juzhe.zhong > Date: 2023-05-04 21:25 > To: gcc-patches > CC: richard.sandiford; rguenther; Ju-Zhe Zhong > Subject: [PATCH V4] VECT: Add decrement IV iteration loop control by variable > amount support > From: Ju-Zhe Zhong > > This patch is fixing V3 patch: > https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/ > > Fix issues according to Richard Sandiford && Richard Biener. > > 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford. > 2. Support multiple-rgroup for non-SLP auto-vectorization. > >For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the > total length: > > _36 = MIN_EXPR ; > > First length (MIN (X, VF/N)): >loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>; > > Second length (X - MIN (X, 1 * VF/N)): >loop_len_16 = _36 - loop_len
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Hi,Richards. Would you mind reviewing this patch? Thanks. juzhe.zh...@rivai.ai From: Jeff Law Date: 2023-05-07 23:19 To: juzhe.zhong; gcc-patches CC: richard.sandiford; rguenther Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support On 5/4/23 07:25, juzhe.zh...@rivai.ai wrote: > From: Ju-Zhe Zhong > > This patch is fixing V3 patch: > https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/ > > Fix issues according to Richard Sandiford && Richard Biener. > > 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford. > 2. Support multiple-rgroup for non-SLP auto-vectorization. > > For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the > total length: > > _36 = MIN_EXPR ; > > First length (MIN (X, VF/N)): > loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>; > > Second length (X - MIN (X, 1 * VF/N)): > loop_len_16 = _36 - loop_len_15; > > Third length (X - MIN (X, 2 * VF/N)): > _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>; > loop_len_17 = _36 - _38; > > Forth length (X - MIN (X, 3 * VF/N)): > _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>; > loop_len_18 = _36 - _39; > > The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length > since using SELECT_VL > to adapt induction IV consumes more instructions than just using MIN_EXPR. > Also, during testing, > I found it's hard to adjust length correctly according to SELECT_VL. > > So, this patch we only use SELECT_VL for single-rgroup with single length > control. > > 3. Fix document of select_vl for Richard Biener (remove mode N). > 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard > Biener. > 5. Keep loop_vinfo as first parameter for "vect_get_loop_len". > 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated > at the caller site. > > More comments from Richard Biener: >>> So it's not actually saturating. The saturating operation is done by >>> .WHILE_LEN? > I define the outcome of SELECT_VL (n, vf) (WHILE_LEN) = IN_RANGE (0, min (n, > vf)) will make > the loop control counter never underflow zero. > >>> I see. I wonder if it makes sense to leave .WHILE_LEN aside for a start, >>> the above scheme should also work for single rgroups, no? >>> As said, it _looks_ like you can progress without .WHILE_LEN and using >>> .WHILE_LEN is a pure optimization? > Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow > target adjust any length = INRANGE (0, min (n, vf)) each iteration. > > Let me known if I missed something for the V3 patch. So at a high level this is pretty good. I think there's some improvements we should make in the documentation and comments, but I'm comfortable with most of the implementation details. > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > index cc4a93a8763..99cf0cdbdca 100644 > --- a/gcc/doc/md.texi > +++ b/gcc/doc/md.texi > @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++) > operand0[i] = operand0[i - 1] && (operand1 + i < operand2); > @end smallexample > > +@cindex @code{select_vl@var{m}} instruction pattern > +@item @code{select_vl@var{m}} > +Set operand 0 to the number of active elements in vector will be updated > value. This reads rather poorly. Is this still accurate? Set operand 0 to the number of active elements in a vector to be updated in a loop iteration based on the total number of elements to be updated, the vectorization factor and vector properties of the target. > +operand 1 is the total elements need to be updated value. operand 1 is the total elements in the vector to be updated. > + > +The output of this pattern is not only used as IV of loop control counter, > but also > +is used as the IV of address calculation with multiply/shift operation. This > allow > +us dynamic adjust the number of elements is processed in each iteration of > the loop. This allows dynamic adjustment of the number of elements processed each loop iteration. -- is that still accurate and does it read better? > @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3. If not see > so that we can free them all at once. */ > static bitmap_obstack loop_renamer_obstack; > > -/* Creates an induction variable with value BASE + STEP * iteration in LOOP. > +/* Creates an induction variable with value BASE (+/-) STEP * iteration in > LOOP. > + If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration. > + If CODE is MINUS_EX
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Hi, Kewen. I have tried to implement "decrement IV" feature and incorporate into "vect_set_loop_controls_directly". Since the implementation is quite different from vect_set_loop_controls_directly, it will make vect_set_loop_controls_directly very complicated sometimes it makes me very hard to debug when I am testing. I am not sure but I can try again. This patch isolated those implementation into a single function "vect_set_loop_controls_by_select_vl" which makes Richards easier to review codes. Well, I think I can try again to incorporate those codes into "vect_set_loop_controls_directly" when they finish the review process of "vect_set_loop_controls_by_select_vl". Thanks. juzhe.zh...@rivai.ai From: Kewen.Lin Date: 2023-05-08 15:55 To: juzhe.zh...@rivai.ai CC: gcc-patches; rguenther; richard.sandiford Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support Hi Juzhe, > Hi, Kewen. > >>> Sorry for chiming in, I had some concern here. >>> We already have some handlings for partial vector in length in >>> vect_set_loop_controls_directly >>>(actually it deals with both mask and length), the proposed >>>vect_set_loop_controls_by_select_vl >>>for select_vl looks like a variant for partial vector in length (comparing >>>to the normal MIN), >>>and also adopts decrement IV. IMHO, it seems better to modify/extend the >>>current handling in >>>vect_set_loop_controls_directly for length, or factor out the handlings for >>>length there and >>>extend the factored one. Otherwise, it means we have two sets of handlings >>>for partial vector >>>in lengths, it looks bad to maintain. As the previous discussion, adopting >>>decrement IV is an >>>enhancement for loop control, it's good for both cases w/ or w/o select_vl. >>>If the above >>>understanding is correct, the possible steps seem to be: >>> - factor out the handling for length (* optional) >>> - modify it with decrement IV >>> - extend it with select_vl. >>>In future if some RVV vendor wants to degenerate select_vl to min, it can >>>just adopt the same >>>handlings with min by not defining select_vl optab. > > You mean like this: > doing this inside vect_set_loop_controls_directly ? > if (use_while_len_p) > return vect_set_loop_controls_by_while_len(...) No, I meant either factoring out those handlings for partial vector in length in function vect_set_loop_controls_directly to one separated function like: vect_set_loop_controls_directly_length and rename the existing one to vect_set_loop_controls_directly_mask, or keep the existing vect_set_loop_controls_directly for both mask and length but modify/extend the part for length. If there is no much code to share between mask and length, the former may be better. BR, Kewen
Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Hi Juzhe, > Hi, Kewen. > >>> Sorry for chiming in, I had some concern here. >>> We already have some handlings for partial vector in length in >>> vect_set_loop_controls_directly >>>(actually it deals with both mask and length), the proposed >>>vect_set_loop_controls_by_select_vl >>>for select_vl looks like a variant for partial vector in length (comparing >>>to the normal MIN), >>>and also adopts decrement IV. IMHO, it seems better to modify/extend the >>>current handling in >>>vect_set_loop_controls_directly for length, or factor out the handlings for >>>length there and >>>extend the factored one. Otherwise, it means we have two sets of handlings >>>for partial vector >>>in lengths, it looks bad to maintain. As the previous discussion, adopting >>>decrement IV is an >>>enhancement for loop control, it's good for both cases w/ or w/o select_vl. >>>If the above >>>understanding is correct, the possible steps seem to be: >>> - factor out the handling for length (* optional) >>> - modify it with decrement IV >>> - extend it with select_vl. >>>In future if some RVV vendor wants to degenerate select_vl to min, it can >>>just adopt the same >>>handlings with min by not defining select_vl optab. > > You mean like this: > doing this inside vect_set_loop_controls_directly ? > if (use_while_len_p) > return vect_set_loop_controls_by_while_len(...) No, I meant either factoring out those handlings for partial vector in length in function vect_set_loop_controls_directly to one separated function like: vect_set_loop_controls_directly_length and rename the existing one to vect_set_loop_controls_directly_mask, or keep the existing vect_set_loop_controls_directly for both mask and length but modify/extend the part for length. If there is no much code to share between mask and length, the former may be better. BR, Kewen
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Hi, Kewen.>> Sorry for chiming in, I had some concern here. >> We already have some handlings for partial vector in length in >> vect_set_loop_controls_directly >>(actually it deals with both mask and length), the proposed >>vect_set_loop_controls_by_select_vl >>for select_vl looks like a variant for partial vector in length (comparing to >>the normal MIN), >>and also adopts decrement IV. IMHO, it seems better to modify/extend the >>current handling in >>vect_set_loop_controls_directly for length, or factor out the handlings for >>length there and >>extend the factored one. Otherwise, it means we have two sets of handlings >>for partial vector >>in lengths, it looks bad to maintain. As the previous discussion, adopting >>decrement IV is an >>enhancement for loop control, it's good for both cases w/ or w/o select_vl. >>If the above >>understanding is correct, the possible steps seem to be: >> - factor out the handling for length (* optional) >> - modify it with decrement IV >> - extend it with select_vl. >>In future if some RVV vendor wants to degenerate select_vl to min, it can >>just adopt the same >>handlings with min by not defining select_vl optab. You mean like this: doing this inside vect_set_loop_controls_directly ? if (use_while_len_p) return vect_set_loop_controls_by_while_len (...) Thanks. juzhe.zh...@rivai.ai From: Kewen.Lin Date: 2023-05-08 13:35 To: juzhe.zhong; rguenther; richard.sandiford CC: gcc-patches Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support Hi, on 2023/5/4 21:25, juzhe.zh...@rivai.ai wrote: > From: Ju-Zhe Zhong > > This patch is fixing V3 patch: > https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/ > > Fix issues according to Richard Sandiford && Richard Biener. > > 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford. > 2. Support multiple-rgroup for non-SLP auto-vectorization. > >For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the > total length: > > _36 = MIN_EXPR ; > > First length (MIN (X, VF/N)): >loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>; > > Second length (X - MIN (X, 1 * VF/N)): >loop_len_16 = _36 - loop_len_15; > > Third length (X - MIN (X, 2 * VF/N)): >_38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>; >loop_len_17 = _36 - _38; > > Forth length (X - MIN (X, 3 * VF/N)): >_39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>; >loop_len_18 = _36 - _39; > > The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length > since using SELECT_VL > to adapt induction IV consumes more instructions than just using MIN_EXPR. > Also, during testing, > I found it's hard to adjust length correctly according to SELECT_VL. > > So, this patch we only use SELECT_VL for single-rgroup with single length > control. > > 3. Fix document of select_vl for Richard Biener (remove mode N). > 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard > Biener. > 5. Keep loop_vinfo as first parameter for "vect_get_loop_len". > 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated > at the caller site. > > More comments from Richard Biener: >>> So it's not actually saturating. The saturating operation is done by >>> .WHILE_LEN? > I define the outcome of SELECT_VL (n, vf) (WHILE_LEN) = IN_RANGE (0, min (n, > vf)) will make > the loop control counter never underflow zero. > >>> I see. I wonder if it makes sense to leave .WHILE_LEN aside for a start, >>> the above scheme should also work for single rgroups, no? >>> As said, it _looks_ like you can progress without .WHILE_LEN and using >>> .WHILE_LEN is a pure optimization? > Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow > target adjust any length = INRANGE (0, min (n, vf)) each iteration. > > Let me known if I missed something for the V3 patch. > Thanks. > > --- > gcc/cfgloopmanip.cc| 2 +- > gcc/doc/md.texi| 34 +++ > gcc/gimple-loop-interchange.cc | 2 +- > gcc/internal-fn.def| 1 + > gcc/optabs.def | 1 + > gcc/tree-ssa-loop-ivcanon.cc | 2 +- > gcc/tree-ssa-loop-ivopts.cc| 2 +- > gcc/tree-ssa-loop-manip.cc | 18 +- > gcc/tree-ssa-loop-manip.h | 4 +- > gcc/tree-vect-data-refs.cc | 8 +- > gcc/tree-vect-loop-manip.cc| 374 - > gcc/t
Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
Hi, on 2023/5/4 21:25, juzhe.zh...@rivai.ai wrote: > From: Ju-Zhe Zhong > > This patch is fixing V3 patch: > https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/ > > Fix issues according to Richard Sandiford && Richard Biener. > > 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford. > 2. Support multiple-rgroup for non-SLP auto-vectorization. > >For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the > total length: > > _36 = MIN_EXPR ; > > First length (MIN (X, VF/N)): >loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>; > > Second length (X - MIN (X, 1 * VF/N)): >loop_len_16 = _36 - loop_len_15; > > Third length (X - MIN (X, 2 * VF/N)): >_38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>; >loop_len_17 = _36 - _38; > > Forth length (X - MIN (X, 3 * VF/N)): >_39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>; >loop_len_18 = _36 - _39; > > The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length > since using SELECT_VL > to adapt induction IV consumes more instructions than just using MIN_EXPR. > Also, during testing, > I found it's hard to adjust length correctly according to SELECT_VL. > > So, this patch we only use SELECT_VL for single-rgroup with single length > control. > > 3. Fix document of select_vl for Richard Biener (remove mode N). > 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard > Biener. > 5. Keep loop_vinfo as first parameter for "vect_get_loop_len". > 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated > at the caller site. > > More comments from Richard Biener: >>> So it's not actually saturating. The saturating operation is done by >>> .WHILE_LEN? > I define the outcome of SELECT_VL (n, vf) (WHILE_LEN) = IN_RANGE (0, min (n, > vf)) will make > the loop control counter never underflow zero. > >>> I see. I wonder if it makes sense to leave .WHILE_LEN aside for a start, >>> the above scheme should also work for single rgroups, no? >>> As said, it _looks_ like you can progress without .WHILE_LEN and using >>> .WHILE_LEN is a pure optimization? > Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow > target adjust any length = INRANGE (0, min (n, vf)) each iteration. > > Let me known if I missed something for the V3 patch. > Thanks. > > --- > gcc/cfgloopmanip.cc| 2 +- > gcc/doc/md.texi| 34 +++ > gcc/gimple-loop-interchange.cc | 2 +- > gcc/internal-fn.def| 1 + > gcc/optabs.def | 1 + > gcc/tree-ssa-loop-ivcanon.cc | 2 +- > gcc/tree-ssa-loop-ivopts.cc| 2 +- > gcc/tree-ssa-loop-manip.cc | 18 +- > gcc/tree-ssa-loop-manip.h | 4 +- > gcc/tree-vect-data-refs.cc | 8 +- > gcc/tree-vect-loop-manip.cc| 374 - > gcc/tree-vect-loop.cc | 32 ++- > gcc/tree-vect-stmts.cc | 89 +++- > gcc/tree-vectorizer.h | 4 +- > 14 files changed, 535 insertions(+), 38 deletions(-) > > diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc > index 0e3ad8ed742..6e09dcbb0b1 100644 > --- a/gcc/cfgloopmanip.cc > +++ b/gcc/cfgloopmanip.cc > @@ -826,7 +826,7 @@ create_empty_loop_on_edge (edge entry_edge, > } > >gsi = gsi_last_bb (loop_header); > - create_iv (initial_value, stride, iv, loop, , false, > + create_iv (initial_value, PLUS_EXPR, stride, iv, loop, , false, >iv_before, iv_after); > >/* Insert loop exit condition. */ > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > index cc4a93a8763..99cf0cdbdca 100644 > --- a/gcc/doc/md.texi > +++ b/gcc/doc/md.texi > @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++) >operand0[i] = operand0[i - 1] && (operand1 + i < operand2); > @end smallexample > > +@cindex @code{select_vl@var{m}} instruction pattern > +@item @code{select_vl@var{m}} > +Set operand 0 to the number of active elements in vector will be updated > value. > +operand 1 is the total elements need to be updated value. > +operand 2 is the vectorization factor. > +The value of operand 0 is target dependent and flexible in each iteration. > +The operation of this pattern can be: > + > +@smallexample > +Case 1: > +operand0 = MIN (operand1, operand2); > +operand2 can be const_poly_int or poly_int related to vector mode size. > +Some target like RISC-V has a standalone instruction to get MIN (n, MODE > SIZE) so > +that we can reduce a use of general purpose register. > + > +In this case, only the last iteration of the loop is partial iteration. > +@end smallexample > + > +@smallexample > +Case 2: > +if (operand1 <= operand2) > + operand0 = operand1; > +else if (operand1 < 2 * operand2) > + operand0 = IN_RANGE (ceil (operand1 / 2), operand2); > +else > + operand0 = operand2; > + > +This case will evenly distribute work over the last 2 iterations of a >
Re: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
>> It's been pretty standard to stick with just PLUS_EXPR for this stuff >> and instead negate the constant to produce the same effect as >> MINUS_EXPR. Is there a reason we're not continuing that practice? >> Sorry if you've answered this already -- if you have, you can just point >> me at the prior discussion and I'll read it. Richard (Sandiford) has answered this question: https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616745.html And as Richard (Biener) said, it will make IVOPTs failed if we have variable IVs. However, we already have implemented variable IVs in downstream RVV GCC and works fine and I don't see any bad codegen so far. So, I think it may not be a serious issue for RVV. Besides, this implementation is not my idea which is just following the guide coming from RVV ISA spec. And also inspired by the implementation coming from LLVM. Reference: 1). RVV ISA: https://github.com/riscv/riscv-v-spec/blob/master/example/vvaddint32.s 2). LLVM length stuff implementation (Should note that "get_vector_length" pattern in LLVM is totally doing the same thing as "select_vl" pattern in GCC here: https://reviews.llvm.org/D99750 vvaddint32: vsetvli t0, a0, e32, ta, ma # Set vector length based on 32-bit vectors vle32.v v0, (a1) # Get first vector sub a0, a0, t0 # Decrement number done slli t0, t0, 2 # Multiply number done by 4 bytes add a1, a1, t0 # Bump pointer vle32.v v1, (a2) # Get second vector add a2, a2, t0 # Bump pointer vadd.vv v2, v0, v1 # Sum vectors vse32.v v2, (a3) # Store result add a3, a3, t0 # Bump pointer bnez a0, vvaddint32# Loop back ret# Finished Notice "sub a0, a0, t0", the "t0" is the variable coming from "vsetvli t0, a0, e32, ta, ma" which is generated by "get_vector_length" in LLVM, or similiar it is also generatd by "select_vl" in GCC too. Other comments from Jeff will be addressed in the next patch (V5), I will wait for Richards (both Sandiford && Biener that are the experts in Loop Vectorizer) comments. Then send V5 patch which is including all comments from Jeff && Richards (Sandiford && Biener). Thanks. juzhe.zh...@rivai.ai From: Jeff Law Date: 2023-05-07 23:19 To: juzhe.zhong; gcc-patches CC: richard.sandiford; rguenther Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support On 5/4/23 07:25, juzhe.zh...@rivai.ai wrote: > From: Ju-Zhe Zhong > > This patch is fixing V3 patch: > https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/ > > Fix issues according to Richard Sandiford && Richard Biener. > > 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford. > 2. Support multiple-rgroup for non-SLP auto-vectorization. > > For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the > total length: > > _36 = MIN_EXPR ; > > First length (MIN (X, VF/N)): > loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>; > > Second length (X - MIN (X, 1 * VF/N)): > loop_len_16 = _36 - loop_len_15; > > Third length (X - MIN (X, 2 * VF/N)): > _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>; > loop_len_17 = _36 - _38; > > Forth length (X - MIN (X, 3 * VF/N)): > _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>; > loop_len_18 = _36 - _39; > > The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length > since using SELECT_VL > to adapt induction IV consumes more instructions than just using MIN_EXPR. > Also, during testing, > I found it's hard to adjust length correctly according to SELECT_VL. > > So, this patch we only use SELECT_VL for single-rgroup with single length > control. > > 3. Fix document of select_vl for Richard Biener (remove mode N). > 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard > Biener. > 5. Keep loop_vinfo as first parameter for "vect_get_loop_len". > 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated > at the caller site. > > More comments from Richard Biener: >>> So it's not actually saturating. The saturating operation is done by >>> .WHILE_LEN? > I define the outcome of SELECT_VL (n, vf) (WHILE_LEN) = IN_RANGE (0, min (n, > vf)) will make > the loop control counter never underflow zero. > >>> I see. I wonder if it makes sense to leave .WHILE_LEN aside for a start, >>> the above scheme should also work for single rgroups, no? >>> As said, it _looks_ li
Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
On 5/4/23 07:25, juzhe.zh...@rivai.ai wrote: From: Ju-Zhe Zhong This patch is fixing V3 patch: https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/ Fix issues according to Richard Sandiford && Richard Biener. 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford. 2. Support multiple-rgroup for non-SLP auto-vectorization. For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the total length: _36 = MIN_EXPR ; First length (MIN (X, VF/N)): loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>; Second length (X - MIN (X, 1 * VF/N)): loop_len_16 = _36 - loop_len_15; Third length (X - MIN (X, 2 * VF/N)): _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>; loop_len_17 = _36 - _38; Forth length (X - MIN (X, 3 * VF/N)): _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>; loop_len_18 = _36 - _39; The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length since using SELECT_VL to adapt induction IV consumes more instructions than just using MIN_EXPR. Also, during testing, I found it's hard to adjust length correctly according to SELECT_VL. So, this patch we only use SELECT_VL for single-rgroup with single length control. 3. Fix document of select_vl for Richard Biener (remove mode N). 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard Biener. 5. Keep loop_vinfo as first parameter for "vect_get_loop_len". 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated at the caller site. More comments from Richard Biener: So it's not actually saturating. The saturating operation is done by .WHILE_LEN? I define the outcome of SELECT_VL (n, vf) (WHILE_LEN) = IN_RANGE (0, min (n, vf)) will make the loop control counter never underflow zero. I see. I wonder if it makes sense to leave .WHILE_LEN aside for a start, the above scheme should also work for single rgroups, no? As said, it _looks_ like you can progress without .WHILE_LEN and using .WHILE_LEN is a pure optimization? Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow target adjust any length = INRANGE (0, min (n, vf)) each iteration. Let me known if I missed something for the V3 patch. So at a high level this is pretty good. I think there's some improvements we should make in the documentation and comments, but I'm comfortable with most of the implementation details. diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index cc4a93a8763..99cf0cdbdca 100644 --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++) operand0[i] = operand0[i - 1] && (operand1 + i < operand2); @end smallexample +@cindex @code{select_vl@var{m}} instruction pattern +@item @code{select_vl@var{m}} +Set operand 0 to the number of active elements in vector will be updated value. This reads rather poorly. Is this still accurate? Set operand 0 to the number of active elements in a vector to be updated in a loop iteration based on the total number of elements to be updated, the vectorization factor and vector properties of the target. +operand 1 is the total elements need to be updated value. operand 1 is the total elements in the vector to be updated. + +The output of this pattern is not only used as IV of loop control counter, but also +is used as the IV of address calculation with multiply/shift operation. This allow +us dynamic adjust the number of elements is processed in each iteration of the loop. This allows dynamic adjustment of the number of elements processed each loop iteration. -- is that still accurate and does it read better? @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3. If not see so that we can free them all at once. */ static bitmap_obstack loop_renamer_obstack; -/* Creates an induction variable with value BASE + STEP * iteration in LOOP. +/* Creates an induction variable with value BASE (+/-) STEP * iteration in LOOP. + If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration. + If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration. It is expected that neither BASE nor STEP are shared with other expressions (unless the sharing rules allow this). Use VAR as a base var_decl for it (if NULL, a new temporary will be created). The increment will occur at It's been pretty standard to stick with just PLUS_EXPR for this stuff and instead negate the constant to produce the same effect as MINUS_EXPR. Is there a reason we're not continuing that practice? Sorry if you've answered this already -- if you have, you can just point me at the prior discussion and I'll read it. diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index 44bd5f2c805..d63ded5d4f0 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -385,6 +385,48 @@
Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
; vsetvli zero, a6, e64, m1, ta, mu sllia7, a4, 3 add a5, a1, a7 vle64.v v8, (a5) add a5, a2, a7 vle64.v v9, (a5) vfadd.vvv8, v8, v9 add a7, a7, a0 add a4, a4, a6 vse64.v v8, (a7) bne a4, a3, .LBB0_2 .LBB0_3:# %for.cond.cleanup ret So if we remove "select_vl" just use MIN_EXPR to force VF elements in non-final iteration, we will end up with having like this: vec_add:# @vec_add bleza3, .LBB0_3 csrrVF in bytes, vlenb .LBB0_2:# %vector.body sub a5, a3, VF in elements a6 = min (a5, VF in elements) vsetvli zero, a6, e64, m1, ta, mu add a5, a1, VF in bytes vle64.v v8, (a5) add a5, a2, VF in bytes vle64.v v9, (a5) vfadd.vvv8, v8, v9 add a7, a7, a0 vse64.v v8, (a7) bne a4, a3, .LBB0_2 .LBB0_3:# %for.cond.cleanup ret Both codegen are working good for RVV. Just only the second one can not have the RVV special optimization (even distributed workload in last 2 iterations). Expecting any suggestions from you. Thanks. juzhe.zh...@rivai.ai From: juzhe.zhong Date: 2023-05-04 21:25 To: gcc-patches CC: richard.sandiford; rguenther; Ju-Zhe Zhong Subject: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support From: Ju-Zhe Zhong This patch is fixing V3 patch: https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/ Fix issues according to Richard Sandiford && Richard Biener. 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford. 2. Support multiple-rgroup for non-SLP auto-vectorization. For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the total length: _36 = MIN_EXPR ; First length (MIN (X, VF/N)): loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>; Second length (X - MIN (X, 1 * VF/N)): loop_len_16 = _36 - loop_len_15; Third length (X - MIN (X, 2 * VF/N)): _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>; loop_len_17 = _36 - _38; Forth length (X - MIN (X, 3 * VF/N)): _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>; loop_len_18 = _36 - _39; The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length since using SELECT_VL to adapt induction IV consumes more instructions than just using MIN_EXPR. Also, during testing, I found it's hard to adjust length correctly according to SELECT_VL. So, this patch we only use SELECT_VL for single-rgroup with single length control. 3. Fix document of select_vl for Richard Biener (remove mode N). 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard Biener. 5. Keep loop_vinfo as first parameter for "vect_get_loop_len". 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated at the caller site. More comments from Richard Biener: >> So it's not actually saturating. The saturating operation is done by >> .WHILE_LEN? I define the outcome of SELECT_VL (n, vf) (WHILE_LEN) = IN_RANGE (0, min (n, vf)) will make the loop control counter never underflow zero. >> I see. I wonder if it makes sense to leave .WHILE_LEN aside for a start, >> the above scheme should also work for single rgroups, no? >> As said, it _looks_ like you can progress without .WHILE_LEN and using >> .WHILE_LEN is a pure optimization? Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow target adjust any length = INRANGE (0, min (n, vf)) each iteration. Let me known if I missed something for the V3 patch. Thanks. --- gcc/cfgloopmanip.cc| 2 +- gcc/doc/md.texi| 34 +++ gcc/gimple-loop-interchange.cc | 2 +- gcc/internal-fn.def| 1 + gcc/optabs.def | 1 + gcc/tree-ssa-loop-ivcanon.cc | 2 +- gcc/tree-ssa-loop-ivopts.cc| 2 +- gcc/tree-ssa-loop-manip.cc | 18 +- gcc/tree-ssa-loop-manip.h | 4 +- gcc/tree-vect-data-refs.cc | 8 +- gcc/tree-vect-loop-manip.cc| 374 - gcc/tree-vect-loop.cc | 32 ++- gcc/tree-vect-stmts.cc | 89 +++- gcc/tree-vectorizer.h | 4 +- 14 files changed, 535 insertions(+), 38 deletions(-) diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc index 0e3ad8ed742..6e09dcbb0b1 100644 --- a/gcc/cfgloopmanip.cc +++ b/gcc/cfgloopmanip.cc @@ -826,7 +826,7 @@ create_empty_loop_on_edge (edge entry_edge, } gsi = gsi_last_bb (loop_header); - create_iv (initial_value, stride, iv, loop, , false, + create_iv (initial_value, PLUS_EXPR, stride, iv, loop, , false, iv_before, iv_after); /* Insert loop ex
[PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
From: Ju-Zhe Zhong This patch is fixing V3 patch: https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zh...@rivai.ai/ Fix issues according to Richard Sandiford && Richard Biener. 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford. 2. Support multiple-rgroup for non-SLP auto-vectorization. For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the total length: _36 = MIN_EXPR ; First length (MIN (X, VF/N)): loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>; Second length (X - MIN (X, 1 * VF/N)): loop_len_16 = _36 - loop_len_15; Third length (X - MIN (X, 2 * VF/N)): _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>; loop_len_17 = _36 - _38; Forth length (X - MIN (X, 3 * VF/N)): _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>; loop_len_18 = _36 - _39; The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length since using SELECT_VL to adapt induction IV consumes more instructions than just using MIN_EXPR. Also, during testing, I found it's hard to adjust length correctly according to SELECT_VL. So, this patch we only use SELECT_VL for single-rgroup with single length control. 3. Fix document of select_vl for Richard Biener (remove mode N). 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard Biener. 5. Keep loop_vinfo as first parameter for "vect_get_loop_len". 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated at the caller site. More comments from Richard Biener: >> So it's not actually saturating. The saturating operation is done by >> .WHILE_LEN? I define the outcome of SELECT_VL (n, vf) (WHILE_LEN) = IN_RANGE (0, min (n, vf)) will make the loop control counter never underflow zero. >> I see. I wonder if it makes sense to leave .WHILE_LEN aside for a start, >> the above scheme should also work for single rgroups, no? >> As said, it _looks_ like you can progress without .WHILE_LEN and using >> .WHILE_LEN is a pure optimization? Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow target adjust any length = INRANGE (0, min (n, vf)) each iteration. Let me known if I missed something for the V3 patch. Thanks. --- gcc/cfgloopmanip.cc| 2 +- gcc/doc/md.texi| 34 +++ gcc/gimple-loop-interchange.cc | 2 +- gcc/internal-fn.def| 1 + gcc/optabs.def | 1 + gcc/tree-ssa-loop-ivcanon.cc | 2 +- gcc/tree-ssa-loop-ivopts.cc| 2 +- gcc/tree-ssa-loop-manip.cc | 18 +- gcc/tree-ssa-loop-manip.h | 4 +- gcc/tree-vect-data-refs.cc | 8 +- gcc/tree-vect-loop-manip.cc| 374 - gcc/tree-vect-loop.cc | 32 ++- gcc/tree-vect-stmts.cc | 89 +++- gcc/tree-vectorizer.h | 4 +- 14 files changed, 535 insertions(+), 38 deletions(-) diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc index 0e3ad8ed742..6e09dcbb0b1 100644 --- a/gcc/cfgloopmanip.cc +++ b/gcc/cfgloopmanip.cc @@ -826,7 +826,7 @@ create_empty_loop_on_edge (edge entry_edge, } gsi = gsi_last_bb (loop_header); - create_iv (initial_value, stride, iv, loop, , false, + create_iv (initial_value, PLUS_EXPR, stride, iv, loop, , false, iv_before, iv_after); /* Insert loop exit condition. */ diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index cc4a93a8763..99cf0cdbdca 100644 --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++) operand0[i] = operand0[i - 1] && (operand1 + i < operand2); @end smallexample +@cindex @code{select_vl@var{m}} instruction pattern +@item @code{select_vl@var{m}} +Set operand 0 to the number of active elements in vector will be updated value. +operand 1 is the total elements need to be updated value. +operand 2 is the vectorization factor. +The value of operand 0 is target dependent and flexible in each iteration. +The operation of this pattern can be: + +@smallexample +Case 1: +operand0 = MIN (operand1, operand2); +operand2 can be const_poly_int or poly_int related to vector mode size. +Some target like RISC-V has a standalone instruction to get MIN (n, MODE SIZE) so +that we can reduce a use of general purpose register. + +In this case, only the last iteration of the loop is partial iteration. +@end smallexample + +@smallexample +Case 2: +if (operand1 <= operand2) + operand0 = operand1; +else if (operand1 < 2 * operand2) + operand0 = IN_RANGE (ceil (operand1 / 2), operand2); +else + operand0 = operand2; + +This case will evenly distribute work over the last 2 iterations of a stripmine loop. +@end smallexample + +The output of this pattern is not only used as IV of loop control counter, but also +is used as the IV of address calculation with multiply/shift operation. This allow +us dynamic adjust the number of elements is processed in each iteration of the loop. + @cindex