"Kewen.Lin" <li...@linux.ibm.com> writes: > Hi Richard, > > Thanks for the review again! > > on 2020/7/25 上午12:21, Richard Sandiford wrote: >> "Kewen.Lin" <li...@linux.ibm.com> writes: >> >> Thanks, the rearrangement of the existing code looks good. Could you >> split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff >> out into separate patches? >> > > Splitted to https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550691.html. > > errr... that subject should be with prefix "[PATCH] vect:". > > [snip ...] > (Some comments in the snipped content will be done in v4) > >>> + here. */ >>> + >>> + /* For now we only operate length-based partial vectors on Power, >>> + which has constant VF all the time, we need some tweakings below >>> + if it doesn't hold in future. */ >>> + gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()); >> >> Where do you rely on this? There didn't seem to be any obvious >> to_constant uses. Since this is “only” a cost calculation, we should >> be using assumed_vf. > > Sorry for the confusion. This was intended for the poly things like > VF or nitems_per_ctrl which isn't constant during compilation time, > then get people's attention on the possible runtime cost on things like > scaling up for nitems_step etc. But I just realized that the computations > like the multiply with another constant can operate on the coefficient, > it looks there is no runtime cost then? If so, I think I thought too > much before. ;-) > >>> - prologue_cost_vec.release (); >>> - epilogue_cost_vec.release (); >>> + (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, >>> scalar_stmt, >>> + NULL, NULL_TREE, 0, vect_prologue); >>> + (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, >>> scalar_stmt, >>> + NULL, NULL_TREE, 0, vect_body); >> >> IMO this seems to be reproducing too much of the functions that you >> referred to. And the danger with that is that they could easily >> get out of sync later. > > Good point! The original intention was to model as possible as we can, > to avoid some bad decision due to some unmodeled pieces, like the case > the loop body is small and some computation become nonnegligible. > The unsync risks seems also applied for other codes. How about adding > some "note" comments in those functions? > > The updated v4 is attached by addressing your comments as well as Segher's > comments. > > BR, > Kewen > ----- > > gcc/ChangeLog: > > * config/rs6000/rs6000.c (rs6000_adjust_vect_cost_per_loop): New > function. > (rs6000_finish_cost): Call rs6000_adjust_vect_cost_per_loop. > * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost > modeling for vector with length. > * tree-vect-loop-manip.c (vect_set_loop_controls_directly): Update > function comment. > * tree-vect-stmts.c (vect_gen_len): Likewise. > > diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c > index 009afc5f894..86ef584e09b 100644 > --- a/gcc/config/rs6000/rs6000.c > +++ b/gcc/config/rs6000/rs6000.c > @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void > *data, int count, > return retval; > } > > +/* For some target specific vectorization cost which can't be handled per > stmt, > + we check the requisite conditions and adjust the vectorization cost > + accordingly if satisfied. One typical example is to model shift cost for > + vector with length by counting number of required lengths under condition > + LOOP_VINFO_FULLY_WITH_LENGTH_P. */ > + > +static void > +rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data) > +{ > + struct loop *loop = data->loop_info; > + gcc_assert (loop); > + loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop); > + > + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) > + { > + rgroup_controls *rgc; > + unsigned int num_vectors_m1; > + unsigned int shift_cnt = 0; > + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) > + if (rgc->type) > + /* Each length needs one shift to fill into bits 0-7. */ > + shift_cnt += num_vectors_m1 + 1; > + > + rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, > scalar_stmt, > + NULL, NULL_TREE, 0, vect_body); > + } > +} > + > /* Implement targetm.vectorize.finish_cost. */ > > static void > @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned > *prologue_cost, > rs6000_cost_data *cost_data = (rs6000_cost_data*) data; > > if (cost_data->loop_info) > - rs6000_density_test (cost_data); > + { > + rs6000_adjust_vect_cost_per_loop (cost_data); > + rs6000_density_test (cost_data); > + } > > /* Don't vectorize minimum-vectorization-factor, simple copy loops > that require versioning for any reason. The vectorization is at > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c > index 490e7befea3..9d0e3fc525e 100644 > --- a/gcc/tree-vect-loop-manip.c > +++ b/gcc/tree-vect-loop-manip.c > @@ -412,7 +412,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, > rgroup_controls *dest_rgm, > > This means that we cannot guarantee that such an induction variable > would ever hit a value that produces a set of all-false masks or zero > - lengths for RGC. */ > + lengths for RGC. > + > + Note that please check cost modeling whether needed to be updated in > + function vect_estimate_min_profitable_iters if any changes. */ > > static tree > vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo, > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > index 06cde4b1da3..a00160a7f2d 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -3798,6 +3798,70 @@ vect_estimate_min_profitable_iters (loop_vec_info > loop_vinfo, > (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1, > vector_stmt, NULL, NULL_TREE, 0, vect_body); > } > + else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) > + { > + /* Referring to the functions vect_set_loop_condition_partial_vectors > + and vect_set_loop_controls_directly, we need to generate each > + length in the prologue and in the loop body if required. Although > + there are some possible optimizations, we consider the worst case > + here. */ > + > + /* For wrap around checking. */ > + tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); > + unsigned int compare_precision = TYPE_PRECISION (compare_type); > + widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo); > + > + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); > + bool need_iterate_p > + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) > + && !vect_known_niters_smaller_than_vf (loop_vinfo)); > + > + /* Init min/max, shift and minus cost relative to single > + scalar_stmt. For now we only use length-based partial vectors on > + Power, target specific cost tweaking may be needed for other > + ports in future. */ > + unsigned int min_max_cost = 2; > + unsigned int shift_cost = 1, minus_cost = 1;
Please instead add a scalar_min_max to vect_cost_for_stmt, and use scalar_stmt for shift and minus. There shouldn't be any Power things hard-coded into target-independent code. > + /* Init cost relative to single scalar_stmt. */ > + unsigned int prologue_cnt = 0; > + unsigned int body_cnt = 0; Maybe _stmts instead of _cnt, to make it clearer what we're counting. At first it looked like this was actually the raw cost. I guess with the above change we'd also want separate counters for min_max. > + rgroup_controls *rgc; > + unsigned int num_vectors_m1; > + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) > + if (rgc->type) > + { > + unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor; > + > + /* May need one shift for nitems_total computation. */ > + if (nitems != 1 && !niters_known_p) > + prologue_cnt += shift_cost; > + > + /* Need to handle wrap around. */ > + if (iv_limit == -1 > + || (wi::min_precision (iv_limit * nitems, UNSIGNED) > + > compare_precision)) > + prologue_cnt += (min_max_cost + minus_cost); I think we should instead split this condition out into a subroutine that both this function and vect_set_loop_condition_partial_vectors can use. It would take the loop_vec_info and rgroup_controls as arguments. That means that we might recompute things a few more times, but this shouldn't be performance-critical. Thanks, Richard