"Kewen.Lin" <li...@linux.ibm.com> writes:
> Hi Richard,
>
> Thanks for the review again!
>
> on 2020/7/25 上午12:21, Richard Sandiford wrote:
>> "Kewen.Lin" <li...@linux.ibm.com> writes:
>> 
>> Thanks, the rearrangement of the existing code looks good.  Could you
>> split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff
>> out into separate patches?
>> 
>
> Splitted to https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550691.html.
>
> errr... that subject should be with prefix "[PATCH] vect:".
>
> [snip ...] 
> (Some comments in the snipped content will be done in v4)
>
>>> +    here.  */
>>> +
>>> +      /* For now we only operate length-based partial vectors on Power,
>>> +    which has constant VF all the time, we need some tweakings below
>>> +    if it doesn't hold in future.  */
>>> +      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
>> 
>> Where do you rely on this?  There didn't seem to be any obvious
>> to_constant uses.  Since this is “only” a cost calculation, we should
>> be using assumed_vf.
>
> Sorry for the confusion.  This was intended for the poly things like
> VF or nitems_per_ctrl which isn't constant during compilation time,
> then get people's attention on the possible runtime cost on things like
> scaling up for nitems_step etc.  But I just realized that the computations
> like the multiply with another constant can operate on the coefficient,
> it looks there is no runtime cost then?  If so, I think I thought too
> much before.  ;-)
>
>>> -      prologue_cost_vec.release ();
>>> -      epilogue_cost_vec.release ();
>>> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, 
>>> scalar_stmt,
>>> +                       NULL, NULL_TREE, 0, vect_prologue);
>>> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, 
>>> scalar_stmt,
>>> +                       NULL, NULL_TREE, 0, vect_body);
>> 
>> IMO this seems to be reproducing too much of the functions that you
>> referred to.  And the danger with that is that they could easily
>> get out of sync later.
>
> Good point!  The original intention was to model as possible as we can,
> to avoid some bad decision due to some unmodeled pieces, like the case
> the loop body is small and some computation become nonnegligible.
> The unsync risks seems also applied for other codes.  How about adding
> some "note" comments in those functions?
>
> The updated v4 is attached by addressing your comments as well as Segher's
> comments.
>
> BR,
> Kewen
> -----
>
> gcc/ChangeLog:
>
>       * config/rs6000/rs6000.c (rs6000_adjust_vect_cost_per_loop): New
>       function.
>       (rs6000_finish_cost): Call rs6000_adjust_vect_cost_per_loop.
>       * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
>       modeling for vector with length.
>       * tree-vect-loop-manip.c (vect_set_loop_controls_directly): Update
>       function comment.
>       * tree-vect-stmts.c (vect_gen_len): Likewise.
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 009afc5f894..86ef584e09b 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void 
> *data, int count,
>    return retval;
>  }
>  
> +/* For some target specific vectorization cost which can't be handled per 
> stmt,
> +   we check the requisite conditions and adjust the vectorization cost
> +   accordingly if satisfied.  One typical example is to model shift cost for
> +   vector with length by counting number of required lengths under condition
> +   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
> +
> +static void
> +rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data)
> +{
> +  struct loop *loop = data->loop_info;
> +  gcc_assert (loop);
> +  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
> +
> +  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +    {
> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      unsigned int shift_cnt = 0;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +     if (rgc->type)
> +       /* Each length needs one shift to fill into bits 0-7.  */
> +       shift_cnt += num_vectors_m1 + 1;
> +
> +      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, 
> scalar_stmt,
> +                         NULL, NULL_TREE, 0, vect_body);
> +    }
> +}
> +
>  /* Implement targetm.vectorize.finish_cost.  */
>  
>  static void
> @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned 
> *prologue_cost,
>    rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
>  
>    if (cost_data->loop_info)
> -    rs6000_density_test (cost_data);
> +    {
> +      rs6000_adjust_vect_cost_per_loop (cost_data);
> +      rs6000_density_test (cost_data);
> +    }
>  
>    /* Don't vectorize minimum-vectorization-factor, simple copy loops
>       that require versioning for any reason.  The vectorization is at
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 490e7befea3..9d0e3fc525e 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -412,7 +412,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, 
> rgroup_controls *dest_rgm,
>  
>     This means that we cannot guarantee that such an induction variable
>     would ever hit a value that produces a set of all-false masks or zero
> -   lengths for RGC.  */
> +   lengths for RGC.
> +
> +   Note that please check cost modeling whether needed to be updated in
> +   function vect_estimate_min_profitable_iters if any changes.  */
>  
>  static tree
>  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 06cde4b1da3..a00160a7f2d 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -3798,6 +3798,70 @@ vect_estimate_min_profitable_iters (loop_vec_info 
> loop_vinfo,
>        (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
>                           vector_stmt, NULL, NULL_TREE, 0, vect_body);
>      }
> +  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +    {
> +      /* Referring to the functions vect_set_loop_condition_partial_vectors
> +      and vect_set_loop_controls_directly, we need to generate each
> +      length in the prologue and in the loop body if required. Although
> +      there are some possible optimizations, we consider the worst case
> +      here.  */
> +
> +      /* For wrap around checking.  */
> +      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> +      unsigned int compare_precision = TYPE_PRECISION (compare_type);
> +      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
> +
> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
> +      bool need_iterate_p
> +     = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +        && !vect_known_niters_smaller_than_vf (loop_vinfo));
> +
> +      /* Init min/max, shift and minus cost relative to single
> +      scalar_stmt. For now we only use length-based partial vectors on
> +      Power, target specific cost tweaking may be needed for other
> +      ports in future.  */
> +      unsigned int min_max_cost = 2;
> +      unsigned int shift_cost = 1, minus_cost = 1;

Please instead add a scalar_min_max to vect_cost_for_stmt, and use
scalar_stmt for shift and minus.  There shouldn't be any Power things
hard-coded into target-independent code.

> +      /* Init cost relative to single scalar_stmt.  */
> +      unsigned int prologue_cnt = 0;
> +      unsigned int body_cnt = 0;

Maybe _stmts instead of _cnt, to make it clearer what we're counting.
At first it looked like this was actually the raw cost.  I guess with
the above change we'd also want separate counters for min_max.

> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +     if (rgc->type)
> +       {
> +         unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
> +
> +         /* May need one shift for nitems_total computation.  */
> +         if (nitems != 1 && !niters_known_p)
> +           prologue_cnt += shift_cost;
> +
> +         /* Need to handle wrap around.  */
> +         if (iv_limit == -1
> +             || (wi::min_precision (iv_limit * nitems, UNSIGNED)
> +                 > compare_precision))
> +           prologue_cnt += (min_max_cost + minus_cost);

I think we should instead split this condition out into a subroutine
that both this function and vect_set_loop_condition_partial_vectors
can use.  It would take the loop_vec_info and rgroup_controls as
arguments.

That means that we might recompute things a few more times, but this
shouldn't be performance-critical.

Thanks,
Richard

Reply via email to