Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Mon, Nov 8, 2021 at 2:06 PM Richard Sandiford
> <richard.sandif...@arm.com> wrote:
>>
>> Richard Biener <richard.guent...@gmail.com> writes:
>> > On Mon, Nov 8, 2021 at 11:45 AM Richard Sandiford via Gcc-patches
>> > <gcc-patches@gcc.gnu.org> wrote:
>> >>
>> >> One of the things we want to do on AArch64 is compare vector loops
>> >> side-by-side and pick the best one.  For some targets, we want this
>> >> to be based on issue rates as well as the usual latency-based costs
>> >> (at least for loops with relatively high iteration counts).
>> >>
>> >> The current approach to doing this is: when costing vectorisation
>> >> candidate A, try to guess what the other main candidate B will look
>> >> like and adjust A's latency-based cost up or down based on the likely
>> >> difference between A and B's issue rates.  This effectively means
>> >> that we try to cost parts of B at the same time as A, without actually
>> >> being able to see B.
>> >
>> > I think the idea of the current costing is that compares are always
>> > to the original scalar loop (with comparing the resulting magic
>> > cost numbers) so that by means of transitivity comparing two
>> > vector variants work as well.  So I'm not sure where 'variant B'
>> > comes into play here?
>>
>> E.g. A could be SVE and B could be Advanced SIMD, or A could be
>> SVE with fully-packed vectors and B could be SVE with partially
>> unpacked vectors.
>>
>> One motivating case is Neoverse V1, which is a 256-bit SVE target.
>> There, scalar code, Advanced SIMD code and SVE code have different issue
>> characteristics.  SVE vector ops generally have the same latencies as
>> the corresponding Advanced SIMD ops.  Advanced SIMD ops generally
>> have double the throughput of SVE ones, so that the overall bit-for-bit
>> throughputs are roughly equal.  However, there are differences due to
>> predication, load/store handling, and so on, and those differences
>> should be decisive in some cases.
>>
>> For SLP, latency-based costs are fine.  But for loop costs, they hide
>> too many details.  What works best in practice, both for vector vs.
>> vector and vector vs. scalar, is:
>>
>> (1) compare issue rates between two loop bodies (vector vs. vector
>>     or vector vs. scalar)
>> (2) if issue rates are equal for a given number of scalar iterations,
>>     compare the latencies of the loop bodies, as we do now
>> (3) if both the above are equal, compare the latencies of the prologue
>>     and epilogue, as we do now
>>
>> It's very difficult to combine latency and issue rate information
>> into a single per-statement integer, in such a way that the integers
>> can just be summed and compared.
>
> Yeah, so the idea I had when proposing the init_cost/add_stmt/finish_cost
> was that finish_cost would determine the issue rate cost part, for
> example if the uarch can issue 2 ops with ISA A and 4 ops with ISA B
> then finish_cost would compute the number of cycles required to
> issue all vector stmts.  That yields sth like IPC the latency cost could
> be divided by.  Doing that for both ISA A and ISA B would produce
> weighted latency values that can be compared?  Alternatively
> you can of course simulate issue with the actual instruction latencies
> in mind and produce an overall iteration latency number.
>
> Comparing just latency or issue rate in isolation is likely not good
> enough?

In practice, comparing issue rates in isolation does seem to be what we
want as the first level of comparison.  I don't think total latency /
issue rate is really a meaningful combination.  It makes sense to the
extent that “lower latency == good” and “higher issue rate == good”,
but I don't think it's the case that halving the total latency is
equally valuable as doubling the issue rate.  In practice the latter is
a significant win while the former might or might not be.

total latency / issue rate also runs the risk of double-counting:
reducing the number of ops decreases the total latency *and* increases
the issue rate, which could have a quadratic effect.  (This is why we
don't decrease SVE costs when we think that the SVE code will issue
more quickly than the Advanced SIMD code.)

For a long-running loop, the only latencies that really matter are
for loop-carried dependencies.  The issue rate information takes
those into account, in that it tracks reduction and (with later
patches) induction limiters.

Thanks,
Richard

>
>> So returning to the above, when costing an SVE loop A, we currently
>> collect on-the-side issue information about both A and the likely
>> Advanced SIMD implementation B.  If B would have a higher throughput
>> than A then we multiply A's latency-based costs by:
>>
>>    ceil(B throughput/A throughput)
>>
>> The problem is, we don't actually know whether B exists or what it
>> would look like (given that Advanced SIMD and SVE have different
>> features).
>>
>> In principle, we should do the same in reverse, but since SVE needs
>> fewer ops than Advanced SIMD, the latencies are usually smaller already.
>>
>> We also do something similar for the scalar code.  When costing both
>> Advanced SIMD and SVE, we try to estimate the issue rate of the
>> original scalar code.  If the scalar code could issue more quickly
>> than the equivalent vector code, we increase the latency-based cost
>> of the vector code to account for that.
>>
>> The idea of the patch (and others) is that we can do the (1), (2),
>> (3) comparison above directly rather than indirectly.  We can also
>> use the issue information calculated while costing the scalar code,
>> rather than having to estimate the scalar issue information while
>> costing the vector code.
>>
>> >> This is needlessly indirect and complex.  It was a compromise due
>> >> to the code being added (too) late in the GCC 11 cycle, so that
>> >> target-independent changes weren't possible.
>> >>
>> >> The target-independent code already compares two candidate loop_vec_infos
>> >> side-by-side, so that information about A and B above are available
>> >> directly.  This patch creates a way for targets to hook into this
>> >> comparison.
>> >
>> > You mean it has both loop_vec_infos.  But as said I'm not sure it's a good
>> > idea to compare those side-by-side - will that not lead to _more_ 
>> > special-casing
>> > since you need to have a working compare to the scalar variant as well
>> > to determine the vectorization threshold?
>>
>> A later patch allows the code to do the same comparison for vector
>> vs. scalar.  It makes the target code significantly simpler compared
>> to now, since add_stmt_cost only needs to consider the latency and
>> issue rate of the code it's actually supposed to be costing, rather than
>> trying also to estimate the issue rate of the scalar code and the issue
>> rate of the Advanced SIMD code.
>>
>> Thanks,
>> Richard
>>
>> >
>> >>
>> >> The AArch64 code can therefore hook into better_main_loop_than_p to
>> >> compare issue rates.  If the issue rate comparison isn't decisive,
>> >> the code can fall back to the normal latency-based comparison instead.
>> >>
>> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>> >>
>> >> Richard
>> >>
>> >>
>> >> gcc/
>> >>         * tree-vectorizer.h (vector_costs::better_main_loop_than_p)
>> >>         (vector_costs::better_epilogue_loop_than_p)
>> >>         (vector_costs::compare_inside_loop_cost)
>> >>         (vector_costs::compare_outside_loop_cost): Likewise.
>> >>         * tree-vectorizer.c (vector_costs::better_main_loop_than_p)
>> >>         (vector_costs::better_epilogue_loop_than_p)
>> >>         (vector_costs::compare_inside_loop_cost)
>> >>         (vector_costs::compare_outside_loop_cost): New functions,
>> >>         containing code moved from...
>> >>         * tree-vect-loop.c (vect_better_loop_vinfo_p): ...here.
>> >> ---
>> >>  gcc/tree-vect-loop.c  | 142 ++---------------------------
>> >>  gcc/tree-vectorizer.c | 204 ++++++++++++++++++++++++++++++++++++++++++
>> >>  gcc/tree-vectorizer.h |  17 ++++
>> >>  3 files changed, 226 insertions(+), 137 deletions(-)
>> >>
>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> >> index dd4a363fee5..c9ee2e15e35 100644
>> >> --- a/gcc/tree-vect-loop.c
>> >> +++ b/gcc/tree-vect-loop.c
>> >> @@ -2784,144 +2784,12 @@ vect_better_loop_vinfo_p (loop_vec_info 
>> >> new_loop_vinfo,
>> >>         return new_simdlen_p;
>> >>      }
>> >>
>> >> -  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
>> >> -  if (main_loop)
>> >> -    {
>> >> -      poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
>> >> -      unsigned HOST_WIDE_INT main_vf;
>> >> -      unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost;
>> >> -      /* If we can determine how many iterations are left for the 
>> >> epilogue
>> >> -        loop, that is if both the main loop's vectorization factor and 
>> >> number
>> >> -        of iterations are constant, then we use them to calculate the 
>> >> cost of
>> >> -        the epilogue loop together with a 'likely value' for the 
>> >> epilogues
>> >> -        vectorization factor.  Otherwise we use the main loop's 
>> >> vectorization
>> >> -        factor and the maximum poly value for the epilogue's.  If the 
>> >> target
>> >> -        has not provided with a sensible upper bound poly vectorization
>> >> -        factors are likely to be favored over constant ones.  */
>> >> -      if (main_poly_vf.is_constant (&main_vf)
>> >> -         && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
>> >> -       {
>> >> -         unsigned HOST_WIDE_INT niters
>> >> -           = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
>> >> -         HOST_WIDE_INT old_likely_vf
>> >> -           = estimated_poly_value (old_vf, POLY_VALUE_LIKELY);
>> >> -         HOST_WIDE_INT new_likely_vf
>> >> -           = estimated_poly_value (new_vf, POLY_VALUE_LIKELY);
>> >> -
>> >> -         /* If the epilogue is using partial vectors we account for the
>> >> -            partial iteration here too.  */
>> >> -         old_factor = niters / old_likely_vf;
>> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)
>> >> -             && niters % old_likely_vf != 0)
>> >> -           old_factor++;
>> >> -
>> >> -         new_factor = niters / new_likely_vf;
>> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)
>> >> -             && niters % new_likely_vf != 0)
>> >> -           new_factor++;
>> >> -       }
>> >> -      else
>> >> -       {
>> >> -         unsigned HOST_WIDE_INT main_vf_max
>> >> -           = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
>> >> -
>> >> -         old_factor = main_vf_max / estimated_poly_value (old_vf,
>> >> -                                                          
>> >> POLY_VALUE_MAX);
>> >> -         new_factor = main_vf_max / estimated_poly_value (new_vf,
>> >> -                                                          
>> >> POLY_VALUE_MAX);
>> >> -
>> >> -         /* If the loop is not using partial vectors then it will 
>> >> iterate one
>> >> -            time less than one that does.  It is safe to subtract one 
>> >> here,
>> >> -            because the main loop's vf is always at least 2x bigger than 
>> >> that
>> >> -            of an epilogue.  */
>> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo))
>> >> -           old_factor -= 1;
>> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo))
>> >> -           new_factor -= 1;
>> >> -       }
>> >> -
>> >> -      /* Compute the costs by multiplying the inside costs with the 
>> >> factor and
>> >> -        add the outside costs for a more complete picture.  The factor 
>> >> is the
>> >> -        amount of times we are expecting to iterate this epilogue.  */
>> >> -      old_cost = old_loop_vinfo->vector_costs->body_cost () * old_factor;
>> >> -      new_cost = new_loop_vinfo->vector_costs->body_cost () * new_factor;
>> >> -      old_cost += old_loop_vinfo->vector_costs->outside_cost ();
>> >> -      new_cost += new_loop_vinfo->vector_costs->outside_cost ();
>> >> -      return new_cost < old_cost;
>> >> -    }
>> >> -
>> >> -  /* Limit the VFs to what is likely to be the maximum number of 
>> >> iterations,
>> >> -     to handle cases in which at least one loop_vinfo is fully-masked.  
>> >> */
>> >> -  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int 
>> >> (loop);
>> >> -  if (estimated_max_niter != -1)
>> >> -    {
>> >> -      if (known_le (estimated_max_niter, new_vf))
>> >> -       new_vf = estimated_max_niter;
>> >> -      if (known_le (estimated_max_niter, old_vf))
>> >> -       old_vf = estimated_max_niter;
>> >> -    }
>> >> -
>> >> -  /* Check whether the (fractional) cost per scalar iteration is lower
>> >> -     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  
>> >> */
>> >> -  poly_int64 rel_new = new_loop_vinfo->vector_costs->body_cost () * 
>> >> old_vf;
>> >> -  poly_int64 rel_old = old_loop_vinfo->vector_costs->body_cost () * 
>> >> new_vf;
>> >> -
>> >> -  HOST_WIDE_INT est_rel_new_min
>> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MIN);
>> >> -  HOST_WIDE_INT est_rel_new_max
>> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MAX);
>> >> -
>> >> -  HOST_WIDE_INT est_rel_old_min
>> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MIN);
>> >> -  HOST_WIDE_INT est_rel_old_max
>> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MAX);
>> >> -
>> >> -  /* Check first if we can make out an unambigous total order from the 
>> >> minimum
>> >> -     and maximum estimates.  */
>> >> -  if (est_rel_new_min < est_rel_old_min
>> >> -      && est_rel_new_max < est_rel_old_max)
>> >> -    return true;
>> >> -  else if (est_rel_old_min < est_rel_new_min
>> >> -          && est_rel_old_max < est_rel_new_max)
>> >> -    return false;
>> >> -  /* When old_loop_vinfo uses a variable vectorization factor,
>> >> -     we know that it has a lower cost for at least one runtime VF.
>> >> -     However, we don't know how likely that VF is.
>> >> -
>> >> -     One option would be to compare the costs for the estimated VFs.
>> >> -     The problem is that that can put too much pressure on the cost
>> >> -     model.  E.g. if the estimated VF is also the lowest possible VF,
>> >> -     and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
>> >> -     for the estimated VF, we'd then choose new_loop_vinfo even
>> >> -     though (a) new_loop_vinfo might not actually be better than
>> >> -     old_loop_vinfo for that VF and (b) it would be significantly
>> >> -     worse at larger VFs.
>> >> -
>> >> -     Here we go for a hacky compromise: pick new_loop_vinfo if it is
>> >> -     no more expensive than old_loop_vinfo even after doubling the
>> >> -     estimated old_loop_vinfo VF.  For all but trivial loops, this
>> >> -     ensures that we only pick new_loop_vinfo if it is significantly
>> >> -     better than old_loop_vinfo at the estimated VF.  */
>> >> -
>> >> -  if (est_rel_old_min != est_rel_new_min
>> >> -      || est_rel_old_max != est_rel_new_max)
>> >> -    {
>> >> -      HOST_WIDE_INT est_rel_new_likely
>> >> -       = estimated_poly_value (rel_new, POLY_VALUE_LIKELY);
>> >> -      HOST_WIDE_INT est_rel_old_likely
>> >> -       = estimated_poly_value (rel_old, POLY_VALUE_LIKELY);
>> >> -
>> >> -      return est_rel_new_likely * 2 <= est_rel_old_likely;
>> >> -    }
>> >> -
>> >> -  /* If there's nothing to choose between the loop bodies, see whether
>> >> -     there's a difference in the prologue and epilogue costs.  */
>> >> -  auto old_outside_cost = old_loop_vinfo->vector_costs->outside_cost ();
>> >> -  auto new_outside_cost = new_loop_vinfo->vector_costs->outside_cost ();
>> >> -  if (new_outside_cost != old_outside_cost)
>> >> -    return new_outside_cost < old_outside_cost;
>> >> +  const auto *old_costs = old_loop_vinfo->vector_costs;
>> >> +  const auto *new_costs = new_loop_vinfo->vector_costs;
>> >> +  if (loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO 
>> >> (old_loop_vinfo))
>> >> +    return new_costs->better_epilogue_loop_than_p (old_costs, main_loop);
>> >>
>> >> -  return false;
>> >> +  return new_costs->better_main_loop_than_p (old_costs);
>> >>  }
>> >>
>> >>  /* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
>> >> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
>> >> index 9ef76ce654b..dcbb2a3f13a 100644
>> >> --- a/gcc/tree-vectorizer.c
>> >> +++ b/gcc/tree-vectorizer.c
>> >> @@ -1744,3 +1744,207 @@ vector_costs::adjust_cost_for_freq (stmt_vec_info 
>> >> stmt_info,
>> >>      }
>> >>    return cost;
>> >>  }
>> >> +
>> >> +/* See the comment above the declaration for details.  */
>> >> +
>> >> +bool
>> >> +vector_costs::better_main_loop_than_p (const vector_costs *other) const
>> >> +{
>> >> +  int diff = compare_inside_loop_cost (other);
>> >> +  if (diff != 0)
>> >> +    return diff < 0;
>> >> +
>> >> +  /* If there's nothing to choose between the loop bodies, see whether
>> >> +     there's a difference in the prologue and epilogue costs.  */
>> >> +  diff = compare_outside_loop_cost (other);
>> >> +  if (diff != 0)
>> >> +    return diff < 0;
>> >> +
>> >> +  return false;
>> >> +}
>> >> +
>> >> +
>> >> +/* See the comment above the declaration for details.  */
>> >> +
>> >> +bool
>> >> +vector_costs::better_epilogue_loop_than_p (const vector_costs *other,
>> >> +                                          loop_vec_info main_loop) const
>> >> +{
>> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
>> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
>> >> +
>> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
>> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
>> >> +
>> >> +  poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
>> >> +  unsigned HOST_WIDE_INT main_vf;
>> >> +  unsigned HOST_WIDE_INT other_factor, this_factor, other_cost, 
>> >> this_cost;
>> >> +  /* If we can determine how many iterations are left for the epilogue
>> >> +     loop, that is if both the main loop's vectorization factor and 
>> >> number
>> >> +     of iterations are constant, then we use them to calculate the cost 
>> >> of
>> >> +     the epilogue loop together with a 'likely value' for the epilogues
>> >> +     vectorization factor.  Otherwise we use the main loop's 
>> >> vectorization
>> >> +     factor and the maximum poly value for the epilogue's.  If the target
>> >> +     has not provided with a sensible upper bound poly vectorization
>> >> +     factors are likely to be favored over constant ones.  */
>> >> +  if (main_poly_vf.is_constant (&main_vf)
>> >> +      && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
>> >> +    {
>> >> +      unsigned HOST_WIDE_INT niters
>> >> +       = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
>> >> +      HOST_WIDE_INT other_likely_vf
>> >> +       = estimated_poly_value (other_vf, POLY_VALUE_LIKELY);
>> >> +      HOST_WIDE_INT this_likely_vf
>> >> +       = estimated_poly_value (this_vf, POLY_VALUE_LIKELY);
>> >> +
>> >> +      /* If the epilogue is using partial vectors we account for the
>> >> +        partial iteration here too.  */
>> >> +      other_factor = niters / other_likely_vf;
>> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo)
>> >> +         && niters % other_likely_vf != 0)
>> >> +       other_factor++;
>> >> +
>> >> +      this_factor = niters / this_likely_vf;
>> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo)
>> >> +         && niters % this_likely_vf != 0)
>> >> +       this_factor++;
>> >> +    }
>> >> +  else
>> >> +    {
>> >> +      unsigned HOST_WIDE_INT main_vf_max
>> >> +       = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
>> >> +
>> >> +      other_factor = main_vf_max / estimated_poly_value (other_vf,
>> >> +                                                      POLY_VALUE_MAX);
>> >> +      this_factor = main_vf_max / estimated_poly_value (this_vf,
>> >> +                                                      POLY_VALUE_MAX);
>> >> +
>> >> +      /* If the loop is not using partial vectors then it will iterate 
>> >> one
>> >> +        time less than one that does.  It is safe to subtract one here,
>> >> +        because the main loop's vf is always at least 2x bigger than that
>> >> +        of an epilogue.  */
>> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo))
>> >> +       other_factor -= 1;
>> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo))
>> >> +       this_factor -= 1;
>> >> +    }
>> >> +
>> >> +  /* Compute the costs by multiplying the inside costs with the factor 
>> >> and
>> >> +     add the outside costs for a more complete picture.  The factor is 
>> >> the
>> >> +     amount of times we are expecting to iterate this epilogue.  */
>> >> +  other_cost = other->body_cost () * other_factor;
>> >> +  this_cost = this->body_cost () * this_factor;
>> >> +  other_cost += other->outside_cost ();
>> >> +  this_cost += this->outside_cost ();
>> >> +  return this_cost < other_cost;
>> >> +}
>> >> +
>> >> +/* A <=>-style subroutine of better_main_loop_than_p.  Check whether we 
>> >> can
>> >> +   determine the return value of better_main_loop_than_p by comparing the
>> >> +   inside (loop body) costs of THIS and OTHER.  Return:
>> >> +
>> >> +   * -1 if better_main_loop_than_p should return true.
>> >> +   * 1 if better_main_loop_than_p should return false.
>> >> +   * 0 if we can't decide.  */
>> >> +
>> >> +int
>> >> +vector_costs::compare_inside_loop_cost (const vector_costs *other) const
>> >> +{
>> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
>> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
>> >> +
>> >> +  struct loop *loop = LOOP_VINFO_LOOP (this_loop_vinfo);
>> >> +  gcc_assert (LOOP_VINFO_LOOP (other_loop_vinfo) == loop);
>> >> +
>> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
>> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
>> >> +
>> >> +  /* Limit the VFs to what is likely to be the maximum number of 
>> >> iterations,
>> >> +     to handle cases in which at least one loop_vinfo is fully-masked.  
>> >> */
>> >> +  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int 
>> >> (loop);
>> >> +  if (estimated_max_niter != -1)
>> >> +    {
>> >> +      if (known_le (estimated_max_niter, this_vf))
>> >> +       this_vf = estimated_max_niter;
>> >> +      if (known_le (estimated_max_niter, other_vf))
>> >> +       other_vf = estimated_max_niter;
>> >> +    }
>> >> +
>> >> +  /* Check whether the (fractional) cost per scalar iteration is lower or
>> >> +     higher: this_inside_cost / this_vf vs. other_inside_cost / 
>> >> other_vf.  */
>> >> +  poly_int64 rel_this = this_loop_vinfo->vector_costs->body_cost () * 
>> >> other_vf;
>> >> +  poly_int64 rel_other
>> >> +    = other_loop_vinfo->vector_costs->body_cost () * this_vf;
>> >> +
>> >> +  HOST_WIDE_INT est_rel_this_min
>> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MIN);
>> >> +  HOST_WIDE_INT est_rel_this_max
>> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MAX);
>> >> +
>> >> +  HOST_WIDE_INT est_rel_other_min
>> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MIN);
>> >> +  HOST_WIDE_INT est_rel_other_max
>> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MAX);
>> >> +
>> >> +  /* Check first if we can make out an unambigous total order from the 
>> >> minimum
>> >> +     and maximum estimates.  */
>> >> +  if (est_rel_this_min < est_rel_other_min
>> >> +      && est_rel_this_max < est_rel_other_max)
>> >> +    return -1;
>> >> +
>> >> +  if (est_rel_other_min < est_rel_this_min
>> >> +      && est_rel_other_max < est_rel_this_max)
>> >> +    return 1;
>> >> +
>> >> +  /* When other_loop_vinfo uses a variable vectorization factor,
>> >> +     we know that it has a lower cost for at least one runtime VF.
>> >> +     However, we don't know how likely that VF is.
>> >> +
>> >> +     One option would be to compare the costs for the estimated VFs.
>> >> +     The problem is that that can put too much pressure on the cost
>> >> +     model.  E.g. if the estimated VF is also the lowest possible VF,
>> >> +     and if other_loop_vinfo is 1 unit worse than this_loop_vinfo
>> >> +     for the estimated VF, we'd then choose this_loop_vinfo even
>> >> +     though (a) this_loop_vinfo might not actually be better than
>> >> +     other_loop_vinfo for that VF and (b) it would be significantly
>> >> +     worse at larger VFs.
>> >> +
>> >> +     Here we go for a hacky compromise: pick this_loop_vinfo if it is
>> >> +     no more expensive than other_loop_vinfo even after doubling the
>> >> +     estimated other_loop_vinfo VF.  For all but trivial loops, this
>> >> +     ensures that we only pick this_loop_vinfo if it is significantly
>> >> +     better than other_loop_vinfo at the estimated VF.  */
>> >> +  if (est_rel_other_min != est_rel_this_min
>> >> +      || est_rel_other_max != est_rel_this_max)
>> >> +    {
>> >> +      HOST_WIDE_INT est_rel_this_likely
>> >> +       = estimated_poly_value (rel_this, POLY_VALUE_LIKELY);
>> >> +      HOST_WIDE_INT est_rel_other_likely
>> >> +       = estimated_poly_value (rel_other, POLY_VALUE_LIKELY);
>> >> +
>> >> +      return est_rel_this_likely * 2 <= est_rel_other_likely ? -1 : 1;
>> >> +    }
>> >> +
>> >> +  return 0;
>> >> +}
>> >> +
>> >> +/* A <=>-style subroutine of better_main_loop_than_p, used when there is
>> >> +   nothing to choose between the inside (loop body) costs of THIS and 
>> >> OTHER.
>> >> +   Check whether we can determine the return value of 
>> >> better_main_loop_than_p
>> >> +   by comparing the outside (prologue and epilogue) costs of THIS and 
>> >> OTHER.
>> >> +   Return:
>> >> +
>> >> +   * -1 if better_main_loop_than_p should return true.
>> >> +   * 1 if better_main_loop_than_p should return false.
>> >> +   * 0 if we can't decide.  */
>> >> +
>> >> +int
>> >> +vector_costs::compare_outside_loop_cost (const vector_costs *other) const
>> >> +{
>> >> +  auto this_outside_cost = this->outside_cost ();
>> >> +  auto other_outside_cost = other->outside_cost ();
>> >> +  if (this_outside_cost != other_outside_cost)
>> >> +    return this_outside_cost < other_outside_cost ? -1 : 1;
>> >> +
>> >> +  return 0;
>> >> +}
>> >> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>> >> index 87d3f211a2e..0e3aad590e8 100644
>> >> --- a/gcc/tree-vectorizer.h
>> >> +++ b/gcc/tree-vectorizer.h
>> >> @@ -1419,6 +1419,21 @@ public:
>> >>       read back using the functions below.  */
>> >>    virtual void finish_cost ();
>> >>
>> >> +  /* The costs in THIS and OTHER both describe ways of vectorizing
>> >> +     a main loop.  Return true if the costs described by THIS are
>> >> +     cheaper than the costs described by OTHER.  Return false if any
>> >> +     of the following are true:
>> >> +
>> >> +     - THIS and OTHER are of equal cost
>> >> +     - OTHER is better than THIS
>> >> +     - we can't be sure about the relative costs of THIS and OTHER.  */
>> >> +  virtual bool better_main_loop_than_p (const vector_costs *other) const;
>> >> +
>> >> +  /* Likewise, but the costs in THIS and OTHER both describe ways of
>> >> +     vectorizing an epilogue loop of MAIN_LOOP.  */
>> >> +  virtual bool better_epilogue_loop_than_p (const vector_costs *other,
>> >> +                                           loop_vec_info main_loop) 
>> >> const;
>> >> +
>> >>    unsigned int prologue_cost () const;
>> >>    unsigned int body_cost () const;
>> >>    unsigned int epilogue_cost () const;
>> >> @@ -1429,6 +1444,8 @@ protected:
>> >>                                  unsigned int);
>> >>    unsigned int adjust_cost_for_freq (stmt_vec_info, 
>> >> vect_cost_model_location,
>> >>                                      unsigned int);
>> >> +  int compare_inside_loop_cost (const vector_costs *) const;
>> >> +  int compare_outside_loop_cost (const vector_costs *) const;
>> >>
>> >>    /* The region of code that we're considering vectorizing.  */
>> >>    vec_info *m_vinfo;
>> >> --
>> >> 2.25.1
>> >>

Reply via email to