On Fri, 14 Nov 2025, Richard Sandiford wrote:

> Richard Biener <[email protected]> writes:
> > The following ensures we still apply a runtime costmodel check even
> > when not peeling for niters.  This is relevant for fully masked
> > loop vectorization and in this case triggers versioning of the loop.
> > Given the condition on whether we apply epilog peeling (not to mention
> > the condition determining whether to include a skip-vector check there,
> > which is the place a cost model check is eventually put on) is mightly
> > complicated, this is a RFC because in the end we should have made
> > sure LOOP_REQUIRES_VERSIONING is true and add a separate
> > LOOP_REQUIRES_VERSIONING_FOR_COST.
> >
> > I think I have seen the same bug even without masking, but vectorization
> > tends to be a static profitability decision in those cases as we know
> > no epilog is needed.  For x86 the case of main loop masking isn't
> > too important right now (besides having this bug), for fixed-length
> > vectorization it's much better to rely on masked epilogs.  For the
> > testcase it's where there is only a single masked vector iteration
> > and for a low number of scalar iteration that's not profitable,
> > like in this case with an "emulated" fold-left reduction.
> >
> > The fact that we set check_profitability but then do not check
> > profitability seems to be a bug though.
> >
> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >
> > Comments welcome, in particular whether it was a conious choice
> > to never apply a runtime check for fully masked vector loops
> > when otherwise no versioning or epilog peeling is required?
> 
> It kind-of was for SVE, in that if we use SVE well, the vector loop
> should be better for all but degenerate cases.  (And if we don't use
> SVE well, we should fix that.)  Degenerate cases would include 1 or
> perhaps 2 scalar iterations, which we're sacrificing to some extent by
> vectorising in any case.  Also, adding a check for 1 or 2 iterations
> could be counterproductive if the branch turns out to be hard to predict.
> 
> This feeds into the Amdahl's Law argument around SVE: the idea was to
> vectorise as much as possible, without caring whether the code was
> particularly hot.  In the common case, one scalar instruction would
> become one vector instruction, with little additional overhead.
> It would somewhat go against that to version most loops, due to
> the code size impact if nothing else.
> 
> But (a) the decision not to version dates from before any actual SVE
> implementations had been developed, let alone added to the cost model;
> and (b) it was when the vectoriser still used a single version check for
> all conditions, with no shortcutting of things like alias detection.
> (b) meant that the runtime check would often be very imprecise.
> 
> I think enough has changed since then that the historical reasons
> are no longer relevant.

Coming back to this only now.  OK, so I think targets would be able to
still opt-in into not doing the versioning if they arrange for the
costs to be comparable to that of a single scalar iteration.  We
could at least make sure to "round downwards" here.

I'll have to resolve some ICEs that came up during testing, likely
because I hacked the extra versioning in, instead of properly
tracking it.  So I'll rework this a bit but also now postpone this
for next stage1 (since on x86 we never run into this issue by default).

Richard.

> Richard
> 
> > Thanks,
> > Richard.
> >
> >     PR tree-optimization/110979
> >     * tree-vectorizer.h (vect_apply_runtime_profitability_check_p):
> >     Adjust for partial vectors, only require a check for th > 1.
> >     (vect_needs_epilog_peeling): Declare.
> >     * tree-vect-loop.cc (vect_analyze_loop_2): Make sure to set
> >     LOOP_VINFO_VERSIONING_THRESHOLD when we'll version.
> >     (vect_transform_loop(): Also version when we need to
> >     check profitability but are not applying epilog peeling.
> >     * tree-vect-loop-manip.cc (vect_needs_epilog_peeling): New
> >     function, split out from ...
> >     (vect_do_peeling): ... here.
> >     (vect_loop_versioning): Always dump when we apply versioning.
> >
> >     * gcc.dg/vect/costmodel/x86_64/costmodel-pr110979.c: New testcase.
> > ---
> >  .../costmodel/x86_64/costmodel-pr110979.c     | 14 +++++
> >  gcc/tree-vect-loop-manip.cc                   | 63 ++++++++++++-------
> >  gcc/tree-vect-loop.cc                         | 11 ++--
> >  gcc/tree-vectorizer.h                         |  5 +-
> >  4 files changed, 67 insertions(+), 26 deletions(-)
> >  create mode 100644 
> > gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr110979.c
> >
> > diff --git 
> > a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr110979.c 
> > b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr110979.c
> > new file mode 100644
> > index 00000000000..99830dcb1dc
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr110979.c
> > @@ -0,0 +1,14 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target avx512f } */
> > +/* { dg-additional-options "-O3 -mavx512f --param 
> > vect-partial-vector-usage=2" } */
> > +
> > +float
> > +foo3 (float* __restrict a, int n)
> > +{
> > +  float sum = 0.0f;
> > +  for (int i = 0; i != n; i++)
> > +    sum += a[i];
> > +  return sum;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "applying loop versioning to loop" "vect" } 
> > } */
> > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > index 5862098803a..1eb2faeb74e 100644
> > --- a/gcc/tree-vect-loop-manip.cc
> > +++ b/gcc/tree-vect-loop-manip.cc
> > @@ -3048,6 +3048,38 @@ vect_get_main_loop_result (loop_vec_info loop_vinfo, 
> > tree main_loop_value,
> >    return phi_result;
> >  }
> >  
> > +/* Returns whether we need to perform epilog peeling and initializes
> > +   *EPILOG_PEELING and *BOUND_EPILOG if so.  */
> > +
> > +static bool
> > +vect_needs_epilog_peeling (loop_vec_info loop_vinfo,
> > +                      bool *epilog_peeling, poly_uint64 *bound_epilog)
> > +{
> > +  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > +  *bound_epilog = 0;
> > +  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> > +      && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > +    *bound_epilog += vf - 1;
> > +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > +    *bound_epilog += 1;
> > +
> > +  /* For early breaks the scalar loop needs to execute at most VF times
> > +     to find the element that caused the break.  */
> > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > +    *bound_epilog = vf;
> > +
> > +  *epilog_peeling = maybe_ne (*bound_epilog, 0U);
> > +  return *epilog_peeling;
> > +}
> > +
> > +bool
> > +vect_needs_epilog_peeling (loop_vec_info loop_vinfo)
> > +{
> > +  bool epilog_peeling;
> > +  poly_uint64 bound_epilog;
> > +  return vect_needs_epilog_peeling (loop_vinfo, &epilog_peeling, 
> > &bound_epilog);
> > +}
> > +
> >  /* Function vect_do_peeling.
> >  
> >     Input:
> > @@ -3153,30 +3185,19 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> > niters, tree nitersm1,
> >    basic_block guard_bb, guard_to;
> >    profile_probability prob_prolog, prob_vector, prob_epilog;
> >    int estimated_vf;
> > -  int prolog_peeling = 0;
> >    bool vect_epilogues = loop_vinfo->epilogue_vinfo != NULL;
> > +  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> >  
> > +  int prolog_peeling = 0;
> >    if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
> >      prolog_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
> >  
> > -  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > -  poly_uint64 bound_epilog = 0;
> > -  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> > -      && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > -    bound_epilog += vf - 1;
> > -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > -    bound_epilog += 1;
> > -
> > -  /* For early breaks the scalar loop needs to execute at most VF times
> > -     to find the element that caused the break.  */
> > -  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > -    bound_epilog = vf;
> > -
> > -  bool epilog_peeling = maybe_ne (bound_epilog, 0U);
> > -  poly_uint64 bound_scalar = bound_epilog;
> > -
> > -  if (!prolog_peeling && !epilog_peeling)
> > +  bool epilog_peeling;
> > +  poly_uint64 bound_epilog;
> > +  if (!vect_needs_epilog_peeling (loop_vinfo, &epilog_peeling, 
> > &bound_epilog)
> > +      && prolog_peeling == 0)
> >      return NULL;
> > +  poly_uint64 bound_scalar = bound_epilog;
> >  
> >    /* Before doing any peeling make sure to reset debug binds outside of
> >       the loop refering to defs not in LC SSA.  */
> > @@ -4366,10 +4387,10 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
> >      }
> >    else
> >      {
> > -      if (loop_to_version != loop
> > -     && dump_enabled_p ())
> > +      if (dump_enabled_p ())
> >     dump_printf_loc (MSG_NOTE, vect_location,
> > -                    "applying loop versioning to outer loop %d\n",
> > +                    "applying loop versioning to %sloop %d\n",
> > +                    loop_to_version != loop ? "outer " : "",
> >                      loop_to_version->num);
> >  
> >        unsigned orig_pe_idx = loop_preheader_edge (loop)->dest_idx;
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index b27e762c69c..62542d400f4 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -2528,7 +2528,9 @@ start_over:
> >       analyzing the epilogue.  For instance, checks for alias versioning 
> > will be
> >       skipped when dealing with epilogues as we assume we already checked 
> > them
> >       for the main loop.  So instead we always check the 'orig_loop_vinfo'. 
> >  */
> > -  if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))
> > +  if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)
> > +      || (vect_apply_runtime_profitability_check_p (orig_loop_vinfo)
> > +     && !vect_needs_epilog_peeling (orig_loop_vinfo)))
> >      {
> >        poly_uint64 niters_th = 0;
> >        unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
> > @@ -2553,8 +2555,8 @@ start_over:
> >        if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> >     niters_th += 1;
> >  
> > -      /*  Use the same condition as vect_transform_loop to decide when to 
> > use
> > -     the cost to determine a versioning threshold.  */
> > +      /* Use the same condition as vect_transform_loop to decide when to 
> > use
> > +    the cost to determine a versioning threshold.  */
> >        if (vect_apply_runtime_profitability_check_p (loop_vinfo)
> >       && ordered_p (th, niters_th))
> >     niters_th = ordered_max (poly_uint64 (th), niters_th);
> > @@ -11093,7 +11095,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, 
> > gimple *loop_vectorized_call)
> >    /* Version the loop first, if required, so the profitability check
> >       comes first.  */
> >  
> > -  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
> > +  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > +      || (check_profitability && !vect_needs_epilog_peeling (loop_vinfo)))
> >      {
> >        class loop *sloop
> >     = vect_loop_versioning (loop_vinfo, loop_vectorized_call);
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index 9c349e8ffde..83689246340 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -2396,7 +2396,9 @@ vect_apply_runtime_profitability_check_p 
> > (loop_vec_info loop_vinfo)
> >  {
> >    unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
> >    return (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > -     && th >= vect_vf_for_cost (loop_vinfo));
> > +     && th > 1
> > +     && (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> > +         || th >= vect_vf_for_cost (loop_vinfo)));
> >  }
> >  
> >  /* Return true if CODE is a lane-reducing opcode.  */
> > @@ -2468,6 +2470,7 @@ class loop *slpeel_tree_duplicate_loop_to_edge_cfg 
> > (class loop *, edge,
> >                                                 edge, edge *, bool = true,
> >                                                 vec<basic_block> * = NULL);
> >  class loop *vect_loop_versioning (loop_vec_info, gimple *);
> > +extern bool vect_needs_epilog_peeling (loop_vec_info);
> >  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
> >                                 tree *, tree *, tree *, int, bool, bool,
> >                                 tree *);
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to