[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 12 Nov 2025 05:31:52 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979


--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'll note that without partial vectors we also generate

  vect__4.8_29 = MEM <vector(16) float> [(float *)vectp_a.6_27];
  stmp_sum_9.9_30 = BIT_FIELD_REF <vect__4.8_29, 32, 0>;
  stmp_sum_9.9_31 = sum_13 + stmp_sum_9.9_30;
  stmp_sum_9.9_32 = BIT_FIELD_REF <vect__4.8_29, 32, 32>;
  stmp_sum_9.9_33 = stmp_sum_9.9_31 + stmp_sum_9.9_32;
  stmp_sum_9.9_34 = BIT_FIELD_REF <vect__4.8_29, 32, 64>;
  stmp_sum_9.9_35 = stmp_sum_9.9_33 + stmp_sum_9.9_34;
...

initially, but later forwprop "saves" us here, changing the full vector
decomposition to scalar loads.  That's of course not possible for the
masked load case.

costing wise it's the usual issue of high load cost on the scalar side
offsetting loads of crap on the vector side.

When we change the scalar loop to run 'n' times instead of constant 100 we
compute

  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 14
t2.c:6:23: note:    Runtime profitability threshold = 14
t2.c:6:23: note:    Static estimate profitability threshold = 32
t2.c:6:23: note:  no need for a runtime choice between the scalar and vector
loops

while without masking the cheap cost model limit hits.

This is because

  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
      && min_profitable_iters < (assumed_vf + peel_iters_prologue))
    /* We want the vectorized loop to execute at least once.  */
    min_profitable_iters = assumed_vf + peel_iters_prologue;
  else if (min_profitable_iters < peel_iters_prologue)
    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
       vectorized loop executes at least once.  */
    min_profitable_iters = peel_iters_prologue;

where for non-masked vectorization the

  Calculated minimum iters for profitability: 2

is turned into the runtime threshold of 16 while for masked vectorization the

  Calculated minimum iters for profitability: 14

is unchanged.  But we do not apply a runtime profitability check because
there's no need for versioning or peeling in this testcase (with masked
vectorization) and vect_apply_runtime_profitability_check_p does not
consider partial vectorization at all.

/* Return true if LOOP_VINFO requires a runtime check for whether the
   vector loop is profitable.  */

inline bool
vect_apply_runtime_profitability_check_p (loop_vec_info loop_vinfo)
{ 
  unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
  return (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
          && th >= vect_vf_for_cost (loop_vinfo));
}

I'll note that we still fail to version the loop for costing when
versioning/peeling isn't necessary, so

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 9c349e8ffde..eb3db8d7c5a 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2396,7 +2396,8 @@ vect_apply_runtime_profitability_check_p (loop_vec_info
loop_vinfo)
 {
   unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
   return (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-         && th >= vect_vf_for_cost (loop_vinfo));
+         && ((LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) && th > 1)
+             || th >= vect_vf_for_cost (loop_vinfo)));
 }

 /* Return true if CODE is a lane-reducing opcode.  */

is not enough to apply a runtime check, but with -O2 we'd reject vectorization
via

t2.c:6:23: missed:  one iteration of the vector loop would be more expensive
than the equivalent number of iterations of the scalar loop
t2.c:6:23: missed:  Loop costings not worthwhile.

even though we'd not actually apply the versioning...

The case we apply the profitability check w/o versionig is when peeling
for niters only and there only as part of the 'skip_vector' condition
which is enabled when either epilog vectorization or
!LOOP_VINFO_USE_VERSIONING_WITHOUT_PEELING.

I have a patch but I think the whole setup needs some TLC (we should have
LOOP_REQUIRES_VERSIONING for the cost model check when no peeling for
niters is done).  The actual condition for that is quite complicated
though.

[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.

Reply via email to