[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 11 Aug 2023 06:32:23 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
For this particular costing there's also the issue that we perform costing at
vectorizable_reduction time but then we don't know yet whether we will use
partial vectors in the end.  We try to apply costs due to using partial vectors
in vect_estimate_min_profitable_iters but we don't have any good way to
account for extra costs because some operations might expand differently
when using partial vectors vs. not using partial vectors.  The only way
would be to separate costing of operations from the analysis phase or
alternatively record multiple cost variants during analysis and pick the
correct one later.  I think the former, separating costing from analysis,
might be the better way in the end.

Note currently we cost

_4 + sum_13 8 times vec_to_scalar costs 64 in body
_4 + sum_13 8 times scalar_stmt costs 96 in body
*_3 1 times unaligned_load (misalign -1) costs 12 in body
t.c:9:21: note:  operating on partial vectors.
<unknown> 2 times vector_stmt costs 8 in prologue
<unknown> 2 times vector_stmt costs 8 in body
t.c:9:21: note:  Cost model analysis:
  Vector inside of loop cost: 180
  Vector prologue cost: 8
  Vector epilogue cost: 0
  Scalar iteration cost: 24
  Scalar outside cost: 0
  Vector outside cost: 8
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 8
t.c:9:21: note:    Runtime profitability threshold = 8
t.c:9:21: note:    Static estimate profitability threshold = 8
t.c:9:21: note:  ***** Analysis succeeded with vector mode V8DF

The vector + overhead is thus cheaper than the scalar version but
that assumes we'd actually run a full round of VF scalar iterations!

If we'd add 7 times vec_to_scalar + scalar_stmt as epilogue cost
we'd up the requirement considerably because the difference between
scalar (192) and vector (180) is already quite small.

I wonder if we wouldn't need to adjust our formula for static profitability
to account for partial vectors?

When using a variable upper loop bound we still see

t.c:9:21: note:  Cost model analysis:
  Vector inside of loop cost: 180
  Vector prologue cost: 8 
  Vector epilogue cost: 0
  Scalar iteration cost: 24
  Scalar outside cost: 32
  Vector outside cost: 8
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7
t.c:9:21: note:    Runtime profitability threshold = 7
t.c:9:21: note:    Static estimate profitability threshold = 32
t.c:9:21: note:  no need for a runtime choice between the scalar and vector
loops
t.c:9:21: note:  ***** Analysis succeeded with vector mode V8DF

but if the scalar loop would only iterate once we'd have a cost of 24 there.

[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.

Reply via email to