[Bug tree-optimization/110692] epilogues for loop which can be also vectorized with half size can be improved.

rguenth at gcc dot gnu.org via Gcc-bugs Mon, 17 Jul 2023 00:45:28 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110692


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
           Keywords|                            |missed-optimization
          Component|middle-end                  |tree-optimization
             Blocks|                            |53947

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
With the variable bound we have

a[i_11] 1 times vector_load costs 12 in body
_1 + 1 1 times scalar_to_vec costs 4 in prologue
_1 + 1 1 times vector_stmt costs 4 in body
_2 1 times vector_store costs 12 in body
t.c:4:28: note:  operating on full vectors for epilogue loop.
t.c:4:28: note:  cost model: epilogue peel iters set to vf/2 because loop
iterations are unknown .
a[i_11] 1 times scalar_load costs 12 in epilogue
_1 + 1 1 times scalar_stmt costs 4 in epilogue
_2 1 times scalar_store costs 12 in epilogue
<unknown> 1 times cond_branch_taken costs 16 in epilogue
t.c:4:28: note:  Cost model analysis:
  Vector inside of loop cost: 28
  Vector prologue cost: 4
  Vector epilogue cost: 44
  Scalar iteration cost: 28
  Scalar outside cost: 32
  Vector outside cost: 48
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 1
t.c:4:28: note:    Runtime profitability threshold = 2
t.c:4:28: note:    Static estimate profitability threshold = 4
t.c:4:28: missed:  not vectorized: estimated iteration count too small.

with the static bound:

a[i_10] 1 times vector_load costs 12 in body
_1 + 1 1 times scalar_to_vec costs 4 in prologue
_1 + 1 1 times vector_stmt costs 4 in body
_2 1 times vector_store costs 12 in body
t.c:4:28: note:  operating on full vectors for epilogue loop.
a[i_10] 1 times scalar_load costs 12 in epilogue
_1 + 1 1 times scalar_stmt costs 4 in epilogue
_2 1 times scalar_store costs 12 in epilogue
t.c:4:28: note:  Cost model analysis:
  Vector inside of loop cost: 28
  Vector prologue cost: 4
  Vector epilogue cost: 28
  Scalar iteration cost: 28
  Scalar outside cost: 0
  Vector outside cost: 32
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 2
t.c:4:28: note:    Runtime profitability threshold = 2
t.c:4:28: note:    Static estimate profitability threshold = 2

so it's the branch of the epilog of the epilog that ups the cost, not sure
whether in a reasonable way for this case.  In the end I think if the
count is unknown a fully peeled epilog with 2 iterations is a reasonable
implementation.

We could decide that costing the epilogue vectorization isn't worthwhile
but note we are not implementing all 8 byte vector ops with the same
efficiency as the 16 byte ones.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/110692] epilogues for loop which can be also vectorized with half size can be improved.

Reply via email to