[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.

2023-08-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979

Richard Biener  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #5 from Richard Biener  ---
For this particular costing there's also the issue that we perform costing at
vectorizable_reduction time but then we don't know yet whether we will use
partial vectors in the end.  We try to apply costs due to using partial vectors
in vect_estimate_min_profitable_iters but we don't have any good way to
account for extra costs because some operations might expand differently
when using partial vectors vs. not using partial vectors.  The only way
would be to separate costing of operations from the analysis phase or
alternatively record multiple cost variants during analysis and pick the
correct one later.  I think the former, separating costing from analysis,
might be the better way in the end.

Note currently we cost

_4 + sum_13 8 times vec_to_scalar costs 64 in body
_4 + sum_13 8 times scalar_stmt costs 96 in body
*_3 1 times unaligned_load (misalign -1) costs 12 in body
t.c:9:21: note:  operating on partial vectors.
 2 times vector_stmt costs 8 in prologue
 2 times vector_stmt costs 8 in body
t.c:9:21: note:  Cost model analysis:
  Vector inside of loop cost: 180
  Vector prologue cost: 8
  Vector epilogue cost: 0
  Scalar iteration cost: 24
  Scalar outside cost: 0
  Vector outside cost: 8
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 8
t.c:9:21: note:Runtime profitability threshold = 8
t.c:9:21: note:Static estimate profitability threshold = 8
t.c:9:21: note:  * Analysis succeeded with vector mode V8DF

The vector + overhead is thus cheaper than the scalar version but
that assumes we'd actually run a full round of VF scalar iterations!

If we'd add 7 times vec_to_scalar + scalar_stmt as epilogue cost
we'd up the requirement considerably because the difference between
scalar (192) and vector (180) is already quite small.

I wonder if we wouldn't need to adjust our formula for static profitability
to account for partial vectors?

When using a variable upper loop bound we still see

t.c:9:21: note:  Cost model analysis:
  Vector inside of loop cost: 180
  Vector prologue cost: 8 
  Vector epilogue cost: 0
  Scalar iteration cost: 24
  Scalar outside cost: 32
  Vector outside cost: 8
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7
t.c:9:21: note:Runtime profitability threshold = 7
t.c:9:21: note:Static estimate profitability threshold = 32
t.c:9:21: note:  no need for a runtime choice between the scalar and vector
loops
t.c:9:21: note:  * Analysis succeeded with vector mode V8DF

but if the scalar loop would only iterate once we'd have a cost of 24 there.

[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.

2023-08-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979

Richard Biener  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #4 from Richard Biener  ---
The wrong-code part is fixed now, what remains is the inefficiency.  I don't
think we currently cost the "excess" lanes in regular vectorized operations but
of course for open-coded fold-left reductions we should likely account for
possibly VF - 1 extra scalar ops (but in the "epilog" even if that doesn't
exist, since that only applies to the last vector iteration).  I fear it's not
going to be enough to fend off vectorization though.

[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.

2023-08-11 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979

--- Comment #3 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:798a880a0b1fed8a9e3b3e026dd8bd09314b7c38

commit r14-3149-g798a880a0b1fed8a9e3b3e026dd8bd09314b7c38
Author: Richard Biener 
Date:   Fri Aug 11 13:00:17 2023 +0200

tree-optimization/110979 - fold-left reduction and partial vectors

When we vectorize fold-left reductions with partial vectors but
no target operation available we use a vector conditional to force
excess elements to zero.  But that doesn't correctly preserve
the sign of zero.  The following patch disables partial vector
support when we have to do that and also need to honor rounding
modes other than round-to-nearest.  When round-to-nearest is in
effect and we have to preserve the sign of zero instead use
negative zero for the excess elements.

PR tree-optimization/110979
* tree-vect-loop.cc (vectorizable_reduction): For
FOLD_LEFT_REDUCTION without target support make sure
we don't need to honor signed zeros and sign dependent rounding.

* gcc.dg/torture/pr110979.c: New testcase.

[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.

2023-08-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979

--- Comment #2 from Alexander Monakov  ---
Yes, it is wrong-code to full extent. To demonstrate, you can initialize 'sum'
and the array to negative zeroes:

#define FLT double
#define N 20

__attribute__((noipa))
FLT
foo3 (FLT *a)
{
FLT sum = -0.0;
for (int i = 0; i != N; i++)
  sum += a[i];
return sum;
}

int main()
{
FLT a[N];
for (int i = 0; i != N; i++)
a[i] = -0.0;
if (!__builtin_signbit(foo3(a)))
__builtin_abort();
}

[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.

2023-08-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979

Richard Biener  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
   Last reconfirmed||2023-08-11
 Ever confirmed|0   |1
 Blocks||53947
 Status|UNCONFIRMED |NEW

--- Comment #1 from Richard Biener  ---
I think there's a duplicate bug showing in-order reduction vectorization
results in bad code (effectively unrolling VF times).  I don't think this is in
any way connected to using partial vectors though.

Now, I think this is also wrong-code to some extent as without
-fno-signed-zeros adding 0.0 can result in a wrong sign?

Without partial vectors and i != 96 you get similar unrolling, but at least
we won't execute adds for not executed lanes.  With -O3 you'd get an
unrolled main loop and an epilog loop (with the original loop bound).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations