[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979 Richard Biener changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #5 from Richard Biener --- For this particular costing there's also the issue that we perform costing at vectorizable_reduction time but then we don't know yet whether we will use partial vectors in the end. We try to apply costs due to using partial vectors in vect_estimate_min_profitable_iters but we don't have any good way to account for extra costs because some operations might expand differently when using partial vectors vs. not using partial vectors. The only way would be to separate costing of operations from the analysis phase or alternatively record multiple cost variants during analysis and pick the correct one later. I think the former, separating costing from analysis, might be the better way in the end. Note currently we cost _4 + sum_13 8 times vec_to_scalar costs 64 in body _4 + sum_13 8 times scalar_stmt costs 96 in body *_3 1 times unaligned_load (misalign -1) costs 12 in body t.c:9:21: note: operating on partial vectors. 2 times vector_stmt costs 8 in prologue 2 times vector_stmt costs 8 in body t.c:9:21: note: Cost model analysis: Vector inside of loop cost: 180 Vector prologue cost: 8 Vector epilogue cost: 0 Scalar iteration cost: 24 Scalar outside cost: 0 Vector outside cost: 8 prologue iterations: 0 epilogue iterations: 0 Minimum number of vector iterations: 1 Calculated minimum iters for profitability: 8 t.c:9:21: note:Runtime profitability threshold = 8 t.c:9:21: note:Static estimate profitability threshold = 8 t.c:9:21: note: * Analysis succeeded with vector mode V8DF The vector + overhead is thus cheaper than the scalar version but that assumes we'd actually run a full round of VF scalar iterations! If we'd add 7 times vec_to_scalar + scalar_stmt as epilogue cost we'd up the requirement considerably because the difference between scalar (192) and vector (180) is already quite small. I wonder if we wouldn't need to adjust our formula for static profitability to account for partial vectors? When using a variable upper loop bound we still see t.c:9:21: note: Cost model analysis: Vector inside of loop cost: 180 Vector prologue cost: 8 Vector epilogue cost: 0 Scalar iteration cost: 24 Scalar outside cost: 32 Vector outside cost: 8 prologue iterations: 0 epilogue iterations: 0 Minimum number of vector iterations: 1 Calculated minimum iters for profitability: 7 t.c:9:21: note:Runtime profitability threshold = 7 t.c:9:21: note:Static estimate profitability threshold = 32 t.c:9:21: note: no need for a runtime choice between the scalar and vector loops t.c:9:21: note: * Analysis succeeded with vector mode V8DF but if the scalar loop would only iterate once we'd have a cost of 24 there.
[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979 Richard Biener changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #4 from Richard Biener --- The wrong-code part is fixed now, what remains is the inefficiency. I don't think we currently cost the "excess" lanes in regular vectorized operations but of course for open-coded fold-left reductions we should likely account for possibly VF - 1 extra scalar ops (but in the "epilog" even if that doesn't exist, since that only applies to the last vector iteration). I fear it's not going to be enough to fend off vectorization though.
[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979 --- Comment #3 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:798a880a0b1fed8a9e3b3e026dd8bd09314b7c38 commit r14-3149-g798a880a0b1fed8a9e3b3e026dd8bd09314b7c38 Author: Richard Biener Date: Fri Aug 11 13:00:17 2023 +0200 tree-optimization/110979 - fold-left reduction and partial vectors When we vectorize fold-left reductions with partial vectors but no target operation available we use a vector conditional to force excess elements to zero. But that doesn't correctly preserve the sign of zero. The following patch disables partial vector support when we have to do that and also need to honor rounding modes other than round-to-nearest. When round-to-nearest is in effect and we have to preserve the sign of zero instead use negative zero for the excess elements. PR tree-optimization/110979 * tree-vect-loop.cc (vectorizable_reduction): For FOLD_LEFT_REDUCTION without target support make sure we don't need to honor signed zeros and sign dependent rounding. * gcc.dg/torture/pr110979.c: New testcase.
[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979 --- Comment #2 from Alexander Monakov --- Yes, it is wrong-code to full extent. To demonstrate, you can initialize 'sum' and the array to negative zeroes: #define FLT double #define N 20 __attribute__((noipa)) FLT foo3 (FLT *a) { FLT sum = -0.0; for (int i = 0; i != N; i++) sum += a[i]; return sum; } int main() { FLT a[N]; for (int i = 0; i != N; i++) a[i] = -0.0; if (!__builtin_signbit(foo3(a))) __builtin_abort(); }
[Bug target/110979] Miss-optimization for O2 fully masked loop on floating point reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979 Richard Biener changed: What|Removed |Added CC||amonakov at gcc dot gnu.org, ||rguenth at gcc dot gnu.org Last reconfirmed||2023-08-11 Ever confirmed|0 |1 Blocks||53947 Status|UNCONFIRMED |NEW --- Comment #1 from Richard Biener --- I think there's a duplicate bug showing in-order reduction vectorization results in bad code (effectively unrolling VF times). I don't think this is in any way connected to using partial vectors though. Now, I think this is also wrong-code to some extent as without -fno-signed-zeros adding 0.0 can result in a wrong sign? Without partial vectors and i != 96 you get similar unrolling, but at least we won't execute adds for not executed lanes. With -O3 you'd get an unrolled main loop and an epilog loop (with the original loop bound). Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations