https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110692
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenth at gcc dot gnu.org Keywords| |missed-optimization Component|middle-end |tree-optimization Blocks| |53947 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- With the variable bound we have a[i_11] 1 times vector_load costs 12 in body _1 + 1 1 times scalar_to_vec costs 4 in prologue _1 + 1 1 times vector_stmt costs 4 in body _2 1 times vector_store costs 12 in body t.c:4:28: note: operating on full vectors for epilogue loop. t.c:4:28: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown . a[i_11] 1 times scalar_load costs 12 in epilogue _1 + 1 1 times scalar_stmt costs 4 in epilogue _2 1 times scalar_store costs 12 in epilogue <unknown> 1 times cond_branch_taken costs 16 in epilogue t.c:4:28: note: Cost model analysis: Vector inside of loop cost: 28 Vector prologue cost: 4 Vector epilogue cost: 44 Scalar iteration cost: 28 Scalar outside cost: 32 Vector outside cost: 48 prologue iterations: 0 epilogue iterations: 1 Calculated minimum iters for profitability: 1 t.c:4:28: note: Runtime profitability threshold = 2 t.c:4:28: note: Static estimate profitability threshold = 4 t.c:4:28: missed: not vectorized: estimated iteration count too small. with the static bound: a[i_10] 1 times vector_load costs 12 in body _1 + 1 1 times scalar_to_vec costs 4 in prologue _1 + 1 1 times vector_stmt costs 4 in body _2 1 times vector_store costs 12 in body t.c:4:28: note: operating on full vectors for epilogue loop. a[i_10] 1 times scalar_load costs 12 in epilogue _1 + 1 1 times scalar_stmt costs 4 in epilogue _2 1 times scalar_store costs 12 in epilogue t.c:4:28: note: Cost model analysis: Vector inside of loop cost: 28 Vector prologue cost: 4 Vector epilogue cost: 28 Scalar iteration cost: 28 Scalar outside cost: 0 Vector outside cost: 32 prologue iterations: 0 epilogue iterations: 1 Calculated minimum iters for profitability: 2 t.c:4:28: note: Runtime profitability threshold = 2 t.c:4:28: note: Static estimate profitability threshold = 2 so it's the branch of the epilog of the epilog that ups the cost, not sure whether in a reasonable way for this case. In the end I think if the count is unknown a fully peeled epilog with 2 iterations is a reasonable implementation. We could decide that costing the epilogue vectorization isn't worthwhile but note we are not implementing all 8 byte vector ops with the same efficiency as the 16 byte ones. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations