The existing vector costs stop some beneficial vectorization. This is mostly due to vector statement cost being set to 3 as well as vector loads having a higher cost than scalar loads. This means that even when we vectorize 4x, it is possible that the cost of a vectorized loop is similar to the scalar version, and we fail to vectorize. For example for a particular loop the costs for -mcpu=generic are:
note: Cost model analysis: Vector inside of loop cost: 146 Vector prologue cost: 5 Vector epilogue cost: 0 Scalar iteration cost: 50 Scalar outside cost: 0 Vector outside cost: 5 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 note: Runtime profitability threshold = 3 note: Static estimate profitability threshold = 3 note: loop vectorized While -mcpu=cortex-a57 reports: note: Cost model analysis: Vector inside of loop cost: 294 Vector prologue cost: 15 Vector epilogue cost: 0 Scalar iteration cost: 74 Scalar outside cost: 0 Vector outside cost: 15 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 31 note: Runtime profitability threshold = 30 note: Static estimate profitability threshold = 30 note: not vectorized: vectorization not profitable. note: not vectorized: iteration count smaller than user specified loop bound parameter or minimum profitable iterations (whichever is more conservative). Using a cost of 3 for a vector operation suggests they are 3 times as expensive as scalar operations. Since most vector operations have a similar throughput as scalar operations, this is not correct. Using slightly lower values for these heuristics now allows this loop and many others to be vectorized. On a proprietary benchmark the gain from vectorizing this loop is around 15-30% which shows vectorizing it is indeed beneficial. ChangeLog: 2016-11-10 Wilco Dijkstra <wdijk...@arm.com> * config/aarch64/aarch64.c (cortexa57_vector_cost): Change vec_stmt_cost, vec_align_load_cost and vec_unalign_load_cost. -- diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 279a6dfaa4a9c306bc7a8dba9f4f53704f61fefe..cff2e8fc6e9309e6aa4f68a5aba3bfac3b737283 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -382,12 +382,12 @@ static const struct cpu_vector_cost cortexa57_vector_cost = 1, /* scalar_stmt_cost */ 4, /* scalar_load_cost */ 1, /* scalar_store_cost */ - 3, /* vec_stmt_cost */ + 2, /* vec_stmt_cost */ 3, /* vec_permute_cost */ 8, /* vec_to_scalar_cost */ 8, /* scalar_to_vec_cost */ - 5, /* vec_align_load_cost */ - 5, /* vec_unalign_load_cost */ + 4, /* vec_align_load_cost */ + 4, /* vec_unalign_load_cost */ 1, /* vec_unalign_store_cost */ 1, /* vec_store_cost */ 1, /* cond_taken_branch_cost */