The existing vector costs stop some beneficial vectorization.  This is mostly 
due
to vector statement cost being set to 3 as well as vector loads having a higher
cost than scalar loads.  This means that even when we vectorize 4x, it is 
possible
that the cost of a vectorized loop is similar to the scalar version, and we fail
to vectorize.  For example for a particular loop the costs for -mcpu=generic 
are:

note: Cost model analysis: 
  Vector inside of loop cost: 146
  Vector prologue cost: 5
  Vector epilogue cost: 0
  Scalar iteration cost: 50
  Scalar outside cost: 0
  Vector outside cost: 5
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1
note:   Runtime profitability threshold = 3
note:   Static estimate profitability threshold = 3
note: loop vectorized


While -mcpu=cortex-a57 reports:

note: Cost model analysis: 
  Vector inside of loop cost: 294
  Vector prologue cost: 15
  Vector epilogue cost: 0
  Scalar iteration cost: 74
  Scalar outside cost: 0
  Vector outside cost: 15
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 31
note:   Runtime profitability threshold = 30
note:   Static estimate profitability threshold = 30
note: not vectorized: vectorization not profitable.
note: not vectorized: iteration count smaller than user specified loop bound 
parameter or minimum profitable iterations (whichever is more conservative).


Using a cost of 3 for a vector operation suggests they are 3 times as
expensive as scalar operations.  Since most vector operations have a 
similar throughput as scalar operations, this is not correct.

Using slightly lower values for these heuristics now allows this loop
and many others to be vectorized.  On a proprietary benchmark the gain
from vectorizing this loop is around 15-30% which shows vectorizing it is
indeed beneficial.

ChangeLog:
2016-11-10  Wilco Dijkstra  <wdijk...@arm.com>

        * config/aarch64/aarch64.c (cortexa57_vector_cost):
        Change vec_stmt_cost, vec_align_load_cost and vec_unalign_load_cost.

--
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
279a6dfaa4a9c306bc7a8dba9f4f53704f61fefe..cff2e8fc6e9309e6aa4f68a5aba3bfac3b737283
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -382,12 +382,12 @@ static const struct cpu_vector_cost cortexa57_vector_cost 
=
   1, /* scalar_stmt_cost  */
   4, /* scalar_load_cost  */
   1, /* scalar_store_cost  */
-  3, /* vec_stmt_cost  */
+  2, /* vec_stmt_cost  */
   3, /* vec_permute_cost  */
   8, /* vec_to_scalar_cost  */
   8, /* scalar_to_vec_cost  */
-  5, /* vec_align_load_cost  */
-  5, /* vec_unalign_load_cost  */
+  4, /* vec_align_load_cost  */
+  4, /* vec_unalign_load_cost  */
   1, /* vec_unalign_store_cost  */
   1, /* vec_store_cost  */
   1, /* cond_taken_branch_cost  */

Reply via email to