A previous patch of mine correcting the vectorizer target cost model
to properly cost scalar FP ops vs. scalar INT ops regressed
416.gamess by ~10% on all modern x86 archs.

The following mitigates this in the cost modeling by noticing
the vectorized loop in question has all loads and stores performed
strided (built up from scalar loads/stores) and building upon
the pessimization of strided loads added last year.

The first half is treating strided stores the same as strided
loads which may make sense (but the latency and dependence
arguments do not count here).  Unfortunately that alone
doesn't make 416.gamess vectorization fail because we end up
with TYPE_VECTOR_SUBPARTS == 2 (AVX256 vectorization is rejected
due to cost reasons already).  Now comes the second half
which is to push it over the edge, adjusting the previous
pessimization by multiplying with TYPE_VECTOR_SUBPARTS + 1
instead of just TYPE_VECTOR_SUBPARTS which makes the biggest
difference for smaller vectors.

I've benchmarked this on a Haswell machine with SPEC 2006
confirming the regression is fixed and re-benchmarked
appearant regressions with 3 runs confirming that was noise
and we end up with maybe even a progression there
(see the bugzilla audit-trail for details).

Bootstrapped and tested on x86_64-unknown-linux-gnu.

OK for trunk?

Note I'm going to apply as two revisions to allow bisection
between the two changes, first pushing pessimizing strided
stores and then adjusting the factor.

Thanks,
Richard.

2019-03-15  Richard Biener  <rguent...@suse.de>

        PR target/87561
        * config/i386/i386.c (ix86_add_stmt_cost): Apply strided
        load pessimization to stores as well.
        * config/i386/i386.c (ix86_add_stmt_cost): Pessimize strided
        loads and stores a bit more.

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c      (revision 269683)
+++ gcc/config/i386/i386.c      (working copy)
@@ -50534,14 +50534,15 @@ ix86_add_stmt_cost (void *data, int coun
      latency and execution resources for the many scalar loads
      (AGU and load ports).  Try to account for this by scaling the
      construction cost by the number of elements involved.  */
-  if (kind == vec_construct
+  if ((kind == vec_construct || kind == vec_to_scalar)
       && stmt_info
-      && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
+      && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
+         || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
       && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
       && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST)
     {
       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
-      stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
+      stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
     }
   if (stmt_cost == -1)
     stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);

Reply via email to