Hello Richard,
On 14 фев 11:26, Richard Biener wrote:
> 
> The following tries to account for the fact that when constructing
> AVX256 or AVX512 vectors from elements we can only use insertps to
> insert into the low 128bits of a vector but have to use
> vinserti128 or vinserti64x4 to build larger AVX256/512 vectors.
> Those operations also have higher latency (Agner documents
> 3 cycles for Broadwell for reg-reg vinserti128 while insertps has
> one cycle latency).  Agner doesn't have tables for AVX512 yet but
> I guess the story is similar for vinserti64x4.
> 
> Latency is similar for FP adds so I re-used ix86_cost->addss for
> this cost.
> 
> This works towards fixing the referenced PRs below where we end
> up vectorizing a lot of loads via elementwise construction, mostly
> "enabled" by the new support for alias versioning for variable
> strides.  Here, analyzed for PR84037, the large number of scalar
> loads and vector builds before any meaningful computation means
> the CPU is bottlenecked with AGU and load ops and doesn't get
> any meaningful work done thus the vectorization should end up
> being not profitable (with some more massaging in the vectorizer
> and using SLP which reduces the number of loads a lot I only
> can get into same-speed as not vectorized territory).
> 
> So the real fix for those issues is to account for those
> microarchitectural issues in the backend costing.  I've decided
> to plumb this onto the vector construction op if that happens
> to be fed by loads, scaling this cost by the number of
> vector elements (overall latency should grow with the number
> of dependences).
> 
> Bootstrap/regtest running on x86_64-unknown-linux-gnu.
> 
> I've benchmarked this on Haswell with SPEC CPU 2006 and a three-run
> reveals that it doesn't regress any benchmark off-noise but improves
> 416.gamess by 7%, 465.tonto by 6% and 481.wrf by 2%.  It also fixes
> the Polyhedron capacita regression (which is what I "tuned" the
> factoring with).  I've mentioned the bugs refering any of the above
> affected benchmarks in the ChangeLog but it still has to be verified
> if the bugs are fully fixed (84037 is).
> 
> Ok for trunk?
Your patch is OK for trunk.

--
Thanks, K

> 
> Any confirmation of the microarchitectural bottleneck in, say,
> Capacita from people with access to cycle-accurate simulators
> are welcome ;)  Performance counters only help so much (not much...),
> so my guesses are based on Agner and finger-counting.
> 
> Thanks,
> Richard.
> 
> 2018-02-13  Richard Biener  <rguent...@suse.de>
> 
>       PR tree-optimization/84037
>       PR tree-optimization/84016
>       PR target/82862
>       * config/i386/i386.c (ix86_builtin_vectorization_cost):
>       Adjust vec_construct for the fact we need additional higher latency
>       128bit inserts for AVX256 and AVX512 vector builds.
>       (ix86_add_stmt_cost): Scale vector construction cost for
>       elementwise loads.
> 
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c    (revision 257620)
> +++ gcc/config/i386/i386.c    (working copy)
> @@ -45904,7 +45904,18 @@ ix86_builtin_vectorization_cost (enum ve
>                             ix86_cost->sse_op, true);
>  
>        case vec_construct:
> -     return ix86_vec_cost (mode, ix86_cost->sse_op, false);
> +     {
> +       /* N element inserts.  */
> +       int cost = ix86_vec_cost (mode, ix86_cost->sse_op, false);
> +       /* One vinserti128 for combining two SSE vectors for AVX256.  */
> +       if (GET_MODE_BITSIZE (mode) == 256)
> +         cost += ix86_vec_cost (mode, ix86_cost->addss, true);
> +       /* One vinserti64x4 and two vinserti128 for combining SSE
> +          and AVX256 vectors to AVX512.  */
> +       else if (GET_MODE_BITSIZE (mode) == 512)
> +         cost += 3 * ix86_vec_cost (mode, ix86_cost->addss, true);
> +       return cost;
> +     }
>  
>        default:
>          gcc_unreachable ();
> @@ -50243,6 +50254,18 @@ ix86_add_stmt_cost (void *data, int coun
>         break;
>       }
>      }
> +  /* If we do elementwise loads into a vector then we are bound by
> +     latency and execution resources for the many scalar loads
> +     (AGU and load ports).  Try to account for this by scaling the
> +     construction cost by the number of elements involved.  */
> +  if (kind == vec_construct
> +      && stmt_info
> +      && stmt_info->type == load_vec_info_type
> +      && stmt_info->memory_access_type == VMAT_ELEMENTWISE)
> +    {
> +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> +      stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
> +    }
>    if (stmt_cost == -1)
>      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>  

Reply via email to