Hello Richard, On 14 фев 11:26, Richard Biener wrote: > > The following tries to account for the fact that when constructing > AVX256 or AVX512 vectors from elements we can only use insertps to > insert into the low 128bits of a vector but have to use > vinserti128 or vinserti64x4 to build larger AVX256/512 vectors. > Those operations also have higher latency (Agner documents > 3 cycles for Broadwell for reg-reg vinserti128 while insertps has > one cycle latency). Agner doesn't have tables for AVX512 yet but > I guess the story is similar for vinserti64x4. > > Latency is similar for FP adds so I re-used ix86_cost->addss for > this cost. > > This works towards fixing the referenced PRs below where we end > up vectorizing a lot of loads via elementwise construction, mostly > "enabled" by the new support for alias versioning for variable > strides. Here, analyzed for PR84037, the large number of scalar > loads and vector builds before any meaningful computation means > the CPU is bottlenecked with AGU and load ops and doesn't get > any meaningful work done thus the vectorization should end up > being not profitable (with some more massaging in the vectorizer > and using SLP which reduces the number of loads a lot I only > can get into same-speed as not vectorized territory). > > So the real fix for those issues is to account for those > microarchitectural issues in the backend costing. I've decided > to plumb this onto the vector construction op if that happens > to be fed by loads, scaling this cost by the number of > vector elements (overall latency should grow with the number > of dependences). > > Bootstrap/regtest running on x86_64-unknown-linux-gnu. > > I've benchmarked this on Haswell with SPEC CPU 2006 and a three-run > reveals that it doesn't regress any benchmark off-noise but improves > 416.gamess by 7%, 465.tonto by 6% and 481.wrf by 2%. It also fixes > the Polyhedron capacita regression (which is what I "tuned" the > factoring with). I've mentioned the bugs refering any of the above > affected benchmarks in the ChangeLog but it still has to be verified > if the bugs are fully fixed (84037 is). > > Ok for trunk? Your patch is OK for trunk.
-- Thanks, K > > Any confirmation of the microarchitectural bottleneck in, say, > Capacita from people with access to cycle-accurate simulators > are welcome ;) Performance counters only help so much (not much...), > so my guesses are based on Agner and finger-counting. > > Thanks, > Richard. > > 2018-02-13 Richard Biener <rguent...@suse.de> > > PR tree-optimization/84037 > PR tree-optimization/84016 > PR target/82862 > * config/i386/i386.c (ix86_builtin_vectorization_cost): > Adjust vec_construct for the fact we need additional higher latency > 128bit inserts for AVX256 and AVX512 vector builds. > (ix86_add_stmt_cost): Scale vector construction cost for > elementwise loads. > > Index: gcc/config/i386/i386.c > =================================================================== > --- gcc/config/i386/i386.c (revision 257620) > +++ gcc/config/i386/i386.c (working copy) > @@ -45904,7 +45904,18 @@ ix86_builtin_vectorization_cost (enum ve > ix86_cost->sse_op, true); > > case vec_construct: > - return ix86_vec_cost (mode, ix86_cost->sse_op, false); > + { > + /* N element inserts. */ > + int cost = ix86_vec_cost (mode, ix86_cost->sse_op, false); > + /* One vinserti128 for combining two SSE vectors for AVX256. */ > + if (GET_MODE_BITSIZE (mode) == 256) > + cost += ix86_vec_cost (mode, ix86_cost->addss, true); > + /* One vinserti64x4 and two vinserti128 for combining SSE > + and AVX256 vectors to AVX512. */ > + else if (GET_MODE_BITSIZE (mode) == 512) > + cost += 3 * ix86_vec_cost (mode, ix86_cost->addss, true); > + return cost; > + } > > default: > gcc_unreachable (); > @@ -50243,6 +50254,18 @@ ix86_add_stmt_cost (void *data, int coun > break; > } > } > + /* If we do elementwise loads into a vector then we are bound by > + latency and execution resources for the many scalar loads > + (AGU and load ports). Try to account for this by scaling the > + construction cost by the number of elements involved. */ > + if (kind == vec_construct > + && stmt_info > + && stmt_info->type == load_vec_info_type > + && stmt_info->memory_access_type == VMAT_ELEMENTWISE) > + { > + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > + stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype); > + } > if (stmt_cost == -1) > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); >