> > The following fixes vec_construct cost calculation to properly consider > that the inserts will happen to SSE regs thus forgo the multiplication > done in ix86_vec_cost which is passed the wrong mode. This gets rid of > the only call passing false to ix86_vec_cost (so consider the patch > amended to remove the arg if approved). > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > OK for trunk?
OK, thanks! > > I am considering to make the factor we apply in ix86_vec_cost > which currently depends on X86_TUNE_AVX128_OPTIMAL and > X86_TUNE_SSE_SPLIT_REGS part of the actual cost tables since > the reason we apply them are underlying CPU architecture details. > Was the original reason of doing the multiplication based on > those tunings to be able to "share" the same basic cost table > across architectures that differ in this important detail? No, just to have fewer entries in the table (since they are rather big and painful to fill in as they are already) > I see X86_TUNE_SSE_SPLIT_REGS is only used for m_ATHLON_K8 > and X86_TUNE_AVX128_OPTIMAL is used for m_BDVER, m_BTVER2 > and m_ZNVER1. Those all have (multiple) exclusive processor_cost_table > entries. > > As a first step I'd like to remove the use of ix86_vec_cost for > the entries that already have entries for multiple modes > (loads and stores) and apply the factor there. For example > Zen can do two 128bit loads per cycle but only one 128bit store. That sounds like a good plan (I think I introduced the entries in cost table only afer introducing the vec_cost thingy) > With multiplying AVX256 costs by two we seem to cost sth like > # instructions to dispatch * instruction latency which is an > odd thing. I'd have expected # instructions to dispatch / instruction > throughput * instruction latency - so a AVX256 add would cost > the same as a AVX128 add, likewise for loads but stores would be > more expensive because of the throughput issue. This all > ignores resource utilization across multiple insns but that's > how the cost model works ... Yep, cost model simply uses latencies because it originated at a time CPUs was not parallel. I know that LLVM backend goes the other way and uses throughputs only. Correct thing would be to build the dependence dag and guess how CPU will schedule but that would be fun to implement at gimple level... Honza > > Thanks, > Richard. > > 2018-10-11 Richard Biener <rguent...@suse.de> > > * config/i386/i386.c (ix86_vec_cost): Remove !parallel path > and argument. > (ix86_builtin_vectorization_cost): For vec_construct properly > cost insertion into SSE regs. > (...): Adjust calls to ix86_vec_cost. > > Index: gcc/config/i386/i386.c > =================================================================== > --- gcc/config/i386/i386.c (revision 265022) > +++ gcc/config/i386/i386.c (working copy) > @@ -39846,11 +39846,10 @@ ix86_set_reg_reg_cost (machine_mode mode > static int > ix86_vec_cost (machine_mode mode, int cost, bool parallel) > { > + gcc_assert (parallel); > if (!VECTOR_MODE_P (mode)) > return cost; > - > - if (!parallel) > - return cost * GET_MODE_NUNITS (mode); > + > if (GET_MODE_BITSIZE (mode) == 128 > && TARGET_SSE_SPLIT_REGS) > return cost * 2; > @@ -45190,8 +45189,9 @@ ix86_builtin_vectorization_cost (enum ve > > case vec_construct: > { > - /* N element inserts. */ > - int cost = ix86_vec_cost (mode, ix86_cost->sse_op, false); > + gcc_assert (VECTOR_MODE_P (mode)); > + /* N element inserts into SSE vectors. */ > + int cost = GET_MODE_NUNITS (mode) * ix86_cost->sse_op; > /* One vinserti128 for combining two SSE vectors for AVX256. */ > if (GET_MODE_BITSIZE (mode) == 256) > cost += ix86_vec_cost (mode, ix86_cost->addss, true); >