https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> --- On Fri, 15 Dec 2017, sergey.shalnov at intel dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 > > --- Comment #14 from sergey.shalnov at intel dot com --- > " we have a basic-block vectorizer. Do you propose to remove it? " > Definitely not! SLP vectorizer is very good to have! > > “What's the rationale for not using vector registers” > I just tried " -fno-tree-slp-vectorize" option and found the performance gain > for different -march= options. > > I see some misunderstanding here. Let me clarify the original question with > –march=znver1. > I use " -Ofast -mfpmath=sse -funroll-loops -march=znver1" options set for > experiments. > > For the basic block we are discussing we have (in vect_analyze_slp_cost() in > tree-vect-slp.c:1897): > > tmp[i_220][0] = _150; > tmp[i_220][2] = _147; > tmp[i_220][1] = _144; > tmp[i_220][3] = _141; > > tmp[i_139][0] = _447; > tmp[i_139][2] = _450; > tmp[i_139][1] = _453; > tmp[i_139][3] = _456; > > tmp[i_458][0] = _54; > tmp[i_458][2] = _56; > tmp[i_458][1] = _58; > tmp[i_458][3] = _60; > > this is si->stmt printed in the loop with "vect_prologue" calculation. > > I see SLP statistic related to this BB: > note: Cost model analysis:. > Vector inside of basic block cost: 64 > Vector prologue cost: 32 > Vector epilogue cost: 0 > Scalar cost of basic block: 256 > note: Basic block will be vectorized using SLP So it looks like both vector and scalar stores are costed 64 by the target and the vector construction is costed 32, this is likely because COSTS_N_INSNS (1), /* cost of cheap SSE instruction. */ and vector construction of 4 elements is thus cost 4 while {8, 8, 8}, /* cost of storing integer registers. */ ... {8, 8, 8, 8, 16}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit. */ That cheap SSE instruction cost is esp. odd when looking at 2, 3, 6, /* cost of moving XMM,YMM,ZMM register. */ or COSTS_N_INSNS (3), /* cost of ADDSS/SD SUBSS/SD insns. */ so if a movq %xmm0, %xmm1 costs 2 I can't imagine anything being cheaper than that. Honza? ... > Please correct me if I wrong but I think we have to have count=3 in > prologue_cost_vec. No, it's one vec_construct operation - it's the task of the target to turn this into a cost comparable to vector_store and scalar_store in this case. > And this could slightly change costs for "Vector prologue cost" and might have > an influence to vectorizer decision. > > Sergey > PS > Richard, > I didn't catch your idea in " but DOM isn't powerful enough " sentence. > Could you please slightly clarify it? The BB vectorization prevents eliding 'tmp' from memory to registers, the CSE pass after vectorization would be responsible for this but the memory references look too "complicated" to the respective implementation used.