https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87743
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2018-10-25 CC| |rguenth at gcc dot gnu.org Blocks| |53947 Ever confirmed|0 |1 --- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- Confirmed. It's a cost-model issue. With GCC 7 the vectorization with AVX256 was not profitable so AVX128 was chosen: t.c:12:1: note: Final SLP tree for instance: t.c:12:1: note: node t.c:12:1: note: stmt 0 dst[0] = _11; t.c:12:1: note: stmt 1 dst[1] = _17; t.c:12:1: note: stmt 2 dst[2] = _23; t.c:12:1: note: stmt 3 dst[3] = _29; t.c:12:1: note: node (external) t.c:12:1: note: stmt 0 _11 = (long long int) _10; t.c:12:1: note: stmt 1 _17 = (long long int) _16; t.c:12:1: note: stmt 2 _23 = (long long int) _22; t.c:12:1: note: stmt 3 _29 = (long long int) _28; t.c:12:1: note: Cost model analysis: Vector inside of basic block cost: 2 Vector prologue cost: 3 Vector epilogue cost: 0 Scalar cost of basic block: 4 t.c:12:1: note: not vectorized: vectorization is not profitable. t.c:12:1: note: ***** Re-trying analysis with vector size 16 but with GCC 8 we now say t.c:12:1: note: Cost model analysis: Vector inside of basic block cost: 20 Vector prologue cost: 28 Vector epilogue cost: 0 Scalar cost of basic block: 48 t.c:12:1: note: Basic block will be vectorized using SLP t.c:12:1: note: SLPing BB part costs on trunk are the same (the above is for generic, for haswell the vector cost is even lower, 12). So we end up with <bb 2> [local count: 214748369]: _10 = src[0]; _11 = (long long int) _10; _16 = src[1]; _17 = (long long int) _16; _22 = src[2]; _23 = (long long int) _22; _28 = src[3]; _29 = (long long int) _28; _13 = {_11, _17, _23, _29}; vect_cst__19 = _13; MEM[(long long int *)&dst] = vect_cst__19; note this just costs the vector construction + vector store against the four scalar stores. Note with my patches to consider both vector sizes this wouldn't be handled either since I didn't update them to work for BB vectorization (and they are not on trunk yet anyways). It would be an apples to oranges comparison anyways since the scalar cost differs (the SLP tree is different for AVX128). Anyways, costing for AVX128 is t.c:12:1: note: Cost model analysis: Vector inside of basic block cost: 44 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 96 (haswell). So if you scale the vector cost by 0.5 because the scalar cost is doubled you end up at 22 which would compare favorably to 12 + 28 == 40. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations