https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125931
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
twotff_ shows the most difference on Zen4 where I reproduced it. We get new
vectorization here:
+mccas.fppized.f:3259:21: optimized: loop vectorized using 16 byte vectors and
unroll factor 2
+mccas.fppized.f:3259:21: optimized: loop versioned for vectorization because
of possible aliasing
+mccas.fppized.f:3304:21: optimized: loop vectorized using 16 byte vectors and
unroll factor 2
+mccas.fppized.f:3304:21: optimized: loop versioned for vectorization because
of possible aliasing
that's the old issue with the triangular nested loop we've never really fixed
but only accidentially handled with costing overrides:
DO 30 MK=1,NOC
DO 30 ML=1,MK
MKL = MKL+1
XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
* VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
* VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
30 CONTINUE
Bad:
(*xpqkl_262(D))[_177] 1 times scalar_load costs 12 in body
(*xpqkl_262(D))[_177] 1 times scalar_load costs 12 in body
(*xpqkl_262(D))[_177] 1 times vec_construct costs 12 in body
(*co_263(D))[_183] 1 times scalar_load costs 12 in body
(*co_263(D))[_183] 1 times scalar_load costs 12 in body
(*co_263(D))[_183] 1 times vec_construct costs 12 in body
_181 * _184 1 times vector_stmt costs 12 in body
node 0x295e00c0 1 times scalar_to_vec costs 4 in prologue
(*co_263(D))[_186] 1 times scalar_load costs 12 in body
(*co_263(D))[_186] 1 times scalar_load costs 12 in body
(*co_263(D))[_186] 1 times vec_construct costs 12 in body
_187 * _189 1 times vector_stmt costs 12 in body
node 0x295dfe00 1 times scalar_to_vec costs 4 in prologue
_185 + _190 1 times vector_stmt costs 12 in body
_191 * val3_207 1 times vector_stmt costs 12 in body
node 0x295e0170 1 times scalar_to_vec costs 4 in prologue
_178 + _192 1 times vector_stmt costs 12 in body
_193 2 times scalar_store costs 32 in body
_193 1 times vec_deconstruct costs 12 in body
mccas.fppized.f:3304:21: note: Cost model analysis:
Vector inside of loop cost: 424
Vector prologue cost: 48
Vector epilogue cost: 240
Scalar iteration cost: 224
Scalar outside cost: 8
Vector outside cost: 288
prologue iterations: 0
epilogue iterations: 1
Calculated minimum iters for profitability: 6
Good:
(*xpqkl_262(D))[_177] 1 times scalar_load costs 12 in body
(*xpqkl_262(D))[_177] 1 times scalar_load costs 12 in body
(*xpqkl_262(D))[_177] 1 times vec_construct costs 12 in body
(*co_263(D))[_183] 1 times scalar_load costs 12 in body
(*co_263(D))[_183] 1 times scalar_load costs 12 in body
(*co_263(D))[_183] 1 times vec_construct costs 12 in body
_181 * _184 1 times vector_stmt costs 12 in body
node 0x15909490 1 times scalar_to_vec costs 4 in prologue
(*co_263(D))[_186] 1 times scalar_load costs 12 in body
(*co_263(D))[_186] 1 times scalar_load costs 12 in body
(*co_263(D))[_186] 1 times vec_construct costs 12 in body
_187 * _189 1 times vector_stmt costs 12 in body
node 0x15909750 1 times scalar_to_vec costs 4 in prologue
_185 + _190 1 times vector_stmt costs 12 in body
_191 * val3_207 1 times vector_stmt costs 12 in body
node 0x159091d0 1 times scalar_to_vec costs 4 in prologue
_178 + _192 1 times vector_stmt costs 12 in body
_193 2 times scalar_store costs 32 in body
_193 2 times vec_to_scalar costs 24 in body
mccas.fppized.f:3304:21: note: Cost model analysis:
Vector inside of loop cost: 448
Vector prologue cost: 48
Vector epilogue cost: 240
Scalar iteration cost: 224
Scalar outside cost: 8
Vector outside cost: 288
prologue iterations: 0
epilogue iterations: 1
mccas.fppized.f:3304:21: missed: cost model: the vector iteration cost = 448
divided by the scalar iteration cost = 224 is greater or equal to the
vectorization factor = 2.
mccas.fppized.f:3304:21: missed: not vectorized: vectorization not profitable.
mccas.fppized.f:3304:21: missed: not vectorized: vector version will never be
profitable.