https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117875
--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #17)
> -sre_math.c:174:17: optimized: loop vectorized using 16 byte vectors
>
> -sre_math.c:192:17: optimized: loop vectorized using 16 byte vectors
Those two are identical,
float **
FMX2Alloc(int rows, int cols)
{
float **mx;
int r;
mx = (float **) __builtin_malloc (sizeof(float *) * rows);
mx[0] = (float *) __builtin_malloc (sizeof(float) * rows * cols);
for (r = 1; r < rows; r++)
mx[r] = mx[0] + r*cols;
return mx;
}
where the "failure" is a missed epilogue vectorization due to cost
(reproducible with Zen2 and Zen4 tuning, not with generic), where
SLP costs
t.c:9:17: note: Cost model analysis:
Vector inside of loop cost: 136
Vector prologue cost: 86
Vector epilogue cost: 128
Scalar iteration cost: 56
Scalar outside cost: 32
Vector outside cost: 214
prologue iterations: 0
epilogue iterations: 2
Calculated minimum iters for profitability: 6
and classical loop vect
t.c:9:17: note: Cost model analysis:
Vector inside of loop cost: 136
Vector prologue cost: 68
Vector epilogue cost: 128
Scalar iteration cost: 56
Scalar outside cost: 32
Vector outside cost: 196
prologue iterations: 0
epilogue iterations: 2
Calculated minimum iters for profitability: 5
where the difference is in
cols_21(D) * r_42 1 times vector_stmt costs 12 in body
node 0x25bf6f00 1 times scalar_to_vec costs 10 in prologue
_8 w* 4 1 times vector_stmt costs 40 in prologue
<unknown> 1 times vector_load costs 12 in prologue
vs.
cols_21(D) * r_42 1 times scalar_to_vec costs 4 in prologue
cols_21(D) * r_42 1 times vector_stmt costs 12 in body
_8 w* 4 1 times vector_stmt costs 40 in prologue
we seem to forget to cost the constant 4 load cost in non-SLP and we
run into target specific costing of scalar_to_vec applying a GPR->XMM
move penalty which we only do for SLP. So, SLP looks fine here.
This looks like a not important vectorization. I verified that with
Zen2 and epilogue vectorization disabled the regression triggered by
--param vect-force-slp=1 remains.