[Bug tree-optimization/117875] [15 Regression] 28% regression for 456.hmmer on Zen4 with -Ofast -march=native

rguenth at gcc dot gnu.org via Gcc-bugs Mon, 20 Jan 2025 04:37:40 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117875


--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #17)
> -sre_math.c:174:17: optimized: loop vectorized using 16 byte vectors
> 
> -sre_math.c:192:17: optimized: loop vectorized using 16 byte vectors

Those two are identical,

float **
FMX2Alloc(int rows, int cols)
{
  float **mx;
  int     r;

  mx    = (float **) __builtin_malloc (sizeof(float *) * rows);
  mx[0] = (float *)  __builtin_malloc (sizeof(float) * rows * cols);
  for (r = 1; r < rows; r++)
    mx[r] = mx[0] + r*cols;
  return mx;
}

where the "failure" is a missed epilogue vectorization due to cost
(reproducible with Zen2 and Zen4 tuning, not with generic), where
SLP costs

t.c:9:17: note:  Cost model analysis: 
  Vector inside of loop cost: 136
  Vector prologue cost: 86
  Vector epilogue cost: 128
  Scalar iteration cost: 56
  Scalar outside cost: 32
  Vector outside cost: 214
  prologue iterations: 0
  epilogue iterations: 2 
  Calculated minimum iters for profitability: 6

and classical loop vect

t.c:9:17: note:  Cost model analysis:
  Vector inside of loop cost: 136
  Vector prologue cost: 68
  Vector epilogue cost: 128
  Scalar iteration cost: 56
  Scalar outside cost: 32
  Vector outside cost: 196
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 5

where the difference is in

cols_21(D) * r_42 1 times vector_stmt costs 12 in body
node 0x25bf6f00 1 times scalar_to_vec costs 10 in prologue
_8 w* 4 1 times vector_stmt costs 40 in prologue
<unknown> 1 times vector_load costs 12 in prologue 

vs.

cols_21(D) * r_42 1 times scalar_to_vec costs 4 in prologue
cols_21(D) * r_42 1 times vector_stmt costs 12 in body
_8 w* 4 1 times vector_stmt costs 40 in prologue

we seem to forget to cost the constant 4 load cost in non-SLP and we
run into target specific costing of scalar_to_vec applying a GPR->XMM
move penalty which we only do for SLP.  So, SLP looks fine here.

This looks like a not important vectorization.  I verified that with
Zen2 and epilogue vectorization disabled the regression triggered by
--param vect-force-slp=1 remains.

[Bug tree-optimization/117875] [15 Regression] 28% regression for 456.hmmer on Zen4 with -Ofast -march=native

Reply via email to