[Bug tree-optimization/125931] [17 Regression] 6% slowdown of gamess on Zen2 since r17-1655-g7d351b07e5b85c

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 23 Jun 2026 03:53:29 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125931


--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
twotff_ shows the most difference on Zen4 where I reproduced it.  We get new
vectorization here:

+mccas.fppized.f:3259:21: optimized: loop vectorized using 16 byte vectors and
unroll factor 2
+mccas.fppized.f:3259:21: optimized:  loop versioned for vectorization because
of possible aliasing
+mccas.fppized.f:3304:21: optimized: loop vectorized using 16 byte vectors and
unroll factor 2
+mccas.fppized.f:3304:21: optimized:  loop versioned for vectorization because
of possible aliasing

that's the old issue with the triangular nested loop we've never really fixed
but only accidentially handled with costing overrides:

            DO 30 MK=1,NOC
            DO 30 ML=1,MK
               MKL = MKL+1
               XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
     *               VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
               XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
     *               VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
   30       CONTINUE

Bad:

(*xpqkl_262(D))[_177] 1 times scalar_load costs 12 in body
(*xpqkl_262(D))[_177] 1 times scalar_load costs 12 in body 
(*xpqkl_262(D))[_177] 1 times vec_construct costs 12 in body
(*co_263(D))[_183] 1 times scalar_load costs 12 in body
(*co_263(D))[_183] 1 times scalar_load costs 12 in body 
(*co_263(D))[_183] 1 times vec_construct costs 12 in body
_181 * _184 1 times vector_stmt costs 12 in body 
node 0x295e00c0 1 times scalar_to_vec costs 4 in prologue
(*co_263(D))[_186] 1 times scalar_load costs 12 in body
(*co_263(D))[_186] 1 times scalar_load costs 12 in body
(*co_263(D))[_186] 1 times vec_construct costs 12 in body
_187 * _189 1 times vector_stmt costs 12 in body
node 0x295dfe00 1 times scalar_to_vec costs 4 in prologue
_185 + _190 1 times vector_stmt costs 12 in body
_191 * val3_207 1 times vector_stmt costs 12 in body
node 0x295e0170 1 times scalar_to_vec costs 4 in prologue
_178 + _192 1 times vector_stmt costs 12 in body
_193 2 times scalar_store costs 32 in body
_193 1 times vec_deconstruct costs 12 in body

mccas.fppized.f:3304:21: note:  Cost model analysis:
  Vector inside of loop cost: 424
  Vector prologue cost: 48
  Vector epilogue cost: 240
  Scalar iteration cost: 224
  Scalar outside cost: 8
  Vector outside cost: 288
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 6

Good:

(*xpqkl_262(D))[_177] 1 times scalar_load costs 12 in body
(*xpqkl_262(D))[_177] 1 times scalar_load costs 12 in body
(*xpqkl_262(D))[_177] 1 times vec_construct costs 12 in body
(*co_263(D))[_183] 1 times scalar_load costs 12 in body
(*co_263(D))[_183] 1 times scalar_load costs 12 in body
(*co_263(D))[_183] 1 times vec_construct costs 12 in body
_181 * _184 1 times vector_stmt costs 12 in body
node 0x15909490 1 times scalar_to_vec costs 4 in prologue
(*co_263(D))[_186] 1 times scalar_load costs 12 in body
(*co_263(D))[_186] 1 times scalar_load costs 12 in body
(*co_263(D))[_186] 1 times vec_construct costs 12 in body
_187 * _189 1 times vector_stmt costs 12 in body
node 0x15909750 1 times scalar_to_vec costs 4 in prologue
_185 + _190 1 times vector_stmt costs 12 in body
_191 * val3_207 1 times vector_stmt costs 12 in body
node 0x159091d0 1 times scalar_to_vec costs 4 in prologue
_178 + _192 1 times vector_stmt costs 12 in body
_193 2 times scalar_store costs 32 in body
_193 2 times vec_to_scalar costs 24 in body

mccas.fppized.f:3304:21: note:  Cost model analysis:
  Vector inside of loop cost: 448
  Vector prologue cost: 48
  Vector epilogue cost: 240
  Scalar iteration cost: 224
  Scalar outside cost: 8
  Vector outside cost: 288
  prologue iterations: 0
  epilogue iterations: 1 
mccas.fppized.f:3304:21: missed:  cost model: the vector iteration cost = 448
divided by the scalar iteration cost = 224 is greater or equal to the
vectorization factor = 2.
mccas.fppized.f:3304:21: missed:  not vectorized: vectorization not profitable.
mccas.fppized.f:3304:21: missed:  not vectorized: vector version will never be
profitable.

[Bug tree-optimization/125931] [17 Regression] 6% slowdown of gamess on Zen2 since r17-1655-g7d351b07e5b85c

Reply via email to