17 Regression] 13% slowdown of exchange2_r on Zen4 since r16-6767-g948d33f490a6b0

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 22 Jan 2026 07:23:56 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123603


--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Josef Melcr from comment #7)
> 2006 calculix with -Ofast -march=x86-64-v3 -g -flto=128 on Zen4 is also
> affected.
> 
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=1119.170.0

We're vectorizing

e_c3d.f:680:34: optimized: loop vectorized using 16 byte vectors and unroll
factor 2

that's

                      do i1=1,3
                        iii1=ii1+i1-1
                        do j1=1,3
                          jjj1=jj1+j1-1
                          do k1=1,3
===> this loop              do l1=1,3
                              s(iii1,jjj1)=s(iii1,jjj1)
     &                         +anisox(i1,k1,j1,l1)*w(k1,l1)*weight
                              do m1=1,3
                                s(iii1,jjj1)=s(iii1,jjj1)
     &                              +anisox(i1,k1,m1,l1)*w(k1,l1)
     &                                 *vo(j1,m1)*weight
     &                              +anisox(m1,k1,j1,l1)*w(k1,l1)
     &                                 *vo(i1,m1)*weight
                                do n1=1,3
                                  s(iii1,jjj1)=s(iii1,jjj1)
     &                                  +anisox(m1,k1,n1,l1)
     &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
     &                                  *weight
                                enddo
                              enddo
                            enddo
                          enddo
                        enddo
                      enddo

In GCC 15 we've not vectorized the loop.  -mno-fma makes no difference
in runtime (but we do have FMA chains in both cases).  This is because of

t.f:15:34: note:   ==> examining statement: _26 = (*w_93(D))[_25];
t.f:15:34: missed:   single-element interleaving not supported for not adjacent
vector loads, using elementwise access

vs.

t.f:15:34: note:   ==> examining statement: _26 = (*w_93(D))[_25];
t.f:15:34: missed:   single-element interleaving not supported for not adjacent
vector loads
t.f:17:72: missed:   not vectorized: relevant stmt not supported: _26 =
(*w_93(D))[_25];

which means we can now vectorize sth we couldn't before.  Iff the same
bisection holds that just made it profitable.  It's definitely a different
"bug".

t.f:15:34: note:  Cost model analysis:
  Vector inside of loop cost: 1008
  Vector prologue cost: 68
  Vector epilogue cost: 752
  Scalar iteration cost: 736
  Scalar outside cost: 0
  Vector outside cost: 820
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 2

I also see we're not hoisting invariant vector CTORs emitted by the
vectorizer, because CONSTRUCTOR_NELTS is easily lower than LIM_EXPENSIVE
(20).  Fixing that doesn't help though, RTL invariant motion does this
already and we spill some of the required invariants.  With zen5 tuning
we don't vectorize, the costs there prevent this.

We have building blocks like

  _489 = {_182, _182};
...
  vect__188.234_743 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)_860 + -72B
+ ivtmp.526_1082 * 1];
  vect__188.247_769 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)_860 + 136B
+ ivtmp.526_1082 * 1];
  vect__188.261_796 = VEC_PERM_EXPR <vect__188.234_743, vect__188.247_769, { 0,
3 }>;
  vect__201.203_117 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)_860];
  vect__201.216_66 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)_860 +
208B];
  vect__201.230_27 = VEC_PERM_EXPR <vect__201.203_117, vect__201.216_66, { 0, 3
}>;
  _869 = .FMA (vect__201.230_27, _489, vect__188.261_796);

where the permutes are basically from pices construction, so we have three
from-pieces vectors fed into .FMA which feeds a reduction chain.  Vector
costing does not have FMA, so it costs two scalar add / mul against
one add / mul plus the permutes on the vector side.

_78 + _250 1 times vector_stmt costs 12 in body
_121 * _220 1 times vector_stmt costs 20 in body
(*anisox_92(D))[_176] 1 times vec_perm costs 4 in body
(*anisox_92(D))[_249] 2 times unaligned_load (misalign -1) costs 24 in body

182 * _201 1 times scalar_stmt costs 20 in epilogue
_55 + _188 1 times scalar_stmt costs 12 in epilogue
(*anisox_92(D))[_176] 1 times scalar_load costs 12 in epilogue

so for this building block the vector variant wins by 20 + 12 - 4 which
is enough.  vector construction of the invariants also costs 4, so that's
at least sensibly the same as the permute.  What likely makes the difference
is tieing previously independent chains into two-element vectors, giving
less OOO freedom to the CPU.  Something we do not model at all.

[Bug target/123603] [16/17 Regression] 13% slowdown of exchange2_r on Zen4 since r16-6767-g948d33f490a6b0

Reply via email to