https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123603
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Josef Melcr from comment #7) > 2006 calculix with -Ofast -march=x86-64-v3 -g -flto=128 on Zen4 is also > affected. > > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=1119.170.0 We're vectorizing e_c3d.f:680:34: optimized: loop vectorized using 16 byte vectors and unroll factor 2 that's do i1=1,3 iii1=ii1+i1-1 do j1=1,3 jjj1=jj1+j1-1 do k1=1,3 ===> this loop do l1=1,3 s(iii1,jjj1)=s(iii1,jjj1) & +anisox(i1,k1,j1,l1)*w(k1,l1)*weight do m1=1,3 s(iii1,jjj1)=s(iii1,jjj1) & +anisox(i1,k1,m1,l1)*w(k1,l1) & *vo(j1,m1)*weight & +anisox(m1,k1,j1,l1)*w(k1,l1) & *vo(i1,m1)*weight do n1=1,3 s(iii1,jjj1)=s(iii1,jjj1) & +anisox(m1,k1,n1,l1) & *w(k1,l1)*vo(i1,m1)*vo(j1,n1) & *weight enddo enddo enddo enddo enddo enddo In GCC 15 we've not vectorized the loop. -mno-fma makes no difference in runtime (but we do have FMA chains in both cases). This is because of t.f:15:34: note: ==> examining statement: _26 = (*w_93(D))[_25]; t.f:15:34: missed: single-element interleaving not supported for not adjacent vector loads, using elementwise access vs. t.f:15:34: note: ==> examining statement: _26 = (*w_93(D))[_25]; t.f:15:34: missed: single-element interleaving not supported for not adjacent vector loads t.f:17:72: missed: not vectorized: relevant stmt not supported: _26 = (*w_93(D))[_25]; which means we can now vectorize sth we couldn't before. Iff the same bisection holds that just made it profitable. It's definitely a different "bug". t.f:15:34: note: Cost model analysis: Vector inside of loop cost: 1008 Vector prologue cost: 68 Vector epilogue cost: 752 Scalar iteration cost: 736 Scalar outside cost: 0 Vector outside cost: 820 prologue iterations: 0 epilogue iterations: 1 Calculated minimum iters for profitability: 2 I also see we're not hoisting invariant vector CTORs emitted by the vectorizer, because CONSTRUCTOR_NELTS is easily lower than LIM_EXPENSIVE (20). Fixing that doesn't help though, RTL invariant motion does this already and we spill some of the required invariants. With zen5 tuning we don't vectorize, the costs there prevent this. We have building blocks like _489 = {_182, _182}; ... vect__188.234_743 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)_860 + -72B + ivtmp.526_1082 * 1]; vect__188.247_769 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)_860 + 136B + ivtmp.526_1082 * 1]; vect__188.261_796 = VEC_PERM_EXPR <vect__188.234_743, vect__188.247_769, { 0, 3 }>; vect__201.203_117 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)_860]; vect__201.216_66 = MEM <vector(2) real(kind=8)> [(real(kind=8) *)_860 + 208B]; vect__201.230_27 = VEC_PERM_EXPR <vect__201.203_117, vect__201.216_66, { 0, 3 }>; _869 = .FMA (vect__201.230_27, _489, vect__188.261_796); where the permutes are basically from pices construction, so we have three from-pieces vectors fed into .FMA which feeds a reduction chain. Vector costing does not have FMA, so it costs two scalar add / mul against one add / mul plus the permutes on the vector side. _78 + _250 1 times vector_stmt costs 12 in body _121 * _220 1 times vector_stmt costs 20 in body (*anisox_92(D))[_176] 1 times vec_perm costs 4 in body (*anisox_92(D))[_249] 2 times unaligned_load (misalign -1) costs 24 in body 182 * _201 1 times scalar_stmt costs 20 in epilogue _55 + _188 1 times scalar_stmt costs 12 in epilogue (*anisox_92(D))[_176] 1 times scalar_load costs 12 in epilogue so for this building block the vector variant wins by 20 + 12 - 4 which is enough. vector construction of the invariants also costs 4, so that's at least sensibly the same as the permute. What likely makes the difference is tieing previously independent chains into two-element vectors, giving less OOO freedom to the CPU. Something we do not model at all.
