https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117874
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note m_mat_na.c and m_mat_nn.c are completely unrolled instead and not
vectorized by GCC 14 (nor trunk), still slower as reported (mul_su3_na/nn).
trunk seems to unroll less,
m_mat_nn.c:73:30: optimized: loop with 3 iterations completely unrolled (header
execution count 268435456)
on trunk vs.
m_mat_nn.c:73:30: optimized: loop with 3 iterations completely unrolled (header
execution count 268435456)
m_mat_nn.c:73:14: optimized: loop with 2 iterations completely unrolled (header
execution count 89478486)
on branch. In particular cunroll on GIMPLE does not unroll the outer loop on
trunk:
Loop 1 iterates 2 times.
Loop 1 iterates at most 2 times.
Loop 1 likely iterates at most 2 times.
size: 104-4, last_iteration: 104-4
Loop size: 104
Estimated size after unrolling: 300
Not unrolling loop 1: number of insns in the unrolled sequence reaches --param
max-completely-peeled-insns limit.
Not peeling: upper bound is known so can unroll completely
vs branch:
Loop 1 iterates 2 times.
Loop 1 iterates at most 2 times.
Loop 1 likely iterates at most 2 times.
size: 104-4, last_iteration: 104-4
Loop size: 104
Estimated size after unrolling: 200
that's the r15-919-gef27b91b62c3aa change I think. The heuristic, while
careful, doesn't accurately remember what's "innermost" in this case
though it's still correct that the body isn't simplified by 1/3 - in
this case cunroll has 306 stmts while optimized 276 (FMA disabled),
so that's purely CSE.
Testcase:
typedef struct {
double real;
double imag;
} complex;
typedef struct { complex e[3][3]; } su3_matrix;
void mult_su3_nn( su3_matrix *a, su3_matrix *b, su3_matrix *c )
{
int i,j;
double t,ar,ai,br,bi,cr,ci;
for(i=0;i<3;i++)for(j=0;j<3;j++){
ar=a->e[i][0].real; ai=a->e[i][0].imag;
br=b->e[0][j].real; bi=b->e[0][j].imag;
cr=ar*br; t=ai*bi; cr -= t;
ci=ar*bi; t=ai*br; ci += t;
ar=a->e[i][1].real; ai=a->e[i][1].imag;
br=b->e[1][j].real; bi=b->e[1][j].imag;
t=ar*br; cr += t; t=ai*bi; cr -= t;
t=ar*bi; ci += t; t=ai*br; ci += t;
ar=a->e[i][2].real; ai=a->e[i][2].imag;
br=b->e[2][j].real; bi=b->e[2][j].imag;
t=ar*br; cr += t; t=ai*bi; cr -= t;
t=ar*bi; ci += t; t=ai*br; ci += t;
c->e[i][j].real=cr;
c->e[i][j].imag=ci;
}
}