https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
On trunk I see with -fno-split-paths:
.L5:
movq (%r14,%rdx,8), %rcx
vmovsd (%rcx,%rbx), %xmm0
vandpd %xmm3, %xmm0, %xmm0
vucomisd %xmm1, %xmm0
jbe .L4
vmovapd %xmm0, %xmm1
movl %edx, %r9d
.L4:
addq $1, %rdx
cmpq %rdi, %rdx
jne .L5
so a jump vs. the max/cmov. I wonder how this subloop can account for 10% of
performance difference... the main part should be the nest
for (ii=j+1; ii<M; ii++)
{
double *Aii = A[ii];
double *Aj = A[j];
double AiiJ = Aii[j];
int jj;
for (jj=j+1; jj<N; jj++)
Aii[jj] -= AiiJ * Aj[jj];
}
but I never profiled LU...