https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118125
--- Comment #8 from Martin Jambor <jamborm at gcc dot gnu.org> ---
I guess I should have started with looking at annotated assembly. The
hot loop in the hot functions changes from:
53 : ,-> 5534e0: lea (%r11,%rax,1),%rsi
659 : | 5534e4: mov (%rsi),%edi
485 : | 5534e6: mov 0x4(%rsi),%r8d
2173 : | 5534ea: vmovsd (%rcx,%rdi,8),%xmm0
1032 : | 5534ef: mov 0x8(%rsi),%edi
1673 : | 5534f2: mov 0xc(%rsi),%esi
550 : | 5534f5: vmovhpd (%rcx,%r8,8),%xmm0,%xmm0
24357 : | 5534fb: vmovsd (%rcx,%rdi,8),%xmm2
900 : | 553500: vmovhpd (%rcx,%rsi,8),%xmm2,%xmm2
: | s += *val_ptr++ * src(*colnum_ptr++);
2198 : | 553505: vmulpd 0x10(%rdx,%rax,2),%xmm2,%xmm2
10806 : | 55350b: vfmadd132pd (%rdx,%rax,2),%xmm2,%xmm0
19463 : | 553511: add $0x10,%rax
158 : | 553515: vaddpd %xmm0,%xmm1,%xmm1
: | while (val_ptr != val_end_of_row)
65079 : | 553519: cmp -0x538(%rbp),%rax
689 : '-- 553520: jne 5534e0
to:
7 : ,-> 5535a0: lea (%rdi,%r10,1),%rdx
: | return val[i];
408 : | 5535a4: mov %rdx,-0x500(%rbp)
2231 : | 5535ab: mov (%rdx),%edx
420 : | 5535ad: mov %rdx,-0x538(%rbp)
59 : | 5535b4: mov -0x500(%rbp),%rdx
2214 : | 5535bb: mov 0x4(%rdx),%edx
658 : | 5535be: mov %rdx,-0x540(%rbp)
21572 : | 5535c5: mov -0x538(%rbp),%rdx
1916 : | 5535cc: vmovsd (%r9,%rdx,8),%xmm0
987 : | 5535d2: mov -0x540(%rbp),%rdx
2341 : | 5535d9: vmovhpd (%r9,%rdx,8),%xmm0,%xmm0
9349 : | 5535df: mov -0x500(%rbp),%rdx
117 : | 5535e6: mov 0x8(%rdx),%edx
1162 : | 5535e9: mov %rdx,-0x538(%rbp)
581 : | 5535f0: mov -0x500(%rbp),%rdx
18660 : | 5535f7: mov 0xc(%rdx),%edx
1778 : | 5535fa: mov %rdx,-0x500(%rbp)
271 : | 553601: mov -0x538(%rbp),%rdx
2605 : | 553608: vmovsd (%r9,%rdx,8),%xmm2
4943 : | 55360e: mov -0x500(%rbp),%rdx
1206 : | 553615: vmovhpd (%r9,%rdx,8),%xmm2,%xmm2
: | s += *val_ptr++ * src(*colnum_ptr++);
11703 : | 55361b: vmulpd 0x10(%rax,%r10,2),%xmm2,%xmm2
56077 : | 553622: vfmadd132pd (%rax,%r10,2),%xmm2,%xmm0
47327 : | 553628: add $0x10,%r10
871 : | 55362c: vaddpd %xmm0,%xmm1,%xmm1
: | while (val_ptr != val_end_of_row)
66067 : | 553630: cmp %r11,%r10
1762 : `-- 553633: jne 5535a0
So it looks like register allocation/spilling issue.
The gimple IL of the loop is the same in both cases, but the "local
count" of the BB with the loop body (in the optimized dump) is
3540039452134 in the fast version and only 832066009199 (so down ~77%).