When I compile this test case with -O2 for x86_64: extern void g (void); float f (float sum, float mult, int *pi) { int i, j; for (i = 0; i < 10; ++i) { g (); for (j = 0; j < 1000; ++j) sum += *pi++ * mult; } return sum; }
I get this result: f: .LFB2: pushq %rbp .LCFI0: movaps %xmm0, %xmm2 xorl %ebp, %ebp pushq %rbx .LCFI1: movq %rdi, %rbx subq $40, %rsp .LCFI2: movss %xmm1, 28(%rsp) .L2: movss %xmm2, (%rsp) call g cvtsi2ss (%rbx), %xmm0 leaq 4(%rbx), %rax movl $1, %edx movss (%rsp), %xmm2 mulss 28(%rsp), %xmm0 addss %xmm0, %xmm2 .p2align 4,,7 .L3: cvtsi2ss (%rax), %xmm1 addl $1, %edx addq $4, %rax cmpl $1000, %edx mulss 28(%rsp), %xmm1 addss %xmm1, %xmm2 jne .L3 addl $1, %ebp addq $4000, %rbx cmpl $10, %ebp jne .L2 addq $40, %rsp movaps %xmm2, %xmm0 popq %rbx popq %rbp ret In the original code, the inner loop is performance critical. Note that this compiles into a mulss loading a value from memory. It would be more efficient to have the value in a register during the inner loop. In fact the value was in a register, but we stored it in the stack because it crossed the function call, and we load it from the stack once for each inner loop iteration rather than once for each outer loop iteration. I don't see a simple approach to fixing this. Some sort of live range splitting might work. -- Summary: x86_64 poor floating point register allocation across function call Product: gcc Version: 4.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: ian at airs dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31704