https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66890
Bug ID: 66890 Summary: function splitting only works with profile feedback Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: andi-gcc at firstfloor dot org Target Milestone: --- Consider this simple example: volatile int count; int main() { int i; for (i = 0; i < 100000; i++) { if (i == 999) count *= 2; count++; } } The default EQ is unlikely heuristic in predict.* predicts that the if (i == 999) is unlikely. So the tracer moves the count *= 2 basic block out of line to preserve instruction cache. gcc50 -O2 -S thotcold.c movl $1, %edx jmp .L2 .p2align 4,,10 .p2align 3 .L4: addl $1, %edx .L2: cmpl $1000, %edx movl count(%rip), %eax je .L6 addl $1, %eax cmpl $100000, %edx movl %eax, count(%rip) jne .L4 xorl %eax, %eax ret # out of line code .L6: addl %eax, %eax movl %eax, count(%rip) movl count(%rip), %eax addl $1, %eax movl %eax, count(%rip) jmp .L4 Now if we enable -freorder-blocks-and-partition I would expect it to be also put into .text.unlikely to given even better cache layout. But that's what is not happening. It generates the same code. Only when I use actual profile feedback and -freorder-blocks-and-partition the code actually ends up being in a separate section (it also unrolled the loop, so the code looks a bit different) gcc -O2 -fprofile-generate -freorder-blocks-and-partition thotcold.c ./a.out gcc -O2 -fprofile-use -freorder-blocks-and-partition thotcold.c ... .cfi_endproc .section .text.unlikely .cfi_startproc .L55: movl count(%rip), %ecx addl $1, %eax addl $1, %ecx cmpl $100000, %eax movl %ecx, count(%rip) je .L6 cmpl $1, %edx je .L5 cmpl $2, %edx je .L28 cmpl $3, %edx -freorder-blocks-and-partition should already use the extra section even without profile feedback. I tested some larger programs and without profile feedback the unlikely section is always empty. The heuristics in predict.* often work quite well and a lot of code would benefit from moving cold code out of the way of the caches. This would allow to use the option to improve frontend bound codes without needing to do full profile feedback.