https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200
--- Comment #20 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
I tried Intel SDE on mcf to get the hot blocks dynamic execution counts.
<snip>
.L98:
jle .L97
cmpl $2, %r9d
jne .L97
.L99:
<snip>
BLOCK: 7 PC: 0000000000403252 ICOUNT: 9064729840 EXECUTIONS:
4532364920 #BYTES: 5 %: 3.08 cumltv%: 55.5 FN: primal_bea_mpp IMG:
/home/gccuser/work/GCC_Team/vekumar/ALU-tune/benchspec/CPU2006/429.mcf/build/build_base_gcc-notune.exe.0000/mcf
XDIS 0000000000403252: BASE 83FF02 cmp edi, 0x2
XDIS 0000000000403255: BASE 752A jnz 0x403281
BLOCK: 8 PC: 0000000000403250 ICOUNT: 6320686443 EXECUTIONS:
6320686443 #BYTES: 2 %: 2.15 cumltv%: 57.7 FN: primal_bea_mpp IMG:
/home/gccuser/work/GCC_Team/vekumar/ALU-tune/benchspec/CPU2006/429.mcf/build/build_base_gcc-notune.exe.0000/mcf
XDIS 0000000000403250: BASE 7E2F jle 0x403281
When I swap the compares.
.L98:
cmpl $2, %r9d
jne .L97
cmpq $0, %rdi
jle .L97
.L99:
BLOCK: 4 PC: 0000000000403250 ICOUNT: 12641372886 EXECUTIONS:
6320686443 #BYTES: 5 %: 4.33 cumltv%: 46.3 FN: primal_bea_mpp IMG:
/home/gccuser/work/GCC_Team/vekumar/ALU-tune/benchspec/CPU2006/429.mcf/build/build_base_gcc-notune.exe.0000/mcf
XDIS 0000000000403250: BASE 83FF02 cmp edi, 0x2
XDIS 0000000000403253: BASE 7542 jnz 0x403297
The block is not at all visible in top 300 hot blocks of MCF, which goes to
show it is executed very rare. we are spending more cycles in the regressing
case.
cmpq $0, %rdi
jle .L97
.L99:
Next I tried with profile guided optimization on MCF. The compares are not
swapped. But basic block reordering has happened.
Hot blocked is placed after the compare. However this does not improve the run
time.
Pass1 -Ofast –march=znver1 –fprofile-generate
Pass2 -Ofast –march=znver1 –fprofile-use
(Snip)
.L14:
jle .L13
cmpl $2, %edi
je .L16
.L13: <== hot block placed
near
addq %r9, %rax
cmpq %rax, %r8
jbe .L12
(snip)
Compared to -Ofast –march=znver1
(snip)
.L198:
jle .L197
cmpl $2, %r9d
jne .L197
.L199: <== cold block
incq %r15
movq %rdi, %r12
movq perm(,%r15,8), %r9
sarq $63, %r12
movq %rdi, 8(%r9)
xorq %r12, %rdi
movq %rax, (%r9)
movq %rdi, 16(%r9)
subq %r12, 16(%r9)
.L197: <== hot block
addq %rbx, %rax
cmpq %rax, %r8
jbe .L196
(snip)
Runtime is better only when I swap the compares.