Hello, folks. This is to discuss Gcc's heuristic strategy about Predicated Instructions and Branches. And probably something needs to be improved.
[The story] Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively. This program is nothing special, just a random code I found on the internet. You can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c. Build it with O2/O3/PGO (My GCC is 13.1): $ gcc -O2 -march=native -g -o huffman huffman.c $ gcc -O3 -march=native -g -o huffman.O3 huffman.c $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c $ ./huffman.instrumented test.data $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o huffman.pgo huffman.c Run them on my 12900H laptop: $ head -c 50M /dev/urandom > test.data $ perf stat -r3 --table -- taskset -c 0 ./huffman test.data $ perf stat -r3 --table -- taskset -c 0 ./huffman.O3 test.data $ perf stat -r3 --table -- taskset -c 0 ./huffman.pgo test.data The result (p-core, no ht, no turbo, performance mode): O2 O3 PGO cycles 2,581,832,749 8,638,401,568 9,394,200,585 (1.07s) (3.49s) (3.80s) instructions 12,609,600,094 11,827,675,782 12,036,010,638 branches 2,303,416,221 2,671,184,833 2,723,414,574 branch-misses 0.00% 7.94% 8.84% cache-misses 3,012,613 3,055,722 3,076,316 L1-icache-load-misses 11,416,391 12,112,703 11,896,077 icache_tag.stalls 1,553,521 1,364,092 1,896,066 itlb_misses.stlb_hit 6,856 21,756 22,600 itlb_misses.walk_completed 14,430 4,454 15,084 baclears.any 131,573 140,355 131,644 int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993 machine_clears.count 22,235 39,671 37,307 dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493 frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319 idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906 idq.dsb_uops 11,149,445,952 26,438,051,062 28,622,657,650 idq.mite_uops 207,881,687 216,734,007 212,003,064 Above data shows: o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively. o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to aggressive inline. o O3/PGO introduced very bad branch prediction. I will explain it later. o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond my expectation. o O3/PGO introduced 78% and 68% more machine clears. This is interesting and I don't know why. (subcategory MC is not measured yet) o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO. o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x. DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO. o Other events are mainly affected by bad branch misprediction. Additionally, here is the TMA level 2 analysis: The main changes in the pipeline slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and idq.mite_uops data. $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman test.data test.data.huf is 1.00% of test.data Performance counter stats for 'taskset -c 0 ./huffman test.data': % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears 0.0 0.8 11.4 76.8 2.0 8.3 0.8 0.0 1.073381357 seconds time elapsed 0.945233000 seconds user 0.095719000 seconds sys $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 test.data test.data.huf is 1.00% of test.data Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data': % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears 38.2 6.6 3.5 21.7 0.9 20.9 7.5 0.8 3.501875873 seconds time elapsed 3.378572000 seconds user 0.084163000 seconds sys $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo test.data test.data.huf is 1.00% of test.data Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data': % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears 40.3 6.3 3.6 19.4 1.2 17.8 10.7 0.8 3.803413059 seconds time elapsed 3.686474000 seconds user 0.079707000 seconds sys I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* performance drop than O2. [Predicated Ins vs Branches] Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in O3/PGO is caused by the below branch. @@ -264,7 +264,7 @@ void bitout (FILE *f, char b) { /* put a one on the end of this byte if b is '1' */ - if (b == '1') current_byte |= 1; + //if (b == '1') current_byte |= 1; /* one more bit */ If I comment it out as above patch, then O3/PGO can get 16% and 12% performance improvement compared to O2 on x86. O2 O3 PGO cycles 2,497,674,824 2,104,993,224 2,199,753,593 instructions 10,457,508,646 9,723,056,131 10,457,216,225 branches 2,303,029,380 2,250,522,323 2,302,994,942 branch-misses 0.00% 0.01% 0.01% The main difference in the compilation output about code around the miss-prediction branch is: o In O2: predicated instruction (cmov here) is selected to eliminate above branch. cmov is true better than branch here. o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction is selected. But this branch is obviously *unpredictable* and the compiler doesn't know it. This why O3/PGO are are so bad for this program. Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm. Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the same purpose. The result is negative. I think we could come to a conclusion that there must be something can improve in Gcc's heuristic strategy about Predicated Instructions and branches, at least for O3 and PGO. And can we add __builtin_unpredictable() support for Gcc? As usually it's hard for the compiler to detect unpredictable branches. -- Cheers, Changbin Du