On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote:
> Hello, folks.
> This is to discuss Gcc's heuristic strategy about Predicated Instructions and
> Branches. And probably something needs to be improved.
>
> [The story]
> Weeks ago, I built a huffman encoding program with O2, O3, and PGO
> respectively.
> This program is nothing special, just a random code I found on the internet.
> You
> can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.
>
> Build it with O2/O3/PGO (My GCC is 13.1):
> $ gcc -O2 -march=native -g -o huffman huffman.c
> $ gcc -O3 -march=native -g -o huffman.O3 huffman.c
>
> $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented
> huffman.c
> $ ./huffman.instrumented test.data
> $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o
> huffman.pgo huffman.c
>
> Run them on my 12900H laptop:
> $ head -c 50M /dev/urandom > test.data
> $ perf stat -r3 --table -- taskset -c 0 ./huffman test.data
> $ perf stat -r3 --table -- taskset -c 0 ./huffman.O3 test.data
> $ perf stat -r3 --table -- taskset -c 0 ./huffman.pgo test.data
>
> The result (p-core, no ht, no turbo, performance mode):
>
> O2 O3 PGO
> cycles 2,581,832,749 8,638,401,568 9,394,200,585
> (1.07s) (3.49s) (3.80s)
> instructions 12,609,600,094 11,827,675,782 12,036,010,638
> branches 2,303,416,221 2,671,184,833 2,723,414,574
> branch-misses 0.00% 7.94% 8.84%
> cache-misses 3,012,613 3,055,722 3,076,316
> L1-icache-load-misses 11,416,391 12,112,703 11,896,077
> icache_tag.stalls 1,553,521 1,364,092 1,896,066
> itlb_misses.stlb_hit 6,856 21,756 22,600
> itlb_misses.walk_completed 14,430 4,454 15,084
> baclears.any 131,573 140,355 131,644
> int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993
> machine_clears.count 22,235 39,671 37,307
> dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493
> frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319
> idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906
> idq.dsb_uops 11,149,445,952 26,438,051,062 28,622,657,650
> idq.mite_uops 207,881,687 216,734,007 212,003,064
>
>
> Above data shows:
> o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
> o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
> aggressive inline.
> o O3/PGO introduced very bad branch prediction. I will explain it later.
> o Code built with O3 has high iTLB miss but much lower sTLB miss. This is
> beyond
> my expectation.
> o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
> I don't know why. (subcategory MC is not measured yet)
> o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
> o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
> DSB hit well. So frontend fetching and decoding is not a problem for
> O3/PGO.
> o Other events are mainly affected by bad branch misprediction.
>
> Additionally, here is the TMA level 2 analysis: The main changes in the
> pipeline
> slots are of Bad Speculation and Frontend Bound categories. I doubt the
> accuracy
> of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
> idq.mite_uops data.
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman
> test.data
> test.data.huf is 1.00% of test.data
>
> Performance counter stats for 'taskset -c 0 ./huffman test.data':
>
> % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations %
> tma_light_operations % tma_memory_bound % tma_fetch_bandwidth %
> tma_fetch_latency % tma_machine_clears
> 0.0 0.8 11.4
> 76.8 2.0 8.3
> 0.8 0.0
>
> 1.073381357 seconds time elapsed
>
> 0.945233000 seconds user
> 0.095719000 seconds sys
>
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0
> ./huffman.O3 test.data
> test.data.huf is 1.00% of test.data
>
> Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':
>
> % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations %
> tma_light_operations % tma_memory_bound % tma_fetch_bandwidth %
> tma_fetch_latency % tma_machine_clears
> 38.2 6.6 3.5
> 21.7 0.9 20.9
> 7.5 0.8
>
> 3.501875873 seconds time elapsed
>
> 3.378572000 seconds user
> 0.084163000 seconds sys
>
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0
> ./huffman.pgo test.data
> test.data.huf is 1.00% of test.data
>
> Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data':
>
> % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations %
> tma_light_operations % tma_memory_bound % tma_fetch_bandwidth %
> tma_fetch_latency % tma_machine_clears
> 40.3 6.3 3.6
> 19.4 1.2 17.8
> 10.7 0.8
>
> 3.803413059 seconds time elapsed
>
> 3.686474000 seconds user
> 0.079707000 seconds sys
>
>
> I also tried the same program with O2/O3 on arm64. And O3 lead to *30%*
> performance
> drop than O2.
>
>
> [Predicated Ins vs Branches]
> Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in
> O3/PGO
> is caused by the below branch.
>
> @@ -264,7 +264,7 @@ void bitout (FILE *f, char b) {
>
> /* put a one on the end of this byte if b is '1' */
>
> - if (b == '1') current_byte |= 1;
> + //if (b == '1') current_byte |= 1;
>
> /* one more bit */
>
> If I comment it out as above patch, then O3/PGO can get 16% and 12%
> performance
> improvement compared to O2 on x86.
>
> O2 O3 PGO
> cycles 2,497,674,824 2,104,993,224 2,199,753,593
> instructions 10,457,508,646 9,723,056,131 10,457,216,225
> branches 2,303,029,380 2,250,522,323 2,302,994,942
> branch-misses 0.00% 0.01% 0.01%
>
> The main difference in the compilation output about code around the
> miss-prediction
> branch is:
> o In O2: predicated instruction (cmov here) is selected to eliminate above
> branch. cmov is true better than branch here.
> o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
> is selected. But this branch is obviously *unpredictable* and the compiler
> doesn't know it. This why O3/PGO are are so bad for this program.
>
After some debugging, I found the if-conversion pass failed at below
'!REG_P (dest)' check with O3.
static bool
check_cond_move_block (basic_block bb,
hash_map<rtx, rtx> *vals,
vec<rtx> *regs,
rtx cond)
{
...
dest = SET_DEST (set);
src = SET_SRC (set);
if (!REG_P (dest) // REG_P() return 0 here
|| (HARD_REGISTER_P (dest)
&& targetm.small_register_classes_for_mode_p (GET_MODE (dest)))) {
printf("check_cond_move_block: failed at line %d\n", __LINE__); //
return false;
}
--
Cheers,
Changbin Du