[Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

Changbin Du via Gcc-bugs Mon, 31 Jul 2023 05:58:09 -0700

Hello, folks.
This is to discuss Gcc's heuristic strategy about Predicated Instructions and
Branches. And probably something needs to be improved.


[The story]
Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively.
This program is nothing special, just a random code I found on the internet. You
can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.

Build it with O2/O3/PGO (My GCC is 13.1):
$ gcc -O2 -march=native -g -o huffman huffman.c
$ gcc -O3 -march=native -g -o huffman.O3 huffman.c

$ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c
$ ./huffman.instrumented test.data
$ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o 
huffman.pgo huffman.c

Run them on my 12900H laptop:
$ head -c 50M /dev/urandom > test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data

The result (p-core, no ht, no turbo, performance mode):

                                O2                      O3              PGO
cycles                          2,581,832,749   8,638,401,568   9,394,200,585
                                (1.07s)         (3.49s)         (3.80s)
instructions                    12,609,600,094  11,827,675,782  12,036,010,638
branches                        2,303,416,221   2,671,184,833   2,723,414,574
branch-misses                   0.00%           7.94%           8.84%
cache-misses                    3,012,613       3,055,722       3,076,316
L1-icache-load-misses           11,416,391      12,112,703      11,896,077
icache_tag.stalls               1,553,521       1,364,092       1,896,066
itlb_misses.stlb_hit            6,856           21,756          22,600
itlb_misses.walk_completed      14,430          4,454           15,084
baclears.any                    131,573         140,355         131,644
int_misc.clear_resteer_cycles   2,545,915       586,578,125     679,021,993
machine_clears.count            22,235          39,671          37,307
dsb2mite_switches.penalty_cycles 6,985,838      12,929,675      8,405,493
frontend_retired.any_dsb_miss   28,785,677      28,161,724      28,093,319
idq.dsb_cycles_any              1,986,038,896   5,683,820,258   5,971,969,906
idq.dsb_uops                    11,149,445,952  26,438,051,062  28,622,657,650
idq.mite_uops                   207,881,687     216,734,007     212,003,064


Above data shows:
  o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
  o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
    aggressive inline.
  o O3/PGO introduced very bad branch prediction. I will explain it later.
  o Code built with O3 has high iTLB miss but much lower sTLB miss. This is 
beyond
    my expectation.
  o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
    I don't know why. (subcategory MC is not measured yet)
  o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
  o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
    DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
  o Other events are mainly affected by bad branch misprediction.

Additionally, here is the TMA level 2 analysis: The main changes in the pipeline
slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy
of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
idq.mite_uops data.

$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman 
test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman test.data':

 %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  
tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
tma_fetch_latency %  tma_machine_clears
                       0.0                  0.8                     11.4        
             76.8                  2.0                     8.3                  
0.8                    0.0

       1.073381357 seconds time elapsed

       0.945233000 seconds user
       0.095719000 seconds sys


$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 
test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':

 %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  
tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
tma_fetch_latency %  tma_machine_clears
                      38.2                  6.6                      3.5        
             21.7                  0.9                    20.9                  
7.5                    0.8

       3.501875873 seconds time elapsed

       3.378572000 seconds user
       0.084163000 seconds sys


$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo 
test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data':

 %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  
tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
tma_fetch_latency %  tma_machine_clears
                      40.3                  6.3                      3.6        
             19.4                  1.2                    17.8                 
10.7                    0.8

       3.803413059 seconds time elapsed

       3.686474000 seconds user
       0.079707000 seconds sys


I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* 
performance
drop than O2.


[Predicated Ins vs Branches]
Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in 
O3/PGO
is caused by the below branch.

@@ -264,7 +264,7 @@ void bitout (FILE *f, char b) {

        /* put a one on the end of this byte if b is '1' */

-       if (b == '1') current_byte |= 1;
+       //if (b == '1') current_byte |= 1;

        /* one more bit */

If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
improvement compared to O2 on x86.

                        O2              O3              PGO
cycles                  2,497,674,824   2,104,993,224   2,199,753,593
instructions            10,457,508,646  9,723,056,131   10,457,216,225
branches                2,303,029,380   2,250,522,323   2,302,994,942
branch-misses           0.00%           0.01%           0.01%

The main difference in the compilation output about code around the 
miss-prediction
branch is:
  o In O2: predicated instruction (cmov here) is selected to eliminate above
    branch. cmov is true better than branch here.
  o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
    is selected. But this branch is obviously *unpredictable* and the compiler
    doesn't know it. This why O3/PGO are are so bad for this program.

Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
same purpose. The result is negative.

I think we could come to a conclusion that there must be something can improve 
in
Gcc's heuristic strategy about Predicated Instructions and branches, at least
for O3 and PGO.

And can we add __builtin_unpredictable() support for Gcc? As usually it's hard
for the compiler to detect unpredictable branches.

--
Cheers,
Changbin Du

[Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

Reply via email to