On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote:
> The result (p-core, no ht, no turbo, performance mode):
> 
>                                 O2                      O3              PGO
> cycles                          2,581,832,749   8,638,401,568   9,394,200,585
>                                 (1.07s)         (3.49s)         (3.80s)
> instructions                    12,609,600,094  11,827,675,782  12,036,010,638
> branches                        2,303,416,221   2,671,184,833   2,723,414,574
> branch-misses                   0.00%           7.94%           8.84%
> cache-misses                    3,012,613       3,055,722       3,076,316
> L1-icache-load-misses           11,416,391      12,112,703      11,896,077
> icache_tag.stalls               1,553,521       1,364,092       1,896,066
> itlb_misses.stlb_hit            6,856           21,756          22,600
> itlb_misses.walk_completed      14,430          4,454           15,084
> baclears.any                    131,573         140,355         131,644
> int_misc.clear_resteer_cycles   2,545,915       586,578,125     679,021,993
> machine_clears.count            22,235          39,671          37,307
> dsb2mite_switches.penalty_cycles 6,985,838      12,929,675      8,405,493
> frontend_retired.any_dsb_miss   28,785,677      28,161,724      28,093,319
> idq.dsb_cycles_any              1,986,038,896   5,683,820,258   5,971,969,906
> idq.dsb_uops                    11,149,445,952  26,438,051,062  28,622,657,650
> idq.mite_uops                   207,881,687     216,734,007     212,003,064
> 
> 
> Above data shows:
>   o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
>   o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
>     aggressive inline.
>   o O3/PGO introduced very bad branch prediction. I will explain it later.
>   o Code built with O3 has high iTLB miss but much lower sTLB miss. This is 
> beyond
>     my expectation.
>   o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
>     I don't know why. (subcategory MC is not measured yet)
The MCs are caused by memory ordering conflict and attribute to the kernel rcu
lock in I/O path, when ext4 tries to update its journal.

>   o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
>   o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
>     DSB hit well. So frontend fetching and decoding is not a problem for 
> O3/PGO.
>   o Other events are mainly affected by bad branch misprediction.
> 

-- 
Cheers,
Changbin Du

Reply via email to