On Wed, Oct 03, 2018 at 16:04:54 -0400, Emilio G. Cota wrote: > Updates can come from other threads, so readers that do not > take tlb_lock must use atomic_read to avoid undefined > behaviour (UB). > > This and the previous commit result in a small performance decrease, > but this is a fair price for removing UB. (snip) > That is, a ~2% slowdown for the aarch64 bootup+shutdown test.
I've run more tests. This slowdown is much more pronounced on memory-heavy workloads. These are the numbers for SPEC06int: Speedup over master 1.05 +-+--+----+----+----+----+----+----+---+----+----+----+----+----+--+-+ | +++ || +++ | |tlb-lock-noatomic +++ | **| |+++ | | +atomic | ++++ | **## | | | 1 +-+..+++...............++##.***#...|..**|#......**|................+-+ | ### ***++ ***# *+*# +++ **+# +++ **## | | # # *+*# *|*# *+*# || ** # **## **|# | | # # * *#+ *+*# * *# || ** # **+#+**|# +** ++### | 0.95 +-+..#.#.....*.*#......*.*#.*.*#.***#.**.#.**.#.**|#......**##***+#+-+ | # # * *# * *# * *# *|*# ** # ** # **+# **+#* * # | | # # * *# * *# * *# *|*# ** # ** # ** #+++++ ** #* * # | 0.9 +-+***.#..+++*.*#......*.*#.*.*#.*+*#.**.#.**.#.**.#+**|..**.#*.*.#+-+ | * * #***##* *# * *# * *# * *# ** # ** # ** # **## ** #* * # | | * * #* *+#* *# +++* *# * *# * *# ** # ** # ** # **|# ** #* * # | | * * #* * #* *# ***# * *# * *# *+*# ** # ** # ** # **+# ** #* * # | 0.85 +-+*.*.#*.*.#*.*#.*.*#+*.*#.*.*#.*.*#.**.#.**.#.**.#.**.#.**.#*.*.#+-+ | * * #* * #* *# * *# * *# * *# * *# ** # ** # ** # ** # ** #* * # | | * * #* * #* *# * *# * *# * *# * *# ** # ** # ** # ** # ** #* * # | | * * #* * #* *# * *# * *# * *# * *# ** # ** # ** # ** # ** #* * # | 0.8 +-+***##***##***#-***#-***#-***#-***#-**##-**##-**##-**##-**##***##+-+ 401.bzi403.g429445.g456.462.libq464.h471.omn4483.xalancbgeomean That is, a 5% average slowdown, with a max slowdown of ~14% for mcf :-( I'll profile tomorrow and see where the slowdown comes from. If the lock is the issue, we might be better off shifting all the work to the cross-vCPU call (e.g. doing a round of synchronous cross-vCPU calls via run_on_cpu), if the assumption that those calls are very rare is correct. Emilio