From: "Emilio G. Cota" <c...@braap.org> This yields sizable scalability improvements, as the below results show.
Host: Two Intel Xeon Silver 4114 20-core CPUs at 2.20 GHz VM: Ubuntu 18.04 ppc64 Speedup vs a single thread for kernel build 7 +-----------------------------------------------------------------------+ | + + + + + + | | ########### baseline ******* | | ##### #### cpu lock ####### | | ## #### | 6 |-+ ## ## +-| | ## #### | | ## ### | | ## ***** # | | ## **** *** # | | ## *** * | 5 |-+ ## *** **** +-| | # **** ** | | # ** ** | | #* ** | | #* ** | | #* * | | # ****** | | # ** | | # * | 3 |-+ # +-| | # | | # | | # | | # | 2 |-+ # +-| | # | | # | | # | | # | | # + + + + + + | 1 +-----------------------------------------------------------------------+ 0 5 10 15 20 25 30 35 Guest vCPUs Pictures are also here: https://drive.google.com/file/d/1ASg5XyP9hNfN9VysXC3qe5s9QSJlwFAt/view?usp=sharing Some notes: - baseline corresponds to the commit before this series - cpu-lock is this series Single-threaded performance is affected very lightly. Results below for debian aarch64 bootup+test for the entire series on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host: - Before: Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs): 7269.033478 task-clock (msec) # 0.998 CPUs utilized ( +- 0.06% ) 30,659,870,302 cycles # 4.218 GHz ( +- 0.06% ) 54,790,540,051 instructions # 1.79 insns per cycle ( +- 0.05% ) 9,796,441,380 branches # 1347.695 M/sec ( +- 0.05% ) 165,132,201 branch-misses # 1.69% of all branches ( +- 0.12% ) 7.287011656 seconds time elapsed ( +- 0.10% ) - After: 7375.924053 task-clock (msec) # 0.998 CPUs utilized ( +- 0.13% ) 31,107,548,846 cycles # 4.217 GHz ( +- 0.12% ) 55,355,668,947 instructions # 1.78 insns per cycle ( +- 0.05% ) 9,929,917,664 branches # 1346.261 M/sec ( +- 0.04% ) 166,547,442 branch-misses # 1.68% of all branches ( +- 0.09% ) 7.389068145 seconds time elapsed ( +- 0.13% ) That is, a 1.37% slowdown. Reviewed-by: Alex Bennée <alex.ben...@linaro.org> Reviewed-by: Richard Henderson <richard.hender...@linaro.org> Tested-by: Alex Bennée <alex.ben...@linaro.org> Signed-off-by: Emilio G. Cota <c...@braap.org> [Updated the speedup chart results for re-based series.] Signed-off-by: Robert Foley <robert.fo...@linaro.org> --- accel/tcg/cputlb.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c index 1e815357c7..7f75054643 100644 --- a/accel/tcg/cputlb.c +++ b/accel/tcg/cputlb.c @@ -299,7 +299,7 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn, CPU_FOREACH(cpu) { if (cpu != src) { - async_run_on_cpu(cpu, fn, d); + async_run_on_cpu_no_bql(cpu, fn, d); } } } @@ -367,8 +367,8 @@ void tlb_flush_by_mmuidx(CPUState *cpu, uint16_t idxmap) tlb_debug("mmu_idx: 0x%" PRIx16 "\n", idxmap); if (cpu->created && !qemu_cpu_is_self(cpu)) { - async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work, - RUN_ON_CPU_HOST_INT(idxmap)); + async_run_on_cpu_no_bql(cpu, tlb_flush_by_mmuidx_async_work, + RUN_ON_CPU_HOST_INT(idxmap)); } else { tlb_flush_by_mmuidx_async_work(cpu, RUN_ON_CPU_HOST_INT(idxmap)); } @@ -562,7 +562,7 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap) * we can stuff idxmap into the low TARGET_PAGE_BITS, avoid * allocating memory for this operation. */ - async_run_on_cpu(cpu, tlb_flush_page_by_mmuidx_async_1, + async_run_on_cpu_no_bql(cpu, tlb_flush_page_by_mmuidx_async_1, RUN_ON_CPU_TARGET_PTR(addr | idxmap)); } else { TLBFlushPageByMMUIdxData *d = g_new(TLBFlushPageByMMUIdxData, 1); @@ -570,7 +570,7 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap) /* Otherwise allocate a structure, freed by the worker. */ d->addr = addr; d->idxmap = idxmap; - async_run_on_cpu(cpu, tlb_flush_page_by_mmuidx_async_2, + async_run_on_cpu_no_bql(cpu, tlb_flush_page_by_mmuidx_async_2, RUN_ON_CPU_HOST_PTR(d)); } } -- 2.17.1