The motivation here is reducing the total overhead. Before a few patches went into target-arm.next, I measured total tlb flush overhead for aarch64 at 25%. This appears to reduce the total overhead to about 5% (I do need to re-run the control tests, not just watch perf top as I'm doing now).
The final patch is somewhat of an RFC. I'd like to know what benchmark was used when putting in pending_tlb_flushes, and I have not done any archaeology to find out. I suspect that it does make any measurable difference beyond tlb_c.dirty, and I think the code is a bit cleaner without it. r~ Richard Henderson (10): cputlb: Move tlb_lock to CPUTLBCommon cputlb: Remove tcg_enabled hack from tlb_flush_nocheck cputlb: Move cpu->pending_tlb_flush to env->tlb_c.pending_flush cputlb: Split large page tracking per mmu_idx cputlb: Move env->vtlb_index to env->tlb_d.vindex cputlb: Merge tlb_flush_nocheck into tlb_flush_by_mmuidx_async_work cputlb: Merge tlb_flush_page into tlb_flush_page_by_mmuidx cputlb: Count "partial" and "elided" tlb flushes cputlb: Filter flushes on already clean tlbs cputlb: Remove tlb_c.pending_flushes include/exec/cpu-defs.h | 51 +++++- include/exec/cputlb.h | 2 +- include/qom/cpu.h | 6 - accel/tcg/cputlb.c | 354 +++++++++++++++----------------------- accel/tcg/translate-all.c | 8 +- 5 files changed, 184 insertions(+), 237 deletions(-) -- 2.17.2