On 03/06/16 23:40, Alex Bennée wrote: > This is a current DRAFT of a design proposal for upgrading TCG emulation > to take advantage of modern CPUs by running a thread-per-CPU. The > document goes through the various areas of the code affected by such a > change and proposes design requirements for each part of the solution. > > It has been written *without* explicit reference to the current ongoing > efforts to introduce this feature. The hope being we can review and > discuss the design choices without assuming the current choices taken by > the implementation are correct. > > Signed-off-by: Alex Bennée <alex.ben...@linaro.org> > > --- > v1 > - initial version > v2 > - update discussion on locks > - bit more detail on vCPU scheduling > - explicitly mention Translation Blocks > - emulated hardware state already covered by iomutex > - a few minor rewords > v3 > - mention this covers system-mode > - describe main main-loop and lookup hot-path > - mention multi-concurrent-reader lookups > - enumerate reasons for invalidation > - add more details on lookup structures > - describe the softmmu hot-path better > - mention store-after-load barrier problem > --- > docs/multi-thread-tcg.txt | 225 > ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 225 insertions(+) > create mode 100644 docs/multi-thread-tcg.txt > > diff --git a/docs/multi-thread-tcg.txt b/docs/multi-thread-tcg.txt > new file mode 100644 > index 0000000..5c88c99 > --- /dev/null > +++ b/docs/multi-thread-tcg.txt > @@ -0,0 +1,225 @@ > +Copyright (c) 2015 Linaro Ltd. > + > +This work is licensed under the terms of the GNU GPL, version 2 or later. > See > +the COPYING file in the top-level directory. > + > +STATUS: DRAFTING > + > +Introduction > +============ > + > +This document outlines the design for multi-threaded TCG system-mode > +emulation. The current user-mode emulation mirrors the thread > +structure of the translated executable. > + > +The original system-mode TCG implementation was single threaded and > +dealt with multiple CPUs by with simple round-robin scheduling. This > +simplified a lot of things but became increasingly limited as systems > +being emulated gained additional cores and per-core performance gains > +for host systems started to level off. > + > +vCPU Scheduling > +=============== > + > +We introduce a new running mode where each vCPU will run on its own > +user-space thread. This will be enabled by default for all > +FE/BE combinations that have had the required work done to support > +this safely. > + > +In the general case of running translated code there should be no > +inter-vCPU dependencies and all vCPUs should be able to run at full > +speed. Synchronisation will only be required while accessing internal > +shared data structures or when the emulated architecture requires a > +coherent representation of the emulated machine state. > + > +Shared Data Structures > +====================== > + > +Main Run Loop > +------------- > + > +Even when there is no code being generated there are a number of > +structures associated with the hot-path through the main run-loop. > +These are associated with looking up the next translation block to > +execute. These include: > + > + tb_jmp_cache (per-vCPU, cache of recent jumps) > + tb_phys_hash (global, phys address->tb lookup) > + > +As TB linking only occurs when blocks are in the same page this code > +is critical to performance as looking up the next TB to execute is the > +most common reason to exit the generated code. > + > +DESIGN REQUIREMENT: Make access to lookup structures safe with > +multiple reader/writer threads. Minimise any lock contention to do it. > + > +Global TCG State > +---------------- > + > +We need to protect the entire code generation cycle including any post > +generation patching of the translated code. This also implies a shared > +translation buffer which contains code running on all cores. Any > +execution path that comes to the main run loop will need to hold a > +mutex for code generation. This also includes times when we need flush > +code or entries from any shared lookups/caches. Structures held on a > +per-vCPU basis won't need locking unless other vCPUs will need to > +modify them. > + > +DESIGN REQUIREMENT: Add locking around all code generation and TB > +patching. If possible make shared lookup/caches able to handle multiple > +readers without locks otherwise protect them with locks as well. > + > +Translation Blocks > +------------------ > + > +Currently the whole system shares a single code generation buffer > +which when full will force a flush of all translations and start from > +scratch again. > + > +Once a basic block has been translated it will continue to be used > +until it is invalidated. These invalidation events are typically due > +a change to the state of a physical page: > + - code modification (self modify code, patching code) > + - page changes (new mapping to physical page)
Mapping changes does invalidate translation blocks in user-mode emulation only. In system-mode emulation, we just do TLB flush and clean CPU's 'tb_jmp_cache' but don't invalidate any translation block, just follow tlb_flush(). That is why in system-mode emulation we can't do direct jumps to another page or to a TB which span a page boundary. > + - debugging operations (breakpoint insertion/removal) > + > +There exist several places reference to TBs exist which need to be > +cleared in a safe way. You mean: "There are several places where references to TBs exist which ..."? > + > +The main reference is a global page table (l1_map) which provides a 2 > +level look-up for PageDesc structures which contain pointers to the > +start of a linked list of all Translation Blocks in that page (see > +page_next). Actually, 'l1_map' is multi-level, see the comment above 'V_L2_BITS' macro definition. > + > +When a block is invalidated any blocks which directly jump to it need > +to have those jumps removed. This requires navigating the tb_jump_list > +linked list as well as patching the jump code in a safe way. > + > +Finally there are a number of look-up mechanisms for accelerating > +lookup of the next TB. These cache and hashed tables need to have > +references removed in a safe way. > + > +DESIGN REQUIREMENT: Safely handle invalidation of TBs > + - safely patch direct jumps > + - remove central PageDesc lookup entries > + - ensure lookup caches/hashes are safely updated > + > +Memory maps and TLBs > +-------------------- > + > +The memory handling code is fairly critical to the speed of memory > +access in the emulated system. The SoftMMU code is designed so the > +hot-path can be handled entirely within translated code. This is > +handled with a per-vCPU TLB structure which once populated will allow > +a series of accesses to the page to occur without exiting the > +translated code. It is possible to set flags in the TLB address which > +will ensure the slow-path is taken for each access. This can be done > +to support: > + > + - Memory regions (dividing up access to PIO, MMIO and RAM) > + - Dirty page tracking (for code gen, migration and display) > + - Virtual TLB (for translating guest address->real address) > + > +When the TLB tables are updated we need to ensure they are done in a > +safe way by bringing all executing threads to a halt before making the > +modifications. Actually, we just need to be thread-safe when modifying vCPU TLB entries from other thread. If it is possible to do using (relaxed) atomic memory access, we could obviously benefit from it. > + > +DESIGN REQUIREMENTS: > + > + - TLB Flush All/Page > + - can be across-CPUs > + - will need all other CPUs brought to a halt Only *cross-CPU* TLB flush *may* need other CPU brought to halt. > + - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) > + - This is a per-CPU table - by definition can't race > + - updated by its own thread when the slow-path is forced > + > +Emulated hardware state > +----------------------- > + > +Currently thanks to KVM work any access to IO memory is automatically > +protected by the global iothread mutex. Any IO region that doesn't use > +global mutex is expected to do its own locking. Worth it mentioning that we're going to push iothread locking down in TCG as much as possible? > + > +Memory Consistency > +================== > + > +Between emulated guests and host systems there are a range of memory > +consistency models. Even emulating weakly ordered systems on strongly > +ordered hosts needs to ensure things like store-after-load re-ordering > +can be prevented when the guest wants to. > + > +Memory Barriers > +--------------- > + > +Barriers (sometimes known as fences) provide a mechanism for software > +to enforce a particular ordering of memory operations from the point > +of view of external observers (e.g. another processor core). They can > +apply to any memory operations as well as just loads or stores. > + > +The Linux kernel has an excellent write-up on the various forms of > +memory barrier and the guarantees they can provide [1]. > + > +Barriers are often wrapped around synchronisation primitives to > +provide explicit memory ordering semantics. However they can be used > +by themselves to provide safe lockless access by ensuring for example > +a signal flag will always be set after a payload. > + > +DESIGN REQUIREMENT: Add a new tcg_memory_barrier op > + > +This would enforce a strong load/store ordering so all loads/stores > +complete at the memory barrier. On single-core non-SMP strongly > +ordered backends this could become a NOP. > + > +There may be a case for further refinement if this causes performance > +bottlenecks. I think it's worth mentioning that aside from explicit standalone memory barrier instructions there's also implicit memory ordering semantics which comes with each guest memory access instruction, e.g. relaxed, acquire/release, sequentially consistent. > + > +Memory Control and Maintenance > +------------------------------ > + > +This includes a class of instructions for controlling system cache > +behaviour. While QEMU doesn't model cache behaviour these instructions > +are often seen when code modification has taken place to ensure the > +changes take effect. > + > +Synchronisation Primitives > +-------------------------- > + > +There are two broad types of synchronisation primitives found in > +modern ISAs: atomic instructions and exclusive regions. > + > +The first type offer a simple atomic instruction which will guarantee > +some sort of test and conditional store will be truly atomic w.r.t. > +other cores sharing access to the memory. The classic example is the > +x86 cmpxchg instruction. > + > +The second type offer a pair of load/store instructions which offer a > +guarantee that an region of memory has not been touched between the > +load and store instructions. An example of this is ARM's ldrex/strex > +pair where the strex instruction will return a flag indicating a > +successful store only if no other CPU has accessed the memory region > +since the ldrex. > + > +Traditionally TCG has generated a series of operations that work > +because they are within the context of a single translation block so > +will have completed before another CPU is scheduled. However with > +the ability to have multiple threads running to emulate multiple CPUs > +we will need to explicitly expose these semantics. > + > +DESIGN REQUIREMENTS: > + - atomics What "atomics" means here ... > + - Introduce some atomic TCG ops for the common semantics > + - The default fallback helper function will use qemu_atomics > + - Each backend can then add a more efficient implementation > + - load/store exclusive ... and how they relates to "load/store exclusive"? > + [AJB: > + There are currently a number proposals of interest: > + - Greensocs tweaks to ldst ex (using locks) > + - Slow-path for atomic instruction translation [2] > + - Helper-based Atomic Instruction Emulation (AIE) [3] > + ] We also have a problem emulating 64-bit guest on 32-bit host: while a naturally aligned machine-word memory access is usually atomic, a single 64-bit guest memory access instruction is translated into a series of two 32-bit host memory access instruction. That can (will) break guest code assumptions about memory access atomicity. Kind regards, Sergey > + > +========== > + > +[1] > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt > +[2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561 > +[3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297