Re: [Qemu-devel] [RFC v3 04/19] docs: new design document multi-thread-tcg.txt (DRAFTING)

Sergey Fedorov Thu, 23 Jun 2016 14:35:51 -0700

On 03/06/16 23:40, Alex Bennée wrote:
> This is a current DRAFT of a design proposal for upgrading TCG emulation
> to take advantage of modern CPUs by running a thread-per-CPU. The
> document goes through the various areas of the code affected by such a
> change and proposes design requirements for each part of the solution.
>
> It has been written *without* explicit reference to the current ongoing
> efforts to introduce this feature. The hope being we can review and
> discuss the design choices without assuming the current choices taken by
> the implementation are correct.
>
> Signed-off-by: Alex Bennée <alex.ben...@linaro.org>
>
> ---
> v1
>   - initial version
> v2
>   - update discussion on locks
>   - bit more detail on vCPU scheduling
>   - explicitly mention Translation Blocks
>   - emulated hardware state already covered by iomutex
>   - a few minor rewords
> v3
>   - mention this covers system-mode
>   - describe main main-loop and lookup hot-path
>   - mention multi-concurrent-reader lookups
>   - enumerate reasons for invalidation
>   - add more details on lookup structures
>   - describe the softmmu hot-path better
>   - mention store-after-load barrier problem
> ---
>  docs/multi-thread-tcg.txt | 225 
> ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 225 insertions(+)
>  create mode 100644 docs/multi-thread-tcg.txt
>
> diff --git a/docs/multi-thread-tcg.txt b/docs/multi-thread-tcg.txt
> new file mode 100644
> index 0000000..5c88c99
> --- /dev/null
> +++ b/docs/multi-thread-tcg.txt
> @@ -0,0 +1,225 @@
> +Copyright (c) 2015 Linaro Ltd.
> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.  
> See
> +the COPYING file in the top-level directory.
> +
> +STATUS: DRAFTING
> +
> +Introduction
> +============
> +
> +This document outlines the design for multi-threaded TCG system-mode
> +emulation. The current user-mode emulation mirrors the thread
> +structure of the translated executable.
> +
> +The original system-mode TCG implementation was single threaded and
> +dealt with multiple CPUs by with simple round-robin scheduling. This
> +simplified a lot of things but became increasingly limited as systems
> +being emulated gained additional cores and per-core performance gains
> +for host systems started to level off.
> +
> +vCPU Scheduling
> +===============
> +
> +We introduce a new running mode where each vCPU will run on its own
> +user-space thread. This will be enabled by default for all
> +FE/BE combinations that have had the required work done to support
> +this safely.
> +
> +In the general case of running translated code there should be no
> +inter-vCPU dependencies and all vCPUs should be able to run at full
> +speed. Synchronisation will only be required while accessing internal
> +shared data structures or when the emulated architecture requires a
> +coherent representation of the emulated machine state.
> +
> +Shared Data Structures
> +======================
> +
> +Main Run Loop
> +-------------
> +
> +Even when there is no code being generated there are a number of
> +structures associated with the hot-path through the main run-loop.
> +These are associated with looking up the next translation block to
> +execute. These include:
> +
> +    tb_jmp_cache (per-vCPU, cache of recent jumps)
> +    tb_phys_hash (global, phys address->tb lookup)
> +
> +As TB linking only occurs when blocks are in the same page this code
> +is critical to performance as looking up the next TB to execute is the
> +most common reason to exit the generated code.
> +
> +DESIGN REQUIREMENT: Make access to lookup structures safe with
> +multiple reader/writer threads. Minimise any lock contention to do it.
> +
> +Global TCG State
> +----------------
> +
> +We need to protect the entire code generation cycle including any post
> +generation patching of the translated code. This also implies a shared
> +translation buffer which contains code running on all cores. Any
> +execution path that comes to the main run loop will need to hold a
> +mutex for code generation. This also includes times when we need flush
> +code or entries from any shared lookups/caches. Structures held on a
> +per-vCPU basis won't need locking unless other vCPUs will need to
> +modify them.
> +
> +DESIGN REQUIREMENT: Add locking around all code generation and TB
> +patching. If possible make shared lookup/caches able to handle multiple
> +readers without locks otherwise protect them with locks as well.
> +
> +Translation Blocks
> +------------------
> + 
> +Currently the whole system shares a single code generation buffer
> +which when full will force a flush of all translations and start from
> +scratch again.
> +
> +Once a basic block has been translated it will continue to be used
> +until it is invalidated. These invalidation events are typically due
> +a change to the state of a physical page:
> +  - code modification (self modify code, patching code)
> +  - page changes (new mapping to physical page)


Mapping changes does invalidate translation blocks in user-mode
emulation only. In system-mode emulation, we just do TLB flush and clean
CPU's 'tb_jmp_cache' but don't invalidate any translation block, just
follow tlb_flush(). That is why in system-mode emulation we can't do
direct jumps to another page or to a TB which span a page boundary.

> +  - debugging operations (breakpoint insertion/removal)
> +
> +There exist several places reference to TBs exist which need to be
> +cleared in a safe way.

You mean: "There are several places where references to TBs exist which
..."?

> +
> +The main reference is a global page table (l1_map) which provides a 2
> +level look-up for PageDesc structures which contain pointers to the
> +start of a linked list of all Translation Blocks in that page (see
> +page_next).

Actually, 'l1_map' is multi-level, see the comment above 'V_L2_BITS'
macro definition.

> +
> +When a block is invalidated any blocks which directly jump to it need
> +to have those jumps removed. This requires navigating the tb_jump_list
> +linked list as well as patching the jump code in a safe way.
> +
> +Finally there are a number of look-up mechanisms for accelerating
> +lookup of the next TB. These cache and hashed tables need to have
> +references removed in a safe way.
> +
> +DESIGN REQUIREMENT: Safely handle invalidation of TBs
> +                      - safely patch direct jumps
> +                      - remove central PageDesc lookup entries
> +                      - ensure lookup caches/hashes are safely updated
> +
> +Memory maps and TLBs
> +--------------------
> +
> +The memory handling code is fairly critical to the speed of memory
> +access in the emulated system. The SoftMMU code is designed so the
> +hot-path can be handled entirely within translated code. This is
> +handled with a per-vCPU TLB structure which once populated will allow
> +a series of accesses to the page to occur without exiting the
> +translated code. It is possible to set flags in the TLB address which
> +will ensure the slow-path is taken for each access. This can be done
> +to support:
> +
> +  - Memory regions (dividing up access to PIO, MMIO and RAM)
> +  - Dirty page tracking (for code gen, migration and display)
> +  - Virtual TLB (for translating guest address->real address)
> +
> +When the TLB tables are updated we need to ensure they are done in a
> +safe way by bringing all executing threads to a halt before making the
> +modifications.

Actually, we just need to be thread-safe when modifying vCPU TLB entries
from other thread. If it is possible to do using (relaxed) atomic memory
access, we could obviously benefit from it.

> +
> +DESIGN REQUIREMENTS:
> +
> +  - TLB Flush All/Page
> +    - can be across-CPUs
> +    - will need all other CPUs brought to a halt

Only *cross-CPU* TLB flush *may* need other CPU brought to halt.

> +  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
> +    - This is a per-CPU table - by definition can't race
> +    - updated by its own thread when the slow-path is forced
> +
> +Emulated hardware state
> +-----------------------
> +
> +Currently thanks to KVM work any access to IO memory is automatically
> +protected by the global iothread mutex. Any IO region that doesn't use
> +global mutex is expected to do its own locking.

Worth it mentioning that we're going to push iothread locking down in
TCG as much as possible?

> +
> +Memory Consistency
> +==================
> +
> +Between emulated guests and host systems there are a range of memory
> +consistency models. Even emulating weakly ordered systems on strongly
> +ordered hosts needs to ensure things like store-after-load re-ordering
> +can be prevented when the guest wants to.
> +
> +Memory Barriers
> +---------------
> +
> +Barriers (sometimes known as fences) provide a mechanism for software
> +to enforce a particular ordering of memory operations from the point
> +of view of external observers (e.g. another processor core). They can
> +apply to any memory operations as well as just loads or stores.
> +
> +The Linux kernel has an excellent write-up on the various forms of
> +memory barrier and the guarantees they can provide [1].
> +
> +Barriers are often wrapped around synchronisation primitives to
> +provide explicit memory ordering semantics. However they can be used
> +by themselves to provide safe lockless access by ensuring for example
> +a signal flag will always be set after a payload.
> +
> +DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
> +
> +This would enforce a strong load/store ordering so all loads/stores
> +complete at the memory barrier. On single-core non-SMP strongly
> +ordered backends this could become a NOP.
> +
> +There may be a case for further refinement if this causes performance
> +bottlenecks.

I think it's worth mentioning that aside from explicit standalone memory
barrier instructions there's also implicit memory ordering semantics
which comes with each guest memory access instruction, e.g. relaxed,
acquire/release, sequentially consistent.

> +
> +Memory Control and Maintenance
> +------------------------------
> +
> +This includes a class of instructions for controlling system cache
> +behaviour. While QEMU doesn't model cache behaviour these instructions
> +are often seen when code modification has taken place to ensure the
> +changes take effect.
> +
> +Synchronisation Primitives
> +--------------------------
> +
> +There are two broad types of synchronisation primitives found in
> +modern ISAs: atomic instructions and exclusive regions.
> +
> +The first type offer a simple atomic instruction which will guarantee
> +some sort of test and conditional store will be truly atomic w.r.t.
> +other cores sharing access to the memory. The classic example is the
> +x86 cmpxchg instruction.
> +
> +The second type offer a pair of load/store instructions which offer a
> +guarantee that an region of memory has not been touched between the
> +load and store instructions. An example of this is ARM's ldrex/strex
> +pair where the strex instruction will return a flag indicating a
> +successful store only if no other CPU has accessed the memory region
> +since the ldrex.
> +
> +Traditionally TCG has generated a series of operations that work
> +because they are within the context of a single translation block so
> +will have completed before another CPU is scheduled. However with
> +the ability to have multiple threads running to emulate multiple CPUs
> +we will need to explicitly expose these semantics.
> +
> +DESIGN REQUIREMENTS:
> + - atomics

What "atomics" means here ...

> +   - Introduce some atomic TCG ops for the common semantics
> +   - The default fallback helper function will use qemu_atomics
> +   - Each backend can then add a more efficient implementation
> + - load/store exclusive

... and how they relates to "load/store exclusive"?

> +   [AJB:
> +        There are currently a number proposals of interest:
> +     - Greensocs tweaks to ldst ex (using locks)
> +     - Slow-path for atomic instruction translation [2]
> +     - Helper-based Atomic Instruction Emulation (AIE) [3]
> +    ]

We also have a problem emulating 64-bit guest on 32-bit host: while a
naturally aligned machine-word memory access is usually atomic, a single
64-bit guest memory access instruction is translated into a series of
two 32-bit host memory access instruction. That can (will) break guest
code assumptions about memory access atomicity.

Kind regards,
Sergey

> +
> +==========
> +
> +[1] 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
> +[2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
> +[3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297

Re: [Qemu-devel] [RFC v3 04/19] docs: new design document multi-thread-tcg.txt (DRAFTING)

Reply via email to