Philippe Mathieu-Daudé <[email protected]> writes:

> Significantly expands the TCG documentation to provide more
> comprehensive overview of its internal architecture.
>
> Use more rST anchors to improve cross-referencing across the
> documentation.
>
> Clarify front-end / optimization / back-end phases.
>
> Detail a bit memory consistency barriers under MTTCG mode.
>
> Add the following new sections:
>
>  - Register Allocation and Liveness analysis
>  - Overviews of the Vector/SIMD internal strategy
>  - Deterministic Execution (icount)
>  - TCG Plugins
>  - Instruction Decoding with decodetree
>
> AI-used-for: docs
> Signed-off-by: Philippe Mathieu-Daudé <[email protected]>
> ---
> Based-on: <[email protected]>
> ---
>  docs/devel/multi-thread-tcg.rst |  2 +-
>  docs/devel/tcg-icount.rst       |  1 +
>  docs/devel/tcg.rst              | 89 +++++++++++++++++++++++++++++++++
>  3 files changed, 91 insertions(+), 1 deletion(-)
>
> diff --git a/docs/devel/multi-thread-tcg.rst b/docs/devel/multi-thread-tcg.rst
> index da9a1530c9f..aa0b11ab360 100644
> --- a/docs/devel/multi-thread-tcg.rst
> +++ b/docs/devel/multi-thread-tcg.rst
> @@ -4,7 +4,7 @@
>    This work is licensed under the terms of the GNU GPL, version 2 or
>    later. See the COPYING file in the top-level directory.
>  
> -.. _mttcg:
> +.. _MTTCG:
>  
>  ==================
>  Multi-threaded TCG
> diff --git a/docs/devel/tcg-icount.rst b/docs/devel/tcg-icount.rst
> index a1dcd79e0fd..848c19a746f 100644
> --- a/docs/devel/tcg-icount.rst
> +++ b/docs/devel/tcg-icount.rst
> @@ -2,6 +2,7 @@
>     Copyright (c) 2020, Linaro Limited
>     Written by Alex Bennée
>  
> +.. _icount:
>  
>  ========================
>  TCG Instruction Counting
> diff --git a/docs/devel/tcg.rst b/docs/devel/tcg.rst
> index 2786f2f6791..9af06018f6a 100644
> --- a/docs/devel/tcg.rst
> +++ b/docs/devel/tcg.rst
> @@ -13,6 +13,16 @@ performances.
>  QEMU's dynamic translation backend is called TCG, for "Tiny Code
>  Generator". For more information, please take a look at :ref:`tcg-ops-ref`.
>  
> +The translation process occurs in several distinct passes:
> +
> +1. **Front-end**: Guest instructions are parsed (often using the
> +   `decodetree <Instruction Decoding (decodetree)_>`_ tool) and converted
> +   into target-independent TCG Intermediate Representation (IR) opcodes.
> +2. **Optimization**: TCG performs passes such as constant folding, liveness
> +   analysis, and dead code elimination on the IR.

Not all optimisation is done here by the way, some of the front-end ops
will select operations based on TCG_TARGET_HAS_ before we get to the
optimisation pass.

> +3. **Back-end**: The optimized IR is converted by a host-specific code
> +   generator into native instructions for the host CPU.
> +
>  The following sections outline some notable features and implementation
>  details of QEMU's dynamic translator.
>  
> @@ -44,6 +54,12 @@ translating it from the guest architecture if it isn’t 
> already available
>  in memory. Then QEMU proceeds to execute this next TB, starting at the
>  prologue and then moving on to the translated instructions.
>  
> +In :ref:`MTTCG` mode, each guest CPU is emulated by a separate host thread.
> +TCG ensures memory consistency by inserting memory barrier (``mb``) opcodes
> +for guest instructions with ordering side effects. Direct block chaining
> +across page boundaries is restricted to ensure that changes to memory
> +mappings in one thread are correctly handled by others.
> +
>  Exiting from the TB this way will cause the ``cpu_exec_interrupt()``
>  callback to be re-evaluated before executing additional instructions.
>  It is mandatory to exit this way after any CPU state changes that may
> @@ -175,6 +191,12 @@ virtual to physical address translation is done at every 
> memory
>  access.
>  
>  QEMU uses an address translation cache (TLB) to speed up the translation.
> +The software MMU partitions accesses into a **TLB fast-path** and a
> +**TLB slow-path**. The fast-path handles RAM and ROM areas, where the TLB
> +provides the direct offset between guest virtual addresses and host memory.
> +If an access does not match a fast-path entry, it falls through to the
> +slow-path, which calls C helper functions to handle MMIO device emulation.
> +
>  In order to avoid flushing the translated code each time the MMU
>  mappings change, all caches in QEMU are physically indexed.  This
>  means that each basic block is indexed with its physical address.
> @@ -190,6 +212,73 @@ memory areas instead calls out to C code for device 
> emulation.
>  Finally, the MMU helps tracking dirty pages and pages pointed to by
>  translation blocks.
>  
> +Register Allocation and Liveness
> +--------------------------------
> +
> +During the translation phase, guest instructions are converted into TCG IR
> +using an **unlimited number of temporaries (TEMPs)**.
> +This allows guest translators to express logic without being constrained
> +by the finite register set of the host CPU.
> +
> +To resolve these TEMPs into physical registers, TCG performs two passes:
> +
> +1. **Liveness Analysis**: This pass determines the "live range" of each
> +   temporary within a basic block. By identifying when a variable
> +   becomes "dead" (i.e., its value is no longer needed), TCG can suppress
> +   redundant moves and remove instructions that compute unused results.
> +2. **Register Allocation**: The Global Register Allocator maps live TEMPs
> +   to host physical registers. Fixed globals, such as the pointer
> +   to the CPU architecture state (``cpu_env``), are often permanently
> +   held in host registers to minimize memory traffic during execution.
> +
> +Vector/SIMD Internal Strategy
> +-----------------------------
> +
> +TCG supports SIMD operations through a set of generic vector instructions
> +(e.g., ``add_vec``, ``shli_vec``) parameterized by vector length and element
> +size. The length is specified as a ``TCGType`` (V64, V128, or V256), and the
> +element size is given in log2 8-bit units.
> +
> +The internal strategy relies on the backend mapping these generic opcodes
> +to native host SIMD instructions, such as x86 AVX or ARM NEON. If the host
> +backend does not support a specific vector operation  or length, TCG's
> +expansion layer automatically decomposes the opcode into smaller supported
> +vector sizes or standard integer operations.
> +
> +Deterministic Execution (icount)
> +--------------------------------
> +
> +The :ref:`icount` mechanism provides deterministic execution by ensuring
> +that each Translation Block executes a fixed number of instructions. This
> +is essential for features like record/replay and deterministic virtual time,
> +where instruction counts serve as the system clock.
> +
> +Instrumentation and Plugins
> +---------------------------
> +
> +:ref:`TCG Plugins` provide a mechanism for runtime instrumentation. Opcodes
> +like ``plugin_cb`` and ``plugin_mem_cb`` are inserted during translation to
> +trigger callbacks in external modules, allowing analysis of instruction
> +execution or memory access.
> +
> +Instruction Decoding (decodetree)
> +---------------------------------
> +
> +The first step of the translation process is converting a raw bitstream of
> +guest instructions into a structured format that the translator can process.
> +QEMU simplifies this using the ``decodetree.py`` script, which generates C
> +code decoders from a domain-specific language defined in ``.decode`` files.
> +
> +The decodetree tool allows developers to define instruction **patterns**
> +based on a bitmask and fixed bits. When a match is found, the generated
> +decoder automatically  extracts defined **fields** (such as registers or
> +immediates) and passes  them to a manually written translation function.
> +
> +This declarative approach drastically reduces the amount of error-prone
> +manual bit-shifting and nested "if-else" logic required in guest translators.
> +
> +For detailled implementation see :ref:`decodetree`.
> +
>  Profiling JITted code
>  ---------------------

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Reply via email to