tcg: Expand on multi-threaded TCG

Philippe Mathieu-Daudé Thu, 28 May 2026 01:21:08 -0700

Significantly expands the TCG documentation to provide more
comprehensive overview of its internal architecture.


Use more rST anchors to improve cross-referencing across the
documentation.

Clarify front-end / optimization / back-end phases.

Detail a bit memory consistency barriers under MTTCG mode.

Add the following new sections:

 - Register Allocation and Liveness analysis
 - Overviews of the Vector/SIMD internal strategy
 - Deterministic Execution (icount)
 - TCG Plugins
 - Instruction Decoding with decodetree

AI-used-for: docs
Signed-off-by: Philippe Mathieu-Daudé <[email protected]>
---
Based-on: <[email protected]>
---
 docs/devel/multi-thread-tcg.rst |  2 +-
 docs/devel/tcg-icount.rst       |  1 +
 docs/devel/tcg.rst              | 89 +++++++++++++++++++++++++++++++++
 3 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/docs/devel/multi-thread-tcg.rst b/docs/devel/multi-thread-tcg.rst
index da9a1530c9f..aa0b11ab360 100644
--- a/docs/devel/multi-thread-tcg.rst
+++ b/docs/devel/multi-thread-tcg.rst
@@ -4,7 +4,7 @@
   This work is licensed under the terms of the GNU GPL, version 2 or
   later. See the COPYING file in the top-level directory.
 
-.. _mttcg:
+.. _MTTCG:
 
 ==================
 Multi-threaded TCG
diff --git a/docs/devel/tcg-icount.rst b/docs/devel/tcg-icount.rst
index a1dcd79e0fd..848c19a746f 100644
--- a/docs/devel/tcg-icount.rst
+++ b/docs/devel/tcg-icount.rst
@@ -2,6 +2,7 @@
    Copyright (c) 2020, Linaro Limited
    Written by Alex Bennée
 
+.. _icount:
 
 ========================
 TCG Instruction Counting
diff --git a/docs/devel/tcg.rst b/docs/devel/tcg.rst
index 2786f2f6791..9af06018f6a 100644
--- a/docs/devel/tcg.rst
+++ b/docs/devel/tcg.rst
@@ -13,6 +13,16 @@ performances.
 QEMU's dynamic translation backend is called TCG, for "Tiny Code
 Generator". For more information, please take a look at :ref:`tcg-ops-ref`.
 
+The translation process occurs in several distinct passes:
+
+1. **Front-end**: Guest instructions are parsed (often using the
+   `decodetree <Instruction Decoding (decodetree)_>`_ tool) and converted
+   into target-independent TCG Intermediate Representation (IR) opcodes.
+2. **Optimization**: TCG performs passes such as constant folding, liveness
+   analysis, and dead code elimination on the IR.
+3. **Back-end**: The optimized IR is converted by a host-specific code
+   generator into native instructions for the host CPU.
+
 The following sections outline some notable features and implementation
 details of QEMU's dynamic translator.
 
@@ -44,6 +54,12 @@ translating it from the guest architecture if it isn’t 
already available
 in memory. Then QEMU proceeds to execute this next TB, starting at the
 prologue and then moving on to the translated instructions.
 
+In :ref:`MTTCG` mode, each guest CPU is emulated by a separate host thread.
+TCG ensures memory consistency by inserting memory barrier (``mb``) opcodes
+for guest instructions with ordering side effects. Direct block chaining
+across page boundaries is restricted to ensure that changes to memory
+mappings in one thread are correctly handled by others.
+
 Exiting from the TB this way will cause the ``cpu_exec_interrupt()``
 callback to be re-evaluated before executing additional instructions.
 It is mandatory to exit this way after any CPU state changes that may
@@ -175,6 +191,12 @@ virtual to physical address translation is done at every 
memory
 access.
 
 QEMU uses an address translation cache (TLB) to speed up the translation.
+The software MMU partitions accesses into a **TLB fast-path** and a
+**TLB slow-path**. The fast-path handles RAM and ROM areas, where the TLB
+provides the direct offset between guest virtual addresses and host memory.
+If an access does not match a fast-path entry, it falls through to the
+slow-path, which calls C helper functions to handle MMIO device emulation.
+
 In order to avoid flushing the translated code each time the MMU
 mappings change, all caches in QEMU are physically indexed.  This
 means that each basic block is indexed with its physical address.
@@ -190,6 +212,73 @@ memory areas instead calls out to C code for device 
emulation.
 Finally, the MMU helps tracking dirty pages and pages pointed to by
 translation blocks.
 
+Register Allocation and Liveness
+--------------------------------
+
+During the translation phase, guest instructions are converted into TCG IR
+using an **unlimited number of temporaries (TEMPs)**.
+This allows guest translators to express logic without being constrained
+by the finite register set of the host CPU.
+
+To resolve these TEMPs into physical registers, TCG performs two passes:
+
+1. **Liveness Analysis**: This pass determines the "live range" of each
+   temporary within a basic block. By identifying when a variable
+   becomes "dead" (i.e., its value is no longer needed), TCG can suppress
+   redundant moves and remove instructions that compute unused results.
+2. **Register Allocation**: The Global Register Allocator maps live TEMPs
+   to host physical registers. Fixed globals, such as the pointer
+   to the CPU architecture state (``cpu_env``), are often permanently
+   held in host registers to minimize memory traffic during execution.
+
+Vector/SIMD Internal Strategy
+-----------------------------
+
+TCG supports SIMD operations through a set of generic vector instructions
+(e.g., ``add_vec``, ``shli_vec``) parameterized by vector length and element
+size. The length is specified as a ``TCGType`` (V64, V128, or V256), and the
+element size is given in log2 8-bit units.
+
+The internal strategy relies on the backend mapping these generic opcodes
+to native host SIMD instructions, such as x86 AVX or ARM NEON. If the host
+backend does not support a specific vector operation  or length, TCG's
+expansion layer automatically decomposes the opcode into smaller supported
+vector sizes or standard integer operations.
+
+Deterministic Execution (icount)
+--------------------------------
+
+The :ref:`icount` mechanism provides deterministic execution by ensuring
+that each Translation Block executes a fixed number of instructions. This
+is essential for features like record/replay and deterministic virtual time,
+where instruction counts serve as the system clock.
+
+Instrumentation and Plugins
+---------------------------
+
+:ref:`TCG Plugins` provide a mechanism for runtime instrumentation. Opcodes
+like ``plugin_cb`` and ``plugin_mem_cb`` are inserted during translation to
+trigger callbacks in external modules, allowing analysis of instruction
+execution or memory access.
+
+Instruction Decoding (decodetree)
+---------------------------------
+
+The first step of the translation process is converting a raw bitstream of
+guest instructions into a structured format that the translator can process.
+QEMU simplifies this using the ``decodetree.py`` script, which generates C
+code decoders from a domain-specific language defined in ``.decode`` files.
+
+The decodetree tool allows developers to define instruction **patterns**
+based on a bitmask and fixed bits. When a match is found, the generated
+decoder automatically  extracts defined **fields** (such as registers or
+immediates) and passes  them to a manually written translation function.
+
+This declarative approach drastically reduces the amount of error-prone
+manual bit-shifting and nested "if-else" logic required in guest translators.
+
+For detailled implementation see :ref:`decodetree`.
+
 Profiling JITted code
 ---------------------
 
-- 
2.53.0

[PATCH] docs/devel/tcg: Expand on multi-threaded TCG

Reply via email to