Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Pierrick Bouvier Fri, 20 Mar 2026 11:09:54 -0700

On 3/19/26 3:29 PM, Ruslan Ruslichenko wrote:

On Thu, Mar 19, 2026 at 8:04 PM Pierrick Bouvier
<[email protected]> wrote:


On 3/19/26 11:20 AM, Ruslan Ruslichenko wrote:

Hi Pierrick,

Thank you for the feedback and review!

Our current plan is to put this plugin through our internal workflows to gather
more data on its limitations and performance.
Based on results, we may consider extending or refining the implementation
in the future.

Any further feedback on potential issues is highly appreciated.


By design, the approach of modifying QEMU internals to allow to inject
IRQ, set a timer, or trigger SMMU has very few chances to be integrated
as it is. At least, it should be discussed with the concerned
maintainers, and see if they would be open to it or not.

It's not wrong in itself, if you want a downstream solution, but it does
not scale upstream if we have to consider and accept everyone's needs.
The plugin API in itself can accept the burden for such things, but it's
harder to justify for internal stuff.

I believe it would be better to rely on ad hoc devices generating this,
with the advantage that even if they don't get accepted upstream, it
will be more easy for you to maintain them downstream compared to more
intrusive patches.

On Wed, Mar 18, 2026 at 6:16 PM Pierrick Bouvier
<[email protected]> wrote:


Hi Ruslan,

On 3/18/26 3:46 AM, Ruslan Ruslichenko wrote:

From: Ruslan Ruslichenko <[email protected]>

This patch series is submitted as an RFC to gather early feedback on a Fault 
Injection (FI) framework built on top of the QEMU TCG plugin subsystem.

Motivation

Testing guest operating systems, hypervisors (like Xen), and low-level drivers 
against unexpected hardware failures can be difficult.
This series provides an interface to inject faults dynamically without altering 
QEMU's core emulation source code for every test case.

Architecture & Key Features

The series introduces the core API extensions and implements a fault injection 
plugin (contrib/plugins/fault_injection.c) targeting AArch64.
The plugin can be controlled statically via XML configurations on boot, or 
dynamically at runtime via a UNIX socket (enabling integration with automated 
testing frameworks via Python or GDB).

New Plugin API Capabilities:

MMIO Interception: Allows plugins to hook into 
memory_region_dispatch_read/write to modify hardware register reads or drop 
writes.
Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks 
to be scheduled based on guest virtual time.
TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force 
re-translation when applying dynamic PC-based hooks.
Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on 
the primary INTC and inject CPU exceptions (e.g., SErrors).
Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) 
can expose specific fault handlers (like CMDQ errors) to be triggered 
externally by plugins.

Patch Summary
Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to 
the public plugin API.
Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception 
routing, and the Custom Fault registry.
Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch 
path.
Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to 
enable direct hardware IRQ injection.
Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to 
demonstrate how device models can expose specific errors (like CMDQ faults) to 
plugins.
Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using 
the new APIs.
Patch 9 (docs): Adds documentation and usage examples for the plugin.

Request for Comments & Feedback

Any suggestions on improvements, potential edge cases, or issues with the 
current design are highly welcome.

Ruslan Ruslichenko (9):
     target/arm: Add API for dynamic exception injection
     plugins/api: Expose virtual clock timers to plugins
     plugins: Expose Transaction Block cache flush API to plugins
     plugins: Introduce fault injection API and core subsystem
     system/memory: Add plugin callbacks to intercept MMIO accesses
     hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
     hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
     contrib/plugins: Add fault injection plugin
     docs: Add description of fault-injection plugin and subsystem

    contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
    contrib/plugins/meson.build       |   1 +
    docs/fault-injection.txt          | 111 +++++
    hw/arm/smmuv3.c                   |  54 +++
    hw/intc/arm_gic.c                 |  28 ++
    hw/intc/arm_gicv3.c               |  28 ++
    include/plugins/qemu-plugin.h     |  28 ++
    include/qemu/plugin.h             |  39 ++
    plugins/api.c                     |  62 +++
    plugins/core.c                    |  11 +
    plugins/fault.c                   | 116 +++++
    plugins/meson.build               |   1 +
    plugins/plugin.h                  |   2 +
    system/memory.c                   |   8 +
    target/arm/cpu.h                  |   4 +
    target/arm/helper.c               |  55 +++
    16 files changed, 1320 insertions(+)
    create mode 100644 contrib/plugins/fault_injection.c
    create mode 100644 docs/fault-injection.txt
    create mode 100644 plugins/fault.c


first, thanks for posting your series!

About the general approach.
As you noticed, this is exposing a lot of QEMU internals, and it's
something we tend to avoid to do. As well, it's very architecture
specific, which is another pattern we try to avoid.

For some of your needs (especially IRQ injection and timer injection),
did you consider writing a custom ad-hoc device and timer generating those?
There is nothing preventing you from writing a plugin that can
communicate with this specific device (through a socket for instance),
to request specific injections. I feel that it would scale better than
exposing all this to QEMU plugins API.

For SMMU, this is trickier. Tao recently (6ce361b02c82) an iommu test
device, associated to qtest to unit test the smmu implementation. We
could maybe see to leverage that on a full machine, associated with the
communication method mentioned above, to generate specific operations at
runtime, all triggered via a plugin.

Exposing qemu_plugin_flush_tb_cache is a hint we are missing something
on QEMU side. Better to fix it than expose this very internal function.


The reason this was needed is that the plugin may receive PC trigger
configuration
dynamically and need to register instruction callback at runtime.
If the TB for that PC is already translated and cached, our newly registered
callback might not be executed.

If there is a more proper way to force QEMU to re-translate a specific
TB or attach
a callback to cached TB it would be great to reduce the complexity here.


I understand better. QEMU plugin current implementation is too limited
for this, and everything has to be done/known at translation time.
What is your use case for receiving PC trigger after translation? Do you
have some mechanism to communicate with the plugin for this?


Yes, exactly. If the guest has already executed the target code, the newly
added trigger will be ignored, as the TB is cached.

For runtime configuration, the plugin spawns a background thread that listens
on a socket. External Python test script connects to this socket to send
dynamically generated XML faults.

Ok.

Internally, we have tb_invalidate_phys_range that will invalidate agiven range of tb. This is called when writing to memory for a givenaddress holding code.


Thus from your plugin, if you write to pc address with

qemu_plugin_write_memory_vaddr, it should trigger a re-translation ofthis tb. You'll need to read 1 byte, and write it back. As well, itshould be more efficient, since you will only invalidate this tb.


Give it a try and let us know if it works for your need.

There are several scenarios where this might be needed, mainly for faults that
are difficult to define statically at boot time.
Examples include injecting faults after specific chain of events, freezing or
overriding system registers values at specific execution points (since this
is currently implemented via PC triggers). Supporting environments with KASLR
enabled might be one more case.

For system registers, you can (heavy but would work) instrumentinconditionally all instructions that touch those registers, so therewould be no need to flush anything. System registers are not accessedfor every instruction, so hopefully, it should not impact too muchexecution time.

With both solutions, it should remove the need to expose tb_flushthrough plugin API.

The associated TRIGGER_ON_PC is very similar to existing inline
operations. They could be enhanced to support writing to a given
register, all the bricks are there. For TRIGGER_ON_SYSREG it's a bit
more complex, but we might enhance inline operations also to support
hooks on specific register writes.


TRIGGER_ON_PC may also be used for generating other faults too. For example,
one use-case is to trigger CPU exceptions on specific instructions.
Supporting TRIGGER_ON_SYSREG as an inline operation sounds like a
really interesting
direction to explore.


In general, having inline operations support on register read/writes
would be a very nice thing to have (though might be tricky to implement
correctly), and more efficient that the existing approach that requires
to check their value everytime.


For MMIO override, the current approach you have is good, and it's
definitely something we could integrate.

What are you toughts about this? (especially the device based approach
in case that you maybe tried first).


I agree such an approach can work well for IRQ's and Timers, and would be
more clean way to implement this.

However, for SMMU and similar cases, triggering internal state errors is not
easy and requires accessing internal logic. So for those specific cases,
a different approach may be needed.


Thus the iommu-testdev I mentioned, that could be extended to support this.


Regards,
Pierrick


BR,
Ruslan


Regards,
Pierrick

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Reply via email to