Am 21.05.2026 um 16:18 hat Fiona Ebner geschrieben:
> Am 21.05.26 um 3:46 PM schrieb Kevin Wolf:
> > Am 21.05.2026 um 14:12 hat Fiona Ebner geschrieben:
> >> Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
> >> I'm still trying to figure things out and come up with a better
> >> reproducer, but wanted to let you know early, also because of the
> >> upcoming stable releases. Of course, I'd also be happy for hints/hunches
> >> and am happy to test suggestions!
> > 
> > Do you have any information about the options used with the image file?
> > In particular, is it using subclusters? Maybe just the 'qemu-img info'
> > output would already give a bit more context.
> 
> No subclusters if I'm not missing anything. When I created the image the
> output was:
> 
> Formatting '/mnt/pve/dir/images/300/vm-300-disk-0.qcow2', fmt=qcow2
> cluster_size=65536 extended_l2=off preallocation=metadata
> compression_type=zlib size=4510973952 lazy_refcounts=off refcount_bits=16
> 
> Our management layer doesn't log the command itself, but doing the same
> operation with logging added (and 301 instead of 300):
> 
> /usr/bin/qemu-img create -o preallocation=metadata -f qcow2
> /mnt/pve/dir/images/301/vm-301-disk-0.qcow2 4405248K
> 
> qemu-img info gives:
> [...]

Ok, looks like all default options.

> > Could you already locate the actual corruption and check what the
> > pattern looks like? Something like zeros where we would expect data or
> > the other way around? Or something less clear? (If you don't know,
> > that's a good answer too. I know well that this kind of things is hard
> > to debug.)
> 
> Unfortunately not. I can only see the symptom of memory swapped back in
> being corrupt (at least that's what happens AFAIU), leading to segfaults
> in various processes as well as issues with heap allocations, e.g.:
> corrupted double-linked list
> free(): invalid pointer
> 
> I'll write a small program which allocates memory with a fixed pattern
> and regularly dumps it, maybe that works to get an idea about the
> corruption.

AI suggests a scenario that looks like a real bug to me, though I'm not
sure if it's yours. See the reproducer below.

Basically it boils down to a non-allocating write being in flight to a
cluster that is concurrently discarded, turning the write essentially
into a host-cluster use-after-free. If you then allocate a new cluster
at the same time, the host cluster will be reused and the write that was
for a different guest cluster still writes to it.

I'm not completely sure yet what the right synchronisation mechanism
would be for this.

Anyway, as it depends on a specific pattern of discard and cluster
allocation happening while a write request is in flight, it should be
possible to use tracing to find out if anything like that is happening
in your case.

Kevin


blkdebug.conf:

[set-state]
state = "1"
event = "write_aio"
new_state = "2"

[set-state]
state = "2"
event = "cluster_alloc"
new_state = "3"


race_test.sh:

#!/bin/bash
#
# Reproducer for the wait_for_dependencies / skip_cow race in
# qcow2_subcluster_zeroize — demonstrating data corruption at an
# UNRELATED guest offset through host cluster reuse.
#
# The scenario:
#   1. Write A to a ZERO_ALLOC cluster creates l2meta. Data I/O suspended.
#   2. Write B to same cluster waits for A. Zero-write also waits for A.
#   3. A completes (cluster → NORMAL). B wakes first (FIFO), gets
#      skip_cow=true (no l2meta), starts data I/O — suspended by blkdebug.
#      Zero-write wakes, finds no deps (B invisible), frees cluster.
#   4. Write D to a DIFFERENT guest offset allocates the freed cluster.
#      D writes its data. D completes.
#   5. B resumes and writes to the same physical cluster, overwriting D.
#   6. Reading D's guest offset returns B's data. CORRUPTION.

set -e

DIR="$(cd "$(dirname "$0")" && pwd)"
QEMU_IO="${DIR}/../build/qemu-io"
QEMU_IMG="${DIR}/../build/qemu-img"
TEST_IMG="/tmp/race_test_$$.qcow2"
BLKDEBUG_CONF="${DIR}/blkdebug.conf"
LOG="/home/cursor/qemu/debug-8a8071.log"

cleanup() {
    rm -f "$TEST_IMG"
}
trap cleanup EXIT

echo "=== Creating test image ==="
"$QEMU_IMG" create -f qcow2 "$TEST_IMG" 1M

echo ""
echo "=== Preparing ZERO_ALLOC cluster at guest offset 0 ==="
"$QEMU_IO" -c "write -P 0x11 0 64k" \
            -c "write -z 0 64k" \
            "$TEST_IMG"

echo ""
echo "=== Running race reproducer ==="
#
# blkdebug.conf state machine:
#   State 1 --(write_aio)--> State 2 --(cluster_alloc)--> State 3
#
# - State 1: tagA breakpoint catches write A
# - State 2: tagB breakpoint catches write B (skip_cow write)
# - State 2→3 transition on cluster_alloc: D's allocation transitions
#   state to 3 BEFORE D fires write_aio, so D is NOT caught by tagB
#
# Sequence:
#   break write_aio tagA          -- breakpoint for state 1
#   aio_write A 0xAA 0 64k       -- suspended at tagA (state 1→2)
#   wait_break tagA
#   break write_aio tagB          -- breakpoint for state 2
#   aio_write B 0xBB 0 64k       -- waits for A (handle_dependencies)
#   aio_write -z -u 0 64k        -- waits for A (wait_for_dependencies)
#   resume tagA                   -- A completes. B wakes (skip_cow),
#                                    caught by tagB. Zero-write frees
#                                    cluster.
#   wait_break tagB               -- B suspended, cluster freed
#   write D 0xDD 64k 64k         -- D allocates the freed cluster
#                                    (cluster_alloc transitions to
#                                    state 3). D's write_aio fires at
#                                    state 3 — no breakpoint. D writes
#                                    its data and completes.
#   resume tagB                   -- B writes to the SAME physical
#                                    cluster, overwriting D's data
#   aio_flush
#
#   read -P 0xDD 64k 64k         -- EXPECTS D's data (0xDD)
#                                    GETS B's data (0xBB) → CORRUPTION

QEMU_IO_OUTPUT=$("$QEMU_IO" \
    -c "break write_aio tagA" \
    -c "aio_write -P 0xAA 0 64k" \
    -c "wait_break tagA" \
    -c "break write_aio tagB" \
    -c "aio_write -P 0xBB 0 64k" \
    -c "aio_write -z -u 0 64k" \
    -c "resume tagA" \
    -c "wait_break tagB" \
    -c "write -P 0xDD 64k 64k" \
    -c "resume tagB" \
    -c "aio_flush" \
    -c "read -vP 0xDD 64k 512" \
    -c "read -vP 0 0 512" \
    "blkdebug:${BLKDEBUG_CONF}:${TEST_IMG}" 2>&1) || true

echo "$QEMU_IO_OUTPUT"

PATTERN_FAIL=$(echo "$QEMU_IO_OUTPUT" | grep -c "Pattern verification failed" 
|| true)
if [ "$PATTERN_FAIL" -gt 0 ]; then
    echo ""
    echo "*** DATA CORRUPTION DETECTED at guest offset 64K ***"
    echo "*** D wrote 0xDD, but reading returns B's data (0xBB)."
    echo "*** B's write to the freed+reallocated cluster corrupted"
    echo "*** an UNRELATED guest address."
fi

echo ""
echo "=== Checking image integrity (metadata) ==="
"$QEMU_IMG" check "$TEST_IMG" || true

echo ""
echo "=== Allocation map ==="
"$QEMU_IMG" map --output=json "$TEST_IMG"

echo ""
echo "=== Checking log for race evidence ==="
if [ -f "$LOG" ]; then
    echo "--- Log entries (chronological) ---"
    cat "$LOG"
    echo ""
    echo "--- Race analysis ---"
    # Extract host offsets for the skip_cow write (B) and D's write
    B_HOST=$(grep '"has_l2meta":0' "$LOG" | grep -o '"host_offset":[0-9]*' | 
head -1 | grep -o '[0-9]*')
    D_HOST=$(grep '"offset":65536' "$LOG" | grep -o '"host_offset":[0-9]*' | 
head -1 | grep -o '[0-9]*')

    echo "Write B (skip_cow, no l2meta) host_offset: $B_HOST"
    echo "Write D (different guest offset)  host_offset: $D_HOST"

    if [ -n "$B_HOST" ] && [ -n "$D_HOST" ] && [ "$B_HOST" = "$D_HOST" ]; then
        echo ""
        echo "*** CLUSTER REUSE CONFIRMED: B and D write to the same"
        echo "*** physical cluster ($B_HOST) for different guest offsets."
        echo "*** B (guest offset 0) overwrites D (guest offset 64K)."
        echo "*** Reading guest offset 64K returns B's data → CORRUPTION"
        echo "*** at an unrelated guest address."
    fi
else
    echo "No log file found at $LOG"
fi


Reply via email to