Am 21.05.2026 um 16:18 hat Fiona Ebner geschrieben:
> Am 21.05.26 um 3:46 PM schrieb Kevin Wolf:
> > Am 21.05.2026 um 14:12 hat Fiona Ebner geschrieben:
> >> Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
> >> I'm still trying to figure things out and come up with a better
> >> reproducer, but wanted to let you know early, also because of the
> >> upcoming stable releases. Of course, I'd also be happy for hints/hunches
> >> and am happy to test suggestions!
> >
> > Do you have any information about the options used with the image file?
> > In particular, is it using subclusters? Maybe just the 'qemu-img info'
> > output would already give a bit more context.
>
> No subclusters if I'm not missing anything. When I created the image the
> output was:
>
> Formatting '/mnt/pve/dir/images/300/vm-300-disk-0.qcow2', fmt=qcow2
> cluster_size=65536 extended_l2=off preallocation=metadata
> compression_type=zlib size=4510973952 lazy_refcounts=off refcount_bits=16
>
> Our management layer doesn't log the command itself, but doing the same
> operation with logging added (and 301 instead of 300):
>
> /usr/bin/qemu-img create -o preallocation=metadata -f qcow2
> /mnt/pve/dir/images/301/vm-301-disk-0.qcow2 4405248K
>
> qemu-img info gives:
> [...]
Ok, looks like all default options.
> > Could you already locate the actual corruption and check what the
> > pattern looks like? Something like zeros where we would expect data or
> > the other way around? Or something less clear? (If you don't know,
> > that's a good answer too. I know well that this kind of things is hard
> > to debug.)
>
> Unfortunately not. I can only see the symptom of memory swapped back in
> being corrupt (at least that's what happens AFAIU), leading to segfaults
> in various processes as well as issues with heap allocations, e.g.:
> corrupted double-linked list
> free(): invalid pointer
>
> I'll write a small program which allocates memory with a fixed pattern
> and regularly dumps it, maybe that works to get an idea about the
> corruption.
AI suggests a scenario that looks like a real bug to me, though I'm not
sure if it's yours. See the reproducer below.
Basically it boils down to a non-allocating write being in flight to a
cluster that is concurrently discarded, turning the write essentially
into a host-cluster use-after-free. If you then allocate a new cluster
at the same time, the host cluster will be reused and the write that was
for a different guest cluster still writes to it.
I'm not completely sure yet what the right synchronisation mechanism
would be for this.
Anyway, as it depends on a specific pattern of discard and cluster
allocation happening while a write request is in flight, it should be
possible to use tracing to find out if anything like that is happening
in your case.
Kevin
blkdebug.conf:
[set-state]
state = "1"
event = "write_aio"
new_state = "2"
[set-state]
state = "2"
event = "cluster_alloc"
new_state = "3"
race_test.sh:
#!/bin/bash
#
# Reproducer for the wait_for_dependencies / skip_cow race in
# qcow2_subcluster_zeroize — demonstrating data corruption at an
# UNRELATED guest offset through host cluster reuse.
#
# The scenario:
# 1. Write A to a ZERO_ALLOC cluster creates l2meta. Data I/O suspended.
# 2. Write B to same cluster waits for A. Zero-write also waits for A.
# 3. A completes (cluster → NORMAL). B wakes first (FIFO), gets
# skip_cow=true (no l2meta), starts data I/O — suspended by blkdebug.
# Zero-write wakes, finds no deps (B invisible), frees cluster.
# 4. Write D to a DIFFERENT guest offset allocates the freed cluster.
# D writes its data. D completes.
# 5. B resumes and writes to the same physical cluster, overwriting D.
# 6. Reading D's guest offset returns B's data. CORRUPTION.
set -e
DIR="$(cd "$(dirname "$0")" && pwd)"
QEMU_IO="${DIR}/../build/qemu-io"
QEMU_IMG="${DIR}/../build/qemu-img"
TEST_IMG="/tmp/race_test_$$.qcow2"
BLKDEBUG_CONF="${DIR}/blkdebug.conf"
LOG="/home/cursor/qemu/debug-8a8071.log"
cleanup() {
rm -f "$TEST_IMG"
}
trap cleanup EXIT
echo "=== Creating test image ==="
"$QEMU_IMG" create -f qcow2 "$TEST_IMG" 1M
echo ""
echo "=== Preparing ZERO_ALLOC cluster at guest offset 0 ==="
"$QEMU_IO" -c "write -P 0x11 0 64k" \
-c "write -z 0 64k" \
"$TEST_IMG"
echo ""
echo "=== Running race reproducer ==="
#
# blkdebug.conf state machine:
# State 1 --(write_aio)--> State 2 --(cluster_alloc)--> State 3
#
# - State 1: tagA breakpoint catches write A
# - State 2: tagB breakpoint catches write B (skip_cow write)
# - State 2→3 transition on cluster_alloc: D's allocation transitions
# state to 3 BEFORE D fires write_aio, so D is NOT caught by tagB
#
# Sequence:
# break write_aio tagA -- breakpoint for state 1
# aio_write A 0xAA 0 64k -- suspended at tagA (state 1→2)
# wait_break tagA
# break write_aio tagB -- breakpoint for state 2
# aio_write B 0xBB 0 64k -- waits for A (handle_dependencies)
# aio_write -z -u 0 64k -- waits for A (wait_for_dependencies)
# resume tagA -- A completes. B wakes (skip_cow),
# caught by tagB. Zero-write frees
# cluster.
# wait_break tagB -- B suspended, cluster freed
# write D 0xDD 64k 64k -- D allocates the freed cluster
# (cluster_alloc transitions to
# state 3). D's write_aio fires at
# state 3 — no breakpoint. D writes
# its data and completes.
# resume tagB -- B writes to the SAME physical
# cluster, overwriting D's data
# aio_flush
#
# read -P 0xDD 64k 64k -- EXPECTS D's data (0xDD)
# GETS B's data (0xBB) → CORRUPTION
QEMU_IO_OUTPUT=$("$QEMU_IO" \
-c "break write_aio tagA" \
-c "aio_write -P 0xAA 0 64k" \
-c "wait_break tagA" \
-c "break write_aio tagB" \
-c "aio_write -P 0xBB 0 64k" \
-c "aio_write -z -u 0 64k" \
-c "resume tagA" \
-c "wait_break tagB" \
-c "write -P 0xDD 64k 64k" \
-c "resume tagB" \
-c "aio_flush" \
-c "read -vP 0xDD 64k 512" \
-c "read -vP 0 0 512" \
"blkdebug:${BLKDEBUG_CONF}:${TEST_IMG}" 2>&1) || true
echo "$QEMU_IO_OUTPUT"
PATTERN_FAIL=$(echo "$QEMU_IO_OUTPUT" | grep -c "Pattern verification failed"
|| true)
if [ "$PATTERN_FAIL" -gt 0 ]; then
echo ""
echo "*** DATA CORRUPTION DETECTED at guest offset 64K ***"
echo "*** D wrote 0xDD, but reading returns B's data (0xBB)."
echo "*** B's write to the freed+reallocated cluster corrupted"
echo "*** an UNRELATED guest address."
fi
echo ""
echo "=== Checking image integrity (metadata) ==="
"$QEMU_IMG" check "$TEST_IMG" || true
echo ""
echo "=== Allocation map ==="
"$QEMU_IMG" map --output=json "$TEST_IMG"
echo ""
echo "=== Checking log for race evidence ==="
if [ -f "$LOG" ]; then
echo "--- Log entries (chronological) ---"
cat "$LOG"
echo ""
echo "--- Race analysis ---"
# Extract host offsets for the skip_cow write (B) and D's write
B_HOST=$(grep '"has_l2meta":0' "$LOG" | grep -o '"host_offset":[0-9]*' |
head -1 | grep -o '[0-9]*')
D_HOST=$(grep '"offset":65536' "$LOG" | grep -o '"host_offset":[0-9]*' |
head -1 | grep -o '[0-9]*')
echo "Write B (skip_cow, no l2meta) host_offset: $B_HOST"
echo "Write D (different guest offset) host_offset: $D_HOST"
if [ -n "$B_HOST" ] && [ -n "$D_HOST" ] && [ "$B_HOST" = "$D_HOST" ]; then
echo ""
echo "*** CLUSTER REUSE CONFIRMED: B and D write to the same"
echo "*** physical cluster ($B_HOST) for different guest offsets."
echo "*** B (guest offset 0) overwrites D (guest offset 64K)."
echo "*** Reading guest offset 64K returns B's data → CORRUPTION"
echo "*** at an unrelated guest address."
fi
else
echo "No log file found at $LOG"
fi