Dear maintainers,

Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
> Most code in qcow2 that accesses (and potentially modifies) L2 tables
> does so while holding s->lock.
> 
> There is one exception, which is allocating writes. They hold the lock
> initially while allocating clusters, but drop it for writing the guest
> payload before taking the lock again for updating the L2 tables. This
> allows concurrent requests that touch other parts of the image file to
> continue in parallel and is an important performance optimisation.
> 
> However, this means that other requests that run while the lock is
> dropped for writing guest data must synchronise with the list of
> allocating requests in s->cluster_allocs and wait if they would overlap.
> For writes, this is done in handle_dependencies(), but discard and write
> zeros operations neglect to synchronise with s->cluster_allocs.
> 
> This means that discard can free a cluster whose L2 entry will already
> be modified in qcow2_alloc_cluster_link_l2() by a previously started
> write. In the case of a pre-allocated zero cluster that is in the
> process of being overwritten, this means that discard can lead to a
> situation where the cluster is still mapped (because the write will
> restore the L2 entry just without the zero flag), but its refcount has
> been decreased, resulting in a corrupted image.
> 
> Add the missing synchronisation to qcow2_cluster_discard() and
> qcow2_subcluster_zeroize() to fix the problem.
> 
> Cc: [email protected]
> Reported-by: Denis V. Lunev <[email protected]>
> Signed-off-by: Kevin Wolf <[email protected]>

we had started rolling out a build of QEMU 11 with this patch already
included. However, some of our users reported issues with VMs using
qcow2 disks soon after [0][1]. I was able to reproduce the in-guest
segfaults from [1] in a memory-constrained Debian 12 guest when using a
swap partition on the same disk. Thanks to Thomas for the hunch with
swap! After reverting this patch, I wasn't able to reproduce the issue
anymore. I do not have a better reproducer yet and am not sure about the
exact pattern causing the issue. It's related to the
wait_for_dependencies() call in qcow2_subcluster_zeroize(), because if I
revert just the one in qcow2_cluster_discard(), the issue still reproduces.

Commandline for my reproducer VM [2]. The issue does not happen if I
drop "detect-zeroes":"unmap". Note that I don't have discard-no-unref
for the qcow2 image, so in zero_in_l2_slice(), the branch with
qcow2_free_any_cluster() is taken. Could the conflict be related to that?

I'm still trying to figure things out and come up with a better
reproducer, but wanted to let you know early, also because of the
upcoming stable releases. Of course, I'd also be happy for hints/hunches
and am happy to test suggestions!

Best Regards,
Fiona

[0]: https://forum.proxmox.com/threads/183679/
[1]: https://forum.proxmox.com/threads/183639/

[2]:

> ./qemu-system-x86_64 \
>   -accel kvm \
>   -chardev 
> 'socket,id=qmp,path=/var/run/qemu-server/300.qmp,server=on,wait=off' \
>   -mon 'chardev=qmp,mode=control' \
>   -chardev 
> 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect-ms=5000' \
>   -mon 'chardev=qmp-event,mode=control' \
>   -pidfile /var/run/qemu-server/300.pid \
>   -smp '4,sockets=2,cores=2,maxcpus=4' \
>   -nodefaults \
>   -boot 
> 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg'
>  \
>   -vnc 'unix:/var/run/qemu-server/300.vnc,password=on' \
>   -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt \
>   -m 256 \
>   -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
>   -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
>   -device 'pci-bridge,id=pci.3,chassis_nr=3,bus=pci.0,addr=0x5' \
>   -device 'VGA,id=vga,bus=pci.0,addr=0x2' \
>   -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1' \
>   -blockdev 
> '{"detect-zeroes":"unmap","discard":"unmap","driver":"qcow2","file":{"detect-zeroes":"unmap","discard":"unmap","driver":"file","filename":"/mnt/pve/dir/images/300/vm-300-disk-0.qcow2","node-name":"e377549e25f53abd39f9ba01c03653e"},"node-name":"drive-scsi0"}'
>  \
>   -device 
> 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,device_id=drive-scsi0,bootindex=100'
>  \
>   -netdev 
> 'type=tap,id=net1,ifname=tap300i1,script=/usr/libexec/qemu-server/pve-bridge,downscript=/usr/libexec/qemu-server/pve-bridgedown,vhost=on'
>  \
>   -device 
> 'virtio-net-pci,mac=BC:24:11:CA:B4:EF,netdev=net1,bus=pci.0,addr=0x13,id=net1,rx_queue_size=1024,tx_queue_size=256,host_mtu=1500'




Reply via email to