Dear maintainers, Am 27.04.26 um 7:04 PM schrieb Kevin Wolf: > Most code in qcow2 that accesses (and potentially modifies) L2 tables > does so while holding s->lock. > > There is one exception, which is allocating writes. They hold the lock > initially while allocating clusters, but drop it for writing the guest > payload before taking the lock again for updating the L2 tables. This > allows concurrent requests that touch other parts of the image file to > continue in parallel and is an important performance optimisation. > > However, this means that other requests that run while the lock is > dropped for writing guest data must synchronise with the list of > allocating requests in s->cluster_allocs and wait if they would overlap. > For writes, this is done in handle_dependencies(), but discard and write > zeros operations neglect to synchronise with s->cluster_allocs. > > This means that discard can free a cluster whose L2 entry will already > be modified in qcow2_alloc_cluster_link_l2() by a previously started > write. In the case of a pre-allocated zero cluster that is in the > process of being overwritten, this means that discard can lead to a > situation where the cluster is still mapped (because the write will > restore the L2 entry just without the zero flag), but its refcount has > been decreased, resulting in a corrupted image. > > Add the missing synchronisation to qcow2_cluster_discard() and > qcow2_subcluster_zeroize() to fix the problem. > > Cc: [email protected] > Reported-by: Denis V. Lunev <[email protected]> > Signed-off-by: Kevin Wolf <[email protected]>
we had started rolling out a build of QEMU 11 with this patch already included. However, some of our users reported issues with VMs using qcow2 disks soon after [0][1]. I was able to reproduce the in-guest segfaults from [1] in a memory-constrained Debian 12 guest when using a swap partition on the same disk. Thanks to Thomas for the hunch with swap! After reverting this patch, I wasn't able to reproduce the issue anymore. I do not have a better reproducer yet and am not sure about the exact pattern causing the issue. It's related to the wait_for_dependencies() call in qcow2_subcluster_zeroize(), because if I revert just the one in qcow2_cluster_discard(), the issue still reproduces. Commandline for my reproducer VM [2]. The issue does not happen if I drop "detect-zeroes":"unmap". Note that I don't have discard-no-unref for the qcow2 image, so in zero_in_l2_slice(), the branch with qcow2_free_any_cluster() is taken. Could the conflict be related to that? I'm still trying to figure things out and come up with a better reproducer, but wanted to let you know early, also because of the upcoming stable releases. Of course, I'd also be happy for hints/hunches and am happy to test suggestions! Best Regards, Fiona [0]: https://forum.proxmox.com/threads/183679/ [1]: https://forum.proxmox.com/threads/183639/ [2]: > ./qemu-system-x86_64 \ > -accel kvm \ > -chardev > 'socket,id=qmp,path=/var/run/qemu-server/300.qmp,server=on,wait=off' \ > -mon 'chardev=qmp,mode=control' \ > -chardev > 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect-ms=5000' \ > -mon 'chardev=qmp-event,mode=control' \ > -pidfile /var/run/qemu-server/300.pid \ > -smp '4,sockets=2,cores=2,maxcpus=4' \ > -nodefaults \ > -boot > 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' > \ > -vnc 'unix:/var/run/qemu-server/300.vnc,password=on' \ > -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt \ > -m 256 \ > -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \ > -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \ > -device 'pci-bridge,id=pci.3,chassis_nr=3,bus=pci.0,addr=0x5' \ > -device 'VGA,id=vga,bus=pci.0,addr=0x2' \ > -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1' \ > -blockdev > '{"detect-zeroes":"unmap","discard":"unmap","driver":"qcow2","file":{"detect-zeroes":"unmap","discard":"unmap","driver":"file","filename":"/mnt/pve/dir/images/300/vm-300-disk-0.qcow2","node-name":"e377549e25f53abd39f9ba01c03653e"},"node-name":"drive-scsi0"}' > \ > -device > 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,device_id=drive-scsi0,bootindex=100' > \ > -netdev > 'type=tap,id=net1,ifname=tap300i1,script=/usr/libexec/qemu-server/pve-bridge,downscript=/usr/libexec/qemu-server/pve-bridgedown,vhost=on' > \ > -device > 'virtio-net-pci,mac=BC:24:11:CA:B4:EF,netdev=net1,bus=pci.0,addr=0x13,id=net1,rx_queue_size=1024,tx_queue_size=256,host_mtu=1500'
