Re: Potential regression in 'qemu-img convert' to LVM
On 07/01/2021 21:03, Nir Soffer wrote: On Tue, Sep 15, 2020 at 2:51 PM Stefan Reiter wrote: On 9/15/20 11:08 AM, Nir Soffer wrote: On Mon, Sep 14, 2020 at 3:25 PM Stefan Reiter wrote: Hi list, following command fails since 5.1 (tested on kernel 5.4.60): # qemu-img convert -p -f raw -O raw /dev/zvol/pool/disk-1 /dev/vg/disk-1 qemu-img: error while writing at byte 2157968896: Device or resource busy (source is ZFS here, but doesn't matter in practice, it always fails the same; offset changes slightly but consistently hovers around 2^31) strace shows the following: fallocate(13, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2157968896, 4608) = -1 EBUSY (Device or resource busy) What is the size of the LV? Same as the source, 5GB in my test case. Created with: # lvcreate -ay --size 5242880k --name disk-1 vg Does it happen if you change sparse minimum size (-S)? For example: -S 64k qemu-img convert -p -f raw -O raw -S 64k /dev/zvol/pool/disk-1 /dev/vg/disk-1 Tried a few different values, always the same result: EBUSY at byte 2157968896. Other fallocate calls leading up to this work fine. This happens since commit edafc70c0c "qemu-img convert: Don't pre-zero images", before that all fallocates happened at the start. Reverting the commit and calling qemu-img exactly the same way on the same data works fine. But slowly, doing up to 100% more work for fully allocated images. Of course, I'm not saying the patch is wrong, reverting it just avoids triggering the bug. Simply retrying the syscall on EBUSY (like EINTR) does *not* work, once it fails it keeps failing with the same error. I couldn't find anything related to EBUSY on fallocate, and it only happens on LVM targets... Any idea or pointers where to look? Is this thin LV? No, regular LV. See command above. This works for us using regular LVs. Which kernel? which distro? Reproducible on: * PVE w/ kernel 5.4.60 (Ubuntu based) * Manjaro w/ kernel 5.8.6 I found that it does not happen with all images, I suppose there must be a certain number of smaller holes for it to happen. I am using a VM image with a bare-bones Alpine Linux installation, but it's not an isolated case, we've had two people report the issue on our bug tracker: https://bugzilla.proxmox.com/show_bug.cgi?id=3002 I think that this issue may be fixed by https://lists.nongnu.org/archive/html/qemu-block/2020-11/msg00358.html Nir Sorry for the late reply, but yes, I can confirm this fixes the issue. ~
Re: Potential regression in 'qemu-img convert' to LVM
On Tue, Sep 15, 2020 at 2:51 PM Stefan Reiter wrote: > > On 9/15/20 11:08 AM, Nir Soffer wrote: > > On Mon, Sep 14, 2020 at 3:25 PM Stefan Reiter wrote: > >> > >> Hi list, > >> > >> following command fails since 5.1 (tested on kernel 5.4.60): > >> > >> # qemu-img convert -p -f raw -O raw /dev/zvol/pool/disk-1 /dev/vg/disk-1 > >> qemu-img: error while writing at byte 2157968896: Device or resource busy > >> > >> (source is ZFS here, but doesn't matter in practice, it always fails the > >> same; offset changes slightly but consistently hovers around 2^31) > >> > >> strace shows the following: > >> fallocate(13, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2157968896, > >> 4608) = -1 EBUSY (Device or resource busy) > > > > What is the size of the LV? > > > > Same as the source, 5GB in my test case. Created with: > > # lvcreate -ay --size 5242880k --name disk-1 vg > > > Does it happen if you change sparse minimum size (-S)? > > > > For example: -S 64k > > > > qemu-img convert -p -f raw -O raw -S 64k /dev/zvol/pool/disk-1 > > /dev/vg/disk-1 > > > > Tried a few different values, always the same result: EBUSY at byte > 2157968896. > > >> Other fallocate calls leading up to this work fine. > >> > >> This happens since commit edafc70c0c "qemu-img convert: Don't pre-zero > >> images", before that all fallocates happened at the start. Reverting the > >> commit and calling qemu-img exactly the same way on the same data works > >> fine. > > > > But slowly, doing up to 100% more work for fully allocated images. > > > > Of course, I'm not saying the patch is wrong, reverting it just avoids > triggering the bug. > > >> Simply retrying the syscall on EBUSY (like EINTR) does *not* work, > >> once it fails it keeps failing with the same error. > >> > >> I couldn't find anything related to EBUSY on fallocate, and it only > >> happens on LVM targets... Any idea or pointers where to look? > > > > Is this thin LV? > > > > No, regular LV. See command above. > > > This works for us using regular LVs. > > > > Which kernel? which distro? > > > > Reproducible on: > * PVE w/ kernel 5.4.60 (Ubuntu based) > * Manjaro w/ kernel 5.8.6 > > I found that it does not happen with all images, I suppose there must be > a certain number of smaller holes for it to happen. I am using a VM > image with a bare-bones Alpine Linux installation, but it's not an > isolated case, we've had two people report the issue on our bug tracker: > https://bugzilla.proxmox.com/show_bug.cgi?id=3002 I think that this issue may be fixed by https://lists.nongnu.org/archive/html/qemu-block/2020-11/msg00358.html Nir
Re: Potential regression in 'qemu-img convert' to LVM
On 9/15/20 11:08 AM, Nir Soffer wrote: On Mon, Sep 14, 2020 at 3:25 PM Stefan Reiter wrote: Hi list, following command fails since 5.1 (tested on kernel 5.4.60): # qemu-img convert -p -f raw -O raw /dev/zvol/pool/disk-1 /dev/vg/disk-1 qemu-img: error while writing at byte 2157968896: Device or resource busy (source is ZFS here, but doesn't matter in practice, it always fails the same; offset changes slightly but consistently hovers around 2^31) strace shows the following: fallocate(13, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2157968896, 4608) = -1 EBUSY (Device or resource busy) What is the size of the LV? Same as the source, 5GB in my test case. Created with: # lvcreate -ay --size 5242880k --name disk-1 vg Does it happen if you change sparse minimum size (-S)? For example: -S 64k qemu-img convert -p -f raw -O raw -S 64k /dev/zvol/pool/disk-1 /dev/vg/disk-1 Tried a few different values, always the same result: EBUSY at byte 2157968896. Other fallocate calls leading up to this work fine. This happens since commit edafc70c0c "qemu-img convert: Don't pre-zero images", before that all fallocates happened at the start. Reverting the commit and calling qemu-img exactly the same way on the same data works fine. But slowly, doing up to 100% more work for fully allocated images. Of course, I'm not saying the patch is wrong, reverting it just avoids triggering the bug. Simply retrying the syscall on EBUSY (like EINTR) does *not* work, once it fails it keeps failing with the same error. I couldn't find anything related to EBUSY on fallocate, and it only happens on LVM targets... Any idea or pointers where to look? Is this thin LV? No, regular LV. See command above. This works for us using regular LVs. Which kernel? which distro? Reproducible on: * PVE w/ kernel 5.4.60 (Ubuntu based) * Manjaro w/ kernel 5.8.6 I found that it does not happen with all images, I suppose there must be a certain number of smaller holes for it to happen. I am using a VM image with a bare-bones Alpine Linux installation, but it's not an isolated case, we've had two people report the issue on our bug tracker: https://bugzilla.proxmox.com/show_bug.cgi?id=3002 Thanks, Stefan Nir
Re: Potential regression in 'qemu-img convert' to LVM
On Mon, Sep 14, 2020 at 3:25 PM Stefan Reiter wrote: > > Hi list, > > following command fails since 5.1 (tested on kernel 5.4.60): > > # qemu-img convert -p -f raw -O raw /dev/zvol/pool/disk-1 /dev/vg/disk-1 > qemu-img: error while writing at byte 2157968896: Device or resource busy > > (source is ZFS here, but doesn't matter in practice, it always fails the > same; offset changes slightly but consistently hovers around 2^31) > > strace shows the following: > fallocate(13, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2157968896, > 4608) = -1 EBUSY (Device or resource busy) What is the size of the LV? Does it happen if you change sparse minimum size (-S)? For example: -S 64k qemu-img convert -p -f raw -O raw -S 64k /dev/zvol/pool/disk-1 /dev/vg/disk-1 > Other fallocate calls leading up to this work fine. > > This happens since commit edafc70c0c "qemu-img convert: Don't pre-zero > images", before that all fallocates happened at the start. Reverting the > commit and calling qemu-img exactly the same way on the same data works > fine. But slowly, doing up to 100% more work for fully allocated images. > Simply retrying the syscall on EBUSY (like EINTR) does *not* work, > once it fails it keeps failing with the same error. > > I couldn't find anything related to EBUSY on fallocate, and it only > happens on LVM targets... Any idea or pointers where to look? Is this thin LV? This works for us using regular LVs. Which kernel? which distro? Nir
Potential regression in 'qemu-img convert' to LVM
Hi list, following command fails since 5.1 (tested on kernel 5.4.60): # qemu-img convert -p -f raw -O raw /dev/zvol/pool/disk-1 /dev/vg/disk-1 qemu-img: error while writing at byte 2157968896: Device or resource busy (source is ZFS here, but doesn't matter in practice, it always fails the same; offset changes slightly but consistently hovers around 2^31) strace shows the following: fallocate(13, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2157968896, 4608) = -1 EBUSY (Device or resource busy) Other fallocate calls leading up to this work fine. This happens since commit edafc70c0c "qemu-img convert: Don't pre-zero images", before that all fallocates happened at the start. Reverting the commit and calling qemu-img exactly the same way on the same data works fine. Simply retrying the syscall on EBUSY (like EINTR) does *not* work, once it fails it keeps failing with the same error. I couldn't find anything related to EBUSY on fallocate, and it only happens on LVM targets... Any idea or pointers where to look? ~ Stefan