io: avoid failure caused by misaligned BLKZEROOUT ioctl

Fiona Ebner Fri, 09 Jan 2026 04:10:55 -0800

Previous discussion here:
https://lore.kernel.org/qemu-devel/[email protected]/


Commit 5634622bcb ("file-posix: allow BLKZEROOUT with -t writeback")
enables the BLKZEROOUT ioctl when using 'writeback' cache, regressing
certain 'qemu-img convert' invocations, because of a pre-existing
issue. Namely, the BLKZEROOUT ioctl might fail with errno EINVAL when
the request is shorter than the block size of the block device.

Stefan suggested prioritizing bl.pwrite_zeroes_alignment in
bdrv_co_do_zero_pwritev(). This RFC explores that approach and the
issues with qcow2 I encountered, where
bl.pwrite_zeroes_alignment = s->subcluster_size;
I would be happy to discuss potential solutions and whether we should
use this approach after all.

For example, in iotest 154 and 271, there are assertion failures,
because the padded request extends beyond the end of the image:
Assertion `offset + bytes <= bs->total_sectors * BDRV_SECTOR_SIZE ||
child->perm & BLK_PERM_RESIZE' failed.
The total image length is not necessarily aligned to the cluster size.
This could be solved by shortening the relevant requests in
bdrv_co_do_zero_pwritev() and submitting them without the
BDRV_REQ_ZERO_WRITE flag and with bl.request_alignment as the
alignment see patch 5/6.

For iotest 179, I would need to avoid clearing BDRV_REQ_ZERO_WRITE for
the head and tail parts as long as the buffer is fully zero.
Otherwise, we end up with more 'data' sectors in the target map. See
patch 6/6. With or without that, iotests 154 and 271 produces
different output (I think it might be expected, but haven't checked in
detail yet).

Another issue is exposed by iotest 177, where the (sub-)cluster size
is 1MiB, but max-transfer is only 64KiB leading to assertion failures,
because max_transfer =
QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_transfer, INT_MAX), align);
evaluates to 0 (because align > bs->bl.max_transfer). This could be
fixed by safeguarding doing the QEMU_ALIGN_DOWN only if the value is
bigger than align, see patch 4/6.

I'm also not sure what to do about iotest 204 and 177 which use
'opt-write-zero=15M' for the blkdebug driver (which assigns that value
to pwrite_zeroes_alignment) making an is_power_of_2(align) assertion
fail.

Yet another issue is the 'detect_zeroes' option. If the option is set,
bdrv_aligned_pwritev() might set the BDRV_REQ_ZERO_WRITE flag even if
the request is not aligned to pwrite_zeroes_alignment and the original
bug could resurface.

Best Regards,
Fiona


Fiona Ebner (6):
  block/io: pass alignment to bdrv_init_padding()
  block/io: add 'bytes' parameter to bdrv_padding_rmw_read()
  block/io: honor pwrite_zeroes_alignment in bdrv_co_do_zero_pwritev()
  block/io: safeguard max transfer calculation in bdrv_aligned_pwritev()
  block/io: handle image length not aligned to write zeroes alignment in
    bdrv_co_do_zero_pwritev()
  block/io: keep zero flag for head/tail parts of misaligned zero write
    when possible

 block/io.c | 78 ++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 55 insertions(+), 23 deletions(-)

-- 
2.47.3

[RFC v2 0/6] block/io: avoid failure caused by misaligned BLKZEROOUT ioctl

Reply via email to