io: avoid failure caused by misaligned BLKZEROOUT ioctl

Stefan Hajnoczi Mon, 19 Jan 2026 11:39:17 -0800

On Fri, Jan 09, 2026 at 01:08:27PM +0100, Fiona Ebner wrote:
> Previous discussion here:
> https://lore.kernel.org/qemu-devel/[email protected]/
> 
> Commit 5634622bcb ("file-posix: allow BLKZEROOUT with -t writeback")
> enables the BLKZEROOUT ioctl when using 'writeback' cache, regressing
> certain 'qemu-img convert' invocations, because of a pre-existing
> issue. Namely, the BLKZEROOUT ioctl might fail with errno EINVAL when
> the request is shorter than the block size of the block device.
> 
> Stefan suggested prioritizing bl.pwrite_zeroes_alignment in
> bdrv_co_do_zero_pwritev(). This RFC explores that approach and the
> issues with qcow2 I encountered, where
> bl.pwrite_zeroes_alignment = s->subcluster_size;
> I would be happy to discuss potential solutions and whether we should
> use this approach after all.


These issues are a headache, but I think it's important for us to
consider them. They indicate that QEMU does not properly distinguish
between read/write and pwrite_zeroes constraints.

If we can agree on how the block layer should handle pwrite_zeroes
constraints in a consistent way that makes the tests pass, then that
should serve the QEMU block layer well in the future.

I will mention this patch series to Kevin as well so we can get his
opinion.

> 
> For example, in iotest 154 and 271, there are assertion failures,
> because the padded request extends beyond the end of the image:
> Assertion `offset + bytes <= bs->total_sectors * BDRV_SECTOR_SIZE ||
> child->perm & BLK_PERM_RESIZE' failed.
> The total image length is not necessarily aligned to the cluster size.
> This could be solved by shortening the relevant requests in
> bdrv_co_do_zero_pwritev() and submitting them without the
> BDRV_REQ_ZERO_WRITE flag and with bl.request_alignment as the
> alignment see patch 5/6.
> 
> For iotest 179, I would need to avoid clearing BDRV_REQ_ZERO_WRITE for
> the head and tail parts as long as the buffer is fully zero.
> Otherwise, we end up with more 'data' sectors in the target map. See
> patch 6/6. With or without that, iotests 154 and 271 produces
> different output (I think it might be expected, but haven't checked in
> detail yet).
> 
> Another issue is exposed by iotest 177, where the (sub-)cluster size
> is 1MiB, but max-transfer is only 64KiB leading to assertion failures,
> because max_transfer =
> QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_transfer, INT_MAX), align);
> evaluates to 0 (because align > bs->bl.max_transfer). This could be
> fixed by safeguarding doing the QEMU_ALIGN_DOWN only if the value is
> bigger than align, see patch 4/6.
> 
> I'm also not sure what to do about iotest 204 and 177 which use
> 'opt-write-zero=15M' for the blkdebug driver (which assigns that value
> to pwrite_zeroes_alignment) making an is_power_of_2(align) assertion
> fail.
> 
> Yet another issue is the 'detect_zeroes' option. If the option is set,
> bdrv_aligned_pwritev() might set the BDRV_REQ_ZERO_WRITE flag even if
> the request is not aligned to pwrite_zeroes_alignment and the original
> bug could resurface.
> 
> Best Regards,
> Fiona
> 
> 
> Fiona Ebner (6):
>   block/io: pass alignment to bdrv_init_padding()
>   block/io: add 'bytes' parameter to bdrv_padding_rmw_read()
>   block/io: honor pwrite_zeroes_alignment in bdrv_co_do_zero_pwritev()
>   block/io: safeguard max transfer calculation in bdrv_aligned_pwritev()
>   block/io: handle image length not aligned to write zeroes alignment in
>     bdrv_co_do_zero_pwritev()
>   block/io: keep zero flag for head/tail parts of misaligned zero write
>     when possible
> 
>  block/io.c | 78 ++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 55 insertions(+), 23 deletions(-)
> 
> -- 
> 2.47.3
> 
>

signature.asc
Description: PGP signature

Re: [RFC v2 0/6] block/io: avoid failure caused by misaligned BLKZEROOUT ioctl

Reply via email to