On 2/19/21 10:07 AM, Nir Soffer wrote:
> When querying image extents for raw image, qemu-nbd reports holes as
> zero:
> 
> $ qemu-nbd -t -r -f raw empty-6g.raw
> 
> $ qemu-img map --output json nbd://localhost
> [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": true, 
> "offset": 0}]
> 
> $ qemu-img map --output json empty-6g.raw
> [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, 
> "offset": 0}]
> 
> Turns out that qemu-img map reports a hole based on BDRV_BLOCK_DATA, but
> nbd server reports a hole based on BDRV_BLOCK_ALLOCATED.
> 
> The NBD protocol says:
> 
>     NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and
>     future writes to that area may cause fragmentation or encounter an
>     NBD_ENOSPC error); if clear, the block is allocated or the server
>     could not otherwise determine its status.

Notoriously, lseek(SEEK_HOLE) cannot actually tell us about which
portions of a file are allocated (it can only reliably tell us which
portions of a file read as zero); you have to use the FIEMAP ioctl or
similar to learn about allocation status of a raw file, which is too
expensive to be useful.  The fact that SEEK_HOLE does not tell us about
NBD_STATE_HOLE, but instead is solely about NBD_STATE_ZERO, is therefore
a misnomer that doesn't make it any easier to reason about.

But this patch is also an indication that our meaning of
BDRV_BLOCK_ALLOCATED is confused; there has been a mismatch over time on
whether it represents whether something is allocated by occupying space
on disk (which as pointed out above is too expensive to determine for
files, but easy to determine for qcow2) or represents allocated at this
depth within a backing chain (makes no sense for protocol layers, but
total sense for format layers like qcow2).  At one point, Vladimir
audited the code base to try and reserve BDRV_BLOCK_ALLOCATED solely for
the latter meaning (which layer of a backing chain supplies the data,
regardless of whether that extent occupies reserved storage space).

So on that grounds alone, I am inclined to agree with you that using
BDRV_BLOCK_ALLOCATED to feed NBD_STATE_HOLE is wrong: whether something
comes from the current layer of the backing image is unrelated to
whether that extent is known to occupy reserved storage in the file
system.  Failure to sest NBD_STATE_HOLE is not fatal (the NBD protocol
specifically chose NBD_STATE_HOLE and NBD_STATE_ZERO to have default
values of 0 in the common case, where setting them to 1 allows some
optimizations, but where skipping the chance to set them to 1 because
determining the answer is too expensive, such as FIEMAP, is fine as well).

> 
> qemu-img manual says:
> 
>     whether the sectors contain actual data or not (boolean field data;
>     if false, the sectors are either unallocated or stored as
>     optimized all-zero clusters);
> 
> To me, data=false looks compatible with NBD_STATE_HOLE. From user point
> of view, getting same results from qemu-nbd and qemu-img is more
> important than being more correct about allocation status.

More to the point, here is our inconsistency:

In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE

In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA

The fact that we are not doing a round-trip conversion means that one of
the two places is wrong.  And your argument that the server side is
wrong makes sense to me.

> 
> Changing nbd server to report holes using BDRV_BLOCK_DATA makes qemu-nbd
> results compatible with qemu-img map:
> 
> $ qemu-img map --output json nbd://localhost
> [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, 
> "offset": 0}]
> 
> Signed-off-by: Nir Soffer <nsof...@redhat.com>
> ---
>  nbd/server.c               | 4 ++--
>  tests/qemu-iotests/241.out | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)

Conflicts with my patch to make NBD_CMD_READ/NBD_CMD_BLOCK_STATUS obey
block alignment boundaries:
https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06010.html

Reviewed-by: Eric Blake <ebl...@redhat.com>

I'll wait a few days for any other reviewer commentary before taking
this through my NBD tree.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


Reply via email to