On 2/19/21 10:07 AM, Nir Soffer wrote: > When querying image extents for raw image, qemu-nbd reports holes as > zero: > > $ qemu-nbd -t -r -f raw empty-6g.raw > > $ qemu-img map --output json nbd://localhost > [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": true, > "offset": 0}] > > $ qemu-img map --output json empty-6g.raw > [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, > "offset": 0}] > > Turns out that qemu-img map reports a hole based on BDRV_BLOCK_DATA, but > nbd server reports a hole based on BDRV_BLOCK_ALLOCATED. > > The NBD protocol says: > > NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and > future writes to that area may cause fragmentation or encounter an > NBD_ENOSPC error); if clear, the block is allocated or the server > could not otherwise determine its status.
Notoriously, lseek(SEEK_HOLE) cannot actually tell us about which portions of a file are allocated (it can only reliably tell us which portions of a file read as zero); you have to use the FIEMAP ioctl or similar to learn about allocation status of a raw file, which is too expensive to be useful. The fact that SEEK_HOLE does not tell us about NBD_STATE_HOLE, but instead is solely about NBD_STATE_ZERO, is therefore a misnomer that doesn't make it any easier to reason about. But this patch is also an indication that our meaning of BDRV_BLOCK_ALLOCATED is confused; there has been a mismatch over time on whether it represents whether something is allocated by occupying space on disk (which as pointed out above is too expensive to determine for files, but easy to determine for qcow2) or represents allocated at this depth within a backing chain (makes no sense for protocol layers, but total sense for format layers like qcow2). At one point, Vladimir audited the code base to try and reserve BDRV_BLOCK_ALLOCATED solely for the latter meaning (which layer of a backing chain supplies the data, regardless of whether that extent occupies reserved storage space). So on that grounds alone, I am inclined to agree with you that using BDRV_BLOCK_ALLOCATED to feed NBD_STATE_HOLE is wrong: whether something comes from the current layer of the backing image is unrelated to whether that extent is known to occupy reserved storage in the file system. Failure to sest NBD_STATE_HOLE is not fatal (the NBD protocol specifically chose NBD_STATE_HOLE and NBD_STATE_ZERO to have default values of 0 in the common case, where setting them to 1 allows some optimizations, but where skipping the chance to set them to 1 because determining the answer is too expensive, such as FIEMAP, is fine as well). > > qemu-img manual says: > > whether the sectors contain actual data or not (boolean field data; > if false, the sectors are either unallocated or stored as > optimized all-zero clusters); > > To me, data=false looks compatible with NBD_STATE_HOLE. From user point > of view, getting same results from qemu-nbd and qemu-img is more > important than being more correct about allocation status. More to the point, here is our inconsistency: In nbd/server.c, we turn !BDRV_BLOCK_ALLOCATED into NBD_STATE_HOLE In block/nbd.c, we turn !NBD_STATE_HOLE into BDRV_BLOCK_DATA The fact that we are not doing a round-trip conversion means that one of the two places is wrong. And your argument that the server side is wrong makes sense to me. > > Changing nbd server to report holes using BDRV_BLOCK_DATA makes qemu-nbd > results compatible with qemu-img map: > > $ qemu-img map --output json nbd://localhost > [{ "start": 0, "length": 6442450944, "depth": 0, "zero": true, "data": false, > "offset": 0}] > > Signed-off-by: Nir Soffer <nsof...@redhat.com> > --- > nbd/server.c | 4 ++-- > tests/qemu-iotests/241.out | 4 ++-- > 2 files changed, 4 insertions(+), 4 deletions(-) Conflicts with my patch to make NBD_CMD_READ/NBD_CMD_BLOCK_STATUS obey block alignment boundaries: https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06010.html Reviewed-by: Eric Blake <ebl...@redhat.com> I'll wait a few days for any other reviewer commentary before taking this through my NBD tree. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org