On Mon, Apr 04, 2016 at 05:32:34PM -0600, Eric Blake wrote:
> On 04/04/2016 05:08 PM, Wouter Verhelst wrote:
> > On Mon, Apr 04, 2016 at 10:54:02PM +0300, Denis V. Lunev wrote:
> >> saying about dirtiness, we would soon come to the fact, that
> >> we can have several dirtiness states regarding different
> >> lines of incremental backups. This complexity is hidden
> >> inside QEMU and it would be very difficult to publish and
> >> reuse it.
> > 
> > How about this then.
> > 
> > A reply to GET_BLOCK_STATUS containing chunks of this:
> > 
> > 32-bit length
> > 32-bit "snapshot status"
> > if bit 0 in the latter field is set, that means the block is allocated
> >   on the original device
> > if bit 1 is set, that means the block is allocated on the first-level
> >   snapshot
> > if bit 2 is set, that means the block is allocated on the second-level
> >   snapshot
> 
> The idea of allocation is orthogonal from the idea of reads as zeroes.
> That is, a client may usefully guarantee that something reads as zeroes,
> whether or not it is allocated (but knowing whether it is a hole or
> allocated will determine whether future writes to that area will cause
> file system fragmentation or be at risk of ENOSPC on thin-provisioning).
>  If we want to expose the notion of depth (and I'm not sure about that
> yet), we may want to reserve bit 0 for 'reads as zero' and bits 1-30 as
> 'allocated at depth "bit-1"'

That works too, I suppose.

> (and bit 31 as 'allocated at depth 30 or greater).

The reason I said "the protocol does not define" was specifically to
avoid this. There is nothing that compels a server to map its
implementation-specific snapshots one-to-one on NBD protocol snapshots.
E.g., a client could ask for a snapshot out of band; the server would
then create a new snapshot on top of its existing snapshots, and map
things as: bit 0 for "reads as zeroes", bit 1 for "allocated before the
snapshot you just asked me for", folding together all allocation bits
for older snapshots, and bit 2 as "this block has been changed in the
context of the snapshot you just asked me for".

The out-of-band protocol could alternatively say "NBD snapshot level 13
is the snapshot I just created for you".

> I don't know if the idea of depth of allocation is useful enough to
> expose in this manner; qemu could certainly advertise depth if the
> protocol calls it out, but I'm still not sure whether knowing depth
> helps any algorithms.

The point is that with the proposed format, depth *can* be exposed but
doesn't *have* to be. The original proposal provided two levels of
allocation (on the original device, and on the snapshot or whatever
tracking method is used). This proposal provides 32 (if counting the "is
zeroes" bit as a separate level).

> > If all flags are cleared, that means the block is not allocated (i.e.,
> > is a hole) and MUST read as zeroes.
> 
> That's too strong.  NBD_CMD_TRIM says that we can create holes whose
> data does not necessarily read as zeroes (and SCSI definitely has
> semantics like this - not all devices guarantee zero reads when you
> UNMAP; and WRITE_SAME has an UNMAP flag to control whether you are okay
> with the faster unmapping operation at the expense of bad reads, or
> slower explicit writes).  Hence my complaint that we have to treat
> 'reads as zero' as an orthogonal bit to 'allocated at depth X'.

Right, so we should special-case the first bit then, as you suggest.

> > If a flag is set at a particular level X, that means the device is dirty
> > at the Xth-level snapshot.
> > 
> > If at least one flag is set for a region, that means the data may read
> > as "not zero".
> > 
> > The protocol does not define what it means to have multiple levels of
> > snapshots, other than:
> > 
> > - Any write command (WRITE or WRITE_ZEROES) MUST NOT clear or set the
> >   Xth level flag if the Yth level flag is not also cleared at the same
> >   time, for any Y > X
> > - Any write (as above) MAY set or clear multiple levels of flags at the
> >   same time, as long as the above holds
> > 
> > Having a 32-bit snapshot status field allows for 32 levels of snapshots.
> > We could switch length and flags to 64 bits so that things continue to
> > align nicely, and then we have a maximum of 64 levels of snapshots.
> 
> 64 bits may not lay out as nicely (a 12-byte struct is not as efficient
> to copy between the wire and a C array as a 8-byte struct).

Which is why I said to switch *both* fields to 64 bits, so that it
becomes 16 bytes rather than 12 :-)

> > (I'm not going to write this up formally at this time of night, but you
> > get the general idea)
> 
> The idea may make it possible to expose dirty information as a layer of
> depth (from the qemu perspective, each qcow2 file would occupy 2 layers
> of depth: one if dirty, and another if allocated; then deeper layers are
> determined by backing files).  But I'm also worried that it may be more
> complicated than the original question at hand (qemu wants to know,  in
> advance of a read, which portions of a file are worth reading, because
> they are either allocated, or because they are dirty; but doesn't care
> to what depth the server has to go to actually perform the reads).

You just do "if ([snapshot status] >= (1 << X))" to see if any bits
beyond level X are set. That's a fairly simple operation. If a client
isn't interested in anything more than "modified beyond the original
device", it checks whether the snapshot status field has a value >= 4.
If so, it knows it must read the relevant blocks.

If the server doesn't want to expose more information than the client
would need for such an operation, it could fold all snapshots beyond the
original device into the first level snapshot (i.e., only sending
information in the first three bits of the field), and then the client
would have less information, but still enough to do what it needs to do.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

Reply via email to