On Tue, May 19, 2026 at 11:20:18PM +0200, Sam Li wrote:
> On Tue, May 19, 2026 at 5:49 PM Stefan Hajnoczi <[email protected]> wrote:
> >
> > On Mon, May 18, 2026 at 12:21:55AM +0200, Sam Li wrote:
> > > On Thu, May 14, 2026 at 9:49 PM Stefan Hajnoczi <[email protected]> 
> > > wrote:
> > > > On Sun, May 10, 2026 at 07:50:57PM +0200, Sam Li wrote:
> > > > > +         48 - 55:  zonedmeta_offset
> > > > > +                   The offset of zoned metadata structure in the 
> > > > > contained
> > > > > +                   image, in bytes.
> > > >
> > > > Do you want to say anything about the order in which metadata is
> > > > persisted to disk when zones used? I guess the data is written into the
> > > > image file first, then the non-zoned qcow2 L1/L2/refcount metadata is
> > > > updated, and finally the write pointer is written. Write pointers are
> > > > not guaranteed to be updated on disk until the write request followed by
> > > > a flush request are both completed.
> > >
> > > The current ordering is not like that. The write pointer is written
> > > persistently first, then the data writes and the non-zoned qcow2
> > > L1/L2/refcount metadata updates. On IO failure, the corresponding
> > > write pointer is re-read from disk. As noted in the previous comment,
> > > the wp must be updated when issuing the IO, under the assumption that
> > > the write IO will succeed.
> > >
> > > The ordering has been settled this way since v7 to deal with
> > > concurrent zone append writes. If the wp was only updated after data
> > > I/O, two concurrent appends would both have read the same wp and tried
> > > to write to the same position.
> > >
> > > >
> > > > (The idea is that the data must be visible in the qcow2 file before it
> > > > is safe to update the write pointer. Otherwise a power failure would
> > > > leave the file in an inconsistent state where the write pointer has
> > > > advanced but the data was not written.)
> > >
> > > The crash-consistency is a concern...
> >
> > Yes, I'm thinking about crash-consistency. The ordering you described
> > can result in qcow2 images where the write pointer is ahead of the
> > actually written data after a power failure or maybe a QEMU crash.
> >
> > QEMU's block layer must follow the same data integrity behavior that
> > real devices guarantee.
> 
> I may have found a solution to deal with both cases. The fix is to
> update wp in memory instead of flushing it before qcow2 metadata and
> data writes. The zone append write path would become:
> 
> On submission:
> 
> 1) wp_lock()
> 2) Check write alignment
> 3) wp_update (in memory)
> 4) wp_unlock()
> 5) Issue write
> 
> And on completion:
> 1) If no error: wp_flush with locks and return success

The data may not be visible in the qcow2 file yet because qcow2's L1/L2/refcount
cache is not written back to the file until a flush request. I think the
write pointer updates should have a dependency on the qcow2 metadata so
that write pointers are only written after qcow2 metadata.

See block/qcow2-cache.c and qcow2_cache_set_dependency(). The idea is
that one type of cached metadata can set a dependency on another type of
cached metadata so that ordering is guaranteed.

> 2) else, wp_lock()
> 3) read_wp (from disk) and use the read wp value as the current wp
> 4) wp_unlock()
> 5) return IO error
> 
> Sam
> 
> >
> > Damien: Do real zoned block devices guarantee that the updated write
> > pointer is persisted only after appended data has written been
> > persisted?
> >
> > Stefan
> 

Attachment: signature.asc
Description: PGP signature

Reply via email to