On Wed, May 20, 2026 at 08:23:41PM +0200, Sam Li wrote:
> On Wed, May 20, 2026 at 7:59 PM Stefan Hajnoczi <[email protected]> wrote:
> >
> > On Tue, May 19, 2026 at 11:20:18PM +0200, Sam Li wrote:
> > > On Tue, May 19, 2026 at 5:49 PM Stefan Hajnoczi <[email protected]> 
> > > wrote:
> > > >
> > > > On Mon, May 18, 2026 at 12:21:55AM +0200, Sam Li wrote:
> > > > > On Thu, May 14, 2026 at 9:49 PM Stefan Hajnoczi <[email protected]> 
> > > > > wrote:
> > > > > > On Sun, May 10, 2026 at 07:50:57PM +0200, Sam Li wrote:
> > > > > > > +         48 - 55:  zonedmeta_offset
> > > > > > > +                   The offset of zoned metadata structure in the 
> > > > > > > contained
> > > > > > > +                   image, in bytes.
> > > > > >
> > > > > > Do you want to say anything about the order in which metadata is
> > > > > > persisted to disk when zones used? I guess the data is written into 
> > > > > > the
> > > > > > image file first, then the non-zoned qcow2 L1/L2/refcount metadata 
> > > > > > is
> > > > > > updated, and finally the write pointer is written. Write pointers 
> > > > > > are
> > > > > > not guaranteed to be updated on disk until the write request 
> > > > > > followed by
> > > > > > a flush request are both completed.
> > > > >
> > > > > The current ordering is not like that. The write pointer is written
> > > > > persistently first, then the data writes and the non-zoned qcow2
> > > > > L1/L2/refcount metadata updates. On IO failure, the corresponding
> > > > > write pointer is re-read from disk. As noted in the previous comment,
> > > > > the wp must be updated when issuing the IO, under the assumption that
> > > > > the write IO will succeed.
> > > > >
> > > > > The ordering has been settled this way since v7 to deal with
> > > > > concurrent zone append writes. If the wp was only updated after data
> > > > > I/O, two concurrent appends would both have read the same wp and tried
> > > > > to write to the same position.
> > > > >
> > > > > >
> > > > > > (The idea is that the data must be visible in the qcow2 file before 
> > > > > > it
> > > > > > is safe to update the write pointer. Otherwise a power failure would
> > > > > > leave the file in an inconsistent state where the write pointer has
> > > > > > advanced but the data was not written.)
> > > > >
> > > > > The crash-consistency is a concern...
> > > >
> > > > Yes, I'm thinking about crash-consistency. The ordering you described
> > > > can result in qcow2 images where the write pointer is ahead of the
> > > > actually written data after a power failure or maybe a QEMU crash.
> > > >
> > > > QEMU's block layer must follow the same data integrity behavior that
> > > > real devices guarantee.
> > >
> > > I may have found a solution to deal with both cases. The fix is to
> > > update wp in memory instead of flushing it before qcow2 metadata and
> > > data writes. The zone append write path would become:
> > >
> > > On submission:
> > >
> > > 1) wp_lock()
> > > 2) Check write alignment
> > > 3) wp_update (in memory)
> > > 4) wp_unlock()
> > > 5) Issue write
> > >
> > > And on completion:
> > > 1) If no error: wp_flush with locks and return success
> >
> > The data may not be visible in the qcow2 file yet because qcow2's 
> > L1/L2/refcount
> > cache is not written back to the file until a flush request. I think the
> > write pointer updates should have a dependency on the qcow2 metadata so
> > that write pointers are only written after qcow2 metadata.
> 
> Indeed. The qcow2 cache was also my concern. Since wp should be
> persisted after corresponding data is flushed, the cache dependency
> would be qcow2 metadata -> data -> wp. Can we set wp's dependency on
> the data so that wp is written after data is persisted? I might be
> missing something here.

The cached metadata is written after the data, so you don't need to do
anything special to ensure data -> qcow2 metadata -> wp ordering.

One thing to consider is when to increment the write pointer in the
cache. When there are concurrent requests, the wp written to file should
reflect the last _completed_ data write and not in-flight data writes.

It might be necessary to use additional state rather than incrementing
the wp cache immediately when submitting a write request. For example,
iterating over in-flight write requests to calculate the next wp based
on the maximum offset + length and only falling back to the wp cache
when there are no in-flight append requests in this zone.

> 
> >
> > See block/qcow2-cache.c and qcow2_cache_set_dependency(). The idea is
> > that one type of cached metadata can set a dependency on another type of
> > cached metadata so that ordering is guaranteed.
> 
> Thanks, I'll check it out.

By the way, I think this will require making the wp metadata a qcow2
cache object that is created with qcow2_cache_create().

Stefan

> 
> >
> > > 2) else, wp_lock()
> > > 3) read_wp (from disk) and use the read wp value as the current wp
> > > 4) wp_unlock()
> > > 5) return IO error
> > >
> > > Sam
> > >
> > > >
> > > > Damien: Do real zoned block devices guarantee that the updated write
> > > > pointer is persisted only after appended data has written been
> > > > persisted?
> > > >
> > > > Stefan
> > >
> 

Attachment: signature.asc
Description: PGP signature

Reply via email to