On Wed, May 20, 2026 at 7:59 PM Stefan Hajnoczi <[email protected]> wrote:
>
> On Tue, May 19, 2026 at 11:20:18PM +0200, Sam Li wrote:
> > On Tue, May 19, 2026 at 5:49 PM Stefan Hajnoczi <[email protected]> wrote:
> > >
> > > On Mon, May 18, 2026 at 12:21:55AM +0200, Sam Li wrote:
> > > > On Thu, May 14, 2026 at 9:49 PM Stefan Hajnoczi <[email protected]> 
> > > > wrote:
> > > > > On Sun, May 10, 2026 at 07:50:57PM +0200, Sam Li wrote:
> > > > > > +         48 - 55:  zonedmeta_offset
> > > > > > +                   The offset of zoned metadata structure in the 
> > > > > > contained
> > > > > > +                   image, in bytes.
> > > > >
> > > > > Do you want to say anything about the order in which metadata is
> > > > > persisted to disk when zones used? I guess the data is written into 
> > > > > the
> > > > > image file first, then the non-zoned qcow2 L1/L2/refcount metadata is
> > > > > updated, and finally the write pointer is written. Write pointers are
> > > > > not guaranteed to be updated on disk until the write request followed 
> > > > > by
> > > > > a flush request are both completed.
> > > >
> > > > The current ordering is not like that. The write pointer is written
> > > > persistently first, then the data writes and the non-zoned qcow2
> > > > L1/L2/refcount metadata updates. On IO failure, the corresponding
> > > > write pointer is re-read from disk. As noted in the previous comment,
> > > > the wp must be updated when issuing the IO, under the assumption that
> > > > the write IO will succeed.
> > > >
> > > > The ordering has been settled this way since v7 to deal with
> > > > concurrent zone append writes. If the wp was only updated after data
> > > > I/O, two concurrent appends would both have read the same wp and tried
> > > > to write to the same position.
> > > >
> > > > >
> > > > > (The idea is that the data must be visible in the qcow2 file before it
> > > > > is safe to update the write pointer. Otherwise a power failure would
> > > > > leave the file in an inconsistent state where the write pointer has
> > > > > advanced but the data was not written.)
> > > >
> > > > The crash-consistency is a concern...
> > >
> > > Yes, I'm thinking about crash-consistency. The ordering you described
> > > can result in qcow2 images where the write pointer is ahead of the
> > > actually written data after a power failure or maybe a QEMU crash.
> > >
> > > QEMU's block layer must follow the same data integrity behavior that
> > > real devices guarantee.
> >
> > I may have found a solution to deal with both cases. The fix is to
> > update wp in memory instead of flushing it before qcow2 metadata and
> > data writes. The zone append write path would become:
> >
> > On submission:
> >
> > 1) wp_lock()
> > 2) Check write alignment
> > 3) wp_update (in memory)
> > 4) wp_unlock()
> > 5) Issue write
> >
> > And on completion:
> > 1) If no error: wp_flush with locks and return success
>
> The data may not be visible in the qcow2 file yet because qcow2's 
> L1/L2/refcount
> cache is not written back to the file until a flush request. I think the
> write pointer updates should have a dependency on the qcow2 metadata so
> that write pointers are only written after qcow2 metadata.

Indeed. The qcow2 cache was also my concern. Since wp should be
persisted after corresponding data is flushed, the cache dependency
would be qcow2 metadata -> data -> wp. Can we set wp's dependency on
the data so that wp is written after data is persisted? I might be
missing something here.

>
> See block/qcow2-cache.c and qcow2_cache_set_dependency(). The idea is
> that one type of cached metadata can set a dependency on another type of
> cached metadata so that ordering is guaranteed.

Thanks, I'll check it out.

>
> > 2) else, wp_lock()
> > 3) read_wp (from disk) and use the read wp value as the current wp
> > 4) wp_unlock()
> > 5) return IO error
> >
> > Sam
> >
> > >
> > > Damien: Do real zoned block devices guarantee that the updated write
> > > pointer is persisted only after appended data has written been
> > > persisted?
> > >
> > > Stefan
> >

Reply via email to