On Wed, May 20, 2026 at 08:23:41PM +0200, Sam Li wrote: > On Wed, May 20, 2026 at 7:59 PM Stefan Hajnoczi <[email protected]> wrote: > > > > On Tue, May 19, 2026 at 11:20:18PM +0200, Sam Li wrote: > > > On Tue, May 19, 2026 at 5:49 PM Stefan Hajnoczi <[email protected]> > > > wrote: > > > > > > > > On Mon, May 18, 2026 at 12:21:55AM +0200, Sam Li wrote: > > > > > On Thu, May 14, 2026 at 9:49 PM Stefan Hajnoczi <[email protected]> > > > > > wrote: > > > > > > On Sun, May 10, 2026 at 07:50:57PM +0200, Sam Li wrote: > > > > > > > + 48 - 55: zonedmeta_offset > > > > > > > + The offset of zoned metadata structure in the > > > > > > > contained > > > > > > > + image, in bytes. > > > > > > > > > > > > Do you want to say anything about the order in which metadata is > > > > > > persisted to disk when zones used? I guess the data is written into > > > > > > the > > > > > > image file first, then the non-zoned qcow2 L1/L2/refcount metadata > > > > > > is > > > > > > updated, and finally the write pointer is written. Write pointers > > > > > > are > > > > > > not guaranteed to be updated on disk until the write request > > > > > > followed by > > > > > > a flush request are both completed. > > > > > > > > > > The current ordering is not like that. The write pointer is written > > > > > persistently first, then the data writes and the non-zoned qcow2 > > > > > L1/L2/refcount metadata updates. On IO failure, the corresponding > > > > > write pointer is re-read from disk. As noted in the previous comment, > > > > > the wp must be updated when issuing the IO, under the assumption that > > > > > the write IO will succeed. > > > > > > > > > > The ordering has been settled this way since v7 to deal with > > > > > concurrent zone append writes. If the wp was only updated after data > > > > > I/O, two concurrent appends would both have read the same wp and tried > > > > > to write to the same position. > > > > > > > > > > > > > > > > > (The idea is that the data must be visible in the qcow2 file before > > > > > > it > > > > > > is safe to update the write pointer. Otherwise a power failure would > > > > > > leave the file in an inconsistent state where the write pointer has > > > > > > advanced but the data was not written.) > > > > > > > > > > The crash-consistency is a concern... > > > > > > > > Yes, I'm thinking about crash-consistency. The ordering you described > > > > can result in qcow2 images where the write pointer is ahead of the > > > > actually written data after a power failure or maybe a QEMU crash. > > > > > > > > QEMU's block layer must follow the same data integrity behavior that > > > > real devices guarantee. > > > > > > I may have found a solution to deal with both cases. The fix is to > > > update wp in memory instead of flushing it before qcow2 metadata and > > > data writes. The zone append write path would become: > > > > > > On submission: > > > > > > 1) wp_lock() > > > 2) Check write alignment > > > 3) wp_update (in memory) > > > 4) wp_unlock() > > > 5) Issue write > > > > > > And on completion: > > > 1) If no error: wp_flush with locks and return success > > > > The data may not be visible in the qcow2 file yet because qcow2's > > L1/L2/refcount > > cache is not written back to the file until a flush request. I think the > > write pointer updates should have a dependency on the qcow2 metadata so > > that write pointers are only written after qcow2 metadata. > > Indeed. The qcow2 cache was also my concern. Since wp should be > persisted after corresponding data is flushed, the cache dependency > would be qcow2 metadata -> data -> wp. Can we set wp's dependency on > the data so that wp is written after data is persisted? I might be > missing something here.
The cached metadata is written after the data, so you don't need to do anything special to ensure data -> qcow2 metadata -> wp ordering. One thing to consider is when to increment the write pointer in the cache. When there are concurrent requests, the wp written to file should reflect the last _completed_ data write and not in-flight data writes. It might be necessary to use additional state rather than incrementing the wp cache immediately when submitting a write request. For example, iterating over in-flight write requests to calculate the next wp based on the maximum offset + length and only falling back to the wp cache when there are no in-flight append requests in this zone. > > > > > See block/qcow2-cache.c and qcow2_cache_set_dependency(). The idea is > > that one type of cached metadata can set a dependency on another type of > > cached metadata so that ordering is guaranteed. > > Thanks, I'll check it out. By the way, I think this will require making the wp metadata a qcow2 cache object that is created with qcow2_cache_create(). Stefan > > > > > > 2) else, wp_lock() > > > 3) read_wp (from disk) and use the read wp value as the current wp > > > 4) wp_unlock() > > > 5) return IO error > > > > > > Sam > > > > > > > > > > > Damien: Do real zoned block devices guarantee that the updated write > > > > pointer is persisted only after appended data has written been > > > > persisted? > > > > > > > > Stefan > > > >
signature.asc
Description: PGP signature
