On Tue, May 19, 2026 at 11:20:18PM +0200, Sam Li wrote: > On Tue, May 19, 2026 at 5:49 PM Stefan Hajnoczi <[email protected]> wrote: > > > > On Mon, May 18, 2026 at 12:21:55AM +0200, Sam Li wrote: > > > On Thu, May 14, 2026 at 9:49 PM Stefan Hajnoczi <[email protected]> > > > wrote: > > > > On Sun, May 10, 2026 at 07:50:57PM +0200, Sam Li wrote: > > > > > + 48 - 55: zonedmeta_offset > > > > > + The offset of zoned metadata structure in the > > > > > contained > > > > > + image, in bytes. > > > > > > > > Do you want to say anything about the order in which metadata is > > > > persisted to disk when zones used? I guess the data is written into the > > > > image file first, then the non-zoned qcow2 L1/L2/refcount metadata is > > > > updated, and finally the write pointer is written. Write pointers are > > > > not guaranteed to be updated on disk until the write request followed by > > > > a flush request are both completed. > > > > > > The current ordering is not like that. The write pointer is written > > > persistently first, then the data writes and the non-zoned qcow2 > > > L1/L2/refcount metadata updates. On IO failure, the corresponding > > > write pointer is re-read from disk. As noted in the previous comment, > > > the wp must be updated when issuing the IO, under the assumption that > > > the write IO will succeed. > > > > > > The ordering has been settled this way since v7 to deal with > > > concurrent zone append writes. If the wp was only updated after data > > > I/O, two concurrent appends would both have read the same wp and tried > > > to write to the same position. > > > > > > > > > > > (The idea is that the data must be visible in the qcow2 file before it > > > > is safe to update the write pointer. Otherwise a power failure would > > > > leave the file in an inconsistent state where the write pointer has > > > > advanced but the data was not written.) > > > > > > The crash-consistency is a concern... > > > > Yes, I'm thinking about crash-consistency. The ordering you described > > can result in qcow2 images where the write pointer is ahead of the > > actually written data after a power failure or maybe a QEMU crash. > > > > QEMU's block layer must follow the same data integrity behavior that > > real devices guarantee. > > I may have found a solution to deal with both cases. The fix is to > update wp in memory instead of flushing it before qcow2 metadata and > data writes. The zone append write path would become: > > On submission: > > 1) wp_lock() > 2) Check write alignment > 3) wp_update (in memory) > 4) wp_unlock() > 5) Issue write > > And on completion: > 1) If no error: wp_flush with locks and return success
The data may not be visible in the qcow2 file yet because qcow2's L1/L2/refcount cache is not written back to the file until a flush request. I think the write pointer updates should have a dependency on the qcow2 metadata so that write pointers are only written after qcow2 metadata. See block/qcow2-cache.c and qcow2_cache_set_dependency(). The idea is that one type of cached metadata can set a dependency on another type of cached metadata so that ordering is guaranteed. > 2) else, wp_lock() > 3) read_wp (from disk) and use the read wp value as the current wp > 4) wp_unlock() > 5) return IO error > > Sam > > > > > Damien: Do real zoned block devices guarantee that the updated write > > pointer is persisted only after appended data has written been > > persisted? > > > > Stefan >
signature.asc
Description: PGP signature
