On 2024/09/23 15:40, Sam Li wrote: > Hi Damien, > > Damien Le Moal <dlem...@kernel.org> 于2024年9月23日周一 15:22写道: >> >> On 2024/09/23 13:06, Sam Li wrote: >> >> [...] >> >>>>> @@ -2837,6 +3180,19 @@ qcow2_co_pwritev_part(BlockDriverState *bs, >>>>> int64_t offset, int64_t bytes, >>>>> qiov_offset += cur_bytes; >>>>> trace_qcow2_writev_done_part(qemu_coroutine_self(), cur_bytes); >>>>> } >>>>> + >>>>> + if (bs->bl.zoned == BLK_Z_HM) { >>>>> + index = start_offset / zone_size; >>>>> + wp = &bs->wps->wp[index]; >>>>> + if (!QCOW2_ZT_IS_CONV(*wp)) { >>>>> + /* Advance the write pointer when the write completes */ >>>> >>>> Updating the write pointer after I/O does not prevent other write >>>> requests from beginning at the same offset as this request. Multiple >>>> write request coroutines can run concurrently and only the first one >>>> should succeed. The others should fail if they are using the same >>>> offset. >>>> >>>> The comment above says "Real drives change states before it can write to >>>> the zone" and I think it's appropriate to update the write pointer >>>> before performing the write too. The qcow2 zone emulation code is >>>> different from the file-posix.c passthrough code. We are responsible for >>>> maintaining zoned metadata state and cannot wait for the result of the >>>> I/O to tell us what happened. >> >> Yes, correct. The wp MUST be updated when issuing the IO, with the assumption >> that the write IO will succeed (errors are rare !). >> >>> The problem of updating the write pointer before IO completion is the >>> failure case. It can't be predicted in advance if an IO fails or not. >>> When write I/O fails, the wp should not be updated. >> >> Correct, if an IO fails, the wp should not be updated. However, that is not >> difficult to deal with: >> 1) under the zone lock, advance the wp position when issuing the write IO >> 2) When the write IO completes with success, nothing else needs to be done. >> 3) When *any* write IO completes with error you need to: >> - Lock the zone >> - Do a report zone for the target zone of the failed write to get >> the current >> wp location >> - Update bs->wps->wp[index] using that current wp location >> - Unlock the zone >> >> With that, one may get a few errors if multiple async writes are being >> issued, >> but that behavior is consistent with the same happening with a real drive. >> So no >> issue. And since the report zones gets you the current wp location, the user >> can >> restart writing from that location once it has dealt with all the previous >> write >> failures. > > I see. To allow the concurrent writes, the lock will only be used on > the failure path while processing append writes. > >> >>> The alternative way is to hold the wps lock as is also required for wp >>> accessing. Therefore only one of multiple concurrent write requests >>> will succeed. >> >> That is a very simple solution that avoids the above error recovery, but that >> would be very bad for performance (especially for a pure sequential write >> workload as we would limit IOs to quue depth 1). So if we can avoid this >> simple >> approach, that would be a lot better. > > Yeah, I'll drop this approach. Although, it reminds me of how > file-posix driver emulates zone_append. It holds the lock whenever > accessing wps. Does that limit IOs to QD 1 too? If so, it can be > improved. > -- one zone_append starts >>> wp_lock() >>>> IO processing >>>>> wp_update >>>>>> wp_unlock() > -- ends
Yes, this limits the IO queue depth to 1, so not great. file-posix can use the exact same error recovery mechanism as I explained. The write IO flow would thus become: On submission: 1) wp_lock() 2) Check write alignement with wp (or change zone append into regular write) 3) Issue write as an asynchronous IO 4) wp_update 5) wp_unlock() And on completion, all you need is: 1) If no error, return success 2) wp_lock() 3) Do a report zone and use the reported wp value as the current wp 4) wp_unlock() 5) return IO error With this simple scheme, when there are no IO errors, things are fast and there is no "big lock" limiting the number of writes that can be issued. Only write error recovery ends up locking everything during the report zones, but that is fine. Errors are rare :) -- Damien Le Moal Western Digital Research