On Mon, Sep 23, 2024 at 15:48 Damien Le Moal <dlem...@kernel.org> wrote:
> On 2024/09/23 15:40, Sam Li wrote: > > Hi Damien, > > > > Damien Le Moal <dlem...@kernel.org> 于2024年9月23日周一 15:22写道: > >> > >> On 2024/09/23 13:06, Sam Li wrote: > >> > >> [...] > >> > >>>>> @@ -2837,6 +3180,19 @@ qcow2_co_pwritev_part(BlockDriverState *bs, > int64_t offset, int64_t bytes, > >>>>> qiov_offset += cur_bytes; > >>>>> trace_qcow2_writev_done_part(qemu_coroutine_self(), > cur_bytes); > >>>>> } > >>>>> + > >>>>> + if (bs->bl.zoned == BLK_Z_HM) { > >>>>> + index = start_offset / zone_size; > >>>>> + wp = &bs->wps->wp[index]; > >>>>> + if (!QCOW2_ZT_IS_CONV(*wp)) { > >>>>> + /* Advance the write pointer when the write completes */ > >>>> > >>>> Updating the write pointer after I/O does not prevent other write > >>>> requests from beginning at the same offset as this request. Multiple > >>>> write request coroutines can run concurrently and only the first one > >>>> should succeed. The others should fail if they are using the same > >>>> offset. > >>>> > >>>> The comment above says "Real drives change states before it can write > to > >>>> the zone" and I think it's appropriate to update the write pointer > >>>> before performing the write too. The qcow2 zone emulation code is > >>>> different from the file-posix.c passthrough code. We are responsible > for > >>>> maintaining zoned metadata state and cannot wait for the result of the > >>>> I/O to tell us what happened. > >> > >> Yes, correct. The wp MUST be updated when issuing the IO, with the > assumption > >> that the write IO will succeed (errors are rare !). > >> > >>> The problem of updating the write pointer before IO completion is the > >>> failure case. It can't be predicted in advance if an IO fails or not. > >>> When write I/O fails, the wp should not be updated. > >> > >> Correct, if an IO fails, the wp should not be updated. However, that is > not > >> difficult to deal with: > >> 1) under the zone lock, advance the wp position when issuing the write > IO > >> 2) When the write IO completes with success, nothing else needs to be > done. > >> 3) When *any* write IO completes with error you need to: > >> - Lock the zone > >> - Do a report zone for the target zone of the failed write to > get the current > >> wp location > >> - Update bs->wps->wp[index] using that current wp location > >> - Unlock the zone > >> > >> With that, one may get a few errors if multiple async writes are being > issued, > >> but that behavior is consistent with the same happening with a real > drive. So no > >> issue. And since the report zones gets you the current wp location, the > user can > >> restart writing from that location once it has dealt with all the > previous write > >> failures. > > > > I see. To allow the concurrent writes, the lock will only be used on > > the failure path while processing append writes. > > > >> > >>> The alternative way is to hold the wps lock as is also required for wp > >>> accessing. Therefore only one of multiple concurrent write requests > >>> will succeed. > >> > >> That is a very simple solution that avoids the above error recovery, > but that > >> would be very bad for performance (especially for a pure sequential > write > >> workload as we would limit IOs to quue depth 1). So if we can avoid > this simple > >> approach, that would be a lot better. > > > > Yeah, I'll drop this approach. Although, it reminds me of how > > file-posix driver emulates zone_append. It holds the lock whenever > > accessing wps. Does that limit IOs to QD 1 too? If so, it can be > > improved. > > -- one zone_append starts > >>> wp_lock() > >>>> IO processing > >>>>> wp_update > >>>>>> wp_unlock() > > -- ends > > Yes, this limits the IO queue depth to 1, so not great. file-posix can use > the > exact same error recovery mechanism as I explained. The write IO flow > would thus > become: > > On submission: > > 1) wp_lock() > 2) Check write alignement with wp (or change zone append into regular > write) > 3) Issue write as an asynchronous IO > 4) wp_update > 5) wp_unlock() > > And on completion, all you need is: > 1) If no error, return success > 2) wp_lock() > 3) Do a report zone and use the reported wp value as the current wp > 4) wp_unlock() > 5) return IO error > > With this simple scheme, when there are no IO errors, things are fast and > there > is no "big lock" limiting the number of writes that can be issued. Only > write > error recovery ends up locking everything during the report zones, but > that is > fine. Errors are rare :) Neat! I’ll send a separate patch for that. Additionally, I was thinking about changing the zone size/capacity field to 64 bit in block layer and qcow2 config, as Stefan once suggested, which will be in the next version of this patch series. Eventually, the 4GB of zone size might be hit. Better change it beforehand. Sam > > > -- > Damien Le Moal > Western Digital Research >