On 2024/09/23 13:06, Sam Li wrote:

[...]

>>> @@ -2837,6 +3180,19 @@ qcow2_co_pwritev_part(BlockDriverState *bs, int64_t 
>>> offset, int64_t bytes,
>>>          qiov_offset += cur_bytes;
>>>          trace_qcow2_writev_done_part(qemu_coroutine_self(), cur_bytes);
>>>      }
>>> +
>>> +    if (bs->bl.zoned == BLK_Z_HM) {
>>> +        index = start_offset / zone_size;
>>> +        wp = &bs->wps->wp[index];
>>> +        if (!QCOW2_ZT_IS_CONV(*wp)) {
>>> +            /* Advance the write pointer when the write completes */
>>
>> Updating the write pointer after I/O does not prevent other write
>> requests from beginning at the same offset as this request. Multiple
>> write request coroutines can run concurrently and only the first one
>> should succeed. The others should fail if they are using the same
>> offset.
>>
>> The comment above says "Real drives change states before it can write to
>> the zone" and I think it's appropriate to update the write pointer
>> before performing the write too. The qcow2 zone emulation code is
>> different from the file-posix.c passthrough code. We are responsible for
>> maintaining zoned metadata state and cannot wait for the result of the
>> I/O to tell us what happened.

Yes, correct. The wp MUST be updated when issuing the IO, with the assumption
that the write IO will succeed (errors are rare !).

> The problem of updating the write pointer before IO completion is the
> failure case.  It can't be predicted in advance if an IO fails or not.
> When write I/O fails, the wp should not be updated.

Correct, if an IO fails, the wp should not be updated. However, that is not
difficult to deal with:
1) under the zone lock, advance the wp position when issuing the write IO
2) When the write IO completes with success, nothing else needs to be done.
3) When *any* write IO completes with error you need to:
        - Lock the zone
        - Do a report zone for the target zone of the failed write to get the 
current
wp location
        - Update bs->wps->wp[index] using that current wp location
        - Unlock the zone

With that, one may get a few errors if multiple async writes are being issued,
but that behavior is consistent with the same happening with a real drive. So no
issue. And since the report zones gets you the current wp location, the user can
restart writing from that location once it has dealt with all the previous write
failures.

> The alternative way is to hold the wps lock as is also required for wp
> accessing. Therefore only one of multiple concurrent write requests
> will succeed.

That is a very simple solution that avoids the above error recovery, but that
would be very bad for performance (especially for a pure sequential write
workload as we would limit IOs to quue depth 1). So if we can avoid this simple
approach, that would be a lot better.


-- 
Damien Le Moal
Western Digital Research

Reply via email to