On Mon, Jun 01, 2026 at 11:44:04PM +0200, Sam Li wrote:
> By adding zone operations and zoned metadata, the zoned emulation
> capability enables full emulation support of zoned device using
> a qcow2 file. The zoned device metadata includes zone type,
> zoned device state and write pointer (WP) of each zone, which is
> stored to an array of unsigned integers.
> 
> WP accessor (qcow2_rw_wp_at) routes reads and writes of an 8-byte
> WP slot through the write pointer cache. The write pointer cache is
> written to disk after the qcow2 metadata is written, thus guaranteeing
> that the write pointer is updated after the corresponding data is
> written. Per-completion cache flush is deferred. The WP cluster reaches
> disk on the next flush.
> 
> Each zone of a zoned device makes state transitions following
> the zone state machine. The zone state machine mainly describes
> five states, IMPLICIT OPEN, EXPLICIT OPEN, FULL, EMPTY and CLOSED.
> READ ONLY and OFFLINE states will generally be affected by device
> internal events. The operations on zones cause corresponding state
> changing.
> 
> Zoned devices have limits on zone resources, which put constraints on
> write operations on zones. It is managed by active zone queues
> following LRU policy.
> 
> Signed-off-by: Sam Li <[email protected]>
> ---
>  block/qcow2-cache.c    |    8 +
>  block/qcow2-refcount.c |    7 +
>  block/qcow2.c          | 1137 +++++++++++++++++++++++++++++++++++++++-
>  block/trace-events     |    2 +
>  4 files changed, 1149 insertions(+), 5 deletions(-)
> 
> diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
> index 23d9588b08..bdfb11ce88 100644
> --- a/block/qcow2-cache.c
> +++ b/block/qcow2-cache.c
> @@ -275,6 +275,14 @@ int qcow2_cache_set_dependency(BlockDriverState *bs, 
> Qcow2Cache *c,
>  {
>      int ret;
>  
> +    /*
> +     * If the dependency graph is unchanged, nothing to do. This avoids
> +     * a synchronous flush on every call below.
> +     */
> +    if (c->depends == dependency) {
> +        return 0;
> +    }

This makes part of the expression below tautologous:

    if (c->depends && (c->depends != dependency)) {
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
        ret = qcow2_cache_flush_dependency(bs, c);
        if (ret < 0) {
            return ret;
        }
    }

That sub-expression could be dropped, but it makes me worry that the
earlier if (dependency->depends) statement is needed even when
c->depends == dependency.

Kevin: Any thoughts on this?

> +
>      if (dependency->depends) {
>          ret = qcow2_cache_flush_dependency(bs, dependency);
>          if (ret < 0) {
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 6512cda407..f551726609 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -1239,6 +1239,13 @@ int qcow2_write_caches(BlockDriverState *bs)
>          }
>      }
>  
> +    if (s->wp_cache) {
> +        ret = qcow2_cache_write(bs, s->wp_cache);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    }
> +
>      return 0;
>  }
>  
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 29eec33e34..bdc8923b71 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -195,6 +195,300 @@ qcow2_extract_crypto_opts(QemuOpts *opts, const char 
> *fmt, Error **errp)
>      return cryptoopts_qdict;
>  }
>  
> +#define QCOW2_ZT_IS_CONV(wp)    (wp & 1ULL << 59)
> +#define QCOW2_GET_WP(wp)        ((wp << 5) >> 5)
> +
> +/*
> + * To emulate a real zoned device, closed, empty and full states are
> + * preserved after a power cycle. The open states are in-memory and will
> + * be lost after closing the device. Read-only and offline states are
> + * device-internal events, which are not considered for simplicity.
> + */
> +static inline BlockZoneState qcow2_get_zone_state(BlockDriverState *bs,
> +                                                  uint32_t index)

I guess this function requires bs->wps->colock or s->lock, otherwise the
TAILQ accesses could race? Please check and document the locking
requirements.

I/O requests may be processed in multiple threads simultaneously.
s->lock protects qcow2 state.

> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +    uint64_t zone_wp = bs->wps->wp[index];
> +    uint64_t zone_start;
> +
> +    if (QCOW2_ZT_IS_CONV(zone_wp)) {
> +        return BLK_ZS_NOT_WP;
> +    }
> +
> +    if (QTAILQ_IN_USE(zone_entry, exp_open_zone_entry)) {
> +        return BLK_ZS_EOPEN;
> +    }
> +    if (QTAILQ_IN_USE(zone_entry, imp_open_zone_entry)) {
> +        return BLK_ZS_IOPEN;
> +    }
> +
> +    zone_start = index * bs->bl.zone_size;

This is a uint32_t * uint32_t multiplication that can overflow. Avoid
that with:

  zone_start = (uint64_t)index * bs->bl.zone_size;

> +    if (zone_wp == zone_start) {
> +        return BLK_ZS_EMPTY;
> +    }
> +    if (zone_wp >= zone_start + bs->bl.zone_capacity) {
> +        return BLK_ZS_FULL;
> +    }
> +    if (zone_wp > zone_start) {
> +        if (!QTAILQ_IN_USE(zone_entry, closed_zone_entry)) {
> +            /*
> +             * The number of closed zones is not always updated in time when
> +             * the device is closed. However, it only matters when doing
> +             * zone report. Refresh the count and list of closed zones to
> +             * provide correct zone states for zone report.
> +             */
> +            QTAILQ_INSERT_HEAD(&s->closed_zones, zone_entry, 
> closed_zone_entry);
> +            s->nr_zones_closed++;
> +        }
> +        return BLK_ZS_CLOSED;
> +    }
> +    return BLK_ZS_NOT_WP;
> +}
> +
> +static void qcow2_rm_exp_open_zone(BDRVQcow2State *s,
> +                                   uint32_t index)

Locking requirements here and in the functions that follow?

> +{
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +
> +    QTAILQ_REMOVE(&s->exp_open_zones, zone_entry, exp_open_zone_entry);
> +    s->nr_zones_exp_open--;
> +}
> +
> +static void qcow2_rm_imp_open_zone(BDRVQcow2State *s,
> +                                   int32_t index)
> +{
> +    Qcow2ZoneListEntry *zone_entry;
> +    if (index < 0) {
> +        /* Apply LRU when the index is not specified. */
> +        zone_entry = QTAILQ_LAST(&s->imp_open_zones);
> +    } else {
> +        zone_entry = &s->zone_list_entries[index];
> +    }
> +
> +    QTAILQ_REMOVE(&s->imp_open_zones, zone_entry, imp_open_zone_entry);
> +    s->nr_zones_imp_open--;
> +}
> +
> +static void qcow2_rm_open_zone(BDRVQcow2State *s,
> +                               uint32_t index)
> +{
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +
> +    if (QTAILQ_IN_USE(zone_entry, exp_open_zone_entry)) {
> +        qcow2_rm_exp_open_zone(s, index);
> +    } else if (QTAILQ_IN_USE(zone_entry, imp_open_zone_entry)) {
> +        qcow2_rm_imp_open_zone(s, index);
> +    }
> +}
> +
> +static void qcow2_rm_closed_zone(BDRVQcow2State *s,
> +                                 uint32_t index)
> +{
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +
> +    QTAILQ_REMOVE(&s->closed_zones, zone_entry, closed_zone_entry);
> +    s->nr_zones_closed--;
> +}
> +
> +static void qcow2_do_imp_open_zone(BDRVQcow2State *s,
> +                                   uint32_t index,
> +                                   BlockZoneState zs)
> +{
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +
> +    switch (zs) {
> +    case BLK_ZS_EMPTY:
> +        break;
> +    case BLK_ZS_CLOSED:
> +        qcow2_rm_closed_zone(s, index);
> +        break;
> +    case BLK_ZS_IOPEN:
> +        /*
> +         * The LRU policy: update the zone that is most recently
> +         * used to the head of the zone list
> +         */
> +        if (zone_entry == QTAILQ_FIRST(&s->imp_open_zones)) {
> +            return;
> +        }
> +        QTAILQ_REMOVE(&s->imp_open_zones, zone_entry, imp_open_zone_entry);
> +        s->nr_zones_imp_open--;
> +        break;
> +    default:
> +        return;
> +    }
> +
> +    QTAILQ_INSERT_HEAD(&s->imp_open_zones, zone_entry, imp_open_zone_entry);
> +    s->nr_zones_imp_open++;
> +}
> +
> +static void qcow2_do_exp_open_zone(BDRVQcow2State *s,
> +                                   uint32_t index)
> +{
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +
> +    QTAILQ_INSERT_HEAD(&s->exp_open_zones, zone_entry, exp_open_zone_entry);
> +    s->nr_zones_exp_open++;
> +}
> +
> +/*
> + * The list of zones is managed using an LRU policy: the last
> + * zone of the list is always the one that was least recently used
> + * for writing and is chosen as the zone to close to be able to
> + * implicitly open another zone.
> + *
> + * We can only close the open zones. The index is not specified
> + * when it is less than 0.
> + */
> +static void qcow2_do_close_zone(BlockDriverState *bs,
> +                                int32_t index,
> +                                BlockZoneState zs)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    Qcow2ZoneListEntry *zone_entry;
> +
> +    if (index >= 0) {
> +        zone_entry = &s->zone_list_entries[index];
> +    } else {
> +        /* before removal of the last implicitly open zone */
> +        zone_entry = QTAILQ_LAST(&s->imp_open_zones);

There is an assumption that zone_entry is no NULL when zs ==
BLK_ZS_IOPEN? I think that make sense and it means NULL dereferences
cannot happen, but I wanted to check. You could add an assert(zone_entry
!= NULL) here to make that explicit.

> +    }
> +
> +    if (zs == BLK_ZS_IOPEN) {
> +        qcow2_rm_imp_open_zone(s, index);
> +        goto close_zone;
> +    }
> +
> +    if (index >= 0 && zs == BLK_ZS_EOPEN) {
> +        qcow2_rm_exp_open_zone(s, index);
> +        /*
> +         * The zone state changes when the zone is removed from the list of
> +         * open zones (explicitly open -> empty). The closed zone list is
> +         * refreshed during get_zone_state().
> +         */
> +        qcow2_get_zone_state(bs, index);
> +    }
> +    return;
> +
> +close_zone:
> +    QTAILQ_INSERT_HEAD(&s->closed_zones, zone_entry, closed_zone_entry);
> +    s->nr_zones_closed++;

Is the goto and label necessary? Maybe move this inside the if
statement instead to simplify the function.

> +}

I've reviewed up to here for now.

Attachment: signature.asc
Description: PGP signature

Reply via email to