RE: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set

2020-10-21 Thread Dmitry Fomichev
> -Original Message-
> From: Klaus Jensen 
> Sent: Wednesday, October 21, 2020 6:26 AM
> To: Dmitry Fomichev 
> Cc: Keith Busch ; Klaus Jensen
> ; Kevin Wolf ; Philippe
> Mathieu-Daudé ; Maxim Levitsky
> ; Fam Zheng ; Niklas Cassel
> ; Damien Le Moal ;
> qemu-block@nongnu.org; qemu-de...@nongnu.org; Alistair Francis
> ; Matias Bjorling 
> Subject: Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace
> Command Set
> 
> On Oct 19 11:17, Dmitry Fomichev wrote:
> > +/*
> > + * Close or finish all the zones that are currently open.
> > + */
> > +static void nvme_zoned_clear_ns(NvmeNamespace *ns)
> > +{
> > +NvmeZone *zone;
> > +uint32_t set_state;
> > +int i;
> > +
> > +zone = ns->zone_array;
> > +for (i = 0; i < ns->num_zones; i++, zone++) {
> > +switch (nvme_get_zone_state(zone)) {
> > +case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> > +QTAILQ_REMOVE(&ns->imp_open_zones, zone, entry);
> > +break;
> > +case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> > +QTAILQ_REMOVE(&ns->exp_open_zones, zone, entry);
> > +break;
> > +case NVME_ZONE_STATE_CLOSED:
> > +/* fall through */
> > +default:
> > +continue;
> > +}
> > +
> > +if (zone->d.wp == zone->d.zslba) {
> > +set_state = NVME_ZONE_STATE_EMPTY;
> > +} else {
> > +set_state = NVME_ZONE_STATE_CLOSED;
> > +}
> > +
> > +switch (set_state) {
> > +case NVME_ZONE_STATE_CLOSED:
> > +trace_pci_nvme_clear_ns_close(nvme_get_zone_state(zone),
> > +  zone->d.zslba);
> > +QTAILQ_INSERT_TAIL(&ns->closed_zones, zone, entry);
> > +break;
> > +case NVME_ZONE_STATE_EMPTY:
> > +trace_pci_nvme_clear_ns_reset(nvme_get_zone_state(zone),
> > +  zone->d.zslba);
> > +break;
> > +case NVME_ZONE_STATE_FULL:
> > +trace_pci_nvme_clear_ns_full(nvme_get_zone_state(zone),
> > + zone->d.zslba);
> > +zone->d.wp = nvme_zone_wr_boundary(zone);
> > +QTAILQ_INSERT_TAIL(&ns->full_zones, zone, entry);
> > +}
> 
> No need for the switch here - just add to the closed list in the
> conditional.

The switch becomes handy later in the series, particularly after adding
descriptor extensions. For easier reviewing, it makes sense to add it from
the beginning even though it is rudimentary at this point.

> 
> The NVME_ZONE_STATE_FULL case is unreachable.

Indeed. This should be introduced in the next patch.

Now, I've looked at this code again and the active/open counting in this
function ends up to be not quite right, I am fixing it.

> 
> > +
> > +zone->w_ptr = zone->d.wp;
> > +nvme_set_zone_state(zone, set_state);
> > +}
> > +}


Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set

2020-10-21 Thread Klaus Jensen
On Oct 19 11:17, Dmitry Fomichev wrote:
> +/*
> + * Close or finish all the zones that are currently open.
> + */
> +static void nvme_zoned_clear_ns(NvmeNamespace *ns)
> +{
> +NvmeZone *zone;
> +uint32_t set_state;
> +int i;
> +
> +zone = ns->zone_array;
> +for (i = 0; i < ns->num_zones; i++, zone++) {
> +switch (nvme_get_zone_state(zone)) {
> +case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +QTAILQ_REMOVE(&ns->imp_open_zones, zone, entry);
> +break;
> +case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +QTAILQ_REMOVE(&ns->exp_open_zones, zone, entry);
> +break;
> +case NVME_ZONE_STATE_CLOSED:
> +/* fall through */
> +default:
> +continue;
> +}
> +
> +if (zone->d.wp == zone->d.zslba) {
> +set_state = NVME_ZONE_STATE_EMPTY;
> +} else {
> +set_state = NVME_ZONE_STATE_CLOSED;
> +}
> +
> +switch (set_state) {
> +case NVME_ZONE_STATE_CLOSED:
> +trace_pci_nvme_clear_ns_close(nvme_get_zone_state(zone),
> +  zone->d.zslba);
> +QTAILQ_INSERT_TAIL(&ns->closed_zones, zone, entry);
> +break;
> +case NVME_ZONE_STATE_EMPTY:
> +trace_pci_nvme_clear_ns_reset(nvme_get_zone_state(zone),
> +  zone->d.zslba);
> +break;
> +case NVME_ZONE_STATE_FULL:
> +trace_pci_nvme_clear_ns_full(nvme_get_zone_state(zone),
> + zone->d.zslba);
> +zone->d.wp = nvme_zone_wr_boundary(zone);
> +QTAILQ_INSERT_TAIL(&ns->full_zones, zone, entry);
> +}

No need for the switch here - just add to the closed list in the
conditional.

The NVME_ZONE_STATE_FULL case is unreachable.

> +
> +zone->w_ptr = zone->d.wp;
> +nvme_set_zone_state(zone, set_state);
> +}
> +}


signature.asc
Description: PGP signature


Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set

2020-10-20 Thread Klaus Jensen
On Oct 19 11:17, Dmitry Fomichev wrote:
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 974aea33f7..fedfad595c 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -133,6 +320,12 @@ static Property nvme_ns_props[] = {
>  DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
>  DEFINE_PROP_UUID("uuid", NvmeNamespace, params.uuid),
>  DEFINE_PROP_BOOL("attached", NvmeNamespace, params.attached, true),
> +DEFINE_PROP_BOOL("zoned", NvmeNamespace, params.zoned, false),

Instead of using a 'zoned' property here, can we add an 'iocs' or 'csi'
property in the namespace types patch? Then, in the future if we add
additional command sets we won't need another property (like 'kv').

> +DEFINE_PROP_SIZE("zone_size", NvmeNamespace, params.zone_size_bs,
> + NVME_DEFAULT_ZONE_SIZE),
> +DEFINE_PROP_SIZE("zone_capacity", NvmeNamespace, params.zone_cap_bs, 0),

I would like that the zone_size and zone_capacity were named zoned.zsze
and zoned.zcap and were in terms of logical blocks, like in the spec.
Putting them in a pseudo-namespace makes it clear that the options
affect the zoned command set and reduces the risk of anything clashing
with the addition of other command sets (like 'kv') in the future.

> +DEFINE_PROP_BOOL("cross_zone_read", NvmeNamespace,
> + params.cross_zone_read, false),

Instead of cluttering the parameters with a bunch of these when others
zone operational characteristics are added, can we use a 'zoned.zoc'
parameter that matches the spec?

> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 93728e51b3..34d0d0250d 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -3079,6 +4001,9 @@ static Property nvme_props[] = {
>  DEFINE_PROP_UINT32("aer_max_queued", NvmeCtrl, params.aer_max_queued, 
> 64),
>  DEFINE_PROP_UINT8("mdts", NvmeCtrl, params.mdts, 7),
>  DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
> +DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
> +DEFINE_PROP_SIZE32("zone_append_size_limit", NvmeCtrl, params.zasl_bs,
> +   NVME_DEFAULT_MAX_ZA_SIZE),

Similar to my reasoning above, I would like this to be zoned.zasl and in
terms of logical blocks like the spec. Also, I think '0' is a better
default since zero values typically identify a default value in the spec
as well.

I know this might sound like bikeshedding, but I wanna make sure that we
get the parameters right since we cannot get rid of them once they are
there. Following the definitions of the spec makes it very clear what
their meaning are and should be. 'mdts' is currently the only other
parameter like this, but that is also specified as in the spec, and not
as an absolute value.

My preference also applies to subsequent patches, like using `zoned.mor`
and `zoned.mar` for the resource limits.


signature.asc
Description: PGP signature


Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set

2020-10-19 Thread Klaus Jensen
On Oct 19 11:50, Klaus Jensen wrote:
> On Oct 19 11:17, Dmitry Fomichev wrote:
> > +static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
> > +  bool failed)
> > +{
> > +NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
> > +NvmeZone *zone;
> > +uint64_t slba, start_wp = req->cqe.result64;
> > +uint32_t nlb;
> > +
> > +if (rw->opcode != NVME_CMD_WRITE &&
> > +rw->opcode != NVME_CMD_ZONE_APPEND &&
> > +rw->opcode != NVME_CMD_WRITE_ZEROES) {
> > +return false;
> > +}
> > +
> > +slba = le64_to_cpu(rw->slba);
> > +nlb = le16_to_cpu(rw->nlb) + 1;
> > +zone = nvme_get_zone_by_slba(ns, slba);
> > +
> > +if (!failed && zone->w_ptr < start_wp + nlb) {
> > +/*
> > + * A preceding queued write to the zone has failed,
> > + * now this write is not at the WP, fail it too.
> > + */
> > +failed = true;
> > +}
> > +
> > +if (failed) {
> > +if (zone->w_ptr > start_wp) {
> > +zone->w_ptr = start_wp;
> > +zone->d.wp = start_wp;
> > +}
> 
> This doesn't fix the data corruption. The example from my last review
> still applies.
> 

An easy fix is to just unconditionally advance the write pointer in all
cases.


signature.asc
Description: PGP signature


Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set

2020-10-19 Thread Klaus Jensen
On Oct 19 11:17, Dmitry Fomichev wrote:
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index d6b2808b97..170cbb8cdc 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -34,6 +45,18 @@ typedef struct NvmeNamespace {
>  const uint32_t *iocs;
>  uint8_t  csi;
>  
> +NvmeIdNsZoned   *id_ns_zoned;
> +NvmeZone*zone_array;
> +QTAILQ_HEAD(, NvmeZone) exp_open_zones;
> +QTAILQ_HEAD(, NvmeZone) imp_open_zones;
> +QTAILQ_HEAD(, NvmeZone) closed_zones;
> +QTAILQ_HEAD(, NvmeZone) full_zones;

Apart from the imp_open_zones list that is being used in a later patch
to support Implicitly Opened to Closed transitions, these lists seem
rather pointless. As far as I can tell the only use they have is being
inserted into, removed from and checking if a zone is in one of those
four states?

The Zone Management Receive (and Send with Select All) is just iterating
on all zones and matching on state.


signature.asc
Description: PGP signature


Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set

2020-10-19 Thread Klaus Jensen
On Oct 19 11:17, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Define values and structures that are needed to support Zoned
> Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
> Define trace events where needed in newly introduced code.
> 
> In order to improve scalability, all open, closed and full zones
> are organized in separate linked lists. Consequently, almost all
> zone operations don't require scanning of the entire zone array
> (which potentially can be quite large) - it is only necessary to
> enumerate one or more zone lists.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> Subsequent commits in this series add ZDE support and checks for
> active and open zone limits.
> 
> Signed-off-by: Niklas Cassel 
> Signed-off-by: Hans Holmberg 
> Signed-off-by: Ajay Joshi 
> Signed-off-by: Chaitanya Kulkarni 
> Signed-off-by: Matias Bjorling 
> Signed-off-by: Aravind Ramesh 
> Signed-off-by: Shin'ichiro Kawasaki 
> Signed-off-by: Adam Manzanares 
> Signed-off-by: Dmitry Fomichev 
> ---
>  block/nvme.c  |   2 +-
>  hw/block/nvme-ns.c| 193 +
>  hw/block/nvme-ns.h|  54 +++
>  hw/block/nvme.c   | 975 --
>  hw/block/nvme.h   |   9 +
>  hw/block/trace-events |  21 +
>  include/block/nvme.h  | 113 -
>  7 files changed, 1339 insertions(+), 28 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 05485fdd11..7a513c9a17 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> +static void nvme_fill_read_data(NvmeRequest *req, uint64_t offset,
> +uint32_t max_len, uint8_t pattern)
> +{
> +QEMUSGList *qsg = &req->qsg;
> +QEMUIOVector *iov = &req->iov;
> +ScatterGatherEntry *entry;
> +uint32_t len, ent_len;
> +
> +if (qsg->nsg > 0) {
> +entry = qsg->sg;
> +len = qsg->size;
> +if (max_len) {
> +len = MIN(len, max_len);
> +}
> +for (; len > 0; len -= ent_len) {
> +ent_len = MIN(len, entry->len);
> +if (offset > ent_len) {
> +offset -= ent_len;
> +} else if (offset != 0) {
> +dma_memory_set(qsg->as, entry->base + offset,
> +   pattern, ent_len - offset);
> +offset = 0;
> +} else {
> +dma_memory_set(qsg->as, entry->base, pattern, ent_len);
> +}
> +entry++;

dma_memory_set can fail. In that case, I think just fail the command
with NVME_DATA_TRAS_ERROR.

But I think this should be removed in any case, see below.

> +static uint16_t nvme_check_zone_read(NvmeNamespace *ns, NvmeZone *zone,
> + uint64_t slba, uint32_t nlb,
> + NvmeReadFillCtx *rfc)
> +{
> +NvmeZone *next_zone;
> +uint64_t bndry = nvme_zone_rd_boundary(ns, zone);
> +uint64_t end = slba + nlb, wp1, wp2;
> +uint16_t status;
> +
> +rfc->pre_rd_fill_slba = ~0ULL;
> +rfc->pre_rd_fill_nlb = 0;
> +rfc->read_slba = slba;
> +rfc->read_nlb = nlb;
> +rfc->post_rd_fill_slba = ~0ULL;
> +rfc->post_rd_fill_nlb = 0;
> +
> +status = nvme_zone_state_ok_to_read(zone);
> +if (status != NVME_SUCCESS) {
> +;
> +} else if (likely(end <= bndry)) {
> +if (end > zone->w_ptr) {
> +wp1 = zone->w_ptr;
> +if (slba >= wp1) {
> +/* No i/o necessary, just fill */
> +rfc->pre_rd_fill_slba = slba;
> +rfc->pre_rd_fill_nlb = nlb;
> +rfc->read_nlb = 0;
> +} else {
> +rfc->read_nlb = wp1 - slba;
> +rfc->post_rd_fill_slba = wp1;
> +rfc->post_rd_fill_nlb = nlb - rfc->read_nlb;
> +   }
> +}
> +} else if (!ns->params.cross_zone_read) {
> +status = NVME_ZONE_BOUNDARY_ERROR;
> +} else {
> +/*
> + * Read across zone boundary, look at the next zone.
> + * Earlier bounds checks ensure that the current zone
> + * is not the last one.
> + */
> +next_zone = zone + 1;
> +status = nvme_zone_sta