from:"Klaus Birkelund Jensen"

Re: [PATCH v2 00/16] nvme: refactoring and cleanups

2020-04-21 Thread Klaus Birkelund Jensen

On Apr 21 19:24, Maxim Levitsky wrote:
> Should I also review the V7 series or I should wait for V8 which will
> not include these cleanups?

Hi Maxim,

Just wait for another series - I don't think I will post a v8, I will
chop op the series into smaller ones instead.

Most patches will hopefully not change too much, so should keep your
Reviewed-by's ;)

Thanks,
Klaus

Re: [PATCH v2 00/16] nvme: refactoring and cleanups

2020-04-20 Thread Klaus Birkelund Jensen

On Apr 21 02:38, Keith Busch wrote:
> The series looks good to me.
> 
> Reviewed-by: Keith Busch 

Thanks for the review Keith!

Kevin, should I rebase this on block-next? I think it might have some
conflicts with the PMR patch that went in previously.

Philippe, then I can also change the *err to *local_err ;)

Re: [PATCH v2 00/16] nvme: refactoring and cleanups

2020-04-19 Thread Klaus Birkelund Jensen

On Apr 15 15:01, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Changes since v1
> 
> * nvme: fix pci doorbell size calculation
>   - added some defines and a better comment (Philippe)
> 
> * nvme: rename trace events to pci_nvme
>   - changed the prefix from nvme_dev to pci_nvme (Philippe)
> 
> * nvme: add max_ioqpairs device parameter
>   - added a deprecation comment. I doubt this will go in until 5.1, so
> changed it to "deprecated from 5.1" (Philippe)
> 
> * nvme: factor out property/constraint checks
> * nvme: factor out block backend setup
>   - changed to return void and propagate errors in proper QEMU style
> (Philippe)
> 
> * nvme: add namespace helpers
>   - use the helper immediately (Philippe)
> 
> * nvme: factor out pci setup
>   - removed setting of vendor and device id which is already inherited
> from nvme_class_init() (Philippe)
> 
> * nvme: factor out cmb setup
>   - add lost comment (Philippe)
> 
> 
> Klaus Jensen (16):
>   nvme: fix pci doorbell size calculation
>   nvme: rename trace events to pci_nvme
>   nvme: remove superfluous breaks
>   nvme: move device parameters to separate struct
>   nvme: use constants in identify
>   nvme: refactor nvme_addr_read
>   nvme: add max_ioqpairs device parameter
>   nvme: remove redundant cmbloc/cmbsz members
>   nvme: factor out property/constraint checks
>   nvme: factor out device state setup
>   nvme: factor out block backend setup
>   nvme: add namespace helpers
>   nvme: factor out namespace setup
>   nvme: factor out pci setup
>   nvme: factor out cmb setup
>   nvme: factor out controller identify setup
> 
>  hw/block/nvme.c   | 433 --
>  hw/block/nvme.h   |  36 +++-
>  hw/block/trace-events | 172 -
>  include/block/nvme.h  |   8 +
>  4 files changed, 372 insertions(+), 277 deletions(-)
> 
> -- 
> 2.26.0
> 

Hi Keith,

You have acked most of this previously, but not in it's most recent
state. Since a good bunch of the refactoring patches have been split up
and changed, only a small subset of the patches still carry your
Acked-by.

The 'nvme: fix pci doorbell size calculation' and 'nvme: add
max_ioqpairs device parameter' are new since your ack and given their
nature a review from you would be nice :)


Thanks,
Klaus

Re: [PATCH v2 13/16] nvme: factor out namespace setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 15:26, Philippe Mathieu-Daudé wrote:
> On 4/15/20 3:20 PM, Klaus Birkelund Jensen wrote:
> > 
> > I'll get the v1.3 series ready next.
> > 
> 
> Cool. What really matters (to me) is seeing tests. If we can merge tests
> (without multiple namespaces) before the rest of your series, even better.
> Tests give reviewers/maintainers confidence that code isn't breaking ;)
> 

The patches that I contribute have been pretty extensively tested by
various means in a "host setting" (e.g. blktests and some internal
tools), which really exercise the device by doing heavy I/O, testing for
compliance and also just being mean to it (e.g. tripping bus mastering
while doing I/O).

Don't misunderstand me as trying to weasel my way out of writing tests,
but I just want to understand the scope of the tests that you are
looking for? I believe (hope!) that you are not asking me to implement a
user-space NVMe driver in the test, so I assume the tests should varify
more low level details?

Re: [PATCH v2 13/16] nvme: factor out namespace setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 15:16, Philippe Mathieu-Daudé wrote:
> On 4/15/20 3:01 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/block/nvme.c | 46 ++
> >   1 file changed, 26 insertions(+), 20 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index d5244102252c..2b007115c302 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1358,6 +1358,27 @@ static void nvme_init_blk(NvmeCtrl *n, Error **errp)
> > false, errp);
> >   }
> > +static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error 
> > **errp)
> > +{
> > +int64_t bs_size;
> > +NvmeIdNs *id_ns = &ns->id_ns;
> > +
> > +bs_size = blk_getlength(n->conf.blk);
> > +if (bs_size < 0) {
> > +error_setg_errno(errp, -bs_size, "could not get backing file 
> > size");
> > +return;
> > +}
> > +
> > +n->ns_size = bs_size;
> > +
> > +id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
> > +
> > +/* no thin provisioning */
> > +id_ns->ncap = id_ns->nsze;
> > +id_ns->nuse = id_ns->ncap;
> > +}
> > +
> >   static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> >   {
> >   NvmeCtrl *n = NVME(pci_dev);
> > @@ -1365,7 +1386,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   Error *err = NULL;
> >   int i;
> > -int64_t bs_size;
> >   uint8_t *pci_conf;
> >   nvme_check_constraints(n, &err);
> > @@ -1376,12 +1396,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   nvme_init_state(n);
> > -bs_size = blk_getlength(n->conf.blk);
> > -if (bs_size < 0) {
> > -error_setg(errp, "could not get backing file size");
> > -return;
> > -}
> > -
> >   nvme_init_blk(n, &err);
> >   if (err) {
> >   error_propagate(errp, err);
> > @@ -1394,8 +1408,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> >   pcie_endpoint_cap_init(pci_dev, 0x80);
> > -n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> 
> Valid because currently 'n->num_namespaces' = 1, OK.
> 
> Reviewed-by: Philippe Mathieu-Daudé 
> 
 
Thank you for the reviews Philippe and the suggesting that I split up
the series :)

I'll get the v1.3 series ready next.

Re: [PATCH 11/16] nvme: factor out block backend setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 13:02, Klaus Birkelund Jensen wrote:
> On Apr 15 12:52, Philippe Mathieu-Daudé wrote:
> > On 4/15/20 12:24 PM, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > > 
> > > Signed-off-by: Klaus Jensen 
> > > ---
> > >   hw/block/nvme.c | 15 ---
> > >   1 file changed, 12 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index e67f578fbf79..f0989cbb4335 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -1348,6 +1348,17 @@ static void nvme_init_state(NvmeCtrl *n)
> > >   n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
> > >   }
> > > +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > > +{
> > > +blkconf_blocksizes(&n->conf);
> > > +if (!blkconf_apply_backend_options(&n->conf, 
> > > blk_is_read_only(n->conf.blk),
> > > +   false, errp)) {
> > > +return -1;
> > > +}
> > > +
> > > +return 0;
> > 
> > I'm not sure this is a correct usage of the 'propagating errors' API (see
> > CODING_STYLE.rst and include/qapi/error.h), I'd expect this function to
> > return void, and use a local_error & error_propagate() in nvme_realize().
> > 
> > However this works, so:
> > Reviewed-by: Philippe Mathieu-Daudé 
> > 
> 
> So, I get that and did use the propagate functionality earlier. But I
> still used the int return. I'm not sure about the style if returning
> void - should I check if errp is now non-NULL? Point is that I need to
> return early since the later calls could fail if previous calls did not
> complete successfully.

Nevermind, I got it. I've changed it to propagate it correctly.

Re: [PATCH 11/16] nvme: factor out block backend setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 12:52, Philippe Mathieu-Daudé wrote:
> On 4/15/20 12:24 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/block/nvme.c | 15 ---
> >   1 file changed, 12 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index e67f578fbf79..f0989cbb4335 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1348,6 +1348,17 @@ static void nvme_init_state(NvmeCtrl *n)
> >   n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
> >   }
> > +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > +{
> > +blkconf_blocksizes(&n->conf);
> > +if (!blkconf_apply_backend_options(&n->conf, 
> > blk_is_read_only(n->conf.blk),
> > +   false, errp)) {
> > +return -1;
> > +}
> > +
> > +return 0;
> 
> I'm not sure this is a correct usage of the 'propagating errors' API (see
> CODING_STYLE.rst and include/qapi/error.h), I'd expect this function to
> return void, and use a local_error & error_propagate() in nvme_realize().
> 
> However this works, so:
> Reviewed-by: Philippe Mathieu-Daudé 
> 

So, I get that and did use the propagate functionality earlier. But I
still used the int return. I'm not sure about the style if returning
void - should I check if errp is now non-NULL? Point is that I need to
return early since the later calls could fail if previous calls did not
complete successfully.

Re: [PATCH 13/16] nvme: factor out namespace setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 12:53, Klaus Birkelund Jensen wrote:
> On Apr 15 12:38, Philippe Mathieu-Daudé wrote:
> > 
> > I'm not sure this line belong to this patch.
> > 
>  
> It does. It is already there in the middle of the realize function. It
> is moved to nvme_init_namespace as
> 
>   n->ns_size = bs_size;
> 
> since only a single namespace can be configured anyway. I will remove
> the for-loop that initializes multiple namespaces as well.

I'm gonna backtrack on that. Removing that for loop is just noise I
think.

Re: [PATCH 13/16] nvme: factor out namespace setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 12:38, Philippe Mathieu-Daudé wrote:
> On 4/15/20 12:24 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/block/nvme.c | 47 ++-
> >   1 file changed, 26 insertions(+), 21 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f0989cbb4335..08f7ae0a48b3 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1359,13 +1359,35 @@ static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> >   return 0;
> >   }
> > +static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error 
> > **errp)
> > +{
> > +int64_t bs_size;
> > +NvmeIdNs *id_ns = &ns->id_ns;
> > +
> > +bs_size = blk_getlength(n->conf.blk);
> > +if (bs_size < 0) {
> > +error_setg_errno(errp, -bs_size, "could not get backing file 
> > size");
> > +return -1;
> > +}
> > +
> > +n->ns_size = bs_size;
> > +
> > +id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
> > +
> > +/* no thin provisioning */
> > +id_ns->ncap = id_ns->nsze;
> > +id_ns->nuse = id_ns->ncap;
> > +
> > +return 0;
> > +}
> > +
> >   static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> >   {
> >   NvmeCtrl *n = NVME(pci_dev);
> >   NvmeIdCtrl *id = &n->id_ctrl;
> >   int i;
> > -int64_t bs_size;
> >   uint8_t *pci_conf;
> >   if (nvme_check_constraints(n, errp)) {
> > @@ -1374,12 +1396,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   nvme_init_state(n);
> > -bs_size = blk_getlength(n->conf.blk);
> > -if (bs_size < 0) {
> > -error_setg(errp, "could not get backing file size");
> > -return;
> > -}
> > -
> >   if (nvme_init_blk(n, errp)) {
> >   return;
> >   }
> > @@ -1390,8 +1406,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> >   pcie_endpoint_cap_init(pci_dev, 0x80);
> > -n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> 
> I'm not sure this line belong to this patch.
> 
 
It does. It is already there in the middle of the realize function. It
is moved to nvme_init_namespace as

  n->ns_size = bs_size;

since only a single namespace can be configured anyway. I will remove
the for-loop that initializes multiple namespaces as well.

Re: [PATCH v7 11/48] nvme: refactor device realization

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:55, Philippe Mathieu-Daudé wrote:
> On 4/15/20 9:25 AM, Klaus Birkelund Jensen wrote:
> > On Apr 15 09:14, Philippe Mathieu-Daudé wrote:
> > > Hi Klaus,
> > > 
> > > This patch is a pain to review... Could you split it? I'd use one trivial
> > > patch for each function extracted from nvme_realize().
> > > 
> > 
> > Understood, I will split it up!
> 
> Thanks, that will help the review.
> 
> As this series is quite big, I recommend you to split it, so part of it can
> get merged quicker and you don't have to carry tons of patches that scare
> reviewers/maintainers.
> 
> Suggestions:
> 
> - 1: cleanups/refactors
> - 2: support v1.3
> - 3: more refactors, strengthening code
> - 4: improve DMA & S/G
> - 5: support for multiple NS
> - 6: tests for multiple NS feature
> - 7: tests bus unplug/replug (idea)
> 
> Or less :)
> 

Okay, good idea. Thanks.

Re: [PATCH v7 45/48] nvme: support multiple namespaces

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:38, Philippe Mathieu-Daudé wrote:
> On 4/15/20 7:51 AM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >-drive file=nvme0n1.img,if=none,id=disk1
> >-drive file=nvme0n2.img,if=none,id=disk2
> >-device nvme,serial=deadbeef,id=nvme0
> >-device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >-device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > Reviewed-by: Keith Busch 
> > ---
> >   hw/block/Makefile.objs |   2 +-
> >   hw/block/nvme-ns.c | 157 +++
> >   hw/block/nvme-ns.h |  60 +++
> >   hw/block/nvme.c| 233 +++--
> >   hw/block/nvme.h|  47 -
> >   hw/block/trace-events  |   8 +-
> >   6 files changed, 396 insertions(+), 111 deletions(-)
> >   create mode 100644 hw/block/nvme-ns.c
> >   create mode 100644 hw/block/nvme-ns.h
> > 
> > diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
> > index 4b4a2b338dc4..d9141d6a4b9b 100644
> > --- a/hw/block/Makefile.objs
> > +++ b/hw/block/Makefile.objs
> > @@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
> >   common-obj-$(CONFIG_XEN) += xen-block.o
> >   common-obj-$(CONFIG_ECC) += ecc.o
> >   common-obj-$(CONFIG_ONENAND) += onenand.o
> > -common-obj-$(CONFIG_NVME_PCI) += nvme.o
> > +common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
> >   common-obj-$(CONFIG_SWIM) += swim.o
> >   common-obj-$(CONFIG_SH4) += tc58128.o
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > new file mode 100644
> > index ..bd64d4a94632
> > --- /dev/null
> > +++ b/hw/block/nvme-ns.c
> > @@ -0,0 +1,157 @@
> 
> Missing copyright + license.
> 

Fixed.

> > +
> > +switch (n->conf.wce) {
> > +case ON_OFF_AUTO_ON:
> > +n->features.volatile_wc = 1;
> > +break;
> > +case ON_OFF_AUTO_OFF:
> > +n->features.volatile_wc = 0;
> 
> Missing 'break'?
> 

Ouch. Fixed.

> > +case ON_OFF_AUTO_AUTO:
> > +n->features.volatile_wc = blk_enable_write_cache(ns->blk);
> > +break;
> > +default:
> > +abort();
> > +}
> > +
> > +blk_set_enable_write_cache(ns->blk, n->features.volatile_wc);
> > +
> > +return 0;
> > +}
> > +
> > +static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
> > +{
> > +if (!ns->blk) {
> > +error_setg(errp, "block backend not configured");
> > +return -1;
> > +}
> > +
> > +return 0;
> > +}
> > +
> > +int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > +{
> > +if (nvme_ns_check_constraints(ns, errp)) {
> > +return -1;
> > +}
> > +
> > +if (nvme_ns_init_blk(n, ns, &n->id_ctrl, errp)) {
> > +return -1;
> > +}
> > +
> > +nvme_ns_init(ns);
> > +if (nvme_register_namespace(n, ns, errp)) {
> > +return -1;
> > +}
> > +
> > +return 0;
> > +}
> > +
> > +static void nvme_ns_realize(DeviceState *dev, Error **errp)
> > +{
> > +NvmeNamespace *ns = NVME_NS(dev);
> > +BusState *s = qdev_get_parent_bus(dev);
> > +NvmeCtrl *n = NVME(s->parent);
> > +Error *local_err = NULL;
> > +
> > +if (nvme_ns_setup(n, ns, &local_err)) {
> > +error_propagate_prepend(errp, local_err,
> > +"could not setup namespace: ");
> > +return;
> > +}
> > +}
> > +
> > +static Property nvme_ns_props[] = {
> > +DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),
> > +DEFINE_PROP_END_OF_LIST(),
> > +};
> > +
> > +static void nvme_ns_class_init(ObjectClass *oc, void *data)
> > +{
> > +DeviceClass *dc = DEVICE_CLASS(oc);
> > +
> > +set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
> > +
> > +dc->bus_type = TYPE_NVME_BUS;
> > +dc->realize = nvme_ns_realize;
> > +device_class_set_props(dc, nvme_ns_props);
> > +dc->desc = "virtual nvme namespace";
> 
> "Virtual NVMe namespace"?
> 

Fixed.

> > +}
> > +
> > +static void nvme_ns_instance_init(Object *obj)
> > +{
> > +NvmeNamespace *ns = NVME_NS(obj);
> > +char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
> > +
> > +device_add_bootindex_property(obj, &ns->bootindex, "bootindex",
> > +  bootindex, DEVICE(obj), &error_abort);
> > +
> > +g_free(bootindex);
> > +}
> > +
> > +static const TypeInfo nvme_ns_info = {
> > +.name = TYPE_NVME_NS,
> > +.

Re: [PATCH v7 06/48] nvme: refactor nvme_addr_read

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:03, Philippe Mathieu-Daudé wrote:
> On 4/15/20 7:50 AM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Pull the controller memory buffer check to its own function. The check
> > will be used on its own in later patches.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >   hw/block/nvme.c | 16 
> >   1 file changed, 12 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 622103c42d0a..02d3dde90842 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -52,14 +52,22 @@
> >   static void nvme_process_sq(void *opaque);
> > +static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> 
> 'inline' not really necessary here.
> 

Fixed.

Re: [PATCH v7 12/48] nvme: add temperature threshold feature

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:24, Klaus Birkelund Jensen wrote:
> On Apr 15 09:19, Philippe Mathieu-Daudé wrote:
> > On 4/15/20 7:51 AM, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > > 
> > > It might seem wierd to implement this feature for an emulated device,
> > 
> > 'weird'
> 
> Thanks, fixed :)
> 
> > 
> > > but it is mandatory to support and the feature is useful for testing
> > > asynchronous event request support, which will be added in a later
> > > patch.
> > 
> > Which patch? I can't find how you set the temperature in this series.
> > 
> 
> The temperature cannot be changed, but the thresholds can with the Set
> Features command (and that can then trigger AERs). That is added in
> "nvme: add temperature threshold feature" and "nvme: add support for the
> asynchronous event request command" respectively.
> 
> There is a test in SPDK that does this.
> 

Oh, I think I misunderstood you.

No, setting the temperature was moved to the "nvme: add support for the
get log page command" patch since that is the patch that actually uses
it. This was on request by Maxim in an earlier review.

Re: [PATCH v7 11/48] nvme: refactor device realization

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:14, Philippe Mathieu-Daudé wrote:
> Hi Klaus,
> 
> This patch is a pain to review... Could you split it? I'd use one trivial
> patch for each function extracted from nvme_realize().
> 

Understood, I will split it up!

Re: [PATCH v7 12/48] nvme: add temperature threshold feature

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:19, Philippe Mathieu-Daudé wrote:
> On 4/15/20 7:51 AM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > It might seem wierd to implement this feature for an emulated device,
> 
> 'weird'

Thanks, fixed :)

> 
> > but it is mandatory to support and the feature is useful for testing
> > asynchronous event request support, which will be added in a later
> > patch.
> 
> Which patch? I can't find how you set the temperature in this series.
> 

The temperature cannot be changed, but the thresholds can with the Set
Features command (and that can then trigger AERs). That is added in
"nvme: add temperature threshold feature" and "nvme: add support for the
asynchronous event request command" respectively.

There is a test in SPDK that does this.

Re: [PATCH v7 10/48] nvme: remove redundant cmbloc/cmbsz members

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:10, Philippe Mathieu-Daudé wrote:
> 
> "hw/block/nvme.h" should not pull in "block/nvme.h", both should include a
> common "hw/block/nvme_spec.h" (or better named). Not related to this patch
> although.
> 

Hmm. It does pull in the "include/block/nvme.h" which is basically the
"nvme_spec.h" you are talking about. That file holds all spec related
structs that are shared between the VFIO based nvme driver
(block/nvme.c) and the emulated device (hw/block/nvme.c).

Isn't that what is intended?

Re: [PATCH v6 32/42] nvme: allow multiple aios per command

2020-04-08 Thread Klaus Birkelund Jensen

On Mar 31 12:10, Maxim Levitsky wrote:
> On Tue, 2020-03-31 at 07:47 +0200, Klaus Birkelund Jensen wrote:
> > On Mar 25 12:57, Maxim Levitsky wrote:
> > > On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > > > @@ -516,10 +613,10 @@ static inline uint16_t nvme_check_prinfo(NvmeCtrl 
> > > > *n, NvmeNamespace *ns,
> > > >  return NVME_SUCCESS;
> > > >  }
> > > >  
> > > > -static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace 
> > > > *ns,
> > > > - uint64_t slba, uint32_t nlb,
> > > > - NvmeRequest *req)
> > > > +static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
> > > > + uint32_t nlb, NvmeRequest 
> > > > *req)
> > > >  {
> > > > +NvmeNamespace *ns = req->ns;
> > > >  uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
> > > 
> > > This should go to the patch that added nvme_check_bounds as well
> > > 
> > 
> > We can't really, because the NvmeRequest does not hold a reference to
> > the namespace as a struct member at that point. This is also an issue
> > with the nvme_check_prinfo function above.
> 
> I see it now. The changes to NvmeRequest together with this are a good 
> candidate
> to split from this patch to get this patch to size that is easy to review.
> 

I'm factoring those changes and other stuff out into separate patches!

Re: [PATCH v6 14/42] nvme: add missing mandatory features

2020-04-08 Thread Klaus Birkelund Jensen

On Mar 31 12:39, Maxim Levitsky wrote:
> On Tue, 2020-03-31 at 07:41 +0200, Klaus Birkelund Jensen wrote:
> > On Mar 25 12:41, Maxim Levitsky wrote:
> > > BTW the user of the device doesn't have to have 1:1 mapping between qid 
> > > and msi interrupt index,
> > > in fact when MSI is not used, all the queues will map to the same vector, 
> > > which will be interrupt 0
> > > from point of view of the device IMHO.
> > > So it kind of makes sense IMHO to have num_irqs or something, even if it 
> > > technically equals to number of queues.
> > > 
> > 
> > Yeah, but the device will still *support* the N IVs, so they can still
> > be configured even though they will not be used. So I don't think we
> > need to introduce an additional parameter?
> 
> Yes and no.
> I wasn't thinking to add a new parameter for number of supporter interrupt 
> vectors,
> but just to have an internal variable to represent it so that we could 
> support in future
> case where these are not equal.
> 
> Also from point of view of validating the users of this virtual nvme drive, I 
> think it kind
> of makes sense to allow having less supported IRQ vectors than IO queues, so 
> to check
> how userspace copes with it. It is valid after all to have same interrupt 
> vector shared between
> multiple queues.
> 

I see that this could be useful for testing, but I think we can defer
that to a later patch. Would you be okay with that for now?

> In fact in theory (but that would complicate the implementation greatly) we 
> should even support
> case when number of submission queues is not equal to number of completion 
> queues. Yes nobody does in real hardware,
> and at least Linux nvme driver hard assumes 1:1 SQ/CQ mapping but still.
> 

It is not the hardware that decides this and I believe that there
definitely are applications that chooses to associate multiple SQs with
a single CQ. The CQ is an attribute of the SQ and the IV of the CQ is
also specified in the create command. I believe this is already
supported.

> My nvme-mdev doesn't make this assumpiton (and neither any assumptions on 
> interrupt vector counts) 
> and allows the user to have any SQ/CQ mapping as far as the spec allows
> (but it does hardcode maximum number of SQ/CQ supported)
> 
> BTW, I haven't looked at that but we should check that the virtual nvme drive 
> can cope with using legacy
> interrupt (that is MSI disabled) - nvme-mdev does support this and was tested 
> with it.
> 

Yes, this is definitely not very well tested.

If you insist on all of the above being implemented, then I will do it,
but I would rather defer this to later patches as this series is already
pretty large ;)

Re: [PATCH v1] nvme: indicate CMB support through controller capabilities register

2020-04-07 Thread Klaus Birkelund Jensen

On Apr  1 11:42, Andrzej Jakowski wrote:
> This patch sets CMBS bit in controller capabilities register when user
> configures NVMe driver with CMB support, so capabilites are correctly reported
> to guest OS.
> 
> Signed-off-by: Andrzej Jakowski 
> ---
>  hw/block/nvme.c  | 2 ++
>  include/block/nvme.h | 4 
>  2 files changed, 6 insertions(+)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index d28335cbf3..986803398f 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1393,6 +1393,8 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> **errp)
>  n->bar.intmc = n->bar.intms = 0;
>  
>  if (n->cmb_size_mb) {
> +/* Contoller capabilities */
> +NVME_CAP_SET_CMBS(n->bar.cap, 1);
>  
>  NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
>  NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 8fb941c653..561891b140 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -27,6 +27,7 @@ enum NvmeCapShift {
>  CAP_CSS_SHIFT  = 37,
>  CAP_MPSMIN_SHIFT   = 48,
>  CAP_MPSMAX_SHIFT   = 52,
> +CAP_CMB_SHIFT  = 57,
>  };
>  
>  enum NvmeCapMask {
> @@ -39,6 +40,7 @@ enum NvmeCapMask {
>  CAP_CSS_MASK   = 0xff,
>  CAP_MPSMIN_MASK= 0xf,
>  CAP_MPSMAX_MASK= 0xf,
> +CAP_CMB_MASK   = 0x1,
>  };
>  
>  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
> @@ -69,6 +71,8 @@ enum NvmeCapMask {
> << 
> CAP_MPSMIN_SHIFT)
>  #define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & 
> CAP_MPSMAX_MASK)\
>  << 
> CAP_MPSMAX_SHIFT)
> +#define NVME_CAP_SET_CMBS(cap, val) (cap |= (uint64_t)(val & CAP_CMB_MASK)\
> +<< CAP_CMB_SHIFT)
>  
>  enum NvmeCcShift {
>  CC_EN_SHIFT = 0,
> -- 
> 2.21.1
> 

Looks good.

Reviewed-by: Klaus Jensen

Re: [PATCH v6 12/42] nvme: add support for the get log page command

2020-03-31 Thread Klaus Birkelund Jensen

On Mar 31 12:45, Maxim Levitsky wrote:
> On Tue, 2020-03-31 at 07:41 +0200, Klaus Birkelund Jensen wrote:
> > On Mar 25 12:40, Maxim Levitsky wrote:
> > > On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > > > From: Klaus Jensen 
> > > > 
> > > > Add support for the Get Log Page command and basic implementations of
> > > > the mandatory Error Information, SMART / Health Information and Firmware
> > > > Slot Information log pages.
> > > > 
> > > > In violation of the specification, the SMART / Health Information log
> > > > page does not persist information over the lifetime of the controller
> > > > because the device has no place to store such persistent state.
> > > > 
> > > > Note that the LPA field in the Identify Controller data structure
> > > > intentionally has bit 0 cleared because there is no namespace specific
> > > > information in the SMART / Health information log page.
> > > > 
> > > > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > > > Section 5.10 ("Get Log Page command").
> > > > 
> > > > Signed-off-by: Klaus Jensen 
> > > > Acked-by: Keith Busch 
> > > > ---
> > > >  hw/block/nvme.c   | 138 +-
> > > >  hw/block/nvme.h   |  10 +++
> > > >  hw/block/trace-events |   2 +
> > > >  3 files changed, 149 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > > index 64c42101df5c..83ff3fbfb463 100644
> > > > 
> > > > +static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > > > buf_len,
> > > > +uint64_t off, NvmeRequest *req)
> > > > +{
> > > > +uint32_t trans_len;
> > > > +uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> > > > +uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> > > > +uint8_t errlog[64];
> > > 
> > > I'll would replace this with sizeof(NvmeErrorLogEntry)
> > > (and add NvmeErrorLogEntry to the nvme.h), just for the sake of 
> > > consistency,
> > > and in case we end up reporting some errors to the log in the future.
> > > 
> > 
> > NvmeErrorLog is already in nvme.h; Fixed to actually use it.
> True that! I'll would rename it to NvmeErrorLogEntry to be honest
> (in that patch that added many nvme spec changes) but I don't mind
> keeping it as is as well.
> 
 
It is used in the block driver (block/nvme.c) as well, and I'd rather
not involved that too much in this series.

Re: [PATCH v6 07/42] nvme: refactor nvme_addr_read

2020-03-31 Thread Klaus Birkelund Jensen

On Mar 31 13:41, Maxim Levitsky wrote:
> On Tue, 2020-03-31 at 07:39 +0200, Klaus Birkelund Jensen wrote:
> > On Mar 25 12:38, Maxim Levitsky wrote:
> > > Note that this patch still contains a bug that it removes the check 
> > > against the accessed
> > > size, which you fix in later patch.
> > > I prefer to not add a bug in first place
> > > However if you have a reason for this, I won't mind.
> > > 
> > 
> > So yeah. The resons is that there is actually no bug at this point
> > because the controller only supports PRPs. I actually thought there was
> > a bug as well and reported it to qemu-security some months ago as a
> > potential out of bounds access. I was then schooled by Keith on how PRPs
> > work ;) Below is a paraphrased version of Keiths analysis.
> > 
> > The PRPs does not cross page boundaries:
> True
> 
> > 
> > trans_len = n->page_size - (prp1 % n->page_size);
> > 
> > The PRPs are always verified to be page aligned:
> True
> > 
> > if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
> > 
> > and the transfer length wont go above page size. So, since the beginning
> > of the address is within the CMB and considering that the CMB is of an
> > MB aligned and sized granularity, then we can never cross outside it
> > with PRPs.
> I understand now, however the reason I am arguing about this is
> that this patch actually _removes_ the size bound check
> 
> It was before the patch:
> 
> n->cmbsz && addr >= n->ctrl_mem.addr &&
>   addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)
> 

I don't think it does - the check is just moved to nvme_addr_is_cmb:

static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
{
hwaddr low = n->ctrl_mem.addr;
hwaddr hi  = n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size);

return addr >= low && addr < hi;
}

We check that `addr` is less than `hi`. Maybe the name is unfortunate...

Re: [PATCH v6 38/42] nvme: support multiple namespaces

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:59, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > Reviewed-by: Keith Busch 
> > ---
> >  hw/block/Makefile.objs |   2 +-
> >  hw/block/nvme-ns.c | 157 +++
> >  hw/block/nvme-ns.h |  60 +++
> >  hw/block/nvme.c| 233 ++---
> >  hw/block/nvme.h|  47 -
> >  hw/block/trace-events  |   4 +-
> >  6 files changed, 389 insertions(+), 114 deletions(-)
> >  create mode 100644 hw/block/nvme-ns.c
> >  create mode 100644 hw/block/nvme-ns.h
> > 
> > diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
> > index 4b4a2b338dc4..d9141d6a4b9b 100644
> > --- a/hw/block/Makefile.objs
> > +++ b/hw/block/Makefile.objs

> > @@ -2518,9 +2561,6 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >  id->psd[0].mp = cpu_to_le16(0x9c4);
> >  id->psd[0].enlat = cpu_to_le32(0x10);
> >  id->psd[0].exlat = cpu_to_le32(0x4);
> > -if (blk_enable_write_cache(n->conf.blk)) {
> > -id->vwc = 1;
> > -}
> Shouldn't that be kept? Assuming that user used the legacy 'drive' option,
> and it had no write cache enabled.
> 

When using the drive option we still end up calling the same code that
handles the "new style" namespaces and that code will handle the write
cache similary.

> >  
> >  n->bar.cap = 0;
> >  NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
> > @@ -2533,25 +2573,34 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >  n->bar.intmc = n->bar.intms = 0;
> >  }
> >  
> > -static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error 
> > **errp)
> > +int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> >  {
> > -int64_t bs_size;
> > -NvmeIdNs *id_ns = &ns->id_ns;
> > +uint32_t nsid = nvme_nsid(ns);
> >  
> > -bs_size = blk_getlength(n->conf.blk);
> > -if (bs_size < 0) {
> > -error_setg_errno(errp, -bs_size, "blk_getlength");
> > +if (nsid > NVME_MAX_NAMESPACES) {
> > +error_setg(errp, "invalid nsid (must be between 0 and %d)",
> > +   NVME_MAX_NAMESPACES);
> >  return -1;
> >  }
> >  
> > -id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > -n->ns_size = bs_size;
> > +if (!nsid) {
> > +for (int i = 1; i <= n->num_namespaces; i++) {
> > +NvmeNamespace *ns = nvme_ns(n, i);
> > +if (!ns) {
> > +nsid = i;
> > +break;
> > +}
> > +}
> This misses an edge error case, where all the namespaces are allocated.
> Yes, it would be insane to allocate all 256 namespaces but still.
> 

Impressive catch! Fixed!

> 
> > +} else {
> > +if (n->namespaces[nsid - 1]) {
> > +error_setg(errp, "nsid must be unique");
> 
> I''l would change that error message to something like 
> "namespace id %d is already in use" or something like that.
> 

Done.

Re: [PATCH v6 32/42] nvme: allow multiple aios per command

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:57, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > This refactors how the device issues asynchronous block backend
> > requests. The NvmeRequest now holds a queue of NvmeAIOs that are
> > associated with the command. This allows multiple aios to be issued for
> > a command. Only when all requests have been completed will the device
> > post a completion queue entry.
> > 
> > Because the device is currently guaranteed to only issue a single aio
> > request per command, the benefit is not immediately obvious. But this
> > functionality is required to support metadata, the dataset management
> > command and other features.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c   | 377 +++---
> >  hw/block/nvme.h   | 129 +--
> >  hw/block/trace-events |   6 +
> >  3 files changed, 407 insertions(+), 105 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 0d2b5b45b0c5..817384e3b1a9 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -373,6 +374,99 @@ static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, 
> > QEMUSGList *qsg,
> >  return nvme_map_prp(n, qsg, iov, prp1, prp2, len, req);
> >  }
> >  
> > +static void nvme_aio_destroy(NvmeAIO *aio)
> > +{
> > +g_free(aio);
> > +}
> > +
> > +static inline void nvme_req_register_aio(NvmeRequest *req, NvmeAIO *aio,
> I guess I'll call this nvme_req_add_aio,
> or nvme_add_aio_to_reg.
> Thoughts?
> Also you can leave this as is, but add a comment on top explaining this
> 

nvme_req_add_aio it is :) And comment added.

> > + NvmeAIOOp opc)
> > +{
> > +aio->opc = opc;
> > +
> > +trace_nvme_dev_req_register_aio(nvme_cid(req), aio, blk_name(aio->blk),
> > +aio->offset, aio->len,
> > +nvme_aio_opc_str(aio), req);
> > +
> > +if (req) {
> > +QTAILQ_INSERT_TAIL(&req->aio_tailq, aio, tailq_entry);
> > +}
> > +}
> > +
> > +static void nvme_submit_aio(NvmeAIO *aio)
> OK, this name makes sense
> Also please add a comment on top.

Done.

> > @@ -505,9 +600,11 @@ static inline uint16_t nvme_check_mdts(NvmeCtrl *n, 
> > size_t len,
> >  return NVME_SUCCESS;
> >  }
> >  
> > -static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, NvmeNamespace *ns,
> > - uint16_t ctrl, NvmeRequest *req)
> > +static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, uint16_t ctrl,
> > + NvmeRequest *req)
> >  {
> > +NvmeNamespace *ns = req->ns;
> > +
> This should go to the patch that added nvme_check_prinfo
> 

Probably killing that patch.

> > @@ -516,10 +613,10 @@ static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, 
> > NvmeNamespace *ns,
> >  return NVME_SUCCESS;
> >  }
> >  
> > -static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
> > - uint64_t slba, uint32_t nlb,
> > - NvmeRequest *req)
> > +static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
> > + uint32_t nlb, NvmeRequest *req)
> >  {
> > +NvmeNamespace *ns = req->ns;
> >  uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
> This should go to the patch that added nvme_check_bounds as well
> 

We can't really, because the NvmeRequest does not hold a reference to
the namespace as a struct member at that point. This is also an issue
with the nvme_check_prinfo function above.

> >  
> >  if (unlikely(UINT64_MAX - slba < nlb || slba + nlb > nsze)) {
> > @@ -530,55 +627,154 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl 
> > *n, NvmeNamespace *ns,
> >  return NVME_SUCCESS;
> >  }
> >  
> > -static void nvme_rw_cb(void *opaque, int ret)
> > +static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +NvmeNamespace *ns = req->ns;
> > +NvmeRwCmd *rw = (NvmeRwCmd *) &req->cmd;
> > +uint16_t ctrl = le16_to_cpu(rw->control);
> > +size_t len = req->nlb << nvme_ns_lbads(ns);
> > +uint16_t status;
> > +
> > +status = nvme_check_mdts(n, len, req);
> > +if (status) {
> > +return status;
> > +}
> > +
> > +status = nvme_check_prinfo(n, ctrl, req);
> > +if (status) {
> > +return status;
> > +}
> > +
> > +status = nvme_check_bounds(n, req->slba, req->nlb, req);
> > +if (status) {
> > +return status;
> > +}
> > +
> > +return NVME_SUCCESS;
> > +}
> 
> Nitpick: I hate to say it but nvme_check_rw should be in a separate patch as 
> well.
> It will also make diff more readable (when adding a funtion and changing a 
> function
> at the same time, you get a diff between two unrelated things)
> 

Done, but had to do it as a follow up patc

Re: [PATCH v6 36/42] nvme: add support for scatter gather lists

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:58, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > For now, support the Data Block, Segment and Last Segment descriptor
> > types.
> > 
> > See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c   | 310 +++---
> >  hw/block/trace-events |   4 +
> >  2 files changed, 262 insertions(+), 52 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 49d323566393..b89b96990f52 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -76,7 +76,12 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr 
> > addr)
> >  
> >  static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> > -if (n->bar.cmbsz && nvme_addr_is_cmb(n, addr)) {
> > +hwaddr hi = addr + size;
> > +if (hi < addr) {
> > +return 1;
> > +}
> > +
> > +if (n->bar.cmbsz && nvme_addr_is_cmb(n, addr) && nvme_addr_is_cmb(n, 
> > hi)) {
> 
> I would suggest to split this into a separate patch as well, since this 
> contains not just one but 2 bugfixes
> for this function and they are not related to sg lists.
> Or at least move this to 'nvme: refactor nvme_addr_read' and rename this patch
> to something like 'nvme: fix and refactor nvme_addr_read'
> 

I've split it into a patch.

> 
> >  memcpy(buf, nvme_addr_to_cmb(n, addr), size);
> >  return 0;
> >  }
> > @@ -328,13 +333,242 @@ unmap:
> >  return status;
> >  }
> >  
> > -static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > - uint64_t prp1, uint64_t prp2, DMADirection 
> > dir,
> > +static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
> > +  QEMUIOVector *iov,
> > +  NvmeSglDescriptor *segment, uint64_t 
> > nsgld,
> > +  size_t *len, NvmeRequest *req)
> > +{
> > +dma_addr_t addr, trans_len;
> > +uint32_t blk_len;
> > +uint16_t status;
> > +
> > +for (int i = 0; i < nsgld; i++) {
> > +uint8_t type = NVME_SGL_TYPE(segment[i].type);
> > +
> > +if (type != NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
> > +switch (type) {
> > +case NVME_SGL_DESCR_TYPE_BIT_BUCKET:
> > +case NVME_SGL_DESCR_TYPE_KEYED_DATA_BLOCK:
> > +return NVME_SGL_DESCR_TYPE_INVALID | NVME_DNR;
> > +default:
> To be honest I don't like that 'default'
> I would explicitly state which segment types remain 
> (I think segment list and last segment list, and various reserved types)
> In fact for the reserved types you probably also want to return 
> NVME_SGL_DESCR_TYPE_INVALID)
> 

I "negated" the logic which I think is more readable. I still really
want to keep the default, for instance, nvme v1.4 adds a new type that
we do not support (the Transport SGL Data Block descriptor).

> Also this function as well really begs to have a description prior to it,
> something like 'map a sg list section, assuming that it only contains SGL 
> data descriptions,
> caller has to ensure this'.
> 

Done.

> 
> > +return NVME_INVALID_NUM_SGL_DESCRS | NVME_DNR;
> > +}
> > +}
> > +
> > +if (*len == 0) {
> > +uint16_t sgls = le16_to_cpu(n->id_ctrl.sgls);
> Nitpick: I would add a small comment here as well describiing
> what this does (We reach this point if sg list covers more that that
> was specified in the commmand, and the NVME_CTRL_SGLS_EXCESS_LENGTH controller
> capability indicates that we support just throwing the extra data away)
> 

Adding a comment. It's the other way around. The size as indicated by
NLB (or whatever depending on the command) is the "authoritative" souce
of information for the size of the payload. We will never accept an SGL
that is too short such that we lose or throw away data, but we might
accept ignoring parts of the SGL.

> > +if (sgls & NVME_CTRL_SGLS_EXCESS_LENGTH) {
> > +break;
> > +}
> > +
> > +trace_nvme_dev_err_invalid_sgl_excess_length(nvme_cid(req));
> > +return NVME_DATA_SGL_LEN_INVALID | NVME_DNR;
> > +}
> > +
> > +addr = le64_to_cpu(segment[i].addr);
> > +blk_len = le32_to_cpu(segment[i].len);
> > +
> > +if (!blk_len) {
> > +continue;
> > +}
> > +
> > +if (UINT64_MAX - addr < blk_len) {
> > +return NVME_DATA_SGL_LEN_INVALID | NVME_DNR;
> > +}
> Good!
> > +
> > +trans_len = MIN(*len, blk_len);
> > +
> > +status = nvme_map_addr(n, qsg, iov, addr, trans_len);
> > +if (status) {
> > +return status;
> > +}
> > +
> > +*len -= trans_len;
> > +}
> > +
> > +return NVME_SUCCESS;
> > +}
> > +
> > +static uint16_t nv

Re: [PATCH v6 35/42] nvme: handle dma errors

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:58, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Handling DMA errors gracefully is required for the device to pass the
> > block/011 test ("disable PCI device while doing I/O") in the blktests
> > suite.
> > 
> > With this patch the device passes the test by retrying "critical"
> > transfers (posting of completion entries and processing of submission
> > queue entries).
> > 
> > If DMA errors occur at any other point in the execution of the command
> > (say, while mapping the PRPs), the command is aborted with a Data
> > Transfer Error status code.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c   | 45 ---
> >  hw/block/trace-events |  2 ++
> >  include/block/nvme.h  |  2 +-
> >  3 files changed, 37 insertions(+), 12 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 15ca2417af04..49d323566393 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -164,7 +164,7 @@ static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, 
> > QEMUIOVector *iov, hwaddr addr,
> >size_t len)
> >  {
> >  if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len)) {
> > -return NVME_DATA_TRAS_ERROR;
> > +return NVME_DATA_TRANSFER_ERROR;
> 
> Minor nitpick: this is also a non functional refactoring.
> I don't think that each piece of a refactoring should be in a separate patch,
> so I usually group all the non functional (aka cosmetic) refactoring in one 
> patch, usually the first in the series.
> But I try not to leave such refactoring in the functional patches.
> 
> However, since there is not that much cases like that left, I don't mind 
> leaving this particular case as is.
> 

Noted. Keeping it here for now ;)

> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
>

Re: [PATCH v6 31/42] nvme: add check for prinfo

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:57, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Check the validity of the PRINFO field.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 50 ---
> >  hw/block/trace-events |  1 +
> >  include/block/nvme.h  |  1 +
> >  3 files changed, 44 insertions(+), 8 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 7d5340c272c6..0d2b5b45b0c5 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -505,6 +505,17 @@ static inline uint16_t nvme_check_mdts(NvmeCtrl *n, 
> > size_t len,
> >  return NVME_SUCCESS;
> >  }
> >  
> > +static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, NvmeNamespace *ns,
> > + uint16_t ctrl, NvmeRequest *req)
> > +{
> > +if ((ctrl & NVME_RW_PRINFO_PRACT) && !(ns->id_ns.dps & DPS_TYPE_MASK)) 
> > {
> > +trace_nvme_dev_err_prinfo(nvme_cid(req), ctrl);
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> 
> I refreshed my (still very limited) knowelege on the metadata
> and the protection info, and this is what I found:
> 
> I think that this is very far from complete, because we also have:
> 
> 1. PRCHECK. According to the spec it is independent of PRACT
>And when some of it is set, 
>together with enabled protection (the DPS field in namespace),
>Then the 8 bytes of the protection info is checked (optionally using the
>the EILBRT and ELBAT/ELBATM fields in the command and CRC of the data for 
> the guard field)
> 
>So this field should also be checked to be zero when protection is disabled
>(I don't see an explicit requirement for that in the spec, but neither I 
> see
>such requirement for PRACT)
> 
> 2. The protection values to be written / checked ((E)ILBRT/(E)LBATM/(E)LBAT)
>Same here, but also these should not be set when PRCHECK is not set for 
> reads,
>plus some are protection type specific.
> 
> 
> The spec does mention the 'Invalid Protection Information' error code which
> refers to invalid values in the PRINFO field.
> So this error code I think should be returned instead of the 'Invalid field'
> 
> Another thing to optionaly check is that the metadata pointer for separate 
> metadata.
>  Is zero as long as we don't support metadata
> (again I don't see an explicit requirement for this in the spec, but it 
> mentions:
> 
> "This field is valid only if the command has metadata that is not interleaved 
> with
> the logical block data, as specified in the Format NVM command"
> 
> )
> 

I'm kinda inclined to just drop this patch. The spec actually says that
the PRACT and PRCHK fields are used only if the namespace is formatted
to use end-to-end protection information. Since we do not support that,
I don't think we even need to check it.

Any opinion on this?

Re: [PATCH v6 23/42] nvme: add mapping helpers

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:45, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add nvme_map_addr, nvme_map_addr_cmb and nvme_addr_to_cmb helpers and
> > use them in nvme_map_prp.
> > 
> > This fixes a bug where in the case of a CMB transfer, the device would
> > map to the buffer with a wrong length.
> > 
> > Fixes: b2b2b67a00574 ("nvme: Add support for Read Data and Write Data in 
> > CMBs.")
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 97 +++
> >  hw/block/trace-events |  1 +
> >  2 files changed, 81 insertions(+), 17 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 08267e847671..187c816eb6ad 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -153,29 +158,79 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue 
> > *cq)
> >  }
> >  }
> >  
> > +static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr 
> > addr,
> > +  size_t len)
> > +{
> > +if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len)) {
> > +return NVME_DATA_TRAS_ERROR;
> > +}
> 
> I just noticed that
> in theory (not that it really matters) but addr+len refers to the byte which 
> is already 
> not the part of the transfer.
> 

Oh. Good catch - and I think that it does matter? Can't we end up
rejecting a valid access? Anyway, I fixed it with a '- 1'.

> 
> > +
> > +qemu_iovec_add(iov, nvme_addr_to_cmb(n, addr), len);
> Also intersting is we can add 0 sized iovec.
> 

I added a check on len. This also makes sure the above '- 1' fix doesn't
cause an 'addr + 0 - 1' to be done.

Re: [PATCH v6 24/42] nvme: remove redundant has_sg member

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:45, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Remove the has_sg member from NvmeRequest since it's redundant.
> 
> To be honest this patch also replaces the dma_acct_start with block_acct_start
> which looks right to me, and IMHO its OK to have both in the same patch,
> but that should be mentioned in the commit message
> 

I pulled it to a separate patch :)

> With this fixed,
> Reviewed-by: Maxim Levitsky 
>

Re: [PATCH v6 29/42] nvme: refactor request bounds checking

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:56, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 28 ++--
> >  1 file changed, 22 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index eecfad694bf8..ba520c76bae5 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -562,13 +577,14 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace 
> > *ns, NvmeCmd *cmd,
> >  uint64_t data_offset = slba << data_shift;
> >  int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
> >  enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : 
> > BLOCK_ACCT_READ;
> > +uint16_t status;
> >  
> >  trace_nvme_dev_rw(is_write ? "write" : "read", nlb, data_size, slba);
> >  
> > -if (unlikely((slba + nlb) > ns->id_ns.nsze)) {
> > +status = nvme_check_bounds(n, ns, slba, nlb, req);
> > +if (status) {
> >  block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > -trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
> > -return NVME_LBA_RANGE | NVME_DNR;
> > +return status;
> >  }
> >  
> >  if (nvme_map(n, cmd, &req->qsg, &req->iov, data_size, req)) {
> Looks good as well, once we get support for discard, it will
> use this as well, but for now indeed only write zeros and read/write
> need bounds checking on the IO path.
> 

I have that patch in the submission queue and the check is factored out
there ;)

> Reviewed-by: Maxim Levitsky 
>

Re: [PATCH v6 14/42] nvme: add missing mandatory features

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:41, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add support for returning a resonable response to Get/Set Features of
> > mandatory features.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c   | 60 ++-
> >  hw/block/trace-events |  2 ++
> >  include/block/nvme.h  |  6 -
> >  3 files changed, 66 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index ff8975cd6667..eb9c722df968 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1058,6 +1069,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  break;
> >  case NVME_TIMESTAMP:
> >  return nvme_get_feature_timestamp(n, cmd);
> > +case NVME_INTERRUPT_COALESCING:
> > +result = cpu_to_le32(n->features.int_coalescing);
> > +break;
> > +case NVME_INTERRUPT_VECTOR_CONF:
> > +if ((dw11 & 0x) > n->params.max_ioqpairs + 1) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> I still think that this should be >= since the interrupt vector is not zero 
> based.
> So if we have for example 3 IO queues, then we have 4 queues in total
> which translates to irq numbers 0..3.
> 

Yes you are right. The device will support max_ioqpairs + 1 IVs, so
trying to access that would actually go 1 beyond the array.

Fixed.

> BTW the user of the device doesn't have to have 1:1 mapping between qid and 
> msi interrupt index,
> in fact when MSI is not used, all the queues will map to the same vector, 
> which will be interrupt 0
> from point of view of the device IMHO.
> So it kind of makes sense IMHO to have num_irqs or something, even if it 
> technically equals to number of queues.
> 

Yeah, but the device will still *support* the N IVs, so they can still
be configured even though they will not be used. So I don't think we
need to introduce an additional parameter?

> > @@ -1120,6 +1146,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  
> >  break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> > +if (blk_enable_write_cache(n->conf.blk)) {
> > +blk_flush(n->conf.blk);
> > +}
> 
> (not your fault) but the blk_enable_write_cache function name is highly 
> misleading,
> since it doesn't enable anything but just gets the flag if the write cache is 
> enabled.
> It really should be called blk_get_enable_write_cache.
> 

Agreed :)

> > @@ -1804,6 +1860,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >  id->cqes = (0x4 << 4) | 0x4;
> >  id->nn = cpu_to_le32(n->num_namespaces);
> >  id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
> > +
> Unrelated whitespace change

Fixed.

> 
> Best regards,
>   Maxim Levitsky
> 
> 
> 
>

Re: [PATCH v6 12/42] nvme: add support for the get log page command

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:40, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add support for the Get Log Page command and basic implementations of
> > the mandatory Error Information, SMART / Health Information and Firmware
> > Slot Information log pages.
> > 
> > In violation of the specification, the SMART / Health Information log
> > page does not persist information over the lifetime of the controller
> > because the device has no place to store such persistent state.
> > 
> > Note that the LPA field in the Identify Controller data structure
> > intentionally has bit 0 cleared because there is no namespace specific
> > information in the SMART / Health information log page.
> > 
> > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > Section 5.10 ("Get Log Page command").
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c   | 138 +-
> >  hw/block/nvme.h   |  10 +++
> >  hw/block/trace-events |   2 +
> >  3 files changed, 149 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 64c42101df5c..83ff3fbfb463 100644
> >
> > +static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > buf_len,
> > +uint64_t off, NvmeRequest *req)
> > +{
> > +uint32_t trans_len;
> > +uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> > +uint8_t errlog[64];
> I'll would replace this with sizeof(NvmeErrorLogEntry)
> (and add NvmeErrorLogEntry to the nvme.h), just for the sake of consistency,
> and in case we end up reporting some errors to the log in the future.
> 

NvmeErrorLog is already in nvme.h; Fixed to actually use it.

> 
> > +
> > +if (off > sizeof(errlog)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +memset(errlog, 0x0, sizeof(errlog));
> > +
> > +trans_len = MIN(sizeof(errlog) - off, buf_len);
> > +
> > +return nvme_dma_read_prp(n, errlog, trans_len, prp1, prp2);
> > +}
> Besides this, looks good now.
> 
> 
> Best regards,
>   Maxim Levitsky
>

Re: [PATCH v6 19/42] nvme: enforce valid queue creation sequence

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:43, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Support returning Command Sequence Error if Set Features on Number of
> > Queues is called after queues have been created.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 7 +++
> >  hw/block/nvme.h | 1 +
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 007f8817f101..b40d27cddc46 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -881,6 +881,8 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  cq = g_malloc0(sizeof(*cq));
> >  nvme_init_cq(cq, n, prp1, cqid, vector, qsize + 1,
> >  NVME_CQ_FLAGS_IEN(qflags));
> > +
> > +n->qs_created = true;
> Very minor nitpick, maybe it is worth mentioning in a comment,
> why this is only needed in CQ creation, as you explained to me.
> 

Added.

> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 
> 
> 
> 
>

Re: [PATCH v6 11/42] nvme: add temperature threshold feature

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:40, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > It might seem wierd to implement this feature for an emulated device,
> > but it is mandatory to support and the feature is useful for testing
> > asynchronous event request support, which will be added in a later
> > patch.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c  | 48 
> >  hw/block/nvme.h  |  2 ++
> >  include/block/nvme.h |  8 +++-
> >  3 files changed, 57 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index b7c465560eea..8cda5f02c622 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -108,6 +108,7 @@ typedef struct NvmeCtrl {
> >  uint64_tirq_status;
> >  uint64_thost_timestamp; /* Timestamp sent by the 
> > host */
> >  uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
> > +uint16_ttemperature;
> You forgot to move this too.
> 

Fixed!
> 
> With 'temperature' field removed from the header:
> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
>

Re: [PATCH v6 09/42] nvme: add max_ioqpairs device parameter

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:39, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > The num_queues device paramater has a slightly confusing meaning because
> > it accounts for the admin queue pair which is not really optional.
> > Secondly, it is really a maximum value of queues allowed.
> > 
> > Add a new max_ioqpairs parameter that only accounts for I/O queue pairs,
> > but keep num_queues for compatibility.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 45 ++---
> >  hw/block/nvme.h |  4 +++-
> >  2 files changed, 29 insertions(+), 20 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 7cf7cf55143e..7dfd8a1a392d 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1332,9 +1333,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >  int64_t bs_size;
> >  uint8_t *pci_conf;
> >  
> > -if (!n->params.num_queues) {
> > -error_setg(errp, "num_queues can't be zero");
> > -return;
> > +if (n->params.num_queues) {
> > +warn_report("nvme: num_queues is deprecated; please use 
> > max_ioqpairs "
> > +"instead");
> > +
> > +n->params.max_ioqpairs = n->params.num_queues - 1;
> > +}
> > +
> > +if (!n->params.max_ioqpairs) {
> > +error_setg(errp, "max_ioqpairs can't be less than 1");
> >  }
> This is not even a nitpick, but just and idea.
> 
> It might be worth it to allow max_ioqpairs=0 to simulate a 'broken'
> nvme controller. I know that kernel has special handling for such controllers,
> which include only creation of the control character device (/dev/nvme*) 
> through
> which the user can submit commands to try and 'fix' the controller (by 
> re-uploading firmware
> maybe or something like that).
> 
> 

Not sure about the implications of this, so I'll leave that on the TODO
:) But a controller with no I/O queues is an "Administrative Controller"
and perfectly legal in NVMe v1.4 AFAIK.

> >  
> >  if (!n->conf.blk) {
> > @@ -1365,19 +1372,19 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >  pcie_endpoint_cap_init(pci_dev, 0x80);
> >  
> >  n->num_namespaces = 1;
> > -n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
> > +n->reg_size = pow2ceil(0x1008 + 2 * (n->params.max_ioqpairs) * 4);
> 
> I hate to say it, but it looks like this thing (which I mentioned to you in 
> V5)
> was pre-existing bug, which is indeed fixed now.
> In theory such fixes should go to separate patches, but in this case, I guess 
> it would
> be too much to ask for it.
> Maybe mention this in the commit message instead, so that this fix doesn't 
> stay hidden like that?
> 
> 

I'm convinced now. I have added a preparatory bugfix patch before this
patch.

> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
>

Re: [PATCH v6 10/42] nvme: refactor device realization

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:40, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > This patch splits up nvme_realize into multiple individual functions,
> > each initializing a different subset of the device.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c | 178 ++--
> >  hw/block/nvme.h |  23 ++-
> >  2 files changed, 134 insertions(+), 67 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 7dfd8a1a392d..665485045066 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1340,57 +1337,100 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >  n->params.max_ioqpairs = n->params.num_queues - 1;
> >  }
> >  
> > -if (!n->params.max_ioqpairs) {
> > -error_setg(errp, "max_ioqpairs can't be less than 1");
> > +if (params->max_ioqpairs < 1 ||
> > +params->max_ioqpairs > PCI_MSIX_FLAGS_QSIZE) {
> > +error_setg(errp, "nvme: max_ioqpairs must be ");
> Looks like the error message is not complete now.

Fixed!

> 
> Small nitpick: To be honest this not only refactoring in the device 
> realization since you also (rightfully)
> removed the duplicated cmbsz/cmbloc so I would add a mention for this in the 
> commit message.
> But that doesn't matter that much, so
> 

You are right. I've added it as a separate patch.

> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
>

Re: [PATCH v6 05/42] nvme: use constant for identify data size

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:37, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 40cb176dea3c..f716f690a594 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -679,7 +679,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
> > NvmeIdentify *c)
> >  
> >  static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
> >  {
> > -static const int data_len = 4 * KiB;
> > +static const int data_len = NVME_IDENTIFY_DATA_SIZE;
> >  uint32_t min_nsid = le32_to_cpu(c->nsid);
> >  uint64_t prp1 = le64_to_cpu(c->prp1);
> >  uint64_t prp2 = le64_to_cpu(c->prp2);
> 
> I'll probably squash this with some other refactoring patch,
> but I absolutely don't mind leaving this as is.
> Fine grained patches never cause any harm.
> 

Squashed into 06/42.

> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 
> 
> 
>

Re: [PATCH v6 06/42] nvme: add identify cns values in header

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:37, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f716f690a594..b38d7e548a60 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -709,11 +709,11 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  NvmeIdentify *c = (NvmeIdentify *)cmd;
> >  
> >  switch (le32_to_cpu(c->cns)) {
> > -case 0x00:
> > +case NVME_ID_CNS_NS:
> >  return nvme_identify_ns(n, c);
> > -case 0x01:
> > +case NVME_ID_CNS_CTRL:
> >  return nvme_identify_ctrl(n, c);
> > -case 0x02:
> > +case NVME_ID_CNS_NS_ACTIVE_LIST:
> >  return nvme_identify_nslist(n, c);
> >  default:
> >  trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
> 
> This is a very good candidate to be squished with the patch 5 IMHO,
> but you can leave this as is as well. I don't mind.
> 

Squashed!

> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 
> 
> 
> 
> 
>

Re: [PATCH v6 07/42] nvme: refactor nvme_addr_read

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:38, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Pull the controller memory buffer check to its own function. The check
> > will be used on its own in later patches.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c | 16 
> >  1 file changed, 12 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index b38d7e548a60..08a83d449de3 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -52,14 +52,22 @@
> >  
> >  static void nvme_process_sq(void *opaque);
> >  
> > +static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> > +{
> > +hwaddr low = n->ctrl_mem.addr;
> > +hwaddr hi  = n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size);
> > +
> > +return addr >= low && addr < hi;
> > +}
> > +
> >  static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> > -if (n->cmbsz && addr >= n->ctrl_mem.addr &&
> > -addr < (n->ctrl_mem.addr + 
> > int128_get64(n->ctrl_mem.size))) {
> > +if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> >  memcpy(buf, (void *)&n->cmbuf[addr - n->ctrl_mem.addr], size);
> > -} else {
> > -pci_dma_read(&n->parent_obj, addr, buf, size);
> > +return;
> >  }
> > +
> > +pci_dma_read(&n->parent_obj, addr, buf, size);
> >  }
> >  
> >  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
> 
> Note that this patch still contains a bug that it removes the check against 
> the accessed
> size, which you fix in later patch.
> I prefer to not add a bug in first place
> However if you have a reason for this, I won't mind.
> 

So yeah. The resons is that there is actually no bug at this point
because the controller only supports PRPs. I actually thought there was
a bug as well and reported it to qemu-security some months ago as a
potential out of bounds access. I was then schooled by Keith on how PRPs
work ;) Below is a paraphrased version of Keiths analysis.

The PRPs does not cross page boundaries:

trans_len = n->page_size - (prp1 % n->page_size);

The PRPs are always verified to be page aligned:

if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {

and the transfer length wont go above page size. So, since the beginning
of the address is within the CMB and considering that the CMB is of an
MB aligned and sized granularity, then we can never cross outside it
with PRPs.

I could add the check at this point (because it *is* needed for when
SGLs are introduced), but I think it would just be noise and I would
need to explain why the check is there, but not really needed at this
point. Instead I'm adding a new patch before the SGL patch that explains
this.

Re: [PATCH v6 04/42] nvme: bump spec data structures to v1.3

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:37, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add missing fields in the Identify Controller and Identify Namespace
> > data structures to bring them in line with NVMe v1.3.
> > 
> > This also adds data structures and defines for SGL support which
> > requires a couple of trivial changes to the nvme block driver as well.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Fam Zheng 
> > ---
> >  block/nvme.c |  18 ++---
> >  hw/block/nvme.c  |  12 ++--
> >  include/block/nvme.h | 153 ++-
> >  3 files changed, 151 insertions(+), 32 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index d41c4bda6e39..99b9bb3dac96 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -589,6 +675,16 @@ enum NvmeIdCtrlOncs {
> >  #define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf)
> >  #define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf)
> >  
> > +#define NVME_CTRL_SGLS_SUPPORTED_MASK(0x3 <<  0)
> > +#define NVME_CTRL_SGLS_SUPPORTED_NO_ALIGNMENT(0x1 <<  0)
> > +#define NVME_CTRL_SGLS_SUPPORTED_DWORD_ALIGNMENT (0x1 <<  1)
> > +#define NVME_CTRL_SGLS_KEYED (0x1 <<  2)
> > +#define NVME_CTRL_SGLS_BITBUCKET (0x1 << 16)
> > +#define NVME_CTRL_SGLS_MPTR_CONTIGUOUS   (0x1 << 17)
> > +#define NVME_CTRL_SGLS_EXCESS_LENGTH (0x1 << 18)
> > +#define NVME_CTRL_SGLS_MPTR_SGL  (0x1 << 19)
> > +#define NVME_CTRL_SGLS_ADDR_OFFSET   (0x1 << 20)
> OK
> > +
> >  typedef struct NvmeFeatureVal {
> >  uint32_tarbitration;
> >  uint32_tpower_mgmt;
> > @@ -611,6 +707,10 @@ typedef struct NvmeFeatureVal {
> >  #define NVME_INTC_THR(intc) (intc & 0xff)
> >  #define NVME_INTC_TIME(intc)((intc >> 8) & 0xff)
> >  
> > +#define NVME_TEMP_THSEL(temp)  ((temp >> 20) & 0x3)
> Nitpick: If we are adding this, I'll add a #define for the values as well
> 

Done. And used in the subsequent "nvme: add temperature threshold
feature" patch.

> > +#define NVME_TEMP_TMPSEL(temp) ((temp >> 16) & 0xf)
> > +#define NVME_TEMP_TMPTH(temp)  ((temp >>  0) & 0x)
> > +
> >  enum NvmeFeatureIds {
> >  NVME_ARBITRATION= 0x1,
> >  NVME_POWER_MANAGEMENT   = 0x2,
> > @@ -653,18 +753,37 @@ typedef struct NvmeIdNs {
> >  uint8_t mc;
> >  uint8_t dpc;
> >  uint8_t dps;
> > -
> >  uint8_t nmic;
> >  uint8_t rescap;
> >  uint8_t fpi;
> >  uint8_t dlfeat;
> > -
> > -uint8_t res34[94];
> > +uint16_tnawun;
> > +uint16_tnawupf;
> > +uint16_tnacwu;
> > +uint16_tnabsn;
> > +uint16_tnabo;
> > +uint16_tnabspf;
> > +uint16_tnoiob;
> > +uint8_t nvmcap[16];
> > +uint8_t rsvd64[40];
> > +uint8_t nguid[16];
> > +uint64_teui64;
> >  NvmeLBAFlbaf[16];
> > -uint8_t res192[192];
> > +uint8_t rsvd192[192];
> >  uint8_t vs[3712];
> >  } NvmeIdNs;
> Also checked this against V5, looks OK now
> 
> >  
> > +typedef struct NvmeIdNsDescr {
> > +uint8_t nidt;
> > +uint8_t nidl;
> > +uint8_t rsvd2[2];
> > +} NvmeIdNsDescr;
> OK
> 
> 
> 
> > +
> > +#define NVME_NIDT_UUID_LEN 16
> > +
> > +enum {
> > +NVME_NIDT_UUID = 0x3,
> Very minor nitpick: I'll would add others as well just for the sake
> of better understanding what this is
> 

Done.

> > +};
> >  
> >  /*Deallocate Logical Block Features*/
> >  #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)   ((dlfeat) & 0x10)
> 
> Looks very good.
> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
>

Re: [PATCH v6 01/42] nvme: rename trace events to nvme_dev

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 25 12:36, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Change the prefix of all nvme device related trace events to 'nvme_dev'
> > to not clash with trace events from the nvme block driver.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > Reviewed-by: Maxim Levitsky 
> > ---
> >  hw/block/nvme.c   | 188 +-
> >  hw/block/trace-events | 172 +++---
> >  2 files changed, 180 insertions(+), 180 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index d28335cbf377..3e4b18956ed2 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1035,32 +1035,32 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr 
> > offset, uint64_t data,
> >  switch (offset) {
> >  case 0xc:   /* INTMS */
> >  if (unlikely(msix_enabled(&(n->parent_obj {
> > -NVME_GUEST_ERR(nvme_ub_mmiowr_intmask_with_msix,
> > +NVME_GUEST_ERR(nvme_dev_ub_mmiowr_intmask_with_msix,
> > "undefined access to interrupt mask set"
> > " when MSI-X is enabled");
> >  /* should be ignored, fall through for now */
> >  }
> >  n->bar.intms |= data & 0x;
> >  n->bar.intmc = n->bar.intms;
> > -trace_nvme_mmio_intm_set(data & 0x,
> > +trace_nvme_dev_mmio_intm_set(data & 0x,
> >   n->bar.intmc);
> Indention.
> 

Fixed.

> >  nvme_irq_check(n);
> >  break;
> >  case 0x10:  /* INTMC */
> >  if (unlikely(msix_enabled(&(n->parent_obj {
> > -NVME_GUEST_ERR(nvme_ub_mmiowr_intmask_with_msix,
> > +NVME_GUEST_ERR(nvme_dev_ub_mmiowr_intmask_with_msix,
> > "undefined access to interrupt mask clr"
> > " when MSI-X is enabled");
> >  /* should be ignored, fall through for now */
> >  }
> >  n->bar.intms &= ~(data & 0x);
> >  n->bar.intmc = n->bar.intms;
> > -trace_nvme_mmio_intm_clr(data & 0x,
> > +trace_nvme_dev_mmio_intm_clr(data & 0x,
> >   n->bar.intmc);
> Indention.
> 

Fixed.

> 
> 
> Other that indention nitpicks, no changes vs V5,
> so my reviewed-by kept correctly.
> 
> Best regards,
>   Maxim Levitsky
>

Re: [PATCH RESEND v4] nvme: introduce PMR support from NVMe 1.4 spec

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 31 03:13, Keith Busch wrote:
> On Mon, Mar 30, 2020 at 08:07:32PM +0200, Klaus Birkelund Jensen wrote:
> > On Mar 31 01:55, Keith Busch wrote:
> > > On Mon, Mar 30, 2020 at 09:46:56AM -0700, Andrzej Jakowski wrote:
> > > > This patch introduces support for PMR that has been defined as part of 
> > > > NVMe 1.4
> > > > spec. User can now specify a pmrdev option that should point to 
> > > > HostMemoryBackend.
> > > > pmrdev memory region will subsequently be exposed as PCI BAR 2 in 
> > > > emulated NVMe
> > > > device. Guest OS can perform mmio read and writes to the PMR region 
> > > > that will stay
> > > > persistent across system reboot.
> > > 
> > > Looks pretty good to me.
> > > 
> > > Reviewed-by: Keith Busch 
> > > 
> > > For a possible future extention, it could be interesting to select the
> > > BAR for PMR dynamically, so that you could have CMB and PMR enabled on
> > > the same device.
> > > 
> > 
> > I thought about the same thing, but this would require disabling MSI-X
> > right? Otherwise there is not enough room. AFAIU, the iomem takes up
> > BAR0 and BAR1, CMB (or PMR) uses BAR2 and BAR3 and MSI-X uses BAR4 or 5.
> 
> That's the way the emulated device is currently set, but there's no
> reason for CMB or MSIx to use an exlusive bar. Both of those can be
> be offsets into BAR 0/1, for example. PMR is the only one that can't
> share.
> 

Oh, I did not consider that at all. Thanks!

Re: [PATCH RESEND v4] nvme: introduce PMR support from NVMe 1.4 spec

2020-03-30 Thread Klaus Birkelund Jensen

On Mar 31 01:55, Keith Busch wrote:
> On Mon, Mar 30, 2020 at 09:46:56AM -0700, Andrzej Jakowski wrote:
> > This patch introduces support for PMR that has been defined as part of NVMe 
> > 1.4
> > spec. User can now specify a pmrdev option that should point to 
> > HostMemoryBackend.
> > pmrdev memory region will subsequently be exposed as PCI BAR 2 in emulated 
> > NVMe
> > device. Guest OS can perform mmio read and writes to the PMR region that 
> > will stay
> > persistent across system reboot.
> 
> Looks pretty good to me.
> 
> Reviewed-by: Keith Busch 
> 
> For a possible future extention, it could be interesting to select the
> BAR for PMR dynamically, so that you could have CMB and PMR enabled on
> the same device.
> 

I thought about the same thing, but this would require disabling MSI-X
right? Otherwise there is not enough room. AFAIU, the iomem takes up
BAR0 and BAR1, CMB (or PMR) uses BAR2 and BAR3 and MSI-X uses BAR4 or 5.

Re: [PATCH v3] block/nvme: introduce PMR support from NVMe 1.4 spec

2020-03-19 Thread Klaus Birkelund Jensen

On Mar 18 13:03, Andrzej Jakowski wrote:
> This patch introduces support for PMR that has been defined as part of NVMe 
> 1.4
> spec. User can now specify a pmrdev option that should point to 
> HostMemoryBackend.
> pmrdev memory region will subsequently be exposed as PCI BAR 2 in emulated 
> NVMe
> device. Guest OS can perform mmio read and writes to the PMR region that will 
> stay
> persistent across system reboot.
> 
> Signed-off-by: Andrzej Jakowski 
> ---
> v2:
>  - reworked PMR to use HostMemoryBackend instead of directly mapping PMR
>backend file into qemu [1] (Stefan)
> 
> v1:
>  - provided support for Bit 1 from PMRWBM register instead of Bit 0 to ensure
>improved performance in virtualized environment [2] (Stefan)
> 
>  - added check if pmr size is power of two in size [3] (David)
> 
>  - addressed cross compilation build problems reported by CI environment
> 
> [1]: 
> https://lore.kernel.org/qemu-devel/20200306223853.37958-1-andrzej.jakow...@linux.intel.com/
> [2]: 
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf
> [3]: 
> https://lore.kernel.org/qemu-devel/20200218224811.30050-1-andrzej.jakow...@linux.intel.com/
> ---
> Persistent Memory Region (PMR) is a new optional feature provided in NVMe 1.4
> specification. This patch implements initial support for it in NVMe driver.
> ---
>  hw/block/nvme.c   | 117 +++-
>  hw/block/nvme.h   |   2 +
>  hw/block/trace-events |   5 ++
>  include/block/nvme.h  | 172 ++
>  4 files changed, 294 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index d28335cbf3..70fd09d293 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -19,10 +19,18 @@
>   *  -drive file=,if=none,id=
>   *  -device nvme,drive=,serial=,id=, \
>   *  cmb_size_mb=, \
> + *  [pmrdev=,] \
>   *  num_queues=
>   *
>   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
>   * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> + *
> + * Either cmb or pmr - due to limitation in available BAR indexes.
> + * pmr_file file needs to be power of two in size.
> + * Enabling pmr emulation can be achieved by pointing to memory-backend-file.
> + * For example:
> + * -object memory-backend-file,id=,share=on,mem-path=, \
> + *  size=  -device nvme,...,pmrdev=
>   */
>  
>  #include "qemu/osdep.h"
> @@ -35,7 +43,9 @@
>  #include "sysemu/sysemu.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> +#include "sysemu/hostmem.h"
>  #include "sysemu/block-backend.h"
> +#include "exec/ramblock.h"
>  
>  #include "qemu/log.h"
>  #include "qemu/module.h"
> @@ -1141,6 +1151,26 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
> uint64_t data,
>  NVME_GUEST_ERR(nvme_ub_mmiowr_cmbsz_readonly,
> "invalid write to read only CMBSZ, ignored");
>  return;
> +case 0xE00: /* PMRCAP */
> +NVME_GUEST_ERR(nvme_ub_mmiowr_pmrcap_readonly,
> +   "invalid write to PMRCAP register, ignored");
> +return;
> +case 0xE04: /* TODO PMRCTL */
> +break;
> +case 0xE08: /* PMRSTS */
> +NVME_GUEST_ERR(nvme_ub_mmiowr_pmrsts_readonly,
> +   "invalid write to PMRSTS register, ignored");
> +return;
> +case 0xE0C: /* PMREBS */
> +NVME_GUEST_ERR(nvme_ub_mmiowr_pmrebs_readonly,
> +   "invalid write to PMREBS register, ignored");
> +return;
> +case 0xE10: /* PMRSWTP */
> +NVME_GUEST_ERR(nvme_ub_mmiowr_pmrswtp_readonly,
> +   "invalid write to PMRSWTP register, ignored");
> +return;
> +case 0xE14: /* TODO PMRMSC */
> + break;
>  default:
>  NVME_GUEST_ERR(nvme_ub_mmiowr_invalid,
> "invalid MMIO write,"
> @@ -1169,6 +1199,23 @@ static uint64_t nvme_mmio_read(void *opaque, hwaddr 
> addr, unsigned size)
>  }
>  
>  if (addr < sizeof(n->bar)) {
> +/*
> + * When PMRWBM bit 1 is set then read from
> + * from PMRSTS should ensure prior writes
> + * made it to persistent media
> + */
> +if (addr == 0xE08 &&
> +(NVME_PMRCAP_PMRWBM(n->bar.pmrcap) & 0x02) >> 1) {

Don't think that shift is needed.

> +int status;
> +
> +status = qemu_msync((void *)n->pmrdev->mr.ram_block->host,
> +n->pmrdev->size,
> +n->pmrdev->mr.ram_block->fd);
> +if (!status) {
> +NVME_GUEST_ERR(nvme_ub_mmiord_pmrread_barrier,
> +   "error while persisting data");
> +}
> +}
>  memcpy(&val, ptr + addr, size);
>  } else {
>  NVME_GUEST_ERR(nvme_ub_mmiord_invalid_ofs,
> @@ -1332,6 +1379,23 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> *

Re: [PATCH v5 20/26] nvme: handle dma errors

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 13:52, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > Handling DMA errors gracefully is required for the device to pass the
> > block/011 test ("disable PCI device while doing I/O") in the blktests
> > suite.
> > 
> > With this patch the device passes the test by retrying "critical"
> > transfers (posting of completion entries and processing of submission
> > queue entries).
> > 
> > If DMA errors occur at any other point in the execution of the command
> > (say, while mapping the PRPs), the command is aborted with a Data
> > Transfer Error status code.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 42 +-
> >  hw/block/trace-events |  2 ++
> >  include/block/nvme.h  |  2 +-
> >  3 files changed, 36 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f8c81b9e2202..204ae1d33234 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -73,14 +73,14 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr 
> > addr)
> >  return addr >= low && addr < hi;
> >  }
> >  
> > -static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> > +static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> >  if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> >  memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
> > -return;
> > +return 0;
> >  }
> >  
> > -pci_dma_read(&n->parent_obj, addr, buf, size);
> > +return pci_dma_read(&n->parent_obj, addr, buf, size);
> >  }
> >  
> >  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
> > @@ -168,6 +168,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> > *qsg, QEMUIOVector *iov,
> >  uint16_t status = NVME_SUCCESS;
> >  bool is_cmb = false;
> >  bool prp_list_in_cmb = false;
> > +int ret;
> >  
> >  trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
> >  prp1, prp2, num_prps);
> > @@ -218,7 +219,12 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> > *qsg, QEMUIOVector *iov,
> >  
> >  nents = (len + n->page_size - 1) >> n->page_bits;
> >  prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > -nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> > +ret = nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> > +if (ret) {
> > +trace_nvme_dev_err_addr_read(prp2);
> > +status = NVME_DATA_TRANSFER_ERROR;
> > +goto unmap;
> > +}
> >  while (len != 0) {
> >  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
> >  
> > @@ -237,7 +243,13 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> > *qsg, QEMUIOVector *iov,
> >  i = 0;
> >  nents = (len + n->page_size - 1) >> n->page_bits;
> >  prp_trans = MIN(n->max_prp_ents, nents) * 
> > sizeof(uint64_t);
> > -nvme_addr_read(n, prp_ent, (void *) prp_list, 
> > prp_trans);
> > +ret = nvme_addr_read(n, prp_ent, (void *) prp_list,
> > +prp_trans);
> > +if (ret) {
> > +trace_nvme_dev_err_addr_read(prp_ent);
> > +status = NVME_DATA_TRANSFER_ERROR;
> > +goto unmap;
> > +}
> >  prp_ent = le64_to_cpu(prp_list[i]);
> >  }
> >  
> > @@ -443,6 +455,7 @@ static void nvme_post_cqes(void *opaque)
> >  NvmeCQueue *cq = opaque;
> >  NvmeCtrl *n = cq->ctrl;
> >  NvmeRequest *req, *next;
> > +int ret;
> >  
> >  QTAILQ_FOREACH_SAFE(req, &cq->req_list, entry, next) {
> >  NvmeSQueue *sq;
> > @@ -452,15 +465,21 @@ static void nvme_post_cqes(void *opaque)
> >  break;
> >  }
> >  
> > -QTAILQ_REMOVE(&cq->req_list, req, entry);
> >  sq = req->sq;
> >  req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
> >  req->cqe.sq_id = cpu_to_le16(sq->sqid);
> >  req->cqe.sq_head = cpu_to_le16(sq->head);
> >  addr = cq->dma_addr + cq->tail * n->cqe_size;
> > -nvme_inc_cq_tail(cq);
> > -pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
> > +ret = pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
> >  sizeof(req->cqe));
> > +if (ret) {
> > +trace_nvme_dev_err_addr_write(addr);
> > +timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> > +100 * SCALE_MS);
> > +break;
> > +}
> > +QTAILQ_REMOVE(&cq->req_list, req, entry);
> > +nvme_inc_cq_tail(cq);
> >  nvme_req_clear(req);
> >  QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
> >  }
> > @@ -1588,7 +1607,12 @@ static vo

Re: [PATCH v5 16/26] nvme: refactor prp mapping

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 13:44, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Refactor nvme_map_prp and allow PRPs to be located in the CMB. The logic
> > ensures that if some of the PRP is in the CMB, all of it must be located
> > there, as per the specification.
> 
> To be honest this looks like not refactoring but a bugfix
> (old code was just assuming that if first prp entry is in cmb, the rest also 
> is)

I split it up into a separate bugfix patch.

> > 
> > Also combine nvme_dma_{read,write}_prp into a single nvme_dma_prp that
> > takes an additional DMADirection parameter.
> 
> To be honest 'nvme_dma_prp' was not a clear function name to me at first 
> glance.
> Could you rename this to nvme_dma_prp_rw or so? (Although even that is 
> somewhat unclear
> to convey the meaning of read/write the data to/from the guest memory areas 
> defined by the prp list.
> Also could you split this change into a new patch?
> 

Splitting into new patch.

> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> Now you even use your both addresses :-)
> 
> > ---
> >  hw/block/nvme.c   | 245 +++---
> >  hw/block/nvme.h   |   2 +-
> >  hw/block/trace-events |   1 +
> >  include/block/nvme.h  |   1 +
> >  4 files changed, 160 insertions(+), 89 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 4acfc85b56a2..334265efb21e 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -58,6 +58,11 @@
> >  
> >  static void nvme_process_sq(void *opaque);
> >  
> > +static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
> > +{
> > +return &n->cmbuf[addr - n->ctrl_mem.addr];
> > +}
> 
> To my taste I would put this together with the patch that
> added nvme_addr_is_cmb. I know that some people are against
> this citing the fact that you should use the code you add
> in the same patch. Your call.
> 
> Regardless of this I also prefer to put refactoring patches first in the 
> series.
> 
> > +
> >  static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> >  {
> >  hwaddr low = n->ctrl_mem.addr;
> > @@ -152,138 +157,187 @@ static void nvme_irq_deassert(NvmeCtrl *n, 
> > NvmeCQueue *cq)
> >  }
> >  }
> >  
> > -static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t 
> > prp1,
> > - uint64_t prp2, uint32_t len, NvmeCtrl *n)
> > +static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector 
> > *iov,
> > +uint64_t prp1, uint64_t prp2, uint32_t len, NvmeRequest *req)
> 
> Split line alignment (it was correct before).
> Also while at the refactoring, it would be great to add some documentation
> to this and few more functions, since its not clear immediately what this 
> does.
> 
> 
> >  {
> >  hwaddr trans_len = n->page_size - (prp1 % n->page_size);
> >  trans_len = MIN(len, trans_len);
> >  int num_prps = (len >> n->page_bits) + 1;
> > +uint16_t status = NVME_SUCCESS;
> > +bool is_cmb = false;
> > +bool prp_list_in_cmb = false;
> > +
> > +trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
> > +prp1, prp2, num_prps);
> >  
> >  if (unlikely(!prp1)) {
> >  trace_nvme_dev_err_invalid_prp();
> >  return NVME_INVALID_FIELD | NVME_DNR;
> > -} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
> > -   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
> > -qsg->nsg = 0;
> > +}
> > +
> > +if (nvme_addr_is_cmb(n, prp1)) {
> > +is_cmb = true;
> > +
> >  qemu_iovec_init(iov, num_prps);
> > -qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], 
> > trans_len);
> > +
> > +/*
> > + * PRPs do not cross page boundaries, so if the start address 
> > (here,
> > + * prp1) is within the CMB, it cannot cross outside the controller
> > + * memory buffer range. This is ensured by
> > + *
> > + *   len = n->page_size - (addr % n->page_size)
> > + *
> > + * Thus, we can directly add to the iovec without risking an out of
> > + * bounds access. This also holds for the remaining qemu_iovec_add
> > + * calls.
> > + */
> > +qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp1), trans_len);
> >  } else {
> >  pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
> >  qemu_sglist_add(qsg, prp1, trans_len);
> >  }
> > +
> >  len -= trans_len;
> >  if (len) {
> >  if (unlikely(!prp2)) {
> >  trace_nvme_dev_err_invalid_prp2_missing();
> > +status = NVME_INVALID_FIELD | NVME_DNR;
> >  goto unmap;
> >  }
> > +
> >  if (len > n->page_size) {
> >  uint64_t prp_list[n->max_prp_ents];
> >  uint32_t nents, prp_trans;
> >  int i = 0;
> >  
> > +if (nvme_addr_is_cmb(n, prp2)) {
> > +prp_list

Re: [PATCH v5 15/26] nvme: bump supported specification to 1.3

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 12:35, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Add new fields to the Identify Controller and Identify Namespace data
> > structures accoding to NVM Express 1.3d.
> > 
> > NVM Express 1.3d requires the following additional features:
> >   - addition of the Namespace Identification Descriptor List (CNS 03h)
> > for the Identify command
> >   - support for returning Command Sequence Error if a Set Features
> > command is submitted for the Number of Queues feature after any I/O
> > queues have been created.
> >   - The addition of the Log Specific Field (LSP) in the Get Log Page
> > command.
> 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 57 ---
> >  hw/block/nvme.h   |  1 +
> >  hw/block/trace-events |  3 ++-
> >  include/block/nvme.h  | 20 ++-
> >  4 files changed, 71 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 900732bb2f38..4acfc85b56a2 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -9,7 +9,7 @@
> >   */
> >  
> >  /**
> > - * Reference Specification: NVM Express 1.2.1
> > + * Reference Specification: NVM Express 1.3d
> >   *
> >   *   https://nvmexpress.org/resources/specifications/
> >   */
> > @@ -43,7 +43,7 @@
> >  #include "trace.h"
> >  #include "nvme.h"
> >  
> > -#define NVME_SPEC_VER 0x00010201
> > +#define NVME_SPEC_VER 0x00010300
> >  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> >  #define NVME_TEMPERATURE 0x143
> >  #define NVME_TEMPERATURE_WARNING 0x157
> > @@ -735,6 +735,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
> > NvmeRequest *req)
> >  uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> >  uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> >  uint8_t  lid = dw10 & 0xff;
> > +uint8_t  lsp = (dw10 >> 8) & 0xf;
> >  uint8_t  rae = (dw10 >> 15) & 0x1;
> >  uint32_t numdl, numdu;
> >  uint64_t off, lpol, lpou;
> > @@ -752,7 +753,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
> > NvmeRequest *req)
> >  return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> >  
> > -trace_nvme_dev_get_log(nvme_cid(req), lid, rae, len, off);
> > +trace_nvme_dev_get_log(nvme_cid(req), lid, lsp, rae, len, off);
> >  
> >  switch (lid) {
> >  case NVME_LOG_ERROR_INFO:
> > @@ -863,6 +864,8 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  cq = g_malloc0(sizeof(*cq));
> >  nvme_init_cq(cq, n, prp1, cqid, vector, qsize + 1,
> >  NVME_CQ_FLAGS_IEN(qflags));
> Code alignment on that '('
> > +
> > +n->qs_created = true;
> Should be done also at nvme_create_sq

No, because you can't create a SQ without a matching CQ:

if (unlikely(!cqid || nvme_check_cqid(n, cqid))) {
trace_nvme_dev_err_invalid_create_sq_cqid(cqid);
return NVME_INVALID_CQID | NVME_DNR;
}


So if there is a matching cq, then qs_created = true.

> >  return NVME_SUCCESS;
> >  }
> >  
> > @@ -924,6 +927,47 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, 
> > NvmeIdentify *c)
> >  return ret;
> >  }
> >  
> > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> > +{
> > +static const int len = 4096;
> The spec caps the Identify payload size to 4K,
> thus this should go to nvme.h

Done.

> > +
> > +struct ns_descr {
> > +uint8_t nidt;
> > +uint8_t nidl;
> > +uint8_t rsvd2[2];
> > +uint8_t nid[16];
> > +};
> This is also part of the spec, thus should
> move to nvme.h
> 

Done - and cleaned up.

> > +
> > +uint32_t nsid = le32_to_cpu(c->nsid);
> > +uint64_t prp1 = le64_to_cpu(c->prp1);
> > +uint64_t prp2 = le64_to_cpu(c->prp2);
> > +
> > +struct ns_descr *list;
> > +uint16_t ret;
> > +
> > +trace_nvme_dev_identify_ns_descr_list(nsid);
> > +
> > +if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > +trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > +return NVME_INVALID_NSID | NVME_DNR;
> > +}
> > +
> > +/*
> > + * Because the NGUID and EUI64 fields are 0 in the Identify Namespace 
> > data
> > + * structure, a Namespace UUID (nidt = 0x3) must be reported in the
> > + * Namespace Identification Descriptor. Add a very basic Namespace UUID
> > + * here.
> Some per namespace uuid qemu property will be very nice to have to have a 
> uuid that
> is at least somewhat unique.
> Linux kernel I think might complain if it detects namespaces with duplicate 
> uuids.

It will be "unique" per controller (because it's just the namespace id).
The spec also says that it should be fixed for the lifetime of the
namespace, but I'm not sure how to ensure that without keeping that
state on disk somehow. I have a solution for this in a later series, but
for now, I think this is ok.

But since we actually support multiple controllers, there certainly is
an issue here. Maybe we can b

Re: [PATCH v5 21/26] nvme: add support for scatter gather lists

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 14:07, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > For now, support the Data Block, Segment and Last Segment descriptor
> > types.
> > 
> > See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Fam Zheng 
> > ---
> >  block/nvme.c  |  18 +-
> >  hw/block/nvme.c   | 375 +++---
> >  hw/block/trace-events |   4 +
> >  include/block/nvme.h  |  62 ++-
> >  4 files changed, 389 insertions(+), 70 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index d41c4bda6e39..521f521054d5 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -446,7 +446,7 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >  error_setg(errp, "Cannot map buffer for DMA");
> >  goto out;
> >  }
> > -cmd.prp1 = cpu_to_le64(iova);
> > +cmd.dptr.prp.prp1 = cpu_to_le64(iova);
> >  
> >  if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> >  error_setg(errp, "Failed to identify controller");
> > @@ -545,7 +545,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, 
> > Error **errp)
> >  }
> >  cmd = (NvmeCmd) {
> >  .opcode = NVME_ADM_CMD_CREATE_CQ,
> > -.prp1 = cpu_to_le64(q->cq.iova),
> > +.dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
> >  .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
> >  .cdw11 = cpu_to_le32(0x3),
> >  };
> > @@ -556,7 +556,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, 
> > Error **errp)
> >  }
> >  cmd = (NvmeCmd) {
> >  .opcode = NVME_ADM_CMD_CREATE_SQ,
> > -.prp1 = cpu_to_le64(q->sq.iova),
> > +.dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
> >  .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
> >  .cdw11 = cpu_to_le32(0x1 | (n << 16)),
> >  };
> > @@ -906,16 +906,16 @@ try_map:
> >  case 0:
> >  abort();
> >  case 1:
> > -cmd->prp1 = pagelist[0];
> > -cmd->prp2 = 0;
> > +cmd->dptr.prp.prp1 = pagelist[0];
> > +cmd->dptr.prp.prp2 = 0;
> >  break;
> >  case 2:
> > -cmd->prp1 = pagelist[0];
> > -cmd->prp2 = pagelist[1];
> > +cmd->dptr.prp.prp1 = pagelist[0];
> > +cmd->dptr.prp.prp2 = pagelist[1];
> >  break;
> >  default:
> > -cmd->prp1 = pagelist[0];
> > -cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
> > +cmd->dptr.prp.prp1 = pagelist[0];
> > +cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + 
> > sizeof(uint64_t));
> >  break;
> >  }
> >  trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 204ae1d33234..a91c60fdc111 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -75,8 +75,10 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr 
> > addr)
> >  
> >  static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> > -if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> > -memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
> > +hwaddr hi = addr + size;
> Are you sure you don't want to check for overflow here?
> Its theoretical issue since addr has to be almost full 64 bit
> but still for those things I check this very defensively.
> 

The use of nvme_addr_read in map_prp simply cannot overflow due to how
the size is calculated, but for SGLs it's different. But the overflow is
checked in map_sgl because we have to return a special error code in
that case.

On the other hand there may be other callers of nvme_addr_read in the
future that does not check this, so I'll re-add it.

> > +
> > +if (n->cmbsz && nvme_addr_is_cmb(n, addr) && nvme_addr_is_cmb(n, hi)) {
> Here you fix the bug I mentioned in patch 6. I suggest you to move the fix 
> there.

Done.

> > +memcpy(buf, nvme_addr_to_cmb(n, addr), size);
> >  return 0;
> >  }
> >  
> > @@ -159,6 +161,48 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue 
> > *cq)
> >  }
> >  }
> >  
> > +static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr 
> > addr,
> > +size_t len)
> > +{
> > +if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len)) {
> > +return NVME_DATA_TRANSFER_ERROR;
> > +}
> > +
> > +qemu_iovec_add(iov, nvme_addr_to_cmb(n, addr), len);
> > +
> > +return NVME_SUCCESS;
> > +}
> > +
> > +static uint16_t nvme_map_addr(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector 
> > *iov,
> > +hwaddr addr, size_t len)
> > +{
> > +bool addr_is_cmb = nvme_addr_is_cmb(n, addr);
> > +
> > +if (addr_is_cmb) {
> > +if (qsg->sg) {
> > +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +}
> > +
> > +if (!iov->iov) {
> > +qemu_iovec_init(iov, 1);
> > +}
> > +
> >

Re: [PATCH v5 12/26] nvme: add missing mandatory features

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 12:27, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Add support for returning a resonable response to Get/Set Features of
> > mandatory features.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 57 ---
> >  hw/block/trace-events |  2 ++
> >  include/block/nvme.h  |  3 ++-
> >  3 files changed, 58 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index a186d95df020..3267ee2de47a 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1008,7 +1008,15 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  uint32_t result;
> >  
> > +trace_nvme_dev_getfeat(nvme_cid(req), dw10);
> > +
> >  switch (dw10) {
> > +case NVME_ARBITRATION:
> > +result = cpu_to_le32(n->features.arbitration);
> > +break;
> > +case NVME_POWER_MANAGEMENT:
> > +result = cpu_to_le32(n->features.power_mgmt);
> > +break;
> >  case NVME_TEMPERATURE_THRESHOLD:
> >  result = 0;
> >  
> > @@ -1029,6 +1037,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  break;
> >  }
> >  
> > +break;
> > +case NVME_ERROR_RECOVERY:
> > +result = cpu_to_le32(n->features.err_rec);
> >  break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  result = blk_enable_write_cache(n->conf.blk);
> 
> This is existing code but still like to point out that endianess conversion 
> is missing.

Fixed.

> Also we need to think if we need to do some flush if the write cache is 
> disabled.
> I don't know yet that area well enough.
> 

Looking at the block layer code it just sets a flag when disabling, but
subsequent requests will have BDRV_REQ_FUA set. So to make sure that
stuff in the cache is flushed, let's do a flush.

> > @@ -1041,6 +1052,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  break;
> >  case NVME_TIMESTAMP:
> >  return nvme_get_feature_timestamp(n, cmd);
> > +case NVME_INTERRUPT_COALESCING:
> > +result = cpu_to_le32(n->features.int_coalescing);
> > +break;
> > +case NVME_INTERRUPT_VECTOR_CONF:
> > +if ((dw11 & 0x) > n->params.num_queues) {
> Looks like it should be >= since interrupt vector is not zero based.

Fixed in other patch.

> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +result = cpu_to_le32(n->features.int_vector_config[dw11 & 0x]);
> > +break;
> > +case NVME_WRITE_ATOMICITY:
> > +result = cpu_to_le32(n->features.write_atomicity);
> > +break;
> >  case NVME_ASYNCHRONOUS_EVENT_CONF:
> >  result = cpu_to_le32(n->features.async_config);
> >  break;
> > @@ -1076,6 +1100,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> >  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  
> > +trace_nvme_dev_setfeat(nvme_cid(req), dw10, dw11);
> > +
> >  switch (dw10) {
> >  case NVME_TEMPERATURE_THRESHOLD:
> >  if (NVME_TEMP_TMPSEL(dw11)) {
> > @@ -1116,6 +1142,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  case NVME_ASYNCHRONOUS_EVENT_CONF:
> >  n->features.async_config = dw11;
> >  break;
> > +case NVME_ARBITRATION:
> > +case NVME_POWER_MANAGEMENT:
> > +case NVME_ERROR_RECOVERY:
> > +case NVME_INTERRUPT_COALESCING:
> > +case NVME_INTERRUPT_VECTOR_CONF:
> > +case NVME_WRITE_ATOMICITY:
> > +return NVME_FEAT_NOT_CHANGABLE | NVME_DNR;
> >  default:
> >  trace_nvme_dev_err_invalid_setfeat(dw10);
> >  return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1689,6 +1722,21 @@ static void nvme_init_state(NvmeCtrl *n)
> >  n->temperature = NVME_TEMPERATURE;
> >  n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> >  n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > +
> > +/*
> > + * There is no limit on the number of commands that the controller may
> > + * launch at one time from a particular Submission Queue.
> > + */
> > +n->features.arbitration = 0x7;
> A nice #define in nvme.h stating that 0x7 means no burst limit would be nice.
> 

Done.

> > +
> > +n->features.int_vector_config = g_malloc0_n(n->params.num_queues,
> > +sizeof(*n->features.int_vector_config));
> > +
> > +/* disable coalescing (not supported) */
> > +for (int i = 0; i < n->params.num_queues; i++) {
> > +n->features.int_vector_config[i] = i | (1 << 16);
> Same here

Done.

> > +}
> > +
> >  n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
> >  }
> >  
> > @@ -1782,15 +1830,17 @@ static void nvme_init_ctrl(NvmeCtr

Re: [PATCH v5 14/26] nvme: make sure ncqr and nsqr is valid

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 12:30, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > 0x is not an allowed value for NCQR and NSQR in Set Features on
> > Number of Queues.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 4 
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 30c5b3e7a67d..900732bb2f38 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1133,6 +1133,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >  break;
> >  case NVME_NUMBER_OF_QUEUES:
> > +if ((dw11 & 0x) == 0x || ((dw11 >> 16) & 0x) == 
> > 0x) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> Very minor nitpick: since this spec requirement is not obvious, a 
> quote/reference to the spec
> would be nice to have here. 
> 

Added.

> > +
> >  trace_nvme_dev_setfeat_numq((dw11 & 0x) + 1,
> >  ((dw11 >> 16) & 0x) + 1, n->params.num_queues - 1,
> >  n->params.num_queues - 1);
> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
>

Re: [PATCH v5 17/26] nvme: allow multiple aios per command

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 13:48, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > This refactors how the device issues asynchronous block backend
> > requests. The NvmeRequest now holds a queue of NvmeAIOs that are
> > associated with the command. This allows multiple aios to be issued for
> > a command. Only when all requests have been completed will the device
> > post a completion queue entry.
> > 
> > Because the device is currently guaranteed to only issue a single aio
> > request per command, the benefit is not immediately obvious. But this
> > functionality is required to support metadata, the dataset management
> > command and other features.
> 
> I don't know what the strategy will be chosen for supporting metadata
> (qemu doesn't have any notion of metadata in the block layer), but for 
> dataset management
> you are right. Dataset management command can contain a table of areas to 
> discard
> (although in reality I have seen no driver putting there more that one entry).
> 

The strategy is different depending on how the metadata is transferred
between host and device. For the "separate buffer" case, metadata is
transferred using a separate memory pointer in the nvme command (MPTR).
In this case the metadata is kept separately on a new blockdev attached
to the namespace.

In the other case, metadata is transferred as part of an extended lba
(say 512 + 8 bytes) and kept inline on the main namespace blockdev. This
is challenging for QEMU as it breaks interoperability of the image with
other devices. But that is a discussion for fresh RFC ;)

Note that the support for multiple AIOs is also used for DULBE support
down the line when I get around to posting those patches. So this is
preparatory for a lot of features that requires persistant state across
device power off.

> 
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 449 +-
> >  hw/block/nvme.h   | 134 +++--
> >  hw/block/trace-events |   8 +
> >  3 files changed, 480 insertions(+), 111 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 334265efb21e..e97da35c4ca1 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -19,7 +19,8 @@
> >   *  -drive file=,if=none,id=
> >   *  -device nvme,drive=,serial=,id=, \
> >   *  cmb_size_mb=, \
> > - *  num_queues=
> > + *  num_queues=, \
> > + *  mdts=
> 
> Could you split mdts checks into a separate patch? This is not related to the 
> series.

Absolutely. Done.

> 
> >   *
> >   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
> >   * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> > @@ -57,6 +58,7 @@
> >  } while (0)
> >  
> >  static void nvme_process_sq(void *opaque);
> > +static void nvme_aio_cb(void *opaque, int ret);
> >  
> >  static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
> >  {
> > @@ -341,6 +343,107 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t 
> > *ptr, uint32_t len,
> >  return status;
> >  }
> >  
> > +static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +NvmeNamespace *ns = req->ns;
> > +
> > +uint32_t len = req->nlb << nvme_ns_lbads(ns);
> > +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +
> > +return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> > +}
> 
> Same here, this is another nice refactoring and it should be in separate 
> patch.

Done.

> 
> > +
> > +static void nvme_aio_destroy(NvmeAIO *aio)
> > +{
> > +g_free(aio);
> > +}
> > +
> > +static inline void nvme_req_register_aio(NvmeRequest *req, NvmeAIO *aio,
> > +NvmeAIOOp opc)
> > +{
> > +aio->opc = opc;
> > +
> > +trace_nvme_dev_req_register_aio(nvme_cid(req), aio, blk_name(aio->blk),
> > +aio->offset, aio->len, nvme_aio_opc_str(aio), req);
> > +
> > +if (req) {
> > +QTAILQ_INSERT_TAIL(&req->aio_tailq, aio, tailq_entry);
> > +}
> > +}
> > +
> > +static void nvme_aio(NvmeAIO *aio)
> Function name not clear to me. Maybe change this to something like 
> nvme_submit_aio.

Fixed.

> > +{
> > +BlockBackend *blk = aio->blk;
> > +BlockAcctCookie *acct = &aio->acct;
> > +BlockAcctStats *stats = blk_get_stats(blk);
> > +
> > +bool is_write, dma;
> > +
> > +switch (aio->opc) {
> > +case NVME_AIO_OPC_NONE:
> > +break;
> > +
> > +case NVME_AIO_OPC_FLUSH:
> > +block_acct_start(stats, acct, 0, BLOCK_ACCT_FLUSH);
> > +aio->aiocb = blk_aio_flush(blk, nvme_aio_cb, aio);
> > +break;
> > +
> > +case NVME_AIO_OPC_WRITE_ZEROES:
> > +block_acct_start(stats, acct, aio->len, BLOCK_ACCT_WRITE);
> > +aio->aiocb = blk_aio_pwrite_zeroes(blk, aio->offset, aio->len,
> > +BDRV_REQ_MAY_UNMAP, nvme_aio_cb, aio);
> > +break;
> > +
> >

Re: [PATCH v5 22/26] nvme: support multiple namespaces

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 14:34, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> Very reasonable way to do it. 
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/Makefile.objs |   2 +-
> >  hw/block/nvme-ns.c | 158 +++
> >  hw/block/nvme-ns.h |  60 +++
> >  hw/block/nvme.c| 235 +
> >  hw/block/nvme.h|  47 -
> >  hw/block/trace-events  |   6 +-
> >  6 files changed, 389 insertions(+), 119 deletions(-)
> >  create mode 100644 hw/block/nvme-ns.c
> >  create mode 100644 hw/block/nvme-ns.h
> > 
> > diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
> > index 28c2495a00dc..45f463462f1e 100644
> > --- a/hw/block/Makefile.objs
> > +++ b/hw/block/Makefile.objs
> > @@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
> >  common-obj-$(CONFIG_XEN) += xen-block.o
> >  common-obj-$(CONFIG_ECC) += ecc.o
> >  common-obj-$(CONFIG_ONENAND) += onenand.o
> > -common-obj-$(CONFIG_NVME_PCI) += nvme.o
> > +common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
> >  common-obj-$(CONFIG_SWIM) += swim.o
> >  
> >  obj-$(CONFIG_SH4) += tc58128.o
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > new file mode 100644
> > index ..0e5be44486f4
> > --- /dev/null
> > +++ b/hw/block/nvme-ns.c
> > @@ -0,0 +1,158 @@
> > +#include "qemu/osdep.h"
> > +#include "qemu/units.h"
> > +#include "qemu/cutils.h"
> > +#include "qemu/log.h"
> > +#include "hw/block/block.h"
> > +#include "hw/pci/msix.h"
> Do you need this include?

No, I needed hw/pci/pci.h instead :)

> > +#include "sysemu/sysemu.h"
> > +#include "sysemu/block-backend.h"
> > +#include "qapi/error.h"
> > +
> > +#include "hw/qdev-properties.h"
> > +#include "hw/qdev-core.h"
> > +
> > +#include "nvme.h"
> > +#include "nvme-ns.h"
> > +
> > +static int nvme_ns_init(NvmeNamespace *ns)
> > +{
> > +NvmeIdNs *id_ns = &ns->id_ns;
> > +
> > +id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->nuse = id_ns->ncap = id_ns->nsze =
> > +cpu_to_le64(nvme_ns_nlbas(ns));
> Nitpick: To be honest I don't really like that chain assignment, 
> especially since it forces to wrap the line, but that is just my
> personal taste.

Fixed, and also added a comment as to why they are the same.

> > +
> > +return 0;
> > +}
> > +
> > +static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, NvmeIdCtrl *id,
> > +Error **errp)
> > +{
> > +uint64_t perm, shared_perm;
> > +
> > +Error *local_err = NULL;
> > +int ret;
> > +
> > +perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
> > +shared_perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
> > +BLK_PERM_GRAPH_MOD;
> > +
> > +ret = blk_set_perm(ns->blk, perm, shared_perm, &local_err);
> > +if (ret) {
> > +error_propagate_prepend(errp, local_err, "blk_set_perm: ");
> > +return ret;
> > +}
> 
> You should consider using blkconf_apply_backend_options.
> Take a look at for example virtio_blk_device_realize.
> That will give you support for read only block devices as well.

So, yeah. There is a reason for this. And I will add that as a comment,
but I will write it here for posterity.

The problem is when the nvme-ns device starts getting more than just a
single drive attached (I have patches ready that will add a "metadata"
and a "state" drive). The blkconf_ functions work on a BlockConf that
embeds a BlockBackend, so you can't have one BlockConf with multiple
BlockBackend's. That is why I'm kinda copying the "good parts" of
the blkconf_apply_backend_options code here.

> 
> I personally only once grazed the area of block permissions,
> so I prefer someone from the block layer to review this as well.
> 
> > +
> > +ns->size = blk_getlength(ns->blk);
> > +if (ns->size < 0) {
> > +error_setg_errno(errp, -ns->size, "blk_getlength");
> > +return 1;
> > +}
> > +
> > +switch (n->conf.wce) {
> > +case ON_OFF_AUTO_ON:
> > +n->features.volatile_wc = 1;
> > +break;
> > +case ON_OFF_AUTO_OFF:
> > +n->features.volatile_wc = 0;
> > +case ON

Re: [PATCH v5 10/26] nvme: add support for the get log page command

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 11:35, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Add support for the Get Log Page command and basic implementations of
> > the mandatory Error Information, SMART / Health Information and Firmware
> > Slot Information log pages.
> > 
> > In violation of the specification, the SMART / Health Information log
> > page does not persist information over the lifetime of the controller
> > because the device has no place to store such persistent state.
> Yea, not the end of the world.
> > 
> > Note that the LPA field in the Identify Controller data structure
> > intentionally has bit 0 cleared because there is no namespace specific
> > information in the SMART / Health information log page.
> Makes sense.
> > 
> > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > Section 5.10 ("Get Log Page command").
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 122 +-
> >  hw/block/nvme.h   |  10 
> >  hw/block/trace-events |   2 +
> >  include/block/nvme.h  |   2 +-
> >  4 files changed, 134 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f72348344832..468c36918042 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -569,6 +569,123 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  return NVME_SUCCESS;
> >  }
> >  
> > +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > buf_len,
> > +uint64_t off, NvmeRequest *req)
> > +{
> > +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +uint32_t nsid = le32_to_cpu(cmd->nsid);
> > +
> > +uint32_t trans_len;
> > +time_t current_ms;
> > +uint64_t units_read = 0, units_written = 0, read_commands = 0,
> > +write_commands = 0;
> > +NvmeSmartLog smart;
> > +BlockAcctStats *s;
> > +
> > +if (nsid && nsid != 0x) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +s = blk_get_stats(n->conf.blk);
> > +
> > +units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > +units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > +read_commands = s->nr_ops[BLOCK_ACCT_READ];
> > +write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> > +
> > +if (off > sizeof(smart)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +trans_len = MIN(sizeof(smart) - off, buf_len);
> > +
> > +memset(&smart, 0x0, sizeof(smart));
> > +
> > +smart.data_units_read[0] = cpu_to_le64(units_read / 1000);
> > +smart.data_units_written[0] = cpu_to_le64(units_written / 1000);
> > +smart.host_read_commands[0] = cpu_to_le64(read_commands);
> > +smart.host_write_commands[0] = cpu_to_le64(write_commands);
> > +
> > +smart.temperature[0] = n->temperature & 0xff;
> > +smart.temperature[1] = (n->temperature >> 8) & 0xff;
> > +
> > +if ((n->temperature > n->features.temp_thresh_hi) ||
> > +(n->temperature < n->features.temp_thresh_low)) {
> > +smart.critical_warning |= NVME_SMART_TEMPERATURE;
> > +}
> > +
> > +current_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > +smart.power_on_hours[0] = cpu_to_le64(
> > +(((current_ms - n->starttime_ms) / 1000) / 60) / 60);
> > +
> > +return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> > +prp2);
> > +}
> Looks OK.
> > +
> > +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > buf_len,
> > +uint64_t off, NvmeRequest *req)
> > +{
> > +uint32_t trans_len;
> > +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +NvmeFwSlotInfoLog fw_log;
> > +
> > +if (off > sizeof(fw_log)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +memset(&fw_log, 0, sizeof(NvmeFwSlotInfoLog));
> > +
> > +trans_len = MIN(sizeof(fw_log) - off, buf_len);
> > +
> > +return nvme_dma_read_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> > +prp2);
> > +}
> Looks OK
> > +
> > +static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > +uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > +uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> > +uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> > +uint8_t  lid = dw10 & 0xff;
> > +uint8_t  rae = (dw10 >> 15) & 0x1;
> > +uint32_t numdl, numdu;
> > +uint64_t off, lpol, lpou;
> > +size_t   len;
> > +
> > +numdl = (dw10 >> 16);
> > +numdu = (dw11 & 0x);
> > +lpol = dw12;
> > +lpou = dw13;
> > +
> > +len = (((numdu << 16) | numdl) + 1) << 2;
> > +off = (lpou << 32ULL) | lpol;
> > +
> > +if (off & 0x3) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> 
> Good. 
> Note that there are plenty of other place

Re: [PATCH v5 09/26] nvme: add temperature threshold feature

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 11:31, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > It might seem wierd to implement this feature for an emulated device,
> > but it is mandatory to support and the feature is useful for testing
> > asynchronous event request support, which will be added in a later
> > patch.
> 
> Absolutely but as the old saying is, rules are rules.
> At least, to the defense of the spec, making this mandatory
> forced the vendors to actually report some statistics about
> the device in neutral format as opposed to yet another
> vendor proprietary thing (I am talking about SMART log page).
> 
> > 
> > Signed-off-by: Klaus Jensen 
> 
> I noticed that you sign off some patches with your @samsung.com email,
> and some with @cnexlabs.com
> Is there a reason for that?

Yeah. Some of this code was made while I was at CNEX Labs. I've since
moved to Samsung. But credit where credit's due.

> 
> 
> > ---
> >  hw/block/nvme.c  | 50 
> >  hw/block/nvme.h  |  2 ++
> >  include/block/nvme.h |  7 ++-
> >  3 files changed, 58 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 81514eaef63a..f72348344832 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -45,6 +45,9 @@
> >  
> >  #define NVME_SPEC_VER 0x00010201
> >  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> > +#define NVME_TEMPERATURE 0x143
> > +#define NVME_TEMPERATURE_WARNING 0x157
> > +#define NVME_TEMPERATURE_CRITICAL 0x175
> >  
> >  #define NVME_GUEST_ERR(trace, fmt, ...) \
> >  do { \
> > @@ -798,9 +801,31 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl 
> > *n, NvmeCmd *cmd)
> >  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest 
> > *req)
> >  {
> >  uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > +uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  uint32_t result;
> >  
> >  switch (dw10) {
> > +case NVME_TEMPERATURE_THRESHOLD:
> > +result = 0;
> > +
> > +/*
> > + * The controller only implements the Composite Temperature 
> > sensor, so
> > + * return 0 for all other sensors.
> > + */
> > +if (NVME_TEMP_TMPSEL(dw11)) {
> > +break;
> > +}
> > +
> > +switch (NVME_TEMP_THSEL(dw11)) {
> > +case 0x0:
> > +result = cpu_to_le16(n->features.temp_thresh_hi);
> > +break;
> > +case 0x1:
> > +result = cpu_to_le16(n->features.temp_thresh_low);
> > +break;
> > +}
> > +
> > +break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  result = blk_enable_write_cache(n->conf.blk);
> >  trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
> > @@ -845,6 +870,23 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  
> >  switch (dw10) {
> > +case NVME_TEMPERATURE_THRESHOLD:
> > +if (NVME_TEMP_TMPSEL(dw11)) {
> > +break;
> > +}
> > +
> > +switch (NVME_TEMP_THSEL(dw11)) {
> > +case 0x0:
> > +n->features.temp_thresh_hi = NVME_TEMP_TMPTH(dw11);
> > +break;
> > +case 0x1:
> > +n->features.temp_thresh_low = NVME_TEMP_TMPTH(dw11);
> > +break;
> > +default:
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >  break;
> > @@ -1366,6 +1408,9 @@ static void nvme_init_state(NvmeCtrl *n)
> >  n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> >  n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> >  n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> > +
> > +n->temperature = NVME_TEMPERATURE;
> 
> This appears not to be used in the patch.
> I think you should move that to the next patch that
> adds the get log page support.
> 

Fixed.

> > +n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> >  }
> >  
> >  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > @@ -1447,6 +1492,11 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >  id->acl = 3;
> >  id->frmw = 7 << 1;
> >  id->lpa = 1 << 0;
> > +
> > +/* recommended default value (~70 C) */
> > +id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> > +id->cctemp = cpu_to_le16(NVME_TEMPERATURE_CRITICAL);
> > +
> >  id->sqes = (0x6 << 4) | 0x6;
> >  id->cqes = (0x4 << 4) | 0x4;
> >  id->nn = cpu_to_le32(n->num_namespaces);
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index a867bdfabafd..1518f32557a3 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -108,6 +108,7 @@ typedef struct NvmeCtrl {
> >  uint64_tirq_status;
> >  uint64_thost_timestamp; /* Timestamp sent by the 
> > host */
> >  ui

Re: [PATCH v5 08/26] nvme: refactor device realization

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 11:27, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > This patch splits up nvme_realize into multiple individual functions,
> > each initializing a different subset of the device.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 175 +++-
> >  hw/block/nvme.h |  21 ++
> >  2 files changed, 133 insertions(+), 63 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index e1810260d40b..81514eaef63a 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -44,6 +44,7 @@
> >  #include "nvme.h"
> >  
> >  #define NVME_SPEC_VER 0x00010201
> > +#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> >  
> >  #define NVME_GUEST_ERR(trace, fmt, ...) \
> >  do { \
> > @@ -1325,67 +1326,106 @@ static const MemoryRegionOps nvme_cmb_ops = {
> >  },
> >  };
> >  
> > -static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > +static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
> >  {
> > -NvmeCtrl *n = NVME(pci_dev);
> > -NvmeIdCtrl *id = &n->id_ctrl;
> > -
> > -int i;
> > -int64_t bs_size;
> > -uint8_t *pci_conf;
> > -
> > -if (!n->params.num_queues) {
> > -error_setg(errp, "num_queues can't be zero");
> > -return;
> > -}
> > +NvmeParams *params = &n->params;
> >  
> >  if (!n->conf.blk) {
> > -error_setg(errp, "drive property not set");
> > -return;
> > +error_setg(errp, "nvme: block backend not configured");
> > +return 1;
> As a matter of taste, negative values indicate error, and 0 is the success 
> value.
> In Linux kernel this is even an official rule.
> >  }

Fixed.

> >  
> > -bs_size = blk_getlength(n->conf.blk);
> > -if (bs_size < 0) {
> > -error_setg(errp, "could not get backing file size");
> > -return;
> > +if (!params->serial) {
> > +error_setg(errp, "nvme: serial not configured");
> > +return 1;
> >  }
> >  
> > -if (!n->params.serial) {
> > -error_setg(errp, "serial property not set");
> > -return;
> > +if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
> > +error_setg(errp, "nvme: invalid queue configuration");
> Maybe something like "nvme: invalid queue count specified, should be between 
> 1 and ..."?
> > +return 1;
> >  }

Fixed.

> > +
> > +return 0;
> > +}
> > +
> > +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > +{
> >  blkconf_blocksizes(&n->conf);
> >  if (!blkconf_apply_backend_options(&n->conf, 
> > blk_is_read_only(n->conf.blk),
> > -   false, errp)) {
> > -return;
> > +false, errp)) {
> > +return 1;
> >  }
> >  
> > -pci_conf = pci_dev->config;
> > -pci_conf[PCI_INTERRUPT_PIN] = 1;
> > -pci_config_set_prog_interface(pci_dev->config, 0x2);
> > -pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> > -pcie_endpoint_cap_init(pci_dev, 0x80);
> > +return 0;
> > +}
> >  
> > +static void nvme_init_state(NvmeCtrl *n)
> > +{
> >  n->num_namespaces = 1;
> >  n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
> 
> Isn't that wrong?
> First 4K of mmio (0x1000) is the registers, and that is followed by the 
> doorbells,
> and each doorbell takes 8 bytes (assuming regular doorbell stride).
> so n->params.num_queues + 1 should be total number of queues, thus the 0x1004 
> should be 0x1000 IMHO.
> I might miss some rounding magic here though.
> 

Yeah. I think you are right. It all becomes slightly more fishy due to
the num_queues device parameter being 1's based and accounts for the
admin queue pair.

But in get/set features, the value has to be 0's based and only account
for the I/O queues, so we need to subtract 2 from the value. It's
confusing all around.

Since the admin queue pair isn't really optional I think it would be
better that we introduces a new max_ioqpairs parameter that is 1's
based, counts number of pairs and obviously only accounts for the io
queues.

I guess we need to keep the num_queues parameter around for
compatibility.

The doorbells are only 4 bytes btw, but the calculation still looks
wrong. With a max_ioqpairs parameter in place, the reg_size should be

pow2ceil(0x1008 + 2 * (n->params.max_ioqpairs) * 4)

Right? Thats 0x1000 for the core registers, 8 bytes for the sq/cq
doorbells for the admin queue pair, and then room for the i/o queue
pairs.

I added a patch for this in v6.

> > -n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> > -
> >  n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> >  n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> >  n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> > +}
> >  
> > -memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
> > -  "nvme", n->reg_size);
> > +static void nvme_in

Re: [PATCH RESEND v2] block/nvme: introduce PMR support from NVMe 1.4 spec

2020-03-11 Thread Klaus Birkelund Jensen

On Mar 11 15:54, Andrzej Jakowski wrote:
> On 3/11/20 2:20 AM, Stefan Hajnoczi wrote:
> > Please try:
> > 
> >   $ git grep pmem
> > 
> > backends/hostmem-file.c is the backend that can be used and the
> > pmem_persist() API can be used to flush writes.
> 
> I've reworked this patch into hostmem-file type of backend.
> From simple tests in virtual machine: writing to PMR region
> and then reading from it after VM power cycle I have observed that
> there is no persistency.
> 
> I guess that persistent behavior can be achieved if memory backend file
> resides on actual persistent memory in VMM. I haven't found mechanism to
> persist memory backend file when it resides in the file system on block
> storage. My original mmap + msync based solution worked well there.
> I believe that main problem with mmap was with "ifdef _WIN32" that made it 
> platform specific and w/o it patchew CI complained. 
> Is there a way that I could rework mmap + msync solution so it would fit
> into qemu design?
> 

Hi Andrzej,

Thanks for working on this!

FWIW, I have implemented other stuff for the NVMe device that requires
persistent storage (e.g. LBA allocation tracking for DULBE support). I
used the approach of adding an additional blockdev and simply use the
qemu block layer. This would also make it work on WIN32. And if we just
set bit 0 in PMRWBM and disable the write cache on the blockdev we
should be good on the durability requirements.

Unfortunately, I do not see (or know, maybe Stefan has an idea?) an easy
way of using the MemoryRegionOps nicely with async block backend i/o. so
we either have to use blocking I/O or fire and forget aio. Or, we can
maybe keep bit 1 set in PMRWBM and force a blocking blk_flush on PMRSTS
read.

Finally, a thing to consider is that this is adding an optional NVMe 1.4
feature to an already frankenstein device that doesn't even implement
mandatory v1.2. I think that bumping the NVMe version to 1.4 is out of
the question until we actually implement it fully wrt. mandatory
features. My patchset brings the device up to v1.3 and I have v1.4 ready
for posting, so I think we can get there.

Klaus

Re: [PATCH v5 01/26] nvme: rename trace events to nvme_dev

2020-02-12 Thread Klaus Birkelund Jensen

On Feb 12 11:08, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Change the prefix of all nvme device related trace events to 'nvme_dev'
> > to not clash with trace events from the nvme block driver.
> > 

Hi Maxim,

Thank you very much for your thorough reviews! Utterly appreciated!

I'll start going through your suggested changes. There is a bit of work
to do on splitting patches into refactoring and bugfixes, but I can
definitely see the reason for this, so I'll get to work.

You mention the alignment with split lines alot. I actually thought I
was following CODING_STYLE.rst (which allows a single 4 space indent for
functions, but not statements such as if/else and while/for). But since
hw/block/nvme.c is originally written in the style of aligning with the
opening paranthesis I'm in the wrong here, so I will of course amend
it. Should have done that from the beginning, it's just my personal
taste shining through ;)

Thanks again,
Klaus

Re: [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:47, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:51:42AM +0100, Klaus Jensen wrote:
> > Hi,
> > 
> > 
> > Changes since v4
> >  - Changed vendor and device id to use a Red Hat allocated one. For
> >backwards compatibility add the 'x-use-intel-id' nvme device
> >parameter. This is off by default but is added as a machine compat
> >property to be true for machine types <= 4.2.
> > 
> >  - SGL mapping code has been refactored.
> 
> Looking pretty good to me. For the series beyond the individually
> reviewed patches:
> 
> Acked-by: Keith Busch 
> 
> If you need to send a v5, you may add my tag to the patches that are not
> substaintially modified if you like.

I'll send a v6 with the changes to "nvme: make lba data size
configurable". It won't be substantially changed, I will just only
accept 9 and 12 as valid values for lbads.

Thanks for the Ack's and Reviews Keith!


Klaus

Re: [PATCH v5 24/26] nvme: change controller pci id

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:35, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:06AM +0100, Klaus Jensen wrote:
> > There are two reasons for changing this:
> > 
> >   1. The nvme device currently uses an internal Intel device id.
> > 
> >   2. Since commits "nvme: fix write zeroes offset and count" and "nvme:
> >  support multiple namespaces" the controller device no longer has
> >  the quirks that the Linux kernel think it has.
> > 
> >  As the quirks are applied based on pci vendor and device id, change
> >  them to get rid of the quirks.
> > 
> > To keep backward compatibility, add a new 'x-use-intel-id' parameter to
> > the nvme device to force use of the Intel vendor and device id. This is
> > off by default but add a compat property to set this for machines 4.2
> > and older.
> > 
> > Signed-off-by: Klaus Jensen 
> 
> Yay, thank you for following through on getting this identifier assigned.
> 
> Reviewed-by: Keith Busch 

This is technically not "officially" sanctioned yet, but I got an
indication from Gerd that we are good to proceed with this.

Re: [PATCH v5 22/26] nvme: support multiple namespaces

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:31, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:04AM +0100, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> 
> I like this feature a lot, thanks for doing it.
> 
> Reviewed-by: Keith Busch 
> 
> > @@ -1256,18 +1272,24 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, 
> > NvmeCmd *cmd, uint8_t rae,
> >  uint64_t units_read = 0, units_written = 0, read_commands = 0,
> >  write_commands = 0;
> >  NvmeSmartLog smart;
> > -BlockAcctStats *s;
> >  
> >  if (nsid && nsid != 0x) {
> >  return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> 
> This is totally optional, but worth mentioning: this patch makes it
> possible to remove this check and allow per-namespace smart logs. The
> ID_CTRL.LPA would need to updated to reflect that if you wanted to
> go that route.

Yeah, I thought about that, but with NVMe v1.4 support arriving in a
later series, there are no longer any namespace specific stuff in the
log page anyway.

The spec isn't really clear on what the preferred behavior for a 1.4
compliant device is. Either

  1. LBA bit 0 set and just return the same page for each namespace or,
  2. LBA bit 0 unset and fail when NSID is set

Re: [PATCH v5 26/26] nvme: make lba data size configurable

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:43, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:08AM +0100, Klaus Jensen wrote:
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme-ns.c | 2 +-
> >  hw/block/nvme-ns.h | 4 +++-
> >  hw/block/nvme.c| 1 +
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index 0e5be44486f4..981d7101b8f2 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -18,7 +18,7 @@ static int nvme_ns_init(NvmeNamespace *ns)
> >  {
> >  NvmeIdNs *id_ns = &ns->id_ns;
> >  
> > -id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->lbaf[0].ds = ns->params.lbads;
> >  id_ns->nuse = id_ns->ncap = id_ns->nsze =
> >  cpu_to_le64(nvme_ns_nlbas(ns));
> >  
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > index b564bac25f6d..f1fe4db78b41 100644
> > --- a/hw/block/nvme-ns.h
> > +++ b/hw/block/nvme-ns.h
> > @@ -7,10 +7,12 @@
> >  
> >  #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> >  DEFINE_PROP_DRIVE("drive", _state, blk), \
> > -DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> > +DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
> > +DEFINE_PROP_UINT8("lbads", _state, _props.lbads, BDRV_SECTOR_BITS)
> 
> I think we need to validate the parameter is between 9 and 12 before
> trusting it can be used safely.
> 
> Alternatively, add supported formats to the lbaf array and let the host
> decide on a live system with the 'format' command.

The device does not yet support Format NVM, but we have a patch ready
for that to be submitted with a new series when this is merged.

For now, while it does not support Format, I will change this patch such
that it defaults to 9 (BRDV_SECTOR_BITS) and only accept 12 as an
alternative (while always keeping the number of formats available to 1).

Re: [PATCH v4 20/24] nvme: add support for scatter gather lists

2020-01-13 Thread Klaus Birkelund Jensen

On Jan  9 11:44, Beata Michalska wrote:
> Hi Klaus,
> 
> On Thu, 19 Dec 2019 at 13:09, Klaus Jensen  wrote:
> > @@ -73,7 +73,12 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr 
> > addr)
> >
> >  static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> > -if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> > +hwaddr hi = addr + size;
> > +if (hi < addr) {
> 
> What is the actual use case for that ?

This was for detecting wrap around in the unsigned addition. I found
that nvme_map_sgl does not check if addr + size is out of bounds (which
it should). With that in place this check is belt and braces, so I might
remove it.

> > +static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
> > +QEMUIOVector *iov, NvmeSglDescriptor *segment, uint64_t nsgld,
> > +uint32_t *len, bool is_cmb, NvmeRequest *req)
> > +{
> > +dma_addr_t addr, trans_len;
> > +uint16_t status;
> > +
> > +for (int i = 0; i < nsgld; i++) {
> > +if (NVME_SGL_TYPE(segment[i].type) != SGL_DESCR_TYPE_DATA_BLOCK) {
> > +trace_nvme_dev_err_invalid_sgl_descriptor(nvme_cid(req),
> > +NVME_SGL_TYPE(segment[i].type));
> > +return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
> > +}
> > +
> > +if (*len == 0) {
> > +if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
> > +
> > trace_nvme_dev_err_invalid_sgl_excess_length(nvme_cid(req));
> > +return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> > +}
> > +
> > +break;
> > +}
> > +
> > +addr = le64_to_cpu(segment[i].addr);
> > +trans_len = MIN(*len, le64_to_cpu(segment[i].len));
> > +
> > +if (nvme_addr_is_cmb(n, addr)) {
> > +/*
> > + * All data and metadata, if any, associated with a particular
> > + * command shall be located in either the CMB or host memory. 
> > Thus,
> > + * if an address if found to be in the CMB and we have already
> 
> s/address if/address is ?

Fixed, thanks.

> > +static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector 
> > *iov,
> > +NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
> > +{
> > +const int MAX_NSGLD = 256;
> > +
> > +NvmeSglDescriptor segment[MAX_NSGLD];
> > +uint64_t nsgld;
> > +uint16_t status;
> > +bool is_cmb = false;
> > +bool sgl_in_cmb = false;
> > +hwaddr addr = le64_to_cpu(sgl.addr);
> > +
> > +trace_nvme_dev_map_sgl(nvme_cid(req), NVME_SGL_TYPE(sgl.type), 
> > req->nlb, len);
> > +
> > +if (nvme_addr_is_cmb(n, addr)) {
> > +is_cmb = true;
> > +
> > +qemu_iovec_init(iov, 1);
> > +} else {
> > +pci_dma_sglist_init(qsg, &n->parent_obj, 1);
> > +}
> > +
> > +/*
> > + * If the entire transfer can be described with a single data block it 
> > can
> > + * be mapped directly.
> > + */
> > +if (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_DATA_BLOCK) {
> > +status = nvme_map_sgl_data(n, qsg, iov, &sgl, 1, &len, is_cmb, 
> > req);
> > +if (status) {
> > +goto unmap;
> > +}
> > +
> > +goto out;
> > +}
> > +
> > +/*
> > + * If the segment is located in the CMB, the submission queue of the
> > + * request must also reside there.
> > + */
> > +if (nvme_addr_is_cmb(n, addr)) {
> > +if (!nvme_addr_is_cmb(n, req->sq->dma_addr)) {
> > +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +}
> > +
> > +sgl_in_cmb = true;
> 
> Why not combining this with the condition few lines above
> for the nvme_addr_is_cmb ? Also is the sgl_in_cmb really needed ?
> If the address is from CMB, that  implies the queue is also there,
> otherwise we wouldn't progress beyond this point. Isn't is_cmb sufficient ?
> 

You are right, there is no need for sgl_in_cmb.

But checking if the queue is in the cmb only needs to be done if the
descriptor in DPTR is *not* a "singleton" data block. But I think I can
refactor it to be slightly nicer, or at least be more specific in the
comments.

> > +}
> > +
> > +while (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_SEGMENT) {
> > +bool addr_is_cmb;
> > +
> > +nsgld = le64_to_cpu(sgl.len) / sizeof(NvmeSglDescriptor);
> > +
> > +/* read the segment in chunks of 256 descriptors (4k) */
> > +while (nsgld > MAX_NSGLD) {
> > +if (nvme_addr_read(n, addr, segment, sizeof(segment))) {
> > +trace_nvme_dev_err_addr_read(addr);
> > +status = NVME_DATA_TRANSFER_ERROR;
> > +goto unmap;
> > +}
> > +
> > +status = nvme_map_sgl_data(n, qsg, iov, segment, MAX_NSGLD, 
> > &len,
> > +is_cmb, req);
> 
> This will probably fail if there is a BitBucket Descriptor on the way (?)
> 

nvme_map_sgl_data will error out on any descriptors different from
"DATA_BLOCK". So I

Re: [PATCH v4 17/24] nvme: allow multiple aios per command

2020-01-13 Thread Klaus Birkelund Jensen

On Jan  9 11:40, Beata Michalska wrote:
> Hi Klaus,
> 
> On Thu, 19 Dec 2019 at 13:09, Klaus Jensen  wrote:
> > +static NvmeAIO *nvme_aio_new(BlockBackend *blk, int64_t offset, size_t len,
> > +QEMUSGList *qsg, QEMUIOVector *iov, NvmeRequest *req,
> > +NvmeAIOCompletionFunc *cb)
> 
> Minor: The indentation here (and in a few other places across the patchset)
> does not seem right . And maybe inline ?

I tried to follow the style in CODING_STYLE.rst for "Multiline Indent",
but how the style is for function definition is a bit underspecified.

I can change it to align with the opening paranthesis. I just found the
"one indent" more readable for these long function definitions.

> Also : seems that there are cases when some of the parameters are
> not required (NULL) , maybe having a simplified version for those cases
> might be useful ?
> 

True. Actually - at this point in the series there are no users of the
NvmeAIOCompletionFunc. It is preparatory for other patches I have in the
pipeline. But I'll clean it up.

> > +static void nvme_aio_cb(void *opaque, int ret)
> > +{
> > +NvmeAIO *aio = opaque;
> > +NvmeRequest *req = aio->req;
> > +
> > +BlockBackend *blk = aio->blk;
> > +BlockAcctCookie *acct = &aio->acct;
> > +BlockAcctStats *stats = blk_get_stats(blk);
> > +
> > +Error *local_err = NULL;
> > +
> > +trace_nvme_dev_aio_cb(nvme_cid(req), aio, blk_name(blk), aio->offset,
> > +nvme_aio_opc_str(aio), req);
> > +
> > +if (req) {
> > +QTAILQ_REMOVE(&req->aio_tailq, aio, tailq_entry);
> > +}
> > +
> >  if (!ret) {
> > -block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
> > -req->status = NVME_SUCCESS;
> > +block_acct_done(stats, acct);
> > +
> > +if (aio->cb) {
> > +aio->cb(aio, aio->cb_arg);
> 
> We are dropping setting status to SUCCESS here,
> is that expected ?

Yes, that is on purpose. nvme_aio_cb is called for *each* issued AIO and
we do not want to overwrite a previously set error status with a success
(if one aio in the request fails even though others succeed, it should
not go unnoticed). Note that NVME_SUCCESS is the default setting in the
request, so if no one sets an error code we are still good.

> Also the aio callback will not get
> called case failure and it probably should ?
> 

I tried both but ended up with just not calling it on failure, but I
think that in the future some AIO callbacks might want to take a
different action if the request failed, so I'll add it back in an add
the aio return value (ret) to the callback function definition.

Thanks,
Klaus

Re: [PATCH v4 19/24] nvme: handle dma errors

2020-01-13 Thread Klaus Birkelund Jensen

On Jan  9 11:35, Beata Michalska wrote:
> Hi Klaus,
> 

Hi Beata,

Your reviews are, as always, much appreciated! Thanks!

> On Thu, 19 Dec 2019 at 13:09, Klaus Jensen  wrote:
> > @@ -1595,7 +1611,12 @@ static void nvme_process_sq(void *opaque)
> >
> >  while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {
> >  addr = sq->dma_addr + sq->head * n->sqe_size;
> > -nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd));
> > +if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
> > +trace_nvme_dev_err_addr_read(addr);
> > +timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> > +100 * SCALE_MS);
> > +break;
> > +}
> 
> Is there a chance we will end up repeatedly triggering the read error here
> as this will come back to the same memory location each time (the sq->head
> is not moving here) ?
> 

Absolutely, and that was the point. Not being able to read the
submission queue is pretty bad, so the device just keeps retrying every
100 ms. This is the same for when writing to the completion queue fails.

But... It would probably be prudent to track how long it has been since
a successful DMA transfer was done and timeout, shutting down the
device. Say maybe after 60 seconds. I'll try to add something like that.

Thanks,
Klaus

Re: [PATCH v4 22/24] nvme: bump controller pci device id

2019-12-19 Thread Klaus Birkelund Jensen

On Dec 20 02:46, Keith Busch wrote:
> On Thu, Dec 19, 2019 at 06:24:57PM +0100, Klaus Birkelund Jensen wrote:
> > On Dec 20 01:16, Keith Busch wrote:
> > > On Thu, Dec 19, 2019 at 02:09:19PM +0100, Klaus Jensen wrote:
> > > > @@ -2480,7 +2480,7 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice 
> > > > *pci_dev)
> > > >  pci_conf[PCI_INTERRUPT_PIN] = 1;
> > > >  pci_config_set_prog_interface(pci_conf, 0x2);
> > > >  pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
> > > > -pci_config_set_device_id(pci_conf, 0x5845);
> > > > +pci_config_set_device_id(pci_conf, 0x5846);
> > > >  pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
> > > >  pcie_endpoint_cap_init(pci_dev, 0x80);
> > > 
> > > We can't just pick a number here, these are supposed to be assigned by the
> > > vendor. A day will come when I will be in trouble for using the existing
> > > identifier: I found out to late it was supposed to be for internal use
> > > only as it was never officially reserved, so lets not make the same
> > > mistake for some future device.
> > > 
> > 
> > Makes sense. And there is no "QEMU" vendor, is there?
> 
> I'm not sure if we can use this, but there is a PCI_VENDOR_ID_QEMU,
> 0x1234, defined in include/hw/pci/pci.h.
> 

Maybe it's possible to use PCI_VENDOR_ID_REDHAT?

Re: [PATCH v4 21/24] nvme: support multiple namespaces

2019-12-19 Thread Klaus Birkelund Jensen

On Dec 19 16:11, Michal Prívozník wrote:
> On 12/19/19 2:09 PM, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> 
> Klaus, just to make sure I understand correctly, this implements
> multiple namespaces for *emulated* NVMe, right? I'm asking because I
> just merged libvirt patches to support:
> 
> -drive
> file.driver=nvme,file.device=:01:00.0,file.namespace=1,format=raw,if=none,id=drive-virtio-disk0
> -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
> 
> and seeing these patches made me doubt my design. But if your patches
> touch emulated NVMe only, then libvirt's fine because it doesn't expose
> that just yet.
> 
> Michal
> 

Hi Michal,

Yes, this is only for the emulated nvme controller.

Re: [PATCH v4 22/24] nvme: bump controller pci device id

2019-12-19 Thread Klaus Birkelund Jensen

On Dec 20 01:16, Keith Busch wrote:
> On Thu, Dec 19, 2019 at 02:09:19PM +0100, Klaus Jensen wrote:
> > @@ -2480,7 +2480,7 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice 
> > *pci_dev)
> >  pci_conf[PCI_INTERRUPT_PIN] = 1;
> >  pci_config_set_prog_interface(pci_conf, 0x2);
> >  pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
> > -pci_config_set_device_id(pci_conf, 0x5845);
> > +pci_config_set_device_id(pci_conf, 0x5846);
> >  pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
> >  pcie_endpoint_cap_init(pci_dev, 0x80);
> 
> We can't just pick a number here, these are supposed to be assigned by the
> vendor. A day will come when I will be in trouble for using the existing
> identifier: I found out to late it was supposed to be for internal use
> only as it was never officially reserved, so lets not make the same
> mistake for some future device.
> 

Makes sense. And there is no "QEMU" vendor, is there? But it would be
really nice to get rid of the quirks.

[Qemu-block] [PATCH 12/16] nvme: bump supported NVMe revision to 1.3d

2019-07-05 Thread Klaus Birkelund Jensen

Add the new Namespace Identification Descriptor List (CNS 03h) and track
creation of queues to enable the controller to return Command Sequence
Error if Set Features is called for Number of Queues after any queues
have been created.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 84 ---
 hw/block/nvme.h   |  1 +
 hw/block/trace-events |  4 ++-
 include/block/nvme.h  | 30 +---
 4 files changed, 102 insertions(+), 17 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 8259dd7c1d6c..8ad95fdfa261 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,20 +9,22 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.3d, 1.2, 1.1, 1.0e
  *
  *  http://www.nvmexpress.org/resources/
  */
 
 /**
  * Usage: add options:
- *  -drive file=,if=none,id=
- *  -device nvme,drive=,serial=,id=, \
- *  cmb_size_mb=, \
- *  num_queues=
+ * -drive file=,if=none,id=
+ * -device nvme,drive=,serial=,id=
  *
- * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
- * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
+ * Advanced optional options:
+ *
+ *   num_queues=  : Maximum number of IO Queues.
+ *  Default: 64
+ *   cmb_size_mb= : Size of Controller Memory Buffer in MBs.
+ *  Default: 0 (disabled)
  */
 
 #include "qemu/osdep.h"
@@ -43,6 +45,7 @@
 #define NVME_ELPE 3
 #define NVME_AERL 3
 #define NVME_OP_ABORTED 0xff
+
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
 (trace_##trace)(__VA_ARGS__); \
@@ -316,6 +319,8 @@ static void nvme_post_cqes(void *opaque)
 static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
 {
 assert(cq->cqid == req->sq->cqid);
+
+trace_nvme_enqueue_req_completion(req->cqe.cid, cq->cqid);
 QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
 QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
@@ -534,6 +539,7 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
 if (sq->sqid) {
 g_free(sq);
 }
+n->qs_created--;
 }
 
 static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -600,6 +606,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
uint64_t dma_addr,
 cq = n->cq[cqid];
 QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
 n->sq[sqid] = sq;
+n->qs_created++;
 }
 
 static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -649,6 +656,7 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
 if (cq->cqid) {
 g_free(cq);
 }
+n->qs_created--;
 }
 
 static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -689,6 +697,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
uint64_t dma_addr,
 msix_vector_use(&n->parent_obj, cq->vector);
 n->cq[cqid] = cq;
 cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+n->qs_created++;
 }
 
 static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -762,7 +771,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify 
*c)
 prp1, prp2);
 }
 
-static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
+static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
 {
 static const int data_len = 4 * KiB;
 uint32_t min_nsid = le32_to_cpu(c->nsid);
@@ -772,7 +781,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeIdentify *c)
 uint16_t ret;
 int i, j = 0;
 
-trace_nvme_identify_nslist(min_nsid);
+trace_nvme_identify_ns_list(min_nsid);
 
 list = g_malloc0(data_len);
 for (i = 0; i < n->num_namespaces; i++) {
@@ -789,6 +798,47 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeIdentify *c)
 return ret;
 }
 
+static uint16_t nvme_identify_ns_descriptor_list(NvmeCtrl *n, NvmeCmd *c)
+{
+static const int data_len = 4 * KiB;
+
+/*
+ * The device model does not have anywhere to store a persistent UUID, so
+ * conjure up something that is reproducible. We generate an UUID of the
+ * form "----", where nsid is similar to, say,
+ * 0001.
+ */
+struct ns_descr {
+uint8_t nidt;
+uint8_t nidl;
+uint8_t rsvd[14];
+uint32_t nid;
+};
+
+uint32_t nsid = le32_to_cpu(c->nsid);
+uint64_t prp1 = le64_to_cpu(c->prp1);
+uint64_t prp2 = le64_to_cpu(c->prp2);
+
+struct ns_descr *list;
+uint16_t ret;
+
+trace_nvme_identify_ns_descriptor_list(nsid);
+
+if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
+trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
+return NVME_INVALID_NSID | NVME_DNR;
+}
+
+list = g_malloc0(data_len);
+

[Qemu-block] [PATCH 14/16] nvme: support multiple block requests per request

2019-07-05 Thread Klaus Birkelund Jensen

Currently, the device only issues a single block backend request per
NVMe request, but as we move towards supporting metadata (and
discontiguous vector requests supported by OpenChannel 2.0) it will be
required to issue multiple block backend requests per NVMe request.

With this patch the NVMe device is ready for that.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 322 --
 hw/block/nvme.h   |  49 +--
 hw/block/trace-events |   3 +
 3 files changed, 290 insertions(+), 84 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 02888dbfdbc1..b285119fd29a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -25,6 +25,8 @@
  *  Default: 64
  *   cmb_size_mb= : Size of Controller Memory Buffer in MBs.
  *  Default: 0 (disabled)
+ *   mdts= : Maximum Data Transfer Size (power of two)
+ *  Default: 7
  */
 
 #include "qemu/osdep.h"
@@ -319,10 +321,9 @@ static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
 uint64_t prp1, uint64_t prp2, NvmeRequest *req)
 {
-QEMUSGList qsg;
 uint16_t err = NVME_SUCCESS;
 
-err = nvme_map_prp(n, &qsg, prp1, prp2, len, req);
+err = nvme_map_prp(n, &req->qsg, prp1, prp2, len, req);
 if (err) {
 return err;
 }
@@ -330,8 +331,8 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 if (req->is_cmb) {
 QEMUIOVector iov;
 
-qemu_iovec_init(&iov, qsg.nsg);
-dma_to_cmb(n, &qsg, &iov);
+qemu_iovec_init(&iov, req->qsg.nsg);
+dma_to_cmb(n, &req->qsg, &iov);
 
 if (unlikely(qemu_iovec_from_buf(&iov, 0, ptr, len) != len)) {
 trace_nvme_err_invalid_dma();
@@ -343,17 +344,86 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 goto out;
 }
 
-if (unlikely(dma_buf_read(ptr, len, &qsg))) {
+if (unlikely(dma_buf_read(ptr, len, &req->qsg))) {
 trace_nvme_err_invalid_dma();
 err = NVME_INVALID_FIELD | NVME_DNR;
 }
 
 out:
-qemu_sglist_destroy(&qsg);
+qemu_sglist_destroy(&req->qsg);
 
 return err;
 }
 
+static void nvme_blk_req_destroy(NvmeBlockBackendRequest *blk_req)
+{
+if (blk_req->iov.nalloc) {
+qemu_iovec_destroy(&blk_req->iov);
+}
+
+g_free(blk_req);
+}
+
+static void nvme_blk_req_put(NvmeCtrl *n, NvmeBlockBackendRequest *blk_req)
+{
+nvme_blk_req_destroy(blk_req);
+}
+
+static NvmeBlockBackendRequest *nvme_blk_req_get(NvmeCtrl *n, NvmeRequest *req,
+QEMUSGList *qsg)
+{
+NvmeBlockBackendRequest *blk_req = g_malloc0(sizeof(*blk_req));
+
+blk_req->req = req;
+
+if (qsg) {
+blk_req->qsg = qsg;
+}
+
+return blk_req;
+}
+
+static uint16_t nvme_blk_setup(NvmeCtrl *n, NvmeNamespace *ns, QEMUSGList *qsg,
+NvmeRequest *req)
+{
+NvmeBlockBackendRequest *blk_req = nvme_blk_req_get(n, req, qsg);
+if (!blk_req) {
+NVME_GUEST_ERR(nvme_err_internal_dev_error, "nvme_blk_req_get: %s",
+"could not allocate memory");
+return NVME_INTERNAL_DEV_ERROR;
+}
+
+blk_req->slba = req->slba;
+blk_req->nlb = req->nlb;
+blk_req->blk_offset = req->slba * nvme_ns_lbads_bytes(ns);
+
+QTAILQ_INSERT_TAIL(&req->blk_req_tailq, blk_req, tailq_entry);
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+NvmeNamespace *ns = req->ns;
+uint16_t err;
+
+uint32_t len = req->nlb * nvme_ns_lbads_bytes(ns);
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+err = nvme_map_prp(n, &req->qsg, prp1, prp2, len, req);
+if (err) {
+return err;
+}
+
+err = nvme_blk_setup(n, ns, &req->qsg, req);
+if (err) {
+return err;
+}
+
+return NVME_SUCCESS;
+}
+
 static void nvme_post_cqes(void *opaque)
 {
 NvmeCQueue *cq = opaque;
@@ -388,6 +458,10 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 {
 assert(cq->cqid == req->sq->cqid);
 
+if (req->qsg.nalloc) {
+qemu_sglist_destroy(&req->qsg);
+}
+
 trace_nvme_enqueue_req_completion(req->cqe.cid, cq->cqid);
 QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
 QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
@@ -471,130 +545,224 @@ static void nvme_process_aers(void *opaque)
 
 static void nvme_rw_cb(void *opaque, int ret)
 {
-NvmeRequest *req = opaque;
+NvmeBlockBackendRequest *blk_req = opaque;
+NvmeRequest *req = blk_req->req;
 NvmeSQueue *sq = req->sq;
 NvmeCtrl *

[Qemu-block] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Klaus Birkelund Jensen

This adds support for multiple namespaces by introducing a new 'nvme-ns'
device model. The nvme device creates a bus named from the device name
('id'). The nvme-ns devices then connect to this and registers
themselves with the nvme device.

This changes how an nvme device is created. Example with two namespaces:

  -drive file=nvme0n1.img,if=none,id=disk1
  -drive file=nvme0n2.img,if=none,id=disk2
  -device nvme,serial=deadbeef,id=nvme0
  -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
  -device nvme-ns,drive=disk2,bus=nvme0,nsid=2

A maximum of 256 namespaces can be configured.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/Makefile.objs |   2 +-
 hw/block/nvme-ns.c | 139 +
 hw/block/nvme-ns.h |  35 +
 hw/block/nvme.c| 169 -
 hw/block/nvme.h|  29 ---
 hw/block/trace-events  |   1 +
 6 files changed, 255 insertions(+), 120 deletions(-)
 create mode 100644 hw/block/nvme-ns.c
 create mode 100644 hw/block/nvme-ns.h

diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
index f5f643f0cc06..d44a2f4b780d 100644
--- a/hw/block/Makefile.objs
+++ b/hw/block/Makefile.objs
@@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
 common-obj-$(CONFIG_XEN) += xen-block.o
 common-obj-$(CONFIG_ECC) += ecc.o
 common-obj-$(CONFIG_ONENAND) += onenand.o
-common-obj-$(CONFIG_NVME_PCI) += nvme.o
+common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
 
 obj-$(CONFIG_SH4) += tc58128.o
 
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
new file mode 100644
index ..11b594467991
--- /dev/null
+++ b/hw/block/nvme-ns.c
@@ -0,0 +1,139 @@
+#include "qemu/osdep.h"
+#include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
+#include "hw/block/block.h"
+#include "hw/pci/msix.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+
+#include "hw/qdev-core.h"
+
+#include "nvme.h"
+#include "nvme-ns.h"
+
+static uint64_t nvme_ns_calc_blks(NvmeNamespace *ns)
+{
+return ns->size / nvme_ns_lbads_bytes(ns);
+}
+
+static void nvme_ns_init_identify(NvmeIdNs *id_ns)
+{
+id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+}
+
+static int nvme_ns_init(NvmeNamespace *ns)
+{
+uint64_t ns_blks;
+NvmeIdNs *id_ns = &ns->id_ns;
+
+nvme_ns_init_identify(id_ns);
+
+ns_blks = nvme_ns_calc_blks(ns);
+id_ns->nuse = id_ns->ncap = id_ns->nsze = cpu_to_le64(ns_blks);
+
+return 0;
+}
+
+static int nvme_ns_init_blk(NvmeNamespace *ns, NvmeIdCtrl *id, Error **errp)
+{
+blkconf_blocksizes(&ns->conf);
+
+if (!blkconf_apply_backend_options(&ns->conf,
+blk_is_read_only(ns->conf.blk), false, errp)) {
+return 1;
+}
+
+ns->size = blk_getlength(ns->conf.blk);
+if (ns->size < 0) {
+error_setg_errno(errp, -ns->size, "blk_getlength");
+return 1;
+}
+
+if (!blk_enable_write_cache(ns->conf.blk)) {
+id->vwc = 0;
+}
+
+return 0;
+}
+
+static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
+{
+if (!ns->conf.blk) {
+error_setg(errp, "nvme-ns: block backend not configured");
+return 1;
+}
+
+return 0;
+}
+
+
+static void nvme_ns_realize(DeviceState *dev, Error **errp)
+{
+NvmeNamespace *ns = NVME_NS(dev);
+BusState *s = qdev_get_parent_bus(dev);
+NvmeCtrl *n = NVME(s->parent);
+Error *local_err = NULL;
+
+if (nvme_ns_check_constraints(ns, &local_err)) {
+error_propagate_prepend(errp, local_err,
+"nvme_ns_check_constraints: ");
+return;
+}
+
+if (nvme_ns_init_blk(ns, &n->id_ctrl, &local_err)) {
+error_propagate_prepend(errp, local_err, "nvme_ns_init_blk: ");
+return;
+}
+
+nvme_ns_init(ns);
+if (nvme_register_namespace(n, ns, &local_err)) {
+error_propagate_prepend(errp, local_err, "nvme_register_namespace: ");
+return;
+}
+}
+
+static Property nvme_ns_props[] = {
+DEFINE_BLOCK_PROPERTIES(NvmeNamespace, conf),
+DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),
+DEFINE_PROP_END_OF_LIST(),
+};
+
+static void nvme_ns_class_init(ObjectClass *oc, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(oc);
+
+set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
+
+dc->bus_type = TYPE_NVME_BUS;
+dc->realize = nvme_ns_realize;
+dc->props = nvme_ns_props;
+dc->desc = "virtual nvme namespace";
+}
+
+static void nvme_ns_instance_init(Object *obj)
+{
+NvmeNamespace *ns = NVME_NS(obj);
+char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
+
+device_add_bootindex_property(obj, &ns->conf.booti

[Qemu-block] [PATCH 13/16] nvme: simplify dma/cmb mappings

2019-07-05 Thread Klaus Birkelund Jensen

Instead of handling both QSGs and IOVs in multiple places, simply use
QSGs everywhere by assuming that the request does not involve the
controller memory buffer (CMB). If the request is found to involve the
CMB, convert the QSG to an IOV and issue the I/O. The QSG is converted
to an IOV by the dma helpers anyway, so the CMB path is not unfairly
affected by this simplifying change.

As a side-effect, this patch also allows PRPs to be located in the CMB.
The logic ensures that if some of the PRP is in the CMB, all of it must
be located there.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 277 --
 hw/block/nvme.h   |   3 +-
 hw/block/trace-events |   1 +
 include/block/nvme.h  |   1 +
 4 files changed, 187 insertions(+), 95 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 8ad95fdfa261..02888dbfdbc1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -55,14 +55,21 @@
 
 static void nvme_process_sq(void *opaque);
 
+static inline uint8_t nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
+{
+return n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size));
+}
+
 static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
-if (n->cmbsz && addr >= n->ctrl_mem.addr &&
-addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+if (nvme_addr_is_cmb(n, addr)) {
 memcpy(buf, (void *)&n->cmbuf[addr - n->ctrl_mem.addr], size);
-} else {
-pci_dma_read(&n->parent_obj, addr, buf, size);
+
+return;
 }
+
+pci_dma_read(&n->parent_obj, addr, buf, size);
 }
 
 static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
@@ -151,139 +158,200 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue 
*cq)
 }
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
- uint64_t prp2, uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, uint64_t prp1,
+uint64_t prp2, uint32_t len, NvmeRequest *req)
 {
 hwaddr trans_len = n->page_size - (prp1 % n->page_size);
 trans_len = MIN(len, trans_len);
 int num_prps = (len >> n->page_bits) + 1;
+uint16_t status = NVME_SUCCESS;
+bool prp_list_in_cmb = false;
+
+trace_nvme_map_prp(req->cmd.opcode, trans_len, len, prp1, prp2, num_prps);
 
 if (unlikely(!prp1)) {
 trace_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
-} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
-   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
-qsg->nsg = 0;
-qemu_iovec_init(iov, num_prps);
-qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], 
trans_len);
-} else {
-pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
-qemu_sglist_add(qsg, prp1, trans_len);
 }
+
+if (nvme_addr_is_cmb(n, prp1)) {
+req->is_cmb = true;
+}
+
+pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
+qemu_sglist_add(qsg, prp1, trans_len);
+
 len -= trans_len;
 if (len) {
 if (unlikely(!prp2)) {
 trace_nvme_err_invalid_prp2_missing();
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
+
 if (len > n->page_size) {
 uint64_t prp_list[n->max_prp_ents];
 uint32_t nents, prp_trans;
 int i = 0;
 
+if (nvme_addr_is_cmb(n, prp2)) {
+prp_list_in_cmb = true;
+}
+
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
+nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
 while (len != 0) {
+bool addr_is_cmb;
 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
 if (i == n->max_prp_ents - 1 && len > n->page_size) {
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_nvme_err_invalid_prplist_ent(prp_ent);
+status = NVME_INVALID_FIELD | NVME_DNR;
+goto unmap;
+}
+
+addr_is_cmb = nvme_addr_is_cmb(n, prp_ent);
+if ((prp_list_in_cmb && !addr_is_cmb) ||
+(!prp_list_in_cmb && addr_is_cmb)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
 goto unmap;
 }
 
 i = 0;
 nents = (len + n->page_size

[Qemu-block] [PATCH 09/16] nvme: support Asynchronous Event Request command

2019-07-05 Thread Klaus Birkelund Jensen

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.2 ("Asynchronous Event Request command").

Modified from Keith's qemu-nvme tree.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 88 ++-
 hw/block/nvme.h   |  7 
 hw/block/trace-events |  7 
 3 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index eb6af6508e2d..a20576654f1b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -39,6 +39,7 @@
 #include "nvme.h"
 
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_AERL 3
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -318,6 +319,51 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_process_aers(void *opaque)
+{
+NvmeCtrl *n = opaque;
+NvmeRequest *req;
+NvmeAerResult *result;
+NvmeAsyncEvent *event, *next;
+
+trace_nvme_process_aers();
+
+QSIMPLEQ_FOREACH_SAFE(event, &n->aer_queue, entry, next) {
+/* can't post cqe if there is nothing to complete */
+if (!n->outstanding_aers) {
+trace_nvme_no_outstanding_aers();
+break;
+}
+
+/* ignore if masked (cqe posted, but event not cleared) */
+if (n->aer_mask & (1 << event->result.event_type)) {
+trace_nvme_aer_masked(event->result.event_type, n->aer_mask);
+continue;
+}
+
+QSIMPLEQ_REMOVE_HEAD(&n->aer_queue, entry);
+
+n->aer_mask |= 1 << event->result.event_type;
+n->aer_mask_queued &= ~(1 << event->result.event_type);
+n->outstanding_aers--;
+
+req = n->aer_reqs[n->outstanding_aers];
+
+result = (NvmeAerResult *) &req->cqe.result;
+result->event_type = event->result.event_type;
+result->event_info = event->result.event_info;
+result->log_page = event->result.log_page;
+g_free(event);
+
+req->status = NVME_SUCCESS;
+
+trace_nvme_aer_post_cqe(result->event_type, result->event_info,
+result->log_page);
+
+nvme_enqueue_req_completion(&n->admin_cq, req);
+}
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
 NvmeRequest *req = opaque;
@@ -796,6 +842,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
+case NVME_ASYNCHRONOUS_EVENT_CONF:
+result = cpu_to_le32(n->features.async_config);
 break;
 default:
 trace_nvme_err_invalid_getfeat(dw10);
@@ -841,11 +889,11 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
 ((n->params.num_queues - 2) << 16));
 break;
-
 case NVME_TIMESTAMP:
 return nvme_set_feature_timestamp(n, cmd);
+case NVME_ASYNCHRONOUS_EVENT_CONF:
+n->features.async_config = dw11;
 break;
-
 default:
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -854,6 +902,22 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 return NVME_SUCCESS;
 }
 
+static uint16_t nvme_aer(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+trace_nvme_aer(req->cqe.cid);
+
+if (n->outstanding_aers > NVME_AERL) {
+trace_nvme_aer_aerl_exceeded();
+return NVME_AER_LIMIT_EXCEEDED;
+}
+
+n->aer_reqs[n->outstanding_aers] = req;
+timer_mod(n->aer_timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+n->outstanding_aers++;
+
+return NVME_NO_COMPLETE;
+}
+
 static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
 NvmeSQueue *sq;
@@ -918,6 +982,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return nvme_set_feature(n, cmd, req);
 case NVME_ADM_CMD_GET_FEATURES:
 return nvme_get_feature(n, cmd, req);
+case NVME_ADM_CMD_ASYNC_EV_REQ:
+return nvme_aer(n, cmd, req);
 case NVME_ADM_CMD_ABORT:
 return nvme_abort(n, cmd, req);
 default:
@@ -963,6 +1029,7 @@ static void nvme_process_sq(void *opaque)
 
 static void nvme_clear_ctrl(NvmeCtrl *n)
 {
+NvmeAsyncEvent *event;
 int i;
 
 blk_drain(n->conf.blk);
@@ -978,8 +1045,19 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 }
 }
 
+if (n->aer_timer) {
+timer_del(n->aer_timer);
+timer_free(n->aer_timer);
+n->aer_timer = NULL;
+}
+while ((event = QSIMPLEQ_FIRST(&n->aer_queue)) != NULL) {
+QSI

[Qemu-block] [PATCH 05/16] nvme: populate the mandatory subnqn and ver fields

2019-07-05 Thread Klaus Birkelund Jensen

Required for compliance with NVMe revision 1.2.1 or later. See NVM
Express 1.2.1, Section 5.11 ("Identify command"), Figure 90 and Section
7.9 ("NVMe Qualified Names").

This also bumps the supported version to 1.2.1.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ce2e5365385b..3c392dc336a8 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1364,12 +1364,18 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 id->ieee[0] = 0x00;
 id->ieee[1] = 0x02;
 id->ieee[2] = 0xb3;
+id->ver = cpu_to_le32(0x00010201);
 id->oacs = cpu_to_le16(0);
 id->frmw = 7 << 1;
 id->sqes = (0x6 << 4) | 0x6;
 id->cqes = (0x4 << 4) | 0x4;
 id->nn = cpu_to_le32(n->num_namespaces);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
+
+strcpy((char *) id->subnqn, "nqn.2014-08.org.nvmexpress:uuid:");
+qemu_uuid_unparse(&qemu_uuid,
+(char *) id->subnqn + strlen((char *) id->subnqn));
+
 id->psd[0].mp = cpu_to_le16(0x9c4);
 id->psd[0].enlat = cpu_to_le32(0x10);
 id->psd[0].exlat = cpu_to_le32(0x4);
@@ -1384,7 +1390,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 NVME_CAP_SET_CSS(n->bar.cap, 1);
 NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
 
-n->bar.vs = 0x00010200;
+n->bar.vs = 0x00010201;
 n->bar.intmc = n->bar.intms = 0;
 
 if (n->params.cmb_size_mb) {
-- 
2.20.1

[Qemu-block] [PATCH 07/16] nvme: support Abort command

2019-07-05 Thread Klaus Birkelund Jensen

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.1 ("Abort command").

Extracted from Keith's qemu-nvme tree. Modified to only consider queued
and not executing commands.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 56 +
 1 file changed, 56 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b31e5ff681bd..4b9ff51868c0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -38,6 +38,7 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
 (trace_##trace)(__VA_ARGS__); \
@@ -848,6 +849,54 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+NvmeSQueue *sq;
+NvmeRequest *new;
+uint32_t index = 0;
+uint16_t sqid = cmd->cdw10 & 0x;
+uint16_t cid = (cmd->cdw10 >> 16) & 0x;
+
+req->cqe.result = 1;
+if (nvme_check_sqid(n, sqid)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+sq = n->sq[sqid];
+
+/* only consider queued (and not executing) commands for abort */
+while ((sq->head + index) % sq->size != sq->tail) {
+NvmeCmd abort_cmd;
+hwaddr addr;
+
+addr = sq->dma_addr + ((sq->head + index) % sq->size) * n->sqe_size;
+
+nvme_addr_read(n, addr, (void *) &abort_cmd, sizeof(abort_cmd));
+if (abort_cmd.cid == cid) {
+req->cqe.result = 0;
+new = QTAILQ_FIRST(&sq->req_list);
+QTAILQ_REMOVE(&sq->req_list, new, entry);
+QTAILQ_INSERT_TAIL(&sq->out_req_list, new, entry);
+
+memset(&new->cqe, 0, sizeof(new->cqe));
+new->cqe.cid = cid;
+new->status = NVME_CMD_ABORT_REQ;
+
+abort_cmd.opcode = NVME_OP_ABORTED;
+nvme_addr_write(n, addr, (void *) &abort_cmd, sizeof(abort_cmd));
+
+nvme_enqueue_req_completion(n->cq[sq->cqid], new);
+
+return NVME_SUCCESS;
+}
+
+++index;
+}
+
 return NVME_SUCCESS;
 }
 
@@ -868,6 +917,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return nvme_set_feature(n, cmd, req);
 case NVME_ADM_CMD_GET_FEATURES:
 return nvme_get_feature(n, cmd, req);
+case NVME_ADM_CMD_ABORT:
+return nvme_abort(n, cmd, req);
 default:
 trace_nvme_err_invalid_admin_opc(cmd->opcode);
 return NVME_INVALID_OPCODE | NVME_DNR;
@@ -890,6 +941,10 @@ static void nvme_process_sq(void *opaque)
 nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd));
 nvme_inc_sq_head(sq);
 
+if (cmd.opcode == NVME_OP_ABORTED) {
+continue;
+}
+
 req = QTAILQ_FIRST(&sq->req_list);
 QTAILQ_REMOVE(&sq->req_list, req, entry);
 QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
@@ -1376,6 +1431,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 id->ieee[2] = 0xb3;
 id->ver = cpu_to_le32(0x00010201);
 id->oacs = cpu_to_le16(0);
+id->acl = 3;
 id->frmw = 7 << 1;
 id->sqes = (0x6 << 4) | 0x6;
 id->cqes = (0x4 << 4) | 0x4;
-- 
2.20.1

[Qemu-block] [PATCH 11/16] nvme: add missing mandatory Features

2019-07-05 Thread Klaus Birkelund Jensen

Add support for returning a resonable response to Get/Set Features of
mandatory features.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 49 ---
 hw/block/trace-events |  2 ++
 include/block/nvme.h  |  3 ++-
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 93f5dff197e0..8259dd7c1d6c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -860,13 +860,24 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, 
NvmeCmd *cmd)
 static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
 uint32_t dw10 = le32_to_cpu(cmd->cdw10);
+uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 uint32_t result;
 
+trace_nvme_getfeat(dw10);
+
 switch (dw10) {
+case NVME_ARBITRATION:
+result = cpu_to_le32(n->features.arbitration);
+break;
+case NVME_POWER_MANAGEMENT:
+result = cpu_to_le32(n->features.power_mgmt);
+break;
 case NVME_TEMPERATURE_THRESHOLD:
 result = cpu_to_le32(n->features.temp_thresh);
 break;
 case NVME_ERROR_RECOVERY:
+result = cpu_to_le32(n->features.err_rec);
+break;
 case NVME_VOLATILE_WRITE_CACHE:
 result = blk_enable_write_cache(n->conf.blk);
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -878,6 +889,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
+case NVME_INTERRUPT_COALESCING:
+result = cpu_to_le32(n->features.int_coalescing);
+break;
+case NVME_INTERRUPT_VECTOR_CONF:
+if ((dw11 & 0x) > n->params.num_queues) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+result = cpu_to_le32(n->features.int_vector_config[dw11 & 0x]);
+break;
+case NVME_WRITE_ATOMICITY:
+result = cpu_to_le32(n->features.write_atomicity);
+break;
 case NVME_ASYNCHRONOUS_EVENT_CONF:
 result = cpu_to_le32(n->features.async_config);
 break;
@@ -913,6 +937,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 uint32_t dw10 = le32_to_cpu(cmd->cdw10);
 uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
+trace_nvme_setfeat(dw10, dw11);
+
 switch (dw10) {
 case NVME_TEMPERATURE_THRESHOLD:
 n->features.temp_thresh = dw11;
@@ -937,6 +963,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_ASYNCHRONOUS_EVENT_CONF:
 n->features.async_config = dw11;
 break;
+case NVME_ARBITRATION:
+case NVME_POWER_MANAGEMENT:
+case NVME_ERROR_RECOVERY:
+case NVME_INTERRUPT_COALESCING:
+case NVME_INTERRUPT_VECTOR_CONF:
+case NVME_WRITE_ATOMICITY:
+return NVME_FEAT_NOT_CHANGABLE | NVME_DNR;
 default:
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -1693,6 +1726,14 @@ static void nvme_init_state(NvmeCtrl *n)
 n->aer_reqs = g_new0(NvmeRequest *, NVME_AERL + 1);
 n->temperature = NVME_TEMPERATURE;
 n->features.temp_thresh = 0x14d;
+n->features.int_vector_config = g_malloc0_n(n->params.num_queues,
+sizeof(*n->features.int_vector_config));
+
+/* disable coalescing (not supported) */
+for (int i = 0; i < n->params.num_queues; i++) {
+n->features.int_vector_config[i] = i | (1 << 16);
+}
+
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -1769,6 +1810,10 @@ static void nvme_init_ctrl(NvmeCtrl *n)
 id->nn = cpu_to_le32(n->num_namespaces);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
 
+if (blk_enable_write_cache(n->conf.blk)) {
+id->vwc = 1;
+}
+
 strcpy((char *) id->subnqn, "nqn.2014-08.org.nvmexpress:uuid:");
 qemu_uuid_unparse(&qemu_uuid,
 (char *) id->subnqn + strlen((char *) id->subnqn));
@@ -1776,9 +1821,6 @@ static void nvme_init_ctrl(NvmeCtrl *n)
 id->psd[0].mp = cpu_to_le16(0x9c4);
 id->psd[0].enlat = cpu_to_le32(0x10);
 id->psd[0].exlat = cpu_to_le32(0x4);
-if (blk_enable_write_cache(n->conf.blk)) {
-id->vwc = 1;
-}
 
 n->bar.cap = 0;
 NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
@@ -1876,6 +1918,7 @@ static void nvme_exit(PCIDevice *pci_dev)
 g_free(n->sq);
 g_free(n->elpes);
 g_free(n->aer_reqs);
+g_free(n->features.int_vector_config);
 
 if (n->params.cmb_size_mb) {
 g_free(n->cmbuf);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index ed666bbc94f2..17485bb0375b 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -41,6 +41,8 @@ nvme_del_cq(uint16_t cqid) "deleted c

[Qemu-block] [PATCH 15/16] nvme: support scatter gather lists

2019-07-05 Thread Klaus Birkelund Jensen

For now, support the Data Block, Segment and Last Segment descriptor
types.

See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").

Signed-off-by: Klaus Birkelund Jensen 
---
 block/nvme.c  |  18 +-
 hw/block/nvme.c   | 390 +++---
 hw/block/nvme.h   |   6 +
 hw/block/trace-events |   3 +
 include/block/nvme.h  |  64 ++-
 5 files changed, 410 insertions(+), 71 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 73ed5fa75f2e..907a610633f2 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -438,7 +438,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 error_setg(errp, "Cannot map buffer for DMA");
 goto out;
 }
-cmd.prp1 = cpu_to_le64(iova);
+cmd.dptr.prp.prp1 = cpu_to_le64(iova);
 
 if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
 error_setg(errp, "Failed to identify controller");
@@ -512,7 +512,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_CQ,
-.prp1 = cpu_to_le64(q->cq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x3),
 };
@@ -523,7 +523,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_SQ,
-.prp1 = cpu_to_le64(q->sq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x1 | (n << 16)),
 };
@@ -858,16 +858,16 @@ try_map:
 case 0:
 abort();
 case 1:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = 0;
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = 0;
 break;
 case 2:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = pagelist[1];
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = pagelist[1];
 break;
 default:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + 
sizeof(uint64_t));
 break;
 }
 trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b285119fd29a..6bf62952dd13 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -273,6 +273,198 @@ unmap:
 return status;
 }
 
+static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor *segment, uint64_t nsgld, uint32_t *len,
+NvmeRequest *req)
+{
+dma_addr_t addr, trans_len;
+
+for (int i = 0; i < nsgld; i++) {
+if (NVME_SGL_TYPE(segment[i].type) != SGL_DESCR_TYPE_DATA_BLOCK) {
+trace_nvme_err_invalid_sgl_descriptor(req->cqe.cid,
+NVME_SGL_TYPE(segment[i].type));
+return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
+}
+
+if (*len == 0) {
+if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
+trace_nvme_err_invalid_sgl_excess_length(req->cqe.cid);
+return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
+}
+
+break;
+}
+
+addr = le64_to_cpu(segment[i].addr);
+trans_len = MIN(*len, le64_to_cpu(segment[i].len));
+
+if (nvme_addr_is_cmb(n, addr)) {
+/*
+ * All data and metadata, if any, associated with a particular
+ * command shall be located in either the CMB or host memory. Thus,
+ * if an address if found to be in the CMB and we have already
+ * mapped data that is in host memory, the use is invalid.
+ */
+if (!req->is_cmb && qsg->size) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+
+req->is_cmb = true;
+} else {
+/*
+ * Similarly, if the address does not reference the CMB, but we
+ * have already established that the request has data or metadata
+ * in the CMB, the use is invalid.
+ */
+if (req->is_cmb) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+}
+
+qemu_sglist_add(qsg, addr, trans_len);
+
+*len -= trans_len;
+}
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
+{
+const int MAX_NSGLD = 256;
+
+NvmeSglDescriptor segment[MAX_NSGLD];
+uint64_t nsgld;
+uint16_t status;
+bool sgl_in_cmb = false;
+hwaddr addr = le64_to_cpu(sgl.addr);
+
+t

[Qemu-block] [PATCH 06/16] nvme: support completion queue in cmb

2019-07-05 Thread Klaus Birkelund Jensen

While not particularly useful, allow completion queues in the controller
memory buffer. Could be useful for testing.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3c392dc336a8..b31e5ff681bd 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -57,6 +57,16 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 }
 }
 
+static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
+{
+if (n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+memcpy((void *)&n->cmbuf[addr - n->ctrl_mem.addr], buf, size);
+return;
+}
+pci_dma_write(&n->parent_obj, addr, buf, size);
+}
+
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
 return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
@@ -276,6 +286,7 @@ static void nvme_post_cqes(void *opaque)
 
 QTAILQ_FOREACH_SAFE(req, &cq->req_list, entry, next) {
 NvmeSQueue *sq;
+NvmeCqe *cqe = &req->cqe;
 hwaddr addr;
 
 if (nvme_cq_full(cq)) {
@@ -289,8 +300,7 @@ static void nvme_post_cqes(void *opaque)
 req->cqe.sq_head = cpu_to_le16(sq->head);
 addr = cq->dma_addr + cq->tail * n->cqe_size;
 nvme_inc_cq_tail(cq);
-pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
-sizeof(req->cqe));
+nvme_addr_write(n, addr, (void *) cqe, sizeof(*cqe));
 QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
 }
 if (cq->tail != cq->head) {
@@ -1399,7 +1409,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
 
 NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
-NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
 NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
-- 
2.20.1

[Qemu-block] [PATCH 00/16] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2019-07-05 Thread Klaus Birkelund Jensen

Matt Fitzpatrick's post ("[RFC,v1] Namespace Management Support") pushed
me to finally get my head out of my a** and post this series.

This is basically a follow-up to my previous series ("nvme: v1.3, sgls,
metadata and new 'ocssd' device"), but I'm not tagging it as a v2
because the patches for metadata and the ocssd device have been dropped.
Instead, this series also includes a patch that enables support for
multiple namespaces in a "proper" way by adding a new 'nvme-ns' device
model such that the "real" nvme device is composed of the 'nvme' device
model (the core controller) and multiple 'nvme-ns' devices that model
the namespaces.

All in all, the patches in this series should be less controversial, but
I know there is a lot to go through. I've kept commit 011de3d531b6
("nvme: refactor device realization") as a single commit, but I can chop
it up if any reviwers would prefer that, but the series is already at 16
patches. The refactor patch is basically just code movement.

At a glance, this series:

  - generally fixes up the device to be as close to NVMe 1.3d compliant as
possible (in terms of 'mandatory' features) by:
  - adding proper setting of the SUBNQN and VER fields
  - supporting the Abort command
  - supporting the Asynchronous Event Request command
  - supporting the Get Log Page command
  - providing reasonable stub responses to Get/Set Feature command of
mandatory features
  - adds support for scatter gather lists (SGLs)
  - simplifies DMA/CMB mappings and support PRPs/SGLs in the CMB
  - adds support for multiple block requests per nvme request (this is
useful for future support for metadata, OCSSD 2.0 vector requests
and upcoming zoned namespaces)
  - adds support for multiple namespaces


Thanks to everyone who chipped in on the discussion on multiple
namespaces! You're CC'ed ;)


Klaus Birkelund Jensen (16):
  nvme: simplify namespace code
  nvme: move device parameters to separate struct
  nvme: fix lpa field
  nvme: add missing fields in identify controller
  nvme: populate the mandatory subnqn and ver fields
  nvme: support completion queue in cmb
  nvme: support Abort command
  nvme: refactor device realization
  nvme: support Asynchronous Event Request command
  nvme: support Get Log Page command
  nvme: add missing mandatory Features
  nvme: bump supported NVMe revision to 1.3d
  nvme: simplify dma/cmb mappings
  nvme: support multiple block requests per request
  nvme: support scatter gather lists
  nvme: support multiple namespaces

 block/nvme.c   |   18 +-
 hw/block/Makefile.objs |2 +-
 hw/block/nvme-ns.c |  139 
 hw/block/nvme-ns.h |   35 +
 hw/block/nvme.c| 1629 
 hw/block/nvme.h|   99 ++-
 hw/block/trace-events  |   24 +-
 include/block/nvme.h   |  130 +++-
 8 files changed, 1739 insertions(+), 337 deletions(-)
 create mode 100644 hw/block/nvme-ns.c
 create mode 100644 hw/block/nvme-ns.h

-- 
2.20.1

[Qemu-block] [PATCH 02/16] nvme: move device parameters to separate struct

2019-07-05 Thread Klaus Birkelund Jensen

Move device configuration parameters to separate struct to make it
explicit what is configurable and what is set internally.

Also, clean up some includes.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 54 +++--
 hw/block/nvme.h | 16 ---
 2 files changed, 38 insertions(+), 32 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 28ebaf1368b1..a3f83f3c2135 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -27,18 +27,14 @@
 
 #include "qemu/osdep.h"
 #include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
 #include "hw/block/block.h"
-#include "hw/hw.h"
 #include "hw/pci/msix.h"
-#include "hw/pci/pci.h"
 #include "sysemu/sysemu.h"
-#include "qapi/error.h"
-#include "qapi/visitor.h"
 #include "sysemu/block-backend.h"
+#include "qapi/error.h"
 
-#include "qemu/log.h"
-#include "qemu/module.h"
-#include "qemu/cutils.h"
 #include "trace.h"
 #include "nvme.h"
 
@@ -63,12 +59,12 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->num_queues && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->num_queues && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->params.num_queues && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -630,7 +626,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
 trace_nvme_err_invalid_create_cq_addr(prp1);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
-if (unlikely(vector > n->num_queues)) {
+if (unlikely(vector > n->params.num_queues)) {
 trace_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -782,7 +778,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 
16));
+result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 trace_nvme_getfeat_numq(result);
 break;
 case NVME_TIMESTAMP:
@@ -827,9 +824,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_NUMBER_OF_QUEUES:
 trace_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->num_queues - 1, n->num_queues - 1);
-req->cqe.result =
-cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
+n->params.num_queues - 1,
+n->params.num_queues - 1);
+req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 break;
 
 case NVME_TIMESTAMP:
@@ -903,12 +901,12 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
 blk_drain(n->conf.blk);
 
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->sq[i] != NULL) {
 nvme_free_sq(n->sq[i], n);
 }
 }
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->cq[i] != NULL) {
 nvme_free_cq(n->cq[i], n);
 }
@@ -1311,7 +1309,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 int64_t bs_size;
 uint8_t *pci_conf;
 
-if (!n->num_queues) {
+if (!n->params.num_queues) {
 error_setg(errp, "num_queues can't be zero");
 return;
 }
@@ -1327,7 +1325,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 return;
 }
 
-if (!n->serial) {
+if (!n->params.serial) {
 error_setg(errp, "serial property not set");
 return;
 }
@@ -1344,24 +1342,24 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 pcie_endpoint_cap_init(pci_dev, 0x80);
 
 n->num_namespaces = 1;
-n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
+n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
-n->sq = g_new0(NvmeSQueue *, n->num_queues);
-n->cq = g_new0(NvmeCQueue *,

[Qemu-block] [PATCH 04/16] nvme: add missing fields in identify controller

2019-07-05 Thread Klaus Birkelund Jensen

Not used by the device model but added for completeness. See NVM Express
1.2.1, Section 5.11 ("Identify command"), Figure 90.

Signed-off-by: Klaus Birkelund Jensen 
---
 include/block/nvme.h | 34 +-
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 3ec8efcc435e..1b0accd4fe2b 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -543,7 +543,13 @@ typedef struct NvmeIdCtrl {
 uint8_t ieee[3];
 uint8_t cmic;
 uint8_t mdts;
-uint8_t rsvd255[178];
+uint16_tcntlid;
+uint32_tver;
+uint16_trtd3r;
+uint32_trtd3e;
+uint32_toaes;
+uint32_tctratt;
+uint8_t rsvd255[156];
 uint16_toacs;
 uint8_t acl;
 uint8_t aerl;
@@ -551,10 +557,22 @@ typedef struct NvmeIdCtrl {
 uint8_t lpa;
 uint8_t elpe;
 uint8_t npss;
-uint8_t rsvd511[248];
+uint8_t avscc;
+uint8_t apsta;
+uint16_twctemp;
+uint16_tcctemp;
+uint16_tmtfa;
+uint32_thmpre;
+uint32_thmmin;
+uint8_t tnvmcap[16];
+uint8_t unvmcap[16];
+uint32_trpmbs;
+uint8_t rsvd319[4];
+uint16_tkas;
+uint8_t rsvd511[190];
 uint8_t sqes;
 uint8_t cqes;
-uint16_trsvd515;
+uint16_tmaxcmd;
 uint32_tnn;
 uint16_toncs;
 uint16_tfuses;
@@ -562,8 +580,14 @@ typedef struct NvmeIdCtrl {
 uint8_t vwc;
 uint16_tawun;
 uint16_tawupf;
-uint8_t rsvd703[174];
-uint8_t rsvd2047[1344];
+uint8_t nvscc;
+uint8_t rsvd531;
+uint16_tacwu;
+uint16_trsvd535;
+uint32_tsgls;
+uint8_t rsvd767[228];
+uint8_t subnqn[256];
+uint8_t rsvd2047[1024];
 NvmePSD psd[32];
 uint8_t vs[1024];
 } NvmeIdCtrl;
-- 
2.20.1

[Qemu-block] [PATCH 10/16] nvme: support Get Log Page command

2019-07-05 Thread Klaus Birkelund Jensen

Add support for the Get Log Page command and stub/dumb implementations
of the mandatory Error Information, SMART/Health Information and
Firmware Slot Information log pages.

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.10 ("Get Log Page command").

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 209 ++
 hw/block/nvme.h   |   3 +
 hw/block/trace-events |   3 +
 include/block/nvme.h  |   4 +-
 4 files changed, 217 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a20576654f1b..93f5dff197e0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -39,6 +39,8 @@
 #include "nvme.h"
 
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_TEMPERATURE 0x143
+#define NVME_ELPE 3
 #define NVME_AERL 3
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
@@ -319,6 +321,36 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_enqueue_event(NvmeCtrl *n, uint8_t event_type,
+uint8_t event_info, uint8_t log_page)
+{
+NvmeAsyncEvent *event;
+
+trace_nvme_enqueue_event(event_type, event_info, log_page);
+
+/*
+ * Do not enqueue the event if something of this type is already queued.
+ * This bounds the size of the event queue and makes sure it does not grow
+ * indefinitely when events are not processed by the host (i.e. does not
+ * issue any AERs).
+ */
+if (n->aer_mask_queued & (1 << event_type)) {
+return;
+}
+n->aer_mask_queued |= (1 << event_type);
+
+event = g_new(NvmeAsyncEvent, 1);
+event->result = (NvmeAerResult) {
+.event_type = event_type,
+.event_info = event_info,
+.log_page   = log_page,
+};
+
+QSIMPLEQ_INSERT_TAIL(&n->aer_queue, event, entry);
+
+timer_mod(n->aer_timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+}
+
 static void nvme_process_aers(void *opaque)
 {
 NvmeCtrl *n = opaque;
@@ -831,6 +863,10 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 uint32_t result;
 
 switch (dw10) {
+case NVME_TEMPERATURE_THRESHOLD:
+result = cpu_to_le32(n->features.temp_thresh);
+break;
+case NVME_ERROR_RECOVERY:
 case NVME_VOLATILE_WRITE_CACHE:
 result = blk_enable_write_cache(n->conf.blk);
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -878,6 +914,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
 switch (dw10) {
+case NVME_TEMPERATURE_THRESHOLD:
+n->features.temp_thresh = dw11;
+if (n->features.temp_thresh <= n->temperature) {
+nvme_enqueue_event(n, NVME_AER_TYPE_SMART,
+NVME_AER_INFO_SMART_TEMP_THRESH, NVME_LOG_SMART_INFO);
+}
+break;
 case NVME_VOLATILE_WRITE_CACHE:
 blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
 break;
@@ -902,6 +945,137 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 return NVME_SUCCESS;
 }
 
+static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
+{
+n->aer_mask &= ~(1 << event_type);
+if (!QSIMPLEQ_EMPTY(&n->aer_queue)) {
+timer_mod(n->aer_timer,
+qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+}
+}
+
+static uint16_t nvme_error_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
+uint32_t buf_len, uint64_t off, NvmeRequest *req)
+{
+uint32_t trans_len;
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+if (off > sizeof(*n->elpes) * (NVME_ELPE + 1)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+trans_len = MIN(sizeof(*n->elpes) * (NVME_ELPE + 1) - off, buf_len);
+
+if (!rae) {
+nvme_clear_events(n, NVME_AER_TYPE_ERROR);
+}
+
+return nvme_dma_read_prp(n, (uint8_t *) n->elpes + off, trans_len, prp1,
+prp2);
+}
+
+static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
+uint32_t buf_len, uint64_t off, NvmeRequest *req)
+{
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+uint32_t trans_len;
+time_t current_ms;
+NvmeSmartLog smart;
+
+if (cmd->nsid != 0 && cmd->nsid != 0x) {
+trace_nvme_err(req->cqe.cid, "smart log not supported for namespace",
+NVME_INVALID_FIELD | NVME_DNR);
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+if (off > sizeof(smart)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+trans_len = MIN(sizeof(smart) - off, buf_len);
+
+memset(&

[Qemu-block] [PATCH 03/16] nvme: fix lpa field

2019-07-05 Thread Klaus Birkelund Jensen

The Log Page Attributes in the Identify Controller structure indicates
that the controller supports the SMART / Health Information log page on
a per namespace basis. It does not, given that neither this log page or
the Get Log Page command is implemented.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a3f83f3c2135..ce2e5365385b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1366,7 +1366,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 id->ieee[2] = 0xb3;
 id->oacs = cpu_to_le16(0);
 id->frmw = 7 << 1;
-id->lpa = 1 << 0;
 id->sqes = (0x6 << 4) | 0x6;
 id->cqes = (0x4 << 4) | 0x4;
 id->nn = cpu_to_le32(n->num_namespaces);
-- 
2.20.1

[Qemu-block] [PATCH 01/16] nvme: simplify namespace code

2019-07-05 Thread Klaus Birkelund Jensen

The device model currently only supports a single namespace and also
specifically sets num_namespaces to 1. Take this into account and
simplify the code.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 26 +++---
 hw/block/nvme.h |  2 +-
 2 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 36d6a8bb3a3e..28ebaf1368b1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -424,7 +424,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return NVME_INVALID_NSID | NVME_DNR;
 }
 
-ns = &n->namespaces[nsid - 1];
+ns = &n->namespace;
 switch (cmd->opcode) {
 case NVME_CMD_FLUSH:
 return nvme_flush(n, ns, cmd, req);
@@ -670,7 +670,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify 
*c)
 return NVME_INVALID_NSID | NVME_DNR;
 }
 
-ns = &n->namespaces[nsid - 1];
+ns = &n->namespace;
 
 return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
 prp1, prp2);
@@ -1306,8 +1306,8 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 {
 NvmeCtrl *n = NVME(pci_dev);
 NvmeIdCtrl *id = &n->id_ctrl;
+NvmeIdNs *id_ns = &n->namespace.id_ns;
 
-int i;
 int64_t bs_size;
 uint8_t *pci_conf;
 
@@ -1347,7 +1347,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
-n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
 n->sq = g_new0(NvmeSQueue *, n->num_queues);
 n->cq = g_new0(NvmeCQueue *, n->num_queues);
 
@@ -1416,20 +1415,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 
 }
 
-for (i = 0; i < n->num_namespaces; i++) {
-NvmeNamespace *ns = &n->namespaces[i];
-NvmeIdNs *id_ns = &ns->id_ns;
-id_ns->nsfeat = 0;
-id_ns->nlbaf = 0;
-id_ns->flbas = 0;
-id_ns->mc = 0;
-id_ns->dpc = 0;
-id_ns->dps = 0;
-id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
-id_ns->ncap  = id_ns->nuse = id_ns->nsze =
-cpu_to_le64(n->ns_size >>
-id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas)].ds);
-}
+id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+id_ns->ncap  = id_ns->nuse = id_ns->nsze =
+cpu_to_le64(n->ns_size >>
+id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)].ds);
 }
 
 static void nvme_exit(PCIDevice *pci_dev)
@@ -1437,7 +1426,6 @@ static void nvme_exit(PCIDevice *pci_dev)
 NvmeCtrl *n = NVME(pci_dev);
 
 nvme_clear_ctrl(n);
-g_free(n->namespaces);
 g_free(n->cq);
 g_free(n->sq);
 
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 557194ee1954..40cedb1ec932 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -83,7 +83,7 @@ typedef struct NvmeCtrl {
 uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
 
 char*serial;
-NvmeNamespace   *namespaces;
+NvmeNamespace   namespace;
 NvmeSQueue  **sq;
 NvmeCQueue  **cq;
 NvmeSQueue  admin_sq;
-- 
2.20.1

[Qemu-block] [PATCH 08/16] nvme: refactor device realization

2019-07-05 Thread Klaus Birkelund Jensen

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 196 ++--
 hw/block/nvme.h |  11 +++
 2 files changed, 152 insertions(+), 55 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 4b9ff51868c0..eb6af6508e2d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -38,6 +38,7 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -1365,66 +1366,105 @@ static const MemoryRegionOps nvme_cmb_ops = {
 },
 };
 
-static void nvme_realize(PCIDevice *pci_dev, Error **errp)
+static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
-NvmeCtrl *n = NVME(pci_dev);
-NvmeIdCtrl *id = &n->id_ctrl;
-NvmeIdNs *id_ns = &n->namespace.id_ns;
-
-int64_t bs_size;
-uint8_t *pci_conf;
-
-if (!n->params.num_queues) {
-error_setg(errp, "num_queues can't be zero");
-return;
-}
+NvmeParams *params = &n->params;
 
 if (!n->conf.blk) {
-error_setg(errp, "drive property not set");
-return;
+error_setg(errp, "nvme: block backend not configured");
+return 1;
 }
 
-bs_size = blk_getlength(n->conf.blk);
-if (bs_size < 0) {
-error_setg(errp, "could not get backing file size");
-return;
+if (!params->serial) {
+error_setg(errp, "nvme: serial not configured");
+return 1;
 }
 
-if (!n->params.serial) {
-error_setg(errp, "serial property not set");
-return;
+if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
+error_setg(errp, "nvme: invalid queue configuration");
+return 1;
 }
+
+return 0;
+}
+
+static int nvme_init_blk(NvmeCtrl *n, Error **errp)
+{
 blkconf_blocksizes(&n->conf);
 if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
-   false, errp)) {
-return;
+false, errp)) {
+return 1;
 }
 
-pci_conf = pci_dev->config;
-pci_conf[PCI_INTERRUPT_PIN] = 1;
-pci_config_set_prog_interface(pci_dev->config, 0x2);
-pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
-pcie_endpoint_cap_init(pci_dev, 0x80);
+return 0;
+}
 
+static void nvme_init_state(NvmeCtrl *n)
+{
 n->num_namespaces = 1;
 n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
-n->ns_size = bs_size / (uint64_t)n->num_namespaces;
-
 n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
 n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
+}
 
-memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
-  "nvme", n->reg_size);
+static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
+NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
+
+NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
+NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
+
+n->cmbloc = n->bar.cmbloc;
+n->cmbsz = n->bar.cmbsz;
+
+n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
+"nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
+PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
+PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
+}
+
+static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+uint8_t *pci_conf = pci_dev->config;
+
+pci_conf[PCI_INTERRUPT_PIN] = 1;
+pci_config_set_prog_interface(pci_conf, 0x2);
+pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
+pci_config_set_device_id(pci_conf, 0x5845);
+pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
+pcie_endpoint_cap_init(pci_dev, 0x80);
+
+memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
+n->reg_size);
 pci_register_bar(pci_dev, 0,
 PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
 &n->iomem);
 msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
 
+if (n->params.cmb_size_mb) {
+nvme_init_cmb(n, pci_dev);
+}
+}
+
+static void nvme_init_ctrl(NvmeCtrl *n)
+{
+NvmeIdCtrl *id = &n->id_ctrl;
+NvmeParams *params = &n->params;
+uint

[Qemu-block] [PATCH] nvme: do not advertise support for unsupported arbitration mechanism

2019-06-06 Thread Klaus Birkelund Jensen

The device mistakenly reports that the Weighted Round Robin with Urgent
Priority Class arbitration mechanism is supported.

It is not.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 30e50f7a3853..415b4641d6b4 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1383,7 +1383,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 n->bar.cap = 0;
 NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
 NVME_CAP_SET_CQR(n->bar.cap, 1);
-NVME_CAP_SET_AMS(n->bar.cap, 1);
 NVME_CAP_SET_TO(n->bar.cap, 0xf);
 NVME_CAP_SET_CSS(n->bar.cap, 1);
 NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
-- 
2.21.0

[Qemu-block] [PATCH] nvme: fix copy direction in DMA reads going to CMB

2019-05-18 Thread Klaus Birkelund Jensen

`nvme_dma_read_prp` erronously used `qemu_iovec_*to*_buf` instead of
`qemu_iovec_*from*_buf` when the request involved the controller memory
buffer.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 7caf92532a09..63a5b58849fb 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -238,7 +238,7 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 }
 qemu_sglist_destroy(&qsg);
 } else {
-if (unlikely(qemu_iovec_to_buf(&iov, 0, ptr, len) != len)) {
+if (unlikely(qemu_iovec_from_buf(&iov, 0, ptr, len) != len)) {
 trace_nvme_err_invalid_dma();
 status = NVME_INVALID_FIELD | NVME_DNR;
 }
-- 
2.21.0

[Qemu-block] [PATCH 2/8] nvme: bump supported spec to 1.3

2019-05-17 Thread Klaus Birkelund Jensen

Bump the supported NVMe version to 1.3. To do so, this patch adds a
number of missing 'Mandatory' features from the spec:

  * Support for returning a Namespace Identification Descriptor List in
the Identify command (CNS 03h).
  * Support for the Asynchronous Event Request command.
  * Support for the Get Log Page command and the mandatory Error
Information, Smart / Health Information and Firmware Slot
Information log pages.
  * Support for the Abort command.

As a side-effect, this bump also fixes support for multiple namespaces.

The implementation of AER, Get Log Page and Abort commands has been
imported and slightly modified from Keith's qemu-nvme tree[1]. Thanks!

  [1]: http://git.infradead.org/users/kbusch/qemu-nvme.git

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 792 --
 hw/block/nvme.h   |  31 +-
 hw/block/trace-events |  16 +-
 include/block/nvme.h  |  58 +++-
 4 files changed, 783 insertions(+), 114 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b689c0776e72..65dfc04f71e5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,17 +9,35 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.3d, 1.2, 1.1, 1.0e
  *
  *  http://www.nvmexpress.org/resources/
  */
 
 /**
  * Usage: add options:
- *  -drive file=,if=none,id=
- *  -device nvme,drive=,serial=,id=, \
- *  cmb_size_mb=, \
- *  num_queues=
+ * -drive file=,if=none,id=
+ * -device nvme,drive=,serial=,id=
+ *
+ * The "file" option must point to a path to a real file that you will use as
+ * the backing storage for your NVMe device. It must be a non-zero length, as
+ * this will be the disk image that your nvme controller will use to carve up
+ * namespaces for storage.
+ *
+ * Note the "drive" option's "id" name must match the "device nvme" drive's
+ * name to link the block device used for backing storage to the nvme
+ * interface.
+ *
+ * Advanced optional options:
+ *
+ *   num_ns=  : Namespaces to make out of the backing storage,
+ *   Default:1
+ *   num_queues=  : Number of possible IO Queues, Default:64
+ *   cmb_size_mb= : Size of CMB in MBs, Default:0
+ *
+ * Parameters will be verified against conflicting capabilities and attributes
+ * and fail to load if there is a conflict or a configuration the emulated
+ * device is unable to handle.
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -38,6 +56,12 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_TEMPERATURE 0x143
+#define NVME_ELPE 3
+#define NVME_AERL 3
+#define NVME_OP_ABORTED 0xff
+
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
 (trace_##trace)(__VA_ARGS__); \
@@ -57,6 +81,16 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 }
 }
 
+static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
+{
+if (n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+memcpy((void *)&n->cmbuf[addr - n->ctrl_mem.addr], buf, size);
+return;
+}
+pci_dma_write(&n->parent_obj, addr, buf, size);
+}
+
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
 return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
@@ -244,6 +278,24 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 return status;
 }
 
+static void nvme_post_cqe(NvmeCQueue *cq, NvmeRequest *req)
+{
+NvmeCtrl *n = cq->ctrl;
+NvmeSQueue *sq = req->sq;
+NvmeCqe *cqe = &req->cqe;
+uint8_t phase = cq->phase;
+hwaddr addr;
+
+addr = cq->dma_addr + cq->tail * n->cqe_size;
+cqe->status = cpu_to_le16((req->status << 1) | phase);
+cqe->sq_id = cpu_to_le16(sq->sqid);
+cqe->sq_head = cpu_to_le16(sq->head);
+nvme_addr_write(n, addr, (void *) cqe, sizeof(*cqe));
+nvme_inc_cq_tail(cq);
+
+QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
+}
+
 static void nvme_post_cqes(void *opaque)
 {
 NvmeCQueue *cq = opaque;
@@ -251,24 +303,14 @@ static void nvme_post_cqes(void *opaque)
 NvmeRequest *req, *next;
 
 QTAILQ_FOREACH_SAFE(req, &cq->req_list, entry, next) {
-NvmeSQueue *sq;
-hwaddr addr;
-
 if (nvme_cq_full(cq)) {
 break;
 }
 
 QTAILQ_REMOVE(&cq->req_list, req, entry);
-sq = req->sq;
-req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
-req->cqe.sq_id = cpu_to_l

[Qemu-block] [PATCH 6/8] nvme: add support for scatter gather lists

2019-05-17 Thread Klaus Birkelund Jensen

Add partial SGL support. For now, only support a single data block or
last segment descriptor. This is in line with what, for instance, SPDK
currently supports.

Signed-off-by: Klaus Birkelund Jensen 
---
 block/nvme.c  |  18 ++--
 hw/block/nvme.c   | 242 +-
 hw/block/nvme.h   |   6 ++
 hw/block/trace-events |   1 +
 include/block/nvme.h  |  81 +-
 5 files changed, 285 insertions(+), 63 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 0684bbd077dd..12d98c0d0be6 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -437,7 +437,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 error_setg(errp, "Cannot map buffer for DMA");
 goto out;
 }
-cmd.prp1 = cpu_to_le64(iova);
+cmd.dptr.prp.prp1 = cpu_to_le64(iova);
 
 if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
 error_setg(errp, "Failed to identify controller");
@@ -511,7 +511,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_CQ,
-.prp1 = cpu_to_le64(q->cq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x3),
 };
@@ -522,7 +522,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_SQ,
-.prp1 = cpu_to_le64(q->sq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x1 | (n << 16)),
 };
@@ -857,16 +857,16 @@ try_map:
 case 0:
 abort();
 case 1:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = 0;
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = 0;
 break;
 case 2:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = pagelist[1];
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = pagelist[1];
 break;
 default:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + 
sizeof(uint64_t));
 break;
 }
 trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 675967a596d1..81201a8b4834 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -279,6 +279,96 @@ unmap:
 return status;
 }
 
+static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
+{
+NvmeSglDescriptor *sgl_descriptors;
+uint64_t nsgld;
+uint16_t status = NVME_SUCCESS;
+
+trace_nvme_map_sgl(req->cqe.cid, le64_to_cpu(sgl.generic.type), req->nlb,
+len);
+
+int cmb = 0;
+
+switch (le64_to_cpu(sgl.generic.type)) {
+case SGL_DESCR_TYPE_DATA_BLOCK:
+sgl_descriptors = &sgl;
+nsgld = 1;
+
+break;
+
+case SGL_DESCR_TYPE_LAST_SEGMENT:
+sgl_descriptors = g_malloc0(le64_to_cpu(sgl.unkeyed.len));
+nsgld = le64_to_cpu(sgl.unkeyed.len) / sizeof(NvmeSglDescriptor);
+
+if (nvme_addr_is_cmb(n, sgl.addr)) {
+cmb = 1;
+}
+
+nvme_addr_read(n, le64_to_cpu(sgl.addr), sgl_descriptors,
+le64_to_cpu(sgl.unkeyed.len));
+
+break;
+
+default:
+return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
+}
+
+if (nvme_addr_is_cmb(n, le64_to_cpu(sgl_descriptors[0].addr))) {
+if (!cmb) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+goto maybe_free;
+}
+
+req->is_cmb = true;
+} else {
+if (cmb) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+goto maybe_free;
+}
+
+req->is_cmb = false;
+}
+
+pci_dma_sglist_init(qsg, &n->parent_obj, nsgld);
+
+for (int i = 0; i < nsgld; i++) {
+uint64_t addr;
+uint32_t trans_len;
+
+if (len == 0) {
+if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
+status = NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
+qemu_sglist_destroy(qsg);
+goto maybe_free;
+}
+
+break;
+}
+
+addr = le64_to_cpu(sgl_descriptors[i].addr);
+trans_len = MIN(len, le64_to_cpu(sgl_descriptors[i].unkeyed.len));
+
+if (req->is_cmb && !nvme_addr_is_cmb(n, addr)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+qemu_sglist_destroy(qsg);
+goto maybe_free;
+}
+
+qemu_sglist_add(qsg, addr, trans_len);
+
+len -= trans_len;

[Qemu-block] [PATCH 5/8] nvme: add support for metadata

2019-05-17 Thread Klaus Birkelund Jensen

The new `ms` parameter may be used to indicate the number of metadata
bytes provided per LBA.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 31 +--
 hw/block/nvme.h | 11 ++-
 2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index c514f93f3867..675967a596d1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -33,6 +33,8 @@
  *   num_ns=  : Namespaces to make out of the backing storage,
  *   Default:1
  *   num_queues=  : Number of possible IO Queues, Default:64
+ *   ms=  : Number of metadata bytes provided per LBA,
+ *   Default:0
  *   cmb_size_mb= : Size of CMB in MBs, Default:0
  *
  * Parameters will be verified against conflicting capabilities and attributes
@@ -386,6 +388,8 @@ static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 
 uint32_t unit_len = nvme_ns_lbads_bytes(ns);
 uint32_t len = req->nlb * unit_len;
+uint32_t meta_unit_len = nvme_ns_ms(ns);
+uint32_t meta_len = req->nlb * meta_unit_len;
 uint64_t prp1 = le64_to_cpu(cmd->prp1);
 uint64_t prp2 = le64_to_cpu(cmd->prp2);
 
@@ -399,6 +403,19 @@ static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return err;
 }
 
+qsg.nsg = 0;
+qsg.size = 0;
+
+if (cmd->mptr && n->params.ms) {
+qemu_sglist_add(&qsg, le64_to_cpu(cmd->mptr), meta_len);
+
+err = nvme_blk_setup(n, ns, &qsg, ns->blk_offset_md, meta_unit_len,
+req);
+if (err) {
+return err;
+}
+}
+
 qemu_sglist_destroy(&qsg);
 
 return NVME_SUCCESS;
@@ -1902,6 +1919,11 @@ static int nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 return 1;
 }
 
+if (params->ms && !is_power_of_2(params->ms)) {
+error_setg(errp, "nvme: invalid metadata configuration");
+return 1;
+}
+
 return 0;
 }
 
@@ -2066,17 +2088,20 @@ static void nvme_init_ctrl(NvmeCtrl *n)
 
 static uint64_t nvme_ns_calc_blks(NvmeCtrl *n, NvmeNamespace *ns)
 {
-return n->ns_size / nvme_ns_lbads_bytes(ns);
+return n->ns_size / (nvme_ns_lbads_bytes(ns) + nvme_ns_ms(ns));
 }
 
 static void nvme_ns_init_identify(NvmeCtrl *n, NvmeIdNs *id_ns)
 {
+NvmeParams *params = &n->params;
+
 id_ns->nlbaf = 0;
 id_ns->flbas = 0;
-id_ns->mc = 0;
+id_ns->mc = params->ms ? 0x2 : 0;
 id_ns->dpc = 0;
 id_ns->dps = 0;
 id_ns->lbaf[0].lbads = BDRV_SECTOR_BITS;
+id_ns->lbaf[0].ms = params->ms;
 }
 
 static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -2086,6 +2111,8 @@ static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace 
*ns, Error **errp)
 nvme_ns_init_identify(n, id_ns);
 
 ns->ns_blks = nvme_ns_calc_blks(n, ns);
+ns->blk_offset_md = ns->blk_offset + nvme_ns_lbads_bytes(ns) * ns->ns_blks;
+
 id_ns->nuse = id_ns->ncap = id_ns->nsze = cpu_to_le64(ns->ns_blks);
 
 return 0;
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 711ca249eac5..81ee0c5173d5 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -8,13 +8,15 @@
 DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
 DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \
 DEFINE_PROP_UINT32("num_ns", _state, _props.num_ns, 1), \
-DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7)
+DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7), \
+DEFINE_PROP_UINT8("ms", _state, _props.ms, 0)
 
 typedef struct NvmeParams {
 char *serial;
 uint32_t num_queues;
 uint32_t num_ns;
 uint8_t  mdts;
+uint8_t  ms;
 uint32_t cmb_size_mb;
 } NvmeParams;
 
@@ -91,6 +93,7 @@ typedef struct NvmeNamespace {
 uint32_tid;
 uint64_tns_blks;
 uint64_tblk_offset;
+uint64_tblk_offset_md;
 } NvmeNamespace;
 
 #define TYPE_NVME "nvme"
@@ -154,4 +157,10 @@ static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
 return 1 << nvme_ns_lbads(ns);
 }
 
+static inline uint16_t nvme_ns_ms(NvmeNamespace *ns)
+{
+NvmeIdNs *id = &ns->id_ns;
+return le16_to_cpu(id->lbaf[NVME_ID_NS_FLBAS_INDEX(id->flbas)].ms);
+}
+
 #endif /* HW_NVME_H */
-- 
2.21.0

[Qemu-block] [PATCH 3/8] nvme: simplify PRP mappings

2019-05-17 Thread Klaus Birkelund Jensen

Instead of handling both QSGs and IOVs in multiple places, simply use
QSGs everywhere by assuming that the request does not involve the
controller memory buffer (CMB). If the request is found to involve the
CMB, convert the QSG to an IOV and issue the I/O.

The QSG is converted to an IOV by the dma helpers anyway, so it is not
like the CMB path is unfairly affected by this simplifying change.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 205 +++---
 hw/block/nvme.h   |   3 +-
 hw/block/trace-events |   1 +
 include/block/nvme.h  |   1 +
 4 files changed, 138 insertions(+), 72 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 65dfc04f71e5..453213f9abb4 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -71,14 +71,21 @@
 
 static void nvme_process_sq(void *opaque);
 
+static inline uint8_t nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
+{
+return n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size));
+}
+
 static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
-if (n->cmbsz && addr >= n->ctrl_mem.addr &&
-addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+if (nvme_addr_is_cmb(n, addr)) {
 memcpy(buf, (void *)&n->cmbuf[addr - n->ctrl_mem.addr], size);
-} else {
-pci_dma_read(&n->parent_obj, addr, buf, size);
+
+return;
 }
+
+pci_dma_read(&n->parent_obj, addr, buf, size);
 }
 
 static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
@@ -167,31 +174,48 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 }
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
- uint64_t prp2, uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, uint64_t prp1,
+uint64_t prp2, uint32_t len, NvmeRequest *req)
 {
 hwaddr trans_len = n->page_size - (prp1 % n->page_size);
 trans_len = MIN(len, trans_len);
 int num_prps = (len >> n->page_bits) + 1;
+uint16_t status = NVME_SUCCESS;
+
+trace_nvme_map_prp(req->cmd_opcode, trans_len, len, prp1, prp2, num_prps);
 
 if (unlikely(!prp1)) {
 trace_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
-} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
-   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
-qsg->nsg = 0;
-qemu_iovec_init(iov, num_prps);
-qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], 
trans_len);
+}
+
+if (nvme_addr_is_cmb(n, prp1)) {
+NvmeSQueue *sq = req->sq;
+if (!nvme_addr_is_cmb(n, sq->dma_addr)) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+
+req->is_cmb = true;
 } else {
-pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
-qemu_sglist_add(qsg, prp1, trans_len);
+req->is_cmb = false;
 }
+
+pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
+qemu_sglist_add(qsg, prp1, trans_len);
+
 len -= trans_len;
 if (len) {
 if (unlikely(!prp2)) {
 trace_nvme_err_invalid_prp2_missing();
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
+
+if (req->is_cmb && !nvme_addr_is_cmb(n, prp2)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+goto unmap;
+}
+
 if (len > n->page_size) {
 uint64_t prp_list[n->max_prp_ents];
 uint32_t nents, prp_trans;
@@ -203,79 +227,99 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
QEMUIOVector *iov, uint64_t prp1,
 while (len != 0) {
 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
+if (req->is_cmb && !nvme_addr_is_cmb(n, prp_ent)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+goto unmap;
+}
+
 if (i == n->max_prp_ents - 1 && len > n->page_size) {
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_nvme_err_invalid_prplist_ent(prp_ent);
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
 
 i = 0;
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-nvme_addr_read(n, prp_ent, (void *)prp_list,
-prp_trans);
+nvme_addr_read(n, prp_ent, (void *)prp_list, prp_trans);

[Qemu-block] [PATCH 7/8] nvme: keep a copy of the NVMe command in request

2019-05-17 Thread Klaus Birkelund Jensen

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 4 ++--
 hw/block/nvme.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 81201a8b4834..5cd593806701 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -184,7 +184,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, 
uint64_t prp1,
 int num_prps = (len >> n->page_bits) + 1;
 uint16_t status = NVME_SUCCESS;
 
-trace_nvme_map_prp(req->cmd_opcode, trans_len, len, prp1, prp2, num_prps);
+trace_nvme_map_prp(req->cmd.opcode, trans_len, len, prp1, prp2, num_prps);
 
 if (unlikely(!prp1)) {
 trace_nvme_err_invalid_prp();
@@ -1559,7 +1559,7 @@ static void nvme_init_req(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 memset(&req->cqe, 0, sizeof(req->cqe));
 req->cqe.cid = le16_to_cpu(cmd->cid);
 
-req->cmd_opcode = cmd->opcode;
+memcpy(&req->cmd, cmd, sizeof(NvmeCmd));
 req->is_cmb = false;
 
 req->status = NVME_SUCCESS;
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 70f4781a1b61..7e1e026d90e6 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -52,7 +52,7 @@ typedef struct NvmeRequest {
 uint16_t status;
 bool is_cmb;
 bool is_write;
-uint8_t  cmd_opcode;
+NvmeCmd  cmd;
 
 QTAILQ_HEAD(, NvmeBlockBackendRequest) blk_req_tailq;
 QTAILQ_ENTRY(NvmeRequest)entry;
@@ -143,7 +143,7 @@ typedef struct NvmeCtrl {
 
 static inline bool nvme_rw_is_write(NvmeRequest *req)
 {
-return req->cmd_opcode == NVME_CMD_WRITE;
+return req->cmd.opcode == NVME_CMD_WRITE;
 }
 
 static inline bool nvme_is_error(uint16_t status, uint16_t err)
-- 
2.21.0

[Qemu-block] [PATCH 1/8] nvme: move device parameters to separate struct

2019-05-17 Thread Klaus Birkelund Jensen

Move device configuration parameters to separate struct to make it
explicit what is configurable and what is set internally.

Also, clean up some includes.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 53 +++--
 hw/block/nvme.h | 16 ---
 2 files changed, 38 insertions(+), 31 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 7caf92532a09..b689c0776e72 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -27,17 +27,14 @@
 
 #include "qemu/osdep.h"
 #include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
 #include "hw/block/block.h"
-#include "hw/hw.h"
 #include "hw/pci/msix.h"
-#include "hw/pci/pci.h"
 #include "sysemu/sysemu.h"
-#include "qapi/error.h"
-#include "qapi/visitor.h"
 #include "sysemu/block-backend.h"
+#include "qapi/error.h"
 
-#include "qemu/log.h"
-#include "qemu/cutils.h"
 #include "trace.h"
 #include "nvme.h"
 
@@ -62,12 +59,12 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->num_queues && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->num_queues && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->params.num_queues && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -605,7 +602,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
 trace_nvme_err_invalid_create_cq_addr(prp1);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
-if (unlikely(vector > n->num_queues)) {
+if (unlikely(vector > n->params.num_queues)) {
 trace_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -707,7 +704,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 
16));
+result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 trace_nvme_getfeat_numq(result);
 break;
 default:
@@ -731,9 +729,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_NUMBER_OF_QUEUES:
 trace_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->num_queues - 1, n->num_queues - 1);
-req->cqe.result =
-cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
+n->params.num_queues - 1,
+n->params.num_queues - 1);
+req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
+  ((n->params.num_queues - 2) << 16));
 break;
 default:
 trace_nvme_err_invalid_setfeat(dw10);
@@ -802,12 +801,12 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
 blk_drain(n->conf.blk);
 
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->sq[i] != NULL) {
 nvme_free_sq(n->sq[i], n);
 }
 }
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->cq[i] != NULL) {
 nvme_free_cq(n->cq[i], n);
 }
@@ -1208,7 +1207,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 int64_t bs_size;
 uint8_t *pci_conf;
 
-if (!n->num_queues) {
+if (!n->params.num_queues) {
 error_setg(errp, "num_queues can't be zero");
 return;
 }
@@ -1224,7 +1223,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 return;
 }
 
-if (!n->serial) {
+if (!n->params.serial) {
 error_setg(errp, "serial property not set");
 return;
 }
@@ -1241,25 +1240,25 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 pcie_endpoint_cap_init(pci_dev, 0x80);
 
 n->num_namespaces = 1;
-n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
+n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
 n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
-

[Qemu-block] [PATCH 0/8] nvme: v1.3, sgls, metadata and new 'ocssd' device

2019-05-17 Thread Klaus Birkelund Jensen

Hi,

This series of patches contains a number of refactorings to the emulated
nvme device, adds additional features, such as support for metadata and
scatter gather lists, and bumps the supported NVMe version to 1.3.
Lastly, it contains a new 'ocssd' device.

The motivation for the first seven patches is to set everything up for
the final patch that adds a new 'ocssd' device and associated block
driver that implements the OpenChannel 2.0 specification[1]. Many of us
in the OpenChannel comunity have used a qemu fork[2] for emulation of
OpenChannel devices. The fork is itself based on Keith's qemu-nvme
tree[3] and we recently merged mainline qemu into it, but the result is
still a "hybrid" nvme device that supports both conventional nvme and
the OCSSD 2.0 spec through a 'dialect' mechanism. Merging instead of
rebasing also created a pretty messy commit history and my efforts to
try and rebase our work onto mainline was getting hairy to say the
least. And I was never really happy with the dialect approach anyway.

I have instead prepared this series of fresh patches that incrementally
adds additional features to the nvme device to bring it into shape for
finally introducing a new (and separate) 'ocssd' device that emulates an
OpenChannel 2.0 device by reusing core functionality from the nvme
device. Providing a separate ocssd device ensures that no ocssd specific
stuff creeps into the nvme device.

The ocssd device is backed by a new 'ocssd' block driver that holds
internal meta data and keeps state permanent across power cycles. In the
future I think we could use the same approach for the nvme device to
keep internal metadata such as utilization and deallocated blocks. For
now, the nvme device does not support the Deallocated and Unwritten
Logical Block Error (DULBE) feature or the Data Set Management command
as this would require such support.

I have tried to make the patches to the nvme device in this series as
digestible as possible, but I undestand that commit 310fcd5965e5 ("nvme:
bump supported spec to 1.3") is pretty huge. I can try to chop it up if
required, but the changes pretty much needs to be done in bulk to
actually implement v1.3.

This version was recently used to find a bug in use of SGLs in the Linux
kernel, so I believe there is some value in introducing these new
features. As for the ocssd device I believe that it is time it is
included upstream and not kept seperately. I have knowledge of at least
one other qemu fork implementing OCSSD 2.0 used by the SPDK team and I
think we could all benefit from using a common implementation. The ocssd
device is feature complete with respect to the OCSSD 2.0 spec (mandatory
as well as optional features).

  [1]: http://lightnvm.io/docs/OCSSD-2_0-20180129.pdf
  [2]: https://github.com/OpenChannelSSD/qemu-nvme
  [3]: http://git.infradead.org/users/kbusch/qemu-nvme.git


Klaus Birkelund Jensen (8):
  nvme: move device parameters to separate struct
  nvme: bump supported spec to 1.3
  nvme: simplify PRP mappings
  nvme: allow multiple i/o's per request
  nvme: add support for metadata
  nvme: add support for scatter gather lists
  nvme: keep a copy of the NVMe command in request
  nvme: add an OpenChannel 2.0 NVMe device (ocssd)

 MAINTAINERS|   14 +-
 Makefile.objs  |1 +
 block.c|2 +-
 block/Makefile.objs|2 +-
 block/nvme.c   |   20 +-
 block/ocssd.c  |  690 ++
 hw/block/Makefile.objs |2 +-
 hw/block/nvme.c| 1405 ---
 hw/block/nvme.h|   92 --
 hw/block/nvme/nvme.c   | 2485 +
 hw/block/nvme/ocssd.c  | 2647 
 hw/block/nvme/ocssd.h  |  140 ++
 hw/block/nvme/trace-events |  136 ++
 hw/block/trace-events  |   91 --
 include/block/block_int.h  |3 +
 include/block/nvme.h   |  152 ++-
 include/block/ocssd.h  |  231 
 include/hw/block/nvme.h|  233 
 include/hw/pci/pci_ids.h   |2 +
 qapi/block-core.json   |   47 +-
 20 files changed, 6774 insertions(+), 1621 deletions(-)
 create mode 100644 block/ocssd.c
 delete mode 100644 hw/block/nvme.c
 delete mode 100644 hw/block/nvme.h
 create mode 100644 hw/block/nvme/nvme.c
 create mode 100644 hw/block/nvme/ocssd.c
 create mode 100644 hw/block/nvme/ocssd.h
 create mode 100644 hw/block/nvme/trace-events
 create mode 100644 include/block/ocssd.h
 create mode 100644 include/hw/block/nvme.h

-- 
2.21.0

[Qemu-block] [PATCH 4/8] nvme: allow multiple i/o's per request

2019-05-17 Thread Klaus Birkelund Jensen

Introduce a new NvmeBlockBackendRequest and move the QEMUSGList and
QEMUIOVector from the NvmeRequest.

This is in preparation for metadata support and makes it easier to
handle multiple block backend requests to different offsets.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 319 --
 hw/block/nvme.h   |  47 +--
 hw/block/trace-events |   2 +
 3 files changed, 286 insertions(+), 82 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 453213f9abb4..c514f93f3867 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -322,6 +322,88 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 return err;
 }
 
+static void nvme_blk_req_destroy(NvmeBlockBackendRequest *blk_req)
+{
+if (blk_req->qsg.nalloc) {
+qemu_sglist_destroy(&blk_req->qsg);
+}
+
+if (blk_req->iov.nalloc) {
+qemu_iovec_destroy(&blk_req->iov);
+}
+
+g_free(blk_req);
+}
+
+static void nvme_blk_req_put(NvmeCtrl *n, NvmeBlockBackendRequest *blk_req)
+{
+nvme_blk_req_destroy(blk_req);
+}
+
+static NvmeBlockBackendRequest *nvme_blk_req_get(NvmeCtrl *n, NvmeRequest *req,
+QEMUSGList *qsg)
+{
+NvmeBlockBackendRequest *blk_req = g_malloc0(sizeof(*blk_req));
+
+blk_req->req = req;
+
+if (qsg) {
+pci_dma_sglist_init(&blk_req->qsg, &n->parent_obj, qsg->nsg);
+memcpy(blk_req->qsg.sg, qsg->sg, qsg->nsg * 
sizeof(ScatterGatherEntry));
+
+blk_req->qsg.nsg = qsg->nsg;
+blk_req->qsg.size = qsg->size;
+}
+
+return blk_req;
+}
+
+static uint16_t nvme_blk_setup(NvmeCtrl *n, NvmeNamespace *ns, QEMUSGList *qsg,
+uint64_t blk_offset, uint32_t unit_len, NvmeRequest *req)
+{
+NvmeBlockBackendRequest *blk_req = nvme_blk_req_get(n, req, qsg);
+if (!blk_req) {
+NVME_GUEST_ERR(nvme_err_internal_dev_error, "nvme_blk_req_get: %s",
+"could not allocate memory");
+return NVME_INTERNAL_DEV_ERROR;
+}
+
+blk_req->slba = req->slba;
+blk_req->nlb = req->nlb;
+blk_req->blk_offset = blk_offset + req->slba * unit_len;
+
+QTAILQ_INSERT_TAIL(&req->blk_req_tailq, blk_req, tailq_entry);
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+NvmeNamespace *ns = req->ns;
+uint16_t err;
+
+QEMUSGList qsg;
+
+uint32_t unit_len = nvme_ns_lbads_bytes(ns);
+uint32_t len = req->nlb * unit_len;
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+err = nvme_map_prp(n, &qsg, prp1, prp2, len, req);
+if (err) {
+return err;
+}
+
+err = nvme_blk_setup(n, ns, &qsg, ns->blk_offset, unit_len, req);
+if (err) {
+return err;
+}
+
+qemu_sglist_destroy(&qsg);
+
+return NVME_SUCCESS;
+}
+
 static void nvme_post_cqe(NvmeCQueue *cq, NvmeRequest *req)
 {
 NvmeCtrl *n = cq->ctrl;
@@ -447,114 +529,190 @@ static void nvme_process_aers(void *opaque)
 
 static void nvme_rw_cb(void *opaque, int ret)
 {
-NvmeRequest *req = opaque;
+NvmeBlockBackendRequest *blk_req = opaque;
+NvmeRequest *req = blk_req->req;
 NvmeSQueue *sq = req->sq;
 NvmeCtrl *n = sq->ctrl;
 NvmeCQueue *cq = n->cq[sq->cqid];
+NvmeNamespace *ns = req->ns;
+
+QTAILQ_REMOVE(&req->blk_req_tailq, blk_req, tailq_entry);
+
+trace_nvme_rw_cb(req->cqe.cid, ns->id);
 
 if (!ret) {
-block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
-req->status = NVME_SUCCESS;
+block_acct_done(blk_get_stats(n->conf.blk), &blk_req->acct);
 } else {
-block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
-req->status = NVME_INTERNAL_DEV_ERROR;
+block_acct_failed(blk_get_stats(n->conf.blk), &blk_req->acct);
+NVME_GUEST_ERR(nvme_err_internal_dev_error, "block request failed: %s",
+strerror(-ret));
+req->status = NVME_INTERNAL_DEV_ERROR | NVME_DNR;
 }
 
-if (req->qsg.nalloc) {
-qemu_sglist_destroy(&req->qsg);
-}
-if (req->iov.nalloc) {
-qemu_iovec_destroy(&req->iov);
+if (QTAILQ_EMPTY(&req->blk_req_tailq)) {
+nvme_enqueue_req_completion(cq, req);
 }
 
-nvme_enqueue_req_completion(cq, req);
+nvme_blk_req_put(n, blk_req);
 }
 
-static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
-NvmeRequest *req)
+static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
-block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
+NvmeBlockBackendRequest *blk_req = nvme_blk_req_get(n, req, NULL);
+if (!blk_req) {
+NVME_GUEST_ERR(nvme_err_internal_dev_error, "nvme_b

93 matches

Mail list logo