Re: [PATCH v2 00/16] nvme: refactoring and cleanups

2020-04-22 Thread Klaus Birkelund Jensen
On Apr 21 19:24, Maxim Levitsky wrote:
> Should I also review the V7 series or I should wait for V8 which will
> not include these cleanups?
 
Hi Maxim,

Just wait for another series - I don't think I will post a v8, I will
chop op the series into smaller ones instead.

Most patches will hopefully not change too much, so should keep your
Reviewed-by's ;)


Thanks,
Klaus



Re: [PATCH v2 00/16] nvme: refactoring and cleanups

2020-04-21 Thread Klaus Birkelund Jensen
On Apr 21 02:38, Keith Busch wrote:
> The series looks good to me.
> 
> Reviewed-by: Keith Busch 

Thanks for the review Keith!

Kevin, should I rebase this on block-next? I think it might have some
conflicts with the PMR patch that went in previously.

Philippe, then I can also change the *err to *local_err ;)



Re: [PATCH v2 00/16] nvme: refactoring and cleanups

2020-04-19 Thread Klaus Birkelund Jensen
On Apr 15 15:01, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Changes since v1
> 
> * nvme: fix pci doorbell size calculation
>   - added some defines and a better comment (Philippe)
> 
> * nvme: rename trace events to pci_nvme
>   - changed the prefix from nvme_dev to pci_nvme (Philippe)
> 
> * nvme: add max_ioqpairs device parameter
>   - added a deprecation comment. I doubt this will go in until 5.1, so
> changed it to "deprecated from 5.1" (Philippe)
> 
> * nvme: factor out property/constraint checks
> * nvme: factor out block backend setup
>   - changed to return void and propagate errors in proper QEMU style
> (Philippe)
> 
> * nvme: add namespace helpers
>   - use the helper immediately (Philippe)
> 
> * nvme: factor out pci setup
>   - removed setting of vendor and device id which is already inherited
> from nvme_class_init() (Philippe)
> 
> * nvme: factor out cmb setup
>   - add lost comment (Philippe)
> 
> 
> Klaus Jensen (16):
>   nvme: fix pci doorbell size calculation
>   nvme: rename trace events to pci_nvme
>   nvme: remove superfluous breaks
>   nvme: move device parameters to separate struct
>   nvme: use constants in identify
>   nvme: refactor nvme_addr_read
>   nvme: add max_ioqpairs device parameter
>   nvme: remove redundant cmbloc/cmbsz members
>   nvme: factor out property/constraint checks
>   nvme: factor out device state setup
>   nvme: factor out block backend setup
>   nvme: add namespace helpers
>   nvme: factor out namespace setup
>   nvme: factor out pci setup
>   nvme: factor out cmb setup
>   nvme: factor out controller identify setup
> 
>  hw/block/nvme.c   | 433 --
>  hw/block/nvme.h   |  36 +++-
>  hw/block/trace-events | 172 -
>  include/block/nvme.h  |   8 +
>  4 files changed, 372 insertions(+), 277 deletions(-)
> 
> -- 
> 2.26.0
> 

Hi Keith,

You have acked most of this previously, but not in it's most recent
state. Since a good bunch of the refactoring patches have been split up
and changed, only a small subset of the patches still carry your
Acked-by.

The 'nvme: fix pci doorbell size calculation' and 'nvme: add
max_ioqpairs device parameter' are new since your ack and given their
nature a review from you would be nice :)


Thanks,
Klaus



Re: [PATCH v2 13/16] nvme: factor out namespace setup

2020-04-16 Thread Klaus Birkelund Jensen
On Apr 15 15:26, Philippe Mathieu-Daudé wrote:
> On 4/15/20 3:20 PM, Klaus Birkelund Jensen wrote:
> > 
> > I'll get the v1.3 series ready next.
> > 
> 
> Cool. What really matters (to me) is seeing tests. If we can merge tests
> (without multiple namespaces) before the rest of your series, even better.
> Tests give reviewers/maintainers confidence that code isn't breaking ;)
> 

The patches that I contribute have been pretty extensively tested by
various means in a "host setting" (e.g. blktests and some internal
tools), which really exercise the device by doing heavy I/O, testing for
compliance and also just being mean to it (e.g. tripping bus mastering
while doing I/O).

Don't misunderstand me as trying to weasel my way out of writing tests,
but I just want to understand the scope of the tests that you are
looking for? I believe (hope!) that you are not asking me to implement a
user-space NVMe driver in the test, so I assume the tests should varify
more low level details?



Re: [PATCH v2 13/16] nvme: factor out namespace setup

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 15:16, Philippe Mathieu-Daudé wrote:
> On 4/15/20 3:01 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/block/nvme.c | 46 ++
> >   1 file changed, 26 insertions(+), 20 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index d5244102252c..2b007115c302 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1358,6 +1358,27 @@ static void nvme_init_blk(NvmeCtrl *n, Error **errp)
> > false, errp);
> >   }
> > +static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error 
> > **errp)
> > +{
> > +int64_t bs_size;
> > +NvmeIdNs *id_ns = >id_ns;
> > +
> > +bs_size = blk_getlength(n->conf.blk);
> > +if (bs_size < 0) {
> > +error_setg_errno(errp, -bs_size, "could not get backing file 
> > size");
> > +return;
> > +}
> > +
> > +n->ns_size = bs_size;
> > +
> > +id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
> > +
> > +/* no thin provisioning */
> > +id_ns->ncap = id_ns->nsze;
> > +id_ns->nuse = id_ns->ncap;
> > +}
> > +
> >   static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> >   {
> >   NvmeCtrl *n = NVME(pci_dev);
> > @@ -1365,7 +1386,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   Error *err = NULL;
> >   int i;
> > -int64_t bs_size;
> >   uint8_t *pci_conf;
> >   nvme_check_constraints(n, );
> > @@ -1376,12 +1396,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   nvme_init_state(n);
> > -bs_size = blk_getlength(n->conf.blk);
> > -if (bs_size < 0) {
> > -error_setg(errp, "could not get backing file size");
> > -return;
> > -}
> > -
> >   nvme_init_blk(n, );
> >   if (err) {
> >   error_propagate(errp, err);
> > @@ -1394,8 +1408,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> >   pcie_endpoint_cap_init(pci_dev, 0x80);
> > -n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> 
> Valid because currently 'n->num_namespaces' = 1, OK.
> 
> Reviewed-by: Philippe Mathieu-Daudé 
> 
 
Thank you for the reviews Philippe and the suggesting that I split up
the series :)

I'll get the v1.3 series ready next.



Re: [PATCH 11/16] nvme: factor out block backend setup

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 13:02, Klaus Birkelund Jensen wrote:
> On Apr 15 12:52, Philippe Mathieu-Daudé wrote:
> > On 4/15/20 12:24 PM, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > > 
> > > Signed-off-by: Klaus Jensen 
> > > ---
> > >   hw/block/nvme.c | 15 ---
> > >   1 file changed, 12 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index e67f578fbf79..f0989cbb4335 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -1348,6 +1348,17 @@ static void nvme_init_state(NvmeCtrl *n)
> > >   n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
> > >   }
> > > +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > > +{
> > > +blkconf_blocksizes(>conf);
> > > +if (!blkconf_apply_backend_options(>conf, 
> > > blk_is_read_only(n->conf.blk),
> > > +   false, errp)) {
> > > +return -1;
> > > +}
> > > +
> > > +return 0;
> > 
> > I'm not sure this is a correct usage of the 'propagating errors' API (see
> > CODING_STYLE.rst and include/qapi/error.h), I'd expect this function to
> > return void, and use a local_error & error_propagate() in nvme_realize().
> > 
> > However this works, so:
> > Reviewed-by: Philippe Mathieu-Daudé 
> > 
> 
> So, I get that and did use the propagate functionality earlier. But I
> still used the int return. I'm not sure about the style if returning
> void - should I check if errp is now non-NULL? Point is that I need to
> return early since the later calls could fail if previous calls did not
> complete successfully.

Nevermind, I got it. I've changed it to propagate it correctly.



Re: [PATCH 11/16] nvme: factor out block backend setup

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 12:52, Philippe Mathieu-Daudé wrote:
> On 4/15/20 12:24 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/block/nvme.c | 15 ---
> >   1 file changed, 12 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index e67f578fbf79..f0989cbb4335 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1348,6 +1348,17 @@ static void nvme_init_state(NvmeCtrl *n)
> >   n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
> >   }
> > +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > +{
> > +blkconf_blocksizes(>conf);
> > +if (!blkconf_apply_backend_options(>conf, 
> > blk_is_read_only(n->conf.blk),
> > +   false, errp)) {
> > +return -1;
> > +}
> > +
> > +return 0;
> 
> I'm not sure this is a correct usage of the 'propagating errors' API (see
> CODING_STYLE.rst and include/qapi/error.h), I'd expect this function to
> return void, and use a local_error & error_propagate() in nvme_realize().
> 
> However this works, so:
> Reviewed-by: Philippe Mathieu-Daudé 
> 

So, I get that and did use the propagate functionality earlier. But I
still used the int return. I'm not sure about the style if returning
void - should I check if errp is now non-NULL? Point is that I need to
return early since the later calls could fail if previous calls did not
complete successfully.



Re: [PATCH 13/16] nvme: factor out namespace setup

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 12:53, Klaus Birkelund Jensen wrote:
> On Apr 15 12:38, Philippe Mathieu-Daudé wrote:
> > 
> > I'm not sure this line belong to this patch.
> > 
>  
> It does. It is already there in the middle of the realize function. It
> is moved to nvme_init_namespace as
> 
>   n->ns_size = bs_size;
> 
> since only a single namespace can be configured anyway. I will remove
> the for-loop that initializes multiple namespaces as well.

I'm gonna backtrack on that. Removing that for loop is just noise I
think.



Re: [PATCH 13/16] nvme: factor out namespace setup

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 12:38, Philippe Mathieu-Daudé wrote:
> On 4/15/20 12:24 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/block/nvme.c | 47 ++-
> >   1 file changed, 26 insertions(+), 21 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f0989cbb4335..08f7ae0a48b3 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1359,13 +1359,35 @@ static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> >   return 0;
> >   }
> > +static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error 
> > **errp)
> > +{
> > +int64_t bs_size;
> > +NvmeIdNs *id_ns = >id_ns;
> > +
> > +bs_size = blk_getlength(n->conf.blk);
> > +if (bs_size < 0) {
> > +error_setg_errno(errp, -bs_size, "could not get backing file 
> > size");
> > +return -1;
> > +}
> > +
> > +n->ns_size = bs_size;
> > +
> > +id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
> > +
> > +/* no thin provisioning */
> > +id_ns->ncap = id_ns->nsze;
> > +id_ns->nuse = id_ns->ncap;
> > +
> > +return 0;
> > +}
> > +
> >   static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> >   {
> >   NvmeCtrl *n = NVME(pci_dev);
> >   NvmeIdCtrl *id = >id_ctrl;
> >   int i;
> > -int64_t bs_size;
> >   uint8_t *pci_conf;
> >   if (nvme_check_constraints(n, errp)) {
> > @@ -1374,12 +1396,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   nvme_init_state(n);
> > -bs_size = blk_getlength(n->conf.blk);
> > -if (bs_size < 0) {
> > -error_setg(errp, "could not get backing file size");
> > -return;
> > -}
> > -
> >   if (nvme_init_blk(n, errp)) {
> >   return;
> >   }
> > @@ -1390,8 +1406,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> >   pcie_endpoint_cap_init(pci_dev, 0x80);
> > -n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> 
> I'm not sure this line belong to this patch.
> 
 
It does. It is already there in the middle of the realize function. It
is moved to nvme_init_namespace as

  n->ns_size = bs_size;

since only a single namespace can be configured anyway. I will remove
the for-loop that initializes multiple namespaces as well.



Re: [PATCH v7 11/48] nvme: refactor device realization

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 09:55, Philippe Mathieu-Daudé wrote:
> On 4/15/20 9:25 AM, Klaus Birkelund Jensen wrote:
> > On Apr 15 09:14, Philippe Mathieu-Daudé wrote:
> > > Hi Klaus,
> > > 
> > > This patch is a pain to review... Could you split it? I'd use one trivial
> > > patch for each function extracted from nvme_realize().
> > > 
> > 
> > Understood, I will split it up!
> 
> Thanks, that will help the review.
> 
> As this series is quite big, I recommend you to split it, so part of it can
> get merged quicker and you don't have to carry tons of patches that scare
> reviewers/maintainers.
> 
> Suggestions:
> 
> - 1: cleanups/refactors
> - 2: support v1.3
> - 3: more refactors, strengthening code
> - 4: improve DMA & S/G
> - 5: support for multiple NS
> - 6: tests for multiple NS feature
> - 7: tests bus unplug/replug (idea)
> 
> Or less :)
> 

Okay, good idea. Thanks.



Re: [PATCH v7 45/48] nvme: support multiple namespaces

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 09:38, Philippe Mathieu-Daudé wrote:
> On 4/15/20 7:51 AM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >-drive file=nvme0n1.img,if=none,id=disk1
> >-drive file=nvme0n2.img,if=none,id=disk2
> >-device nvme,serial=deadbeef,id=nvme0
> >-device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >-device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > Reviewed-by: Keith Busch 
> > ---
> >   hw/block/Makefile.objs |   2 +-
> >   hw/block/nvme-ns.c | 157 +++
> >   hw/block/nvme-ns.h |  60 +++
> >   hw/block/nvme.c| 233 +++--
> >   hw/block/nvme.h|  47 -
> >   hw/block/trace-events  |   8 +-
> >   6 files changed, 396 insertions(+), 111 deletions(-)
> >   create mode 100644 hw/block/nvme-ns.c
> >   create mode 100644 hw/block/nvme-ns.h
> > 
> > diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
> > index 4b4a2b338dc4..d9141d6a4b9b 100644
> > --- a/hw/block/Makefile.objs
> > +++ b/hw/block/Makefile.objs
> > @@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
> >   common-obj-$(CONFIG_XEN) += xen-block.o
> >   common-obj-$(CONFIG_ECC) += ecc.o
> >   common-obj-$(CONFIG_ONENAND) += onenand.o
> > -common-obj-$(CONFIG_NVME_PCI) += nvme.o
> > +common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
> >   common-obj-$(CONFIG_SWIM) += swim.o
> >   common-obj-$(CONFIG_SH4) += tc58128.o
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > new file mode 100644
> > index ..bd64d4a94632
> > --- /dev/null
> > +++ b/hw/block/nvme-ns.c
> > @@ -0,0 +1,157 @@
> 
> Missing copyright + license.
> 

Fixed.

> > +
> > +switch (n->conf.wce) {
> > +case ON_OFF_AUTO_ON:
> > +n->features.volatile_wc = 1;
> > +break;
> > +case ON_OFF_AUTO_OFF:
> > +n->features.volatile_wc = 0;
> 
> Missing 'break'?
> 

Ouch. Fixed.

> > +case ON_OFF_AUTO_AUTO:
> > +n->features.volatile_wc = blk_enable_write_cache(ns->blk);
> > +break;
> > +default:
> > +abort();
> > +}
> > +
> > +blk_set_enable_write_cache(ns->blk, n->features.volatile_wc);
> > +
> > +return 0;
> > +}
> > +
> > +static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
> > +{
> > +if (!ns->blk) {
> > +error_setg(errp, "block backend not configured");
> > +return -1;
> > +}
> > +
> > +return 0;
> > +}
> > +
> > +int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > +{
> > +if (nvme_ns_check_constraints(ns, errp)) {
> > +return -1;
> > +}
> > +
> > +if (nvme_ns_init_blk(n, ns, >id_ctrl, errp)) {
> > +return -1;
> > +}
> > +
> > +nvme_ns_init(ns);
> > +if (nvme_register_namespace(n, ns, errp)) {
> > +return -1;
> > +}
> > +
> > +return 0;
> > +}
> > +
> > +static void nvme_ns_realize(DeviceState *dev, Error **errp)
> > +{
> > +NvmeNamespace *ns = NVME_NS(dev);
> > +BusState *s = qdev_get_parent_bus(dev);
> > +NvmeCtrl *n = NVME(s->parent);
> > +Error *local_err = NULL;
> > +
> > +if (nvme_ns_setup(n, ns, _err)) {
> > +error_propagate_prepend(errp, local_err,
> > +"could not setup namespace: ");
> > +return;
> > +}
> > +}
> > +
> > +static Property nvme_ns_props[] = {
> > +DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),
> > +DEFINE_PROP_END_OF_LIST(),
> > +};
> > +
> > +static void nvme_ns_class_init(ObjectClass *oc, void *data)
> > +{
> > +DeviceClass *dc = DEVICE_CLASS(oc);
> > +
> > +set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
> > +
> > +dc->bus_type = TYPE_NVME_BUS;
> > +dc->realize = nvme_ns_realize;
> > +device_class_set_props(dc, nvme_ns_props);
> > +dc->desc = "virtual nvme namespace";
> 
> "Virtual NVMe namespace"?
> 

Fixed.

> > +}
> > +
> > +static void nvme_ns_instance_init(Object *obj)
> > +{
> > +NvmeNamespace *ns = NVME_NS(obj);
> > +char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
> > +
> > +device_add_bootindex_property(obj, >bootindex, "bootindex",
> > +  bootindex, DEVICE(obj), _abort);
> > +
> > +g_free(bootindex);
> > +}
> > +
> > +static const TypeInfo nvme_ns_info = {
> > +.name = TYPE_NVME_NS,
> > +.parent = 

Re: [PATCH v7 06/48] nvme: refactor nvme_addr_read

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 09:03, Philippe Mathieu-Daudé wrote:
> On 4/15/20 7:50 AM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Pull the controller memory buffer check to its own function. The check
> > will be used on its own in later patches.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >   hw/block/nvme.c | 16 
> >   1 file changed, 12 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 622103c42d0a..02d3dde90842 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -52,14 +52,22 @@
> >   static void nvme_process_sq(void *opaque);
> > +static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> 
> 'inline' not really necessary here.
> 

Fixed.




Re: [PATCH v7 12/48] nvme: add temperature threshold feature

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 09:24, Klaus Birkelund Jensen wrote:
> On Apr 15 09:19, Philippe Mathieu-Daudé wrote:
> > On 4/15/20 7:51 AM, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > > 
> > > It might seem wierd to implement this feature for an emulated device,
> > 
> > 'weird'
> 
> Thanks, fixed :)
> 
> > 
> > > but it is mandatory to support and the feature is useful for testing
> > > asynchronous event request support, which will be added in a later
> > > patch.
> > 
> > Which patch? I can't find how you set the temperature in this series.
> > 
> 
> The temperature cannot be changed, but the thresholds can with the Set
> Features command (and that can then trigger AERs). That is added in
> "nvme: add temperature threshold feature" and "nvme: add support for the
> asynchronous event request command" respectively.
> 
> There is a test in SPDK that does this.
> 

Oh, I think I misunderstood you.

No, setting the temperature was moved to the "nvme: add support for the
get log page command" patch since that is the patch that actually uses
it. This was on request by Maxim in an earlier review.



Re: [PATCH v7 11/48] nvme: refactor device realization

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 09:14, Philippe Mathieu-Daudé wrote:
> Hi Klaus,
> 
> This patch is a pain to review... Could you split it? I'd use one trivial
> patch for each function extracted from nvme_realize().
> 

Understood, I will split it up!



Re: [PATCH v7 12/48] nvme: add temperature threshold feature

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 09:19, Philippe Mathieu-Daudé wrote:
> On 4/15/20 7:51 AM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > It might seem wierd to implement this feature for an emulated device,
> 
> 'weird'

Thanks, fixed :)

> 
> > but it is mandatory to support and the feature is useful for testing
> > asynchronous event request support, which will be added in a later
> > patch.
> 
> Which patch? I can't find how you set the temperature in this series.
> 

The temperature cannot be changed, but the thresholds can with the Set
Features command (and that can then trigger AERs). That is added in
"nvme: add temperature threshold feature" and "nvme: add support for the
asynchronous event request command" respectively.

There is a test in SPDK that does this.




Re: [PATCH v7 10/48] nvme: remove redundant cmbloc/cmbsz members

2020-04-15 Thread Klaus Birkelund Jensen
On Apr 15 09:10, Philippe Mathieu-Daudé wrote:
> 
> "hw/block/nvme.h" should not pull in "block/nvme.h", both should include a
> common "hw/block/nvme_spec.h" (or better named). Not related to this patch
> although.
> 

Hmm. It does pull in the "include/block/nvme.h" which is basically the
"nvme_spec.h" you are talking about. That file holds all spec related
structs that are shared between the VFIO based nvme driver
(block/nvme.c) and the emulated device (hw/block/nvme.c).

Isn't that what is intended?



Re: [PATCH v6 32/42] nvme: allow multiple aios per command

2020-04-08 Thread Klaus Birkelund Jensen
On Mar 31 12:10, Maxim Levitsky wrote:
> On Tue, 2020-03-31 at 07:47 +0200, Klaus Birkelund Jensen wrote:
> > On Mar 25 12:57, Maxim Levitsky wrote:
> > > On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > > > @@ -516,10 +613,10 @@ static inline uint16_t nvme_check_prinfo(NvmeCtrl 
> > > > *n, NvmeNamespace *ns,
> > > >  return NVME_SUCCESS;
> > > >  }
> > > >  
> > > > -static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace 
> > > > *ns,
> > > > - uint64_t slba, uint32_t nlb,
> > > > - NvmeRequest *req)
> > > > +static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
> > > > + uint32_t nlb, NvmeRequest 
> > > > *req)
> > > >  {
> > > > +NvmeNamespace *ns = req->ns;
> > > >  uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
> > > 
> > > This should go to the patch that added nvme_check_bounds as well
> > > 
> > 
> > We can't really, because the NvmeRequest does not hold a reference to
> > the namespace as a struct member at that point. This is also an issue
> > with the nvme_check_prinfo function above.
> 
> I see it now. The changes to NvmeRequest together with this are a good 
> candidate
> to split from this patch to get this patch to size that is easy to review.
> 

I'm factoring those changes and other stuff out into separate patches!





Re: [PATCH v6 14/42] nvme: add missing mandatory features

2020-04-08 Thread Klaus Birkelund Jensen
On Mar 31 12:39, Maxim Levitsky wrote:
> On Tue, 2020-03-31 at 07:41 +0200, Klaus Birkelund Jensen wrote:
> > On Mar 25 12:41, Maxim Levitsky wrote:
> > > BTW the user of the device doesn't have to have 1:1 mapping between qid 
> > > and msi interrupt index,
> > > in fact when MSI is not used, all the queues will map to the same vector, 
> > > which will be interrupt 0
> > > from point of view of the device IMHO.
> > > So it kind of makes sense IMHO to have num_irqs or something, even if it 
> > > technically equals to number of queues.
> > > 
> > 
> > Yeah, but the device will still *support* the N IVs, so they can still
> > be configured even though they will not be used. So I don't think we
> > need to introduce an additional parameter?
> 
> Yes and no.
> I wasn't thinking to add a new parameter for number of supporter interrupt 
> vectors,
> but just to have an internal variable to represent it so that we could 
> support in future
> case where these are not equal.
> 
> Also from point of view of validating the users of this virtual nvme drive, I 
> think it kind
> of makes sense to allow having less supported IRQ vectors than IO queues, so 
> to check
> how userspace copes with it. It is valid after all to have same interrupt 
> vector shared between
> multiple queues.
> 

I see that this could be useful for testing, but I think we can defer
that to a later patch. Would you be okay with that for now?

> In fact in theory (but that would complicate the implementation greatly) we 
> should even support
> case when number of submission queues is not equal to number of completion 
> queues. Yes nobody does in real hardware,
> and at least Linux nvme driver hard assumes 1:1 SQ/CQ mapping but still.
> 

It is not the hardware that decides this and I believe that there
definitely are applications that chooses to associate multiple SQs with
a single CQ. The CQ is an attribute of the SQ and the IV of the CQ is
also specified in the create command. I believe this is already
supported.

> My nvme-mdev doesn't make this assumpiton (and neither any assumptions on 
> interrupt vector counts) 
> and allows the user to have any SQ/CQ mapping as far as the spec allows
> (but it does hardcode maximum number of SQ/CQ supported)
> 
> BTW, I haven't looked at that but we should check that the virtual nvme drive 
> can cope with using legacy
> interrupt (that is MSI disabled) - nvme-mdev does support this and was tested 
> with it.
> 

Yes, this is definitely not very well tested.

If you insist on all of the above being implemented, then I will do it,
but I would rather defer this to later patches as this series is already
pretty large ;)



Re: [PATCH v1] nvme: indicate CMB support through controller capabilities register

2020-04-07 Thread Klaus Birkelund Jensen
On Apr  1 11:42, Andrzej Jakowski wrote:
> This patch sets CMBS bit in controller capabilities register when user
> configures NVMe driver with CMB support, so capabilites are correctly reported
> to guest OS.
> 
> Signed-off-by: Andrzej Jakowski 
> ---
>  hw/block/nvme.c  | 2 ++
>  include/block/nvme.h | 4 
>  2 files changed, 6 insertions(+)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index d28335cbf3..986803398f 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1393,6 +1393,8 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> **errp)
>  n->bar.intmc = n->bar.intms = 0;
>  
>  if (n->cmb_size_mb) {
> +/* Contoller capabilities */
> +NVME_CAP_SET_CMBS(n->bar.cap, 1);
>  
>  NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
>  NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 8fb941c653..561891b140 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -27,6 +27,7 @@ enum NvmeCapShift {
>  CAP_CSS_SHIFT  = 37,
>  CAP_MPSMIN_SHIFT   = 48,
>  CAP_MPSMAX_SHIFT   = 52,
> +CAP_CMB_SHIFT  = 57,
>  };
>  
>  enum NvmeCapMask {
> @@ -39,6 +40,7 @@ enum NvmeCapMask {
>  CAP_CSS_MASK   = 0xff,
>  CAP_MPSMIN_MASK= 0xf,
>  CAP_MPSMAX_MASK= 0xf,
> +CAP_CMB_MASK   = 0x1,
>  };
>  
>  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
> @@ -69,6 +71,8 @@ enum NvmeCapMask {
> << 
> CAP_MPSMIN_SHIFT)
>  #define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & 
> CAP_MPSMAX_MASK)\
>  << 
> CAP_MPSMAX_SHIFT)
> +#define NVME_CAP_SET_CMBS(cap, val) (cap |= (uint64_t)(val & CAP_CMB_MASK)\
> +<< CAP_CMB_SHIFT)
>  
>  enum NvmeCcShift {
>  CC_EN_SHIFT = 0,
> -- 
> 2.21.1
> 

Looks good.

Reviewed-by: Klaus Jensen 



Re: [PATCH v6 12/42] nvme: add support for the get log page command

2020-03-31 Thread Klaus Birkelund Jensen
On Mar 31 12:45, Maxim Levitsky wrote:
> On Tue, 2020-03-31 at 07:41 +0200, Klaus Birkelund Jensen wrote:
> > On Mar 25 12:40, Maxim Levitsky wrote:
> > > On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > > > From: Klaus Jensen 
> > > > 
> > > > Add support for the Get Log Page command and basic implementations of
> > > > the mandatory Error Information, SMART / Health Information and Firmware
> > > > Slot Information log pages.
> > > > 
> > > > In violation of the specification, the SMART / Health Information log
> > > > page does not persist information over the lifetime of the controller
> > > > because the device has no place to store such persistent state.
> > > > 
> > > > Note that the LPA field in the Identify Controller data structure
> > > > intentionally has bit 0 cleared because there is no namespace specific
> > > > information in the SMART / Health information log page.
> > > > 
> > > > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > > > Section 5.10 ("Get Log Page command").
> > > > 
> > > > Signed-off-by: Klaus Jensen 
> > > > Acked-by: Keith Busch 
> > > > ---
> > > >  hw/block/nvme.c   | 138 +-
> > > >  hw/block/nvme.h   |  10 +++
> > > >  hw/block/trace-events |   2 +
> > > >  3 files changed, 149 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > > index 64c42101df5c..83ff3fbfb463 100644
> > > > 
> > > > +static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > > > buf_len,
> > > > +uint64_t off, NvmeRequest *req)
> > > > +{
> > > > +uint32_t trans_len;
> > > > +uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> > > > +uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> > > > +uint8_t errlog[64];
> > > 
> > > I'll would replace this with sizeof(NvmeErrorLogEntry)
> > > (and add NvmeErrorLogEntry to the nvme.h), just for the sake of 
> > > consistency,
> > > and in case we end up reporting some errors to the log in the future.
> > > 
> > 
> > NvmeErrorLog is already in nvme.h; Fixed to actually use it.
> True that! I'll would rename it to NvmeErrorLogEntry to be honest
> (in that patch that added many nvme spec changes) but I don't mind
> keeping it as is as well.
> 
 
It is used in the block driver (block/nvme.c) as well, and I'd rather
not involved that too much in this series.



Re: [PATCH v6 07/42] nvme: refactor nvme_addr_read

2020-03-31 Thread Klaus Birkelund Jensen
On Mar 31 13:41, Maxim Levitsky wrote:
> On Tue, 2020-03-31 at 07:39 +0200, Klaus Birkelund Jensen wrote:
> > On Mar 25 12:38, Maxim Levitsky wrote:
> > > Note that this patch still contains a bug that it removes the check 
> > > against the accessed
> > > size, which you fix in later patch.
> > > I prefer to not add a bug in first place
> > > However if you have a reason for this, I won't mind.
> > > 
> > 
> > So yeah. The resons is that there is actually no bug at this point
> > because the controller only supports PRPs. I actually thought there was
> > a bug as well and reported it to qemu-security some months ago as a
> > potential out of bounds access. I was then schooled by Keith on how PRPs
> > work ;) Below is a paraphrased version of Keiths analysis.
> > 
> > The PRPs does not cross page boundaries:
> True
> 
> > 
> > trans_len = n->page_size - (prp1 % n->page_size);
> > 
> > The PRPs are always verified to be page aligned:
> True
> > 
> > if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
> > 
> > and the transfer length wont go above page size. So, since the beginning
> > of the address is within the CMB and considering that the CMB is of an
> > MB aligned and sized granularity, then we can never cross outside it
> > with PRPs.
> I understand now, however the reason I am arguing about this is
> that this patch actually _removes_ the size bound check
> 
> It was before the patch:
> 
> n->cmbsz && addr >= n->ctrl_mem.addr &&
>   addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)
> 

I don't think it does - the check is just moved to nvme_addr_is_cmb:

static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
{
hwaddr low = n->ctrl_mem.addr;
hwaddr hi  = n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size);

return addr >= low && addr < hi;
}

We check that `addr` is less than `hi`. Maybe the name is unfortunate...





Re: [PATCH v6 38/42] nvme: support multiple namespaces

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:59, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > Reviewed-by: Keith Busch 
> > ---
> >  hw/block/Makefile.objs |   2 +-
> >  hw/block/nvme-ns.c | 157 +++
> >  hw/block/nvme-ns.h |  60 +++
> >  hw/block/nvme.c| 233 ++---
> >  hw/block/nvme.h|  47 -
> >  hw/block/trace-events  |   4 +-
> >  6 files changed, 389 insertions(+), 114 deletions(-)
> >  create mode 100644 hw/block/nvme-ns.c
> >  create mode 100644 hw/block/nvme-ns.h
> > 
> > diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
> > index 4b4a2b338dc4..d9141d6a4b9b 100644
> > --- a/hw/block/Makefile.objs
> > +++ b/hw/block/Makefile.objs

> > @@ -2518,9 +2561,6 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >  id->psd[0].mp = cpu_to_le16(0x9c4);
> >  id->psd[0].enlat = cpu_to_le32(0x10);
> >  id->psd[0].exlat = cpu_to_le32(0x4);
> > -if (blk_enable_write_cache(n->conf.blk)) {
> > -id->vwc = 1;
> > -}
> Shouldn't that be kept? Assuming that user used the legacy 'drive' option,
> and it had no write cache enabled.
> 

When using the drive option we still end up calling the same code that
handles the "new style" namespaces and that code will handle the write
cache similary.

> >  
> >  n->bar.cap = 0;
> >  NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
> > @@ -2533,25 +2573,34 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >  n->bar.intmc = n->bar.intms = 0;
> >  }
> >  
> > -static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error 
> > **errp)
> > +int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> >  {
> > -int64_t bs_size;
> > -NvmeIdNs *id_ns = >id_ns;
> > +uint32_t nsid = nvme_nsid(ns);
> >  
> > -bs_size = blk_getlength(n->conf.blk);
> > -if (bs_size < 0) {
> > -error_setg_errno(errp, -bs_size, "blk_getlength");
> > +if (nsid > NVME_MAX_NAMESPACES) {
> > +error_setg(errp, "invalid nsid (must be between 0 and %d)",
> > +   NVME_MAX_NAMESPACES);
> >  return -1;
> >  }
> >  
> > -id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > -n->ns_size = bs_size;
> > +if (!nsid) {
> > +for (int i = 1; i <= n->num_namespaces; i++) {
> > +NvmeNamespace *ns = nvme_ns(n, i);
> > +if (!ns) {
> > +nsid = i;
> > +break;
> > +}
> > +}
> This misses an edge error case, where all the namespaces are allocated.
> Yes, it would be insane to allocate all 256 namespaces but still.
> 

Impressive catch! Fixed!

> 
> > +} else {
> > +if (n->namespaces[nsid - 1]) {
> > +error_setg(errp, "nsid must be unique");
> 
> I''l would change that error message to something like 
> "namespace id %d is already in use" or something like that.
> 

Done.




Re: [PATCH v6 36/42] nvme: add support for scatter gather lists

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:58, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > For now, support the Data Block, Segment and Last Segment descriptor
> > types.
> > 
> > See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c   | 310 +++---
> >  hw/block/trace-events |   4 +
> >  2 files changed, 262 insertions(+), 52 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 49d323566393..b89b96990f52 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -76,7 +76,12 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr 
> > addr)
> >  
> >  static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> > -if (n->bar.cmbsz && nvme_addr_is_cmb(n, addr)) {
> > +hwaddr hi = addr + size;
> > +if (hi < addr) {
> > +return 1;
> > +}
> > +
> > +if (n->bar.cmbsz && nvme_addr_is_cmb(n, addr) && nvme_addr_is_cmb(n, 
> > hi)) {
> 
> I would suggest to split this into a separate patch as well, since this 
> contains not just one but 2 bugfixes
> for this function and they are not related to sg lists.
> Or at least move this to 'nvme: refactor nvme_addr_read' and rename this patch
> to something like 'nvme: fix and refactor nvme_addr_read'
> 

I've split it into a patch.

> 
> >  memcpy(buf, nvme_addr_to_cmb(n, addr), size);
> >  return 0;
> >  }
> > @@ -328,13 +333,242 @@ unmap:
> >  return status;
> >  }
> >  
> > -static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > - uint64_t prp1, uint64_t prp2, DMADirection 
> > dir,
> > +static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
> > +  QEMUIOVector *iov,
> > +  NvmeSglDescriptor *segment, uint64_t 
> > nsgld,
> > +  size_t *len, NvmeRequest *req)
> > +{
> > +dma_addr_t addr, trans_len;
> > +uint32_t blk_len;
> > +uint16_t status;
> > +
> > +for (int i = 0; i < nsgld; i++) {
> > +uint8_t type = NVME_SGL_TYPE(segment[i].type);
> > +
> > +if (type != NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
> > +switch (type) {
> > +case NVME_SGL_DESCR_TYPE_BIT_BUCKET:
> > +case NVME_SGL_DESCR_TYPE_KEYED_DATA_BLOCK:
> > +return NVME_SGL_DESCR_TYPE_INVALID | NVME_DNR;
> > +default:
> To be honest I don't like that 'default'
> I would explicitly state which segment types remain 
> (I think segment list and last segment list, and various reserved types)
> In fact for the reserved types you probably also want to return 
> NVME_SGL_DESCR_TYPE_INVALID)
> 

I "negated" the logic which I think is more readable. I still really
want to keep the default, for instance, nvme v1.4 adds a new type that
we do not support (the Transport SGL Data Block descriptor).

> Also this function as well really begs to have a description prior to it,
> something like 'map a sg list section, assuming that it only contains SGL 
> data descriptions,
> caller has to ensure this'.
> 

Done.

> 
> > +return NVME_INVALID_NUM_SGL_DESCRS | NVME_DNR;
> > +}
> > +}
> > +
> > +if (*len == 0) {
> > +uint16_t sgls = le16_to_cpu(n->id_ctrl.sgls);
> Nitpick: I would add a small comment here as well describiing
> what this does (We reach this point if sg list covers more that that
> was specified in the commmand, and the NVME_CTRL_SGLS_EXCESS_LENGTH controller
> capability indicates that we support just throwing the extra data away)
> 

Adding a comment. It's the other way around. The size as indicated by
NLB (or whatever depending on the command) is the "authoritative" souce
of information for the size of the payload. We will never accept an SGL
that is too short such that we lose or throw away data, but we might
accept ignoring parts of the SGL.

> > +if (sgls & NVME_CTRL_SGLS_EXCESS_LENGTH) {
> > +break;
> > +}
> > +
> > +trace_nvme_dev_err_invalid_sgl_excess_length(nvme_cid(req));
> > +return NVME_DATA_SGL_LEN_INVALID | NVME_DNR;
> > +}
> > +
> > +addr = le64_to_cpu(segment[i].addr);
> > +blk_len = le32_to_cpu(segment[i].len);
> > +
> > +if (!blk_len) {
> > +continue;
> > +}
> > +
> > +if (UINT64_MAX - addr < blk_len) {
> > +return NVME_DATA_SGL_LEN_INVALID | NVME_DNR;
> > +}
> Good!
> > +
> > +trans_len = MIN(*len, blk_len);
> > +
> > +status = nvme_map_addr(n, qsg, iov, addr, trans_len);
> > +if (status) {
> > +return status;
> > +}
> > +
> > +*len -= trans_len;
> > +}
> > +
> > +return NVME_SUCCESS;
> > +}
> > +
> > +static uint16_t 

Re: [PATCH v6 35/42] nvme: handle dma errors

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:58, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Handling DMA errors gracefully is required for the device to pass the
> > block/011 test ("disable PCI device while doing I/O") in the blktests
> > suite.
> > 
> > With this patch the device passes the test by retrying "critical"
> > transfers (posting of completion entries and processing of submission
> > queue entries).
> > 
> > If DMA errors occur at any other point in the execution of the command
> > (say, while mapping the PRPs), the command is aborted with a Data
> > Transfer Error status code.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c   | 45 ---
> >  hw/block/trace-events |  2 ++
> >  include/block/nvme.h  |  2 +-
> >  3 files changed, 37 insertions(+), 12 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 15ca2417af04..49d323566393 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -164,7 +164,7 @@ static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, 
> > QEMUIOVector *iov, hwaddr addr,
> >size_t len)
> >  {
> >  if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len)) {
> > -return NVME_DATA_TRAS_ERROR;
> > +return NVME_DATA_TRANSFER_ERROR;
> 
> Minor nitpick: this is also a non functional refactoring.
> I don't think that each piece of a refactoring should be in a separate patch,
> so I usually group all the non functional (aka cosmetic) refactoring in one 
> patch, usually the first in the series.
> But I try not to leave such refactoring in the functional patches.
> 
> However, since there is not that much cases like that left, I don't mind 
> leaving this particular case as is.
> 

Noted. Keeping it here for now ;)

> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 




Re: [PATCH v6 29/42] nvme: refactor request bounds checking

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:56, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 28 ++--
> >  1 file changed, 22 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index eecfad694bf8..ba520c76bae5 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -562,13 +577,14 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace 
> > *ns, NvmeCmd *cmd,
> >  uint64_t data_offset = slba << data_shift;
> >  int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
> >  enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : 
> > BLOCK_ACCT_READ;
> > +uint16_t status;
> >  
> >  trace_nvme_dev_rw(is_write ? "write" : "read", nlb, data_size, slba);
> >  
> > -if (unlikely((slba + nlb) > ns->id_ns.nsze)) {
> > +status = nvme_check_bounds(n, ns, slba, nlb, req);
> > +if (status) {
> >  block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > -trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
> > -return NVME_LBA_RANGE | NVME_DNR;
> > +return status;
> >  }
> >  
> >  if (nvme_map(n, cmd, >qsg, >iov, data_size, req)) {
> Looks good as well, once we get support for discard, it will
> use this as well, but for now indeed only write zeros and read/write
> need bounds checking on the IO path.
> 

I have that patch in the submission queue and the check is factored out
there ;)

> Reviewed-by: Maxim Levitsky 
> 




Re: [PATCH v6 32/42] nvme: allow multiple aios per command

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:57, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > This refactors how the device issues asynchronous block backend
> > requests. The NvmeRequest now holds a queue of NvmeAIOs that are
> > associated with the command. This allows multiple aios to be issued for
> > a command. Only when all requests have been completed will the device
> > post a completion queue entry.
> > 
> > Because the device is currently guaranteed to only issue a single aio
> > request per command, the benefit is not immediately obvious. But this
> > functionality is required to support metadata, the dataset management
> > command and other features.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c   | 377 +++---
> >  hw/block/nvme.h   | 129 +--
> >  hw/block/trace-events |   6 +
> >  3 files changed, 407 insertions(+), 105 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 0d2b5b45b0c5..817384e3b1a9 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -373,6 +374,99 @@ static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, 
> > QEMUSGList *qsg,
> >  return nvme_map_prp(n, qsg, iov, prp1, prp2, len, req);
> >  }
> >  
> > +static void nvme_aio_destroy(NvmeAIO *aio)
> > +{
> > +g_free(aio);
> > +}
> > +
> > +static inline void nvme_req_register_aio(NvmeRequest *req, NvmeAIO *aio,
> I guess I'll call this nvme_req_add_aio,
> or nvme_add_aio_to_reg.
> Thoughts?
> Also you can leave this as is, but add a comment on top explaining this
> 

nvme_req_add_aio it is :) And comment added.

> > + NvmeAIOOp opc)
> > +{
> > +aio->opc = opc;
> > +
> > +trace_nvme_dev_req_register_aio(nvme_cid(req), aio, blk_name(aio->blk),
> > +aio->offset, aio->len,
> > +nvme_aio_opc_str(aio), req);
> > +
> > +if (req) {
> > +QTAILQ_INSERT_TAIL(>aio_tailq, aio, tailq_entry);
> > +}
> > +}
> > +
> > +static void nvme_submit_aio(NvmeAIO *aio)
> OK, this name makes sense
> Also please add a comment on top.

Done.

> > @@ -505,9 +600,11 @@ static inline uint16_t nvme_check_mdts(NvmeCtrl *n, 
> > size_t len,
> >  return NVME_SUCCESS;
> >  }
> >  
> > -static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, NvmeNamespace *ns,
> > - uint16_t ctrl, NvmeRequest *req)
> > +static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, uint16_t ctrl,
> > + NvmeRequest *req)
> >  {
> > +NvmeNamespace *ns = req->ns;
> > +
> This should go to the patch that added nvme_check_prinfo
> 

Probably killing that patch.

> > @@ -516,10 +613,10 @@ static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, 
> > NvmeNamespace *ns,
> >  return NVME_SUCCESS;
> >  }
> >  
> > -static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
> > - uint64_t slba, uint32_t nlb,
> > - NvmeRequest *req)
> > +static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
> > + uint32_t nlb, NvmeRequest *req)
> >  {
> > +NvmeNamespace *ns = req->ns;
> >  uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
> This should go to the patch that added nvme_check_bounds as well
> 

We can't really, because the NvmeRequest does not hold a reference to
the namespace as a struct member at that point. This is also an issue
with the nvme_check_prinfo function above.

> >  
> >  if (unlikely(UINT64_MAX - slba < nlb || slba + nlb > nsze)) {
> > @@ -530,55 +627,154 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl 
> > *n, NvmeNamespace *ns,
> >  return NVME_SUCCESS;
> >  }
> >  
> > -static void nvme_rw_cb(void *opaque, int ret)
> > +static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +NvmeNamespace *ns = req->ns;
> > +NvmeRwCmd *rw = (NvmeRwCmd *) >cmd;
> > +uint16_t ctrl = le16_to_cpu(rw->control);
> > +size_t len = req->nlb << nvme_ns_lbads(ns);
> > +uint16_t status;
> > +
> > +status = nvme_check_mdts(n, len, req);
> > +if (status) {
> > +return status;
> > +}
> > +
> > +status = nvme_check_prinfo(n, ctrl, req);
> > +if (status) {
> > +return status;
> > +}
> > +
> > +status = nvme_check_bounds(n, req->slba, req->nlb, req);
> > +if (status) {
> > +return status;
> > +}
> > +
> > +return NVME_SUCCESS;
> > +}
> 
> Nitpick: I hate to say it but nvme_check_rw should be in a separate patch as 
> well.
> It will also make diff more readable (when adding a funtion and changing a 
> function
> at the same time, you get a diff between two unrelated things)
> 

Done, but had to do it as a follow up patch.

> >  

Re: [PATCH v6 24/42] nvme: remove redundant has_sg member

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:45, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Remove the has_sg member from NvmeRequest since it's redundant.
> 
> To be honest this patch also replaces the dma_acct_start with block_acct_start
> which looks right to me, and IMHO its OK to have both in the same patch,
> but that should be mentioned in the commit message
> 

I pulled it to a separate patch :)

> With this fixed,
> Reviewed-by: Maxim Levitsky 
> 




Re: [PATCH v6 31/42] nvme: add check for prinfo

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:57, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Check the validity of the PRINFO field.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 50 ---
> >  hw/block/trace-events |  1 +
> >  include/block/nvme.h  |  1 +
> >  3 files changed, 44 insertions(+), 8 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 7d5340c272c6..0d2b5b45b0c5 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -505,6 +505,17 @@ static inline uint16_t nvme_check_mdts(NvmeCtrl *n, 
> > size_t len,
> >  return NVME_SUCCESS;
> >  }
> >  
> > +static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, NvmeNamespace *ns,
> > + uint16_t ctrl, NvmeRequest *req)
> > +{
> > +if ((ctrl & NVME_RW_PRINFO_PRACT) && !(ns->id_ns.dps & DPS_TYPE_MASK)) 
> > {
> > +trace_nvme_dev_err_prinfo(nvme_cid(req), ctrl);
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> 
> I refreshed my (still very limited) knowelege on the metadata
> and the protection info, and this is what I found:
> 
> I think that this is very far from complete, because we also have:
> 
> 1. PRCHECK. According to the spec it is independent of PRACT
>And when some of it is set, 
>together with enabled protection (the DPS field in namespace),
>Then the 8 bytes of the protection info is checked (optionally using the
>the EILBRT and ELBAT/ELBATM fields in the command and CRC of the data for 
> the guard field)
> 
>So this field should also be checked to be zero when protection is disabled
>(I don't see an explicit requirement for that in the spec, but neither I 
> see
>such requirement for PRACT)
> 
> 2. The protection values to be written / checked ((E)ILBRT/(E)LBATM/(E)LBAT)
>Same here, but also these should not be set when PRCHECK is not set for 
> reads,
>plus some are protection type specific.
> 
> 
> The spec does mention the 'Invalid Protection Information' error code which
> refers to invalid values in the PRINFO field.
> So this error code I think should be returned instead of the 'Invalid field'
> 
> Another thing to optionaly check is that the metadata pointer for separate 
> metadata.
>  Is zero as long as we don't support metadata
> (again I don't see an explicit requirement for this in the spec, but it 
> mentions:
> 
> "This field is valid only if the command has metadata that is not interleaved 
> with
> the logical block data, as specified in the Format NVM command"
> 
> )
> 

I'm kinda inclined to just drop this patch. The spec actually says that
the PRACT and PRCHK fields are used only if the namespace is formatted
to use end-to-end protection information. Since we do not support that,
I don't think we even need to check it.

Any opinion on this?



Re: [PATCH v6 23/42] nvme: add mapping helpers

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:45, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add nvme_map_addr, nvme_map_addr_cmb and nvme_addr_to_cmb helpers and
> > use them in nvme_map_prp.
> > 
> > This fixes a bug where in the case of a CMB transfer, the device would
> > map to the buffer with a wrong length.
> > 
> > Fixes: b2b2b67a00574 ("nvme: Add support for Read Data and Write Data in 
> > CMBs.")
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 97 +++
> >  hw/block/trace-events |  1 +
> >  2 files changed, 81 insertions(+), 17 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 08267e847671..187c816eb6ad 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -153,29 +158,79 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue 
> > *cq)
> >  }
> >  }
> >  
> > +static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr 
> > addr,
> > +  size_t len)
> > +{
> > +if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len)) {
> > +return NVME_DATA_TRAS_ERROR;
> > +}
> 
> I just noticed that
> in theory (not that it really matters) but addr+len refers to the byte which 
> is already 
> not the part of the transfer.
> 

Oh. Good catch - and I think that it does matter? Can't we end up
rejecting a valid access? Anyway, I fixed it with a '- 1'.

> 
> > +
> > +qemu_iovec_add(iov, nvme_addr_to_cmb(n, addr), len);
> Also intersting is we can add 0 sized iovec.
> 

I added a check on len. This also makes sure the above '- 1' fix doesn't
cause an 'addr + 0 - 1' to be done.




Re: [PATCH v6 14/42] nvme: add missing mandatory features

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:41, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add support for returning a resonable response to Get/Set Features of
> > mandatory features.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c   | 60 ++-
> >  hw/block/trace-events |  2 ++
> >  include/block/nvme.h  |  6 -
> >  3 files changed, 66 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index ff8975cd6667..eb9c722df968 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1058,6 +1069,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  break;
> >  case NVME_TIMESTAMP:
> >  return nvme_get_feature_timestamp(n, cmd);
> > +case NVME_INTERRUPT_COALESCING:
> > +result = cpu_to_le32(n->features.int_coalescing);
> > +break;
> > +case NVME_INTERRUPT_VECTOR_CONF:
> > +if ((dw11 & 0x) > n->params.max_ioqpairs + 1) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> I still think that this should be >= since the interrupt vector is not zero 
> based.
> So if we have for example 3 IO queues, then we have 4 queues in total
> which translates to irq numbers 0..3.
> 

Yes you are right. The device will support max_ioqpairs + 1 IVs, so
trying to access that would actually go 1 beyond the array.

Fixed.

> BTW the user of the device doesn't have to have 1:1 mapping between qid and 
> msi interrupt index,
> in fact when MSI is not used, all the queues will map to the same vector, 
> which will be interrupt 0
> from point of view of the device IMHO.
> So it kind of makes sense IMHO to have num_irqs or something, even if it 
> technically equals to number of queues.
> 

Yeah, but the device will still *support* the N IVs, so they can still
be configured even though they will not be used. So I don't think we
need to introduce an additional parameter?

> > @@ -1120,6 +1146,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  
> >  break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> > +if (blk_enable_write_cache(n->conf.blk)) {
> > +blk_flush(n->conf.blk);
> > +}
> 
> (not your fault) but the blk_enable_write_cache function name is highly 
> misleading,
> since it doesn't enable anything but just gets the flag if the write cache is 
> enabled.
> It really should be called blk_get_enable_write_cache.
> 

Agreed :)

> > @@ -1804,6 +1860,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >  id->cqes = (0x4 << 4) | 0x4;
> >  id->nn = cpu_to_le32(n->num_namespaces);
> >  id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
> > +
> Unrelated whitespace change

Fixed.

> 
> Best regards,
>   Maxim Levitsky
> 
> 
> 
> 



Re: [PATCH v6 12/42] nvme: add support for the get log page command

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:40, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add support for the Get Log Page command and basic implementations of
> > the mandatory Error Information, SMART / Health Information and Firmware
> > Slot Information log pages.
> > 
> > In violation of the specification, the SMART / Health Information log
> > page does not persist information over the lifetime of the controller
> > because the device has no place to store such persistent state.
> > 
> > Note that the LPA field in the Identify Controller data structure
> > intentionally has bit 0 cleared because there is no namespace specific
> > information in the SMART / Health information log page.
> > 
> > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > Section 5.10 ("Get Log Page command").
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c   | 138 +-
> >  hw/block/nvme.h   |  10 +++
> >  hw/block/trace-events |   2 +
> >  3 files changed, 149 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 64c42101df5c..83ff3fbfb463 100644
> >
> > +static uint16_t nvme_error_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > buf_len,
> > +uint64_t off, NvmeRequest *req)
> > +{
> > +uint32_t trans_len;
> > +uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
> > +uint8_t errlog[64];
> I'll would replace this with sizeof(NvmeErrorLogEntry)
> (and add NvmeErrorLogEntry to the nvme.h), just for the sake of consistency,
> and in case we end up reporting some errors to the log in the future.
> 

NvmeErrorLog is already in nvme.h; Fixed to actually use it.

> 
> > +
> > +if (off > sizeof(errlog)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +memset(errlog, 0x0, sizeof(errlog));
> > +
> > +trans_len = MIN(sizeof(errlog) - off, buf_len);
> > +
> > +return nvme_dma_read_prp(n, errlog, trans_len, prp1, prp2);
> > +}
> Besides this, looks good now.
> 
> 
> Best regards,
>   Maxim Levitsky
> 




Re: [PATCH v6 19/42] nvme: enforce valid queue creation sequence

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:43, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Support returning Command Sequence Error if Set Features on Number of
> > Queues is called after queues have been created.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 7 +++
> >  hw/block/nvme.h | 1 +
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 007f8817f101..b40d27cddc46 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -881,6 +881,8 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  cq = g_malloc0(sizeof(*cq));
> >  nvme_init_cq(cq, n, prp1, cqid, vector, qsize + 1,
> >  NVME_CQ_FLAGS_IEN(qflags));
> > +
> > +n->qs_created = true;
> Very minor nitpick, maybe it is worth mentioning in a comment,
> why this is only needed in CQ creation, as you explained to me.
> 

Added.

> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 
> 
> 
> 
> 



Re: [PATCH v6 10/42] nvme: refactor device realization

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:40, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > This patch splits up nvme_realize into multiple individual functions,
> > each initializing a different subset of the device.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c | 178 ++--
> >  hw/block/nvme.h |  23 ++-
> >  2 files changed, 134 insertions(+), 67 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 7dfd8a1a392d..665485045066 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1340,57 +1337,100 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >  n->params.max_ioqpairs = n->params.num_queues - 1;
> >  }
> >  
> > -if (!n->params.max_ioqpairs) {
> > -error_setg(errp, "max_ioqpairs can't be less than 1");
> > +if (params->max_ioqpairs < 1 ||
> > +params->max_ioqpairs > PCI_MSIX_FLAGS_QSIZE) {
> > +error_setg(errp, "nvme: max_ioqpairs must be ");
> Looks like the error message is not complete now.

Fixed!

> 
> Small nitpick: To be honest this not only refactoring in the device 
> realization since you also (rightfully)
> removed the duplicated cmbsz/cmbloc so I would add a mention for this in the 
> commit message.
> But that doesn't matter that much, so
> 

You are right. I've added it as a separate patch.

> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 




Re: [PATCH v6 06/42] nvme: add identify cns values in header

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:37, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f716f690a594..b38d7e548a60 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -709,11 +709,11 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  NvmeIdentify *c = (NvmeIdentify *)cmd;
> >  
> >  switch (le32_to_cpu(c->cns)) {
> > -case 0x00:
> > +case NVME_ID_CNS_NS:
> >  return nvme_identify_ns(n, c);
> > -case 0x01:
> > +case NVME_ID_CNS_CTRL:
> >  return nvme_identify_ctrl(n, c);
> > -case 0x02:
> > +case NVME_ID_CNS_NS_ACTIVE_LIST:
> >  return nvme_identify_nslist(n, c);
> >  default:
> >  trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
> 
> This is a very good candidate to be squished with the patch 5 IMHO,
> but you can leave this as is as well. I don't mind.
> 

Squashed!

> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 
> 
> 
> 
> 
> 



Re: [PATCH v6 09/42] nvme: add max_ioqpairs device parameter

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:39, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > The num_queues device paramater has a slightly confusing meaning because
> > it accounts for the admin queue pair which is not really optional.
> > Secondly, it is really a maximum value of queues allowed.
> > 
> > Add a new max_ioqpairs parameter that only accounts for I/O queue pairs,
> > but keep num_queues for compatibility.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 45 ++---
> >  hw/block/nvme.h |  4 +++-
> >  2 files changed, 29 insertions(+), 20 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 7cf7cf55143e..7dfd8a1a392d 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1332,9 +1333,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >  int64_t bs_size;
> >  uint8_t *pci_conf;
> >  
> > -if (!n->params.num_queues) {
> > -error_setg(errp, "num_queues can't be zero");
> > -return;
> > +if (n->params.num_queues) {
> > +warn_report("nvme: num_queues is deprecated; please use 
> > max_ioqpairs "
> > +"instead");
> > +
> > +n->params.max_ioqpairs = n->params.num_queues - 1;
> > +}
> > +
> > +if (!n->params.max_ioqpairs) {
> > +error_setg(errp, "max_ioqpairs can't be less than 1");
> >  }
> This is not even a nitpick, but just and idea.
> 
> It might be worth it to allow max_ioqpairs=0 to simulate a 'broken'
> nvme controller. I know that kernel has special handling for such controllers,
> which include only creation of the control character device (/dev/nvme*) 
> through
> which the user can submit commands to try and 'fix' the controller (by 
> re-uploading firmware
> maybe or something like that).
> 
> 

Not sure about the implications of this, so I'll leave that on the TODO
:) But a controller with no I/O queues is an "Administrative Controller"
and perfectly legal in NVMe v1.4 AFAIK.

> >  
> >  if (!n->conf.blk) {
> > @@ -1365,19 +1372,19 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >  pcie_endpoint_cap_init(pci_dev, 0x80);
> >  
> >  n->num_namespaces = 1;
> > -n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
> > +n->reg_size = pow2ceil(0x1008 + 2 * (n->params.max_ioqpairs) * 4);
> 
> I hate to say it, but it looks like this thing (which I mentioned to you in 
> V5)
> was pre-existing bug, which is indeed fixed now.
> In theory such fixes should go to separate patches, but in this case, I guess 
> it would
> be too much to ask for it.
> Maybe mention this in the commit message instead, so that this fix doesn't 
> stay hidden like that?
> 
> 

I'm convinced now. I have added a preparatory bugfix patch before this
patch.

> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 




Re: [PATCH v6 11/42] nvme: add temperature threshold feature

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:40, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > It might seem wierd to implement this feature for an emulated device,
> > but it is mandatory to support and the feature is useful for testing
> > asynchronous event request support, which will be added in a later
> > patch.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c  | 48 
> >  hw/block/nvme.h  |  2 ++
> >  include/block/nvme.h |  8 +++-
> >  3 files changed, 57 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index b7c465560eea..8cda5f02c622 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -108,6 +108,7 @@ typedef struct NvmeCtrl {
> >  uint64_tirq_status;
> >  uint64_thost_timestamp; /* Timestamp sent by the 
> > host */
> >  uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
> > +uint16_ttemperature;
> You forgot to move this too.
> 

Fixed!
> 
> With 'temperature' field removed from the header:
> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 




Re: [PATCH v6 07/42] nvme: refactor nvme_addr_read

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:38, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Pull the controller memory buffer check to its own function. The check
> > will be used on its own in later patches.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >  hw/block/nvme.c | 16 
> >  1 file changed, 12 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index b38d7e548a60..08a83d449de3 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -52,14 +52,22 @@
> >  
> >  static void nvme_process_sq(void *opaque);
> >  
> > +static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> > +{
> > +hwaddr low = n->ctrl_mem.addr;
> > +hwaddr hi  = n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size);
> > +
> > +return addr >= low && addr < hi;
> > +}
> > +
> >  static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> > -if (n->cmbsz && addr >= n->ctrl_mem.addr &&
> > -addr < (n->ctrl_mem.addr + 
> > int128_get64(n->ctrl_mem.size))) {
> > +if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> >  memcpy(buf, (void *)>cmbuf[addr - n->ctrl_mem.addr], size);
> > -} else {
> > -pci_dma_read(>parent_obj, addr, buf, size);
> > +return;
> >  }
> > +
> > +pci_dma_read(>parent_obj, addr, buf, size);
> >  }
> >  
> >  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
> 
> Note that this patch still contains a bug that it removes the check against 
> the accessed
> size, which you fix in later patch.
> I prefer to not add a bug in first place
> However if you have a reason for this, I won't mind.
> 

So yeah. The resons is that there is actually no bug at this point
because the controller only supports PRPs. I actually thought there was
a bug as well and reported it to qemu-security some months ago as a
potential out of bounds access. I was then schooled by Keith on how PRPs
work ;) Below is a paraphrased version of Keiths analysis.

The PRPs does not cross page boundaries:

trans_len = n->page_size - (prp1 % n->page_size);

The PRPs are always verified to be page aligned:

if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {

and the transfer length wont go above page size. So, since the beginning
of the address is within the CMB and considering that the CMB is of an
MB aligned and sized granularity, then we can never cross outside it
with PRPs.

I could add the check at this point (because it *is* needed for when
SGLs are introduced), but I think it would just be noise and I would
need to explain why the check is there, but not really needed at this
point. Instead I'm adding a new patch before the SGL patch that explains
this.



Re: [PATCH v6 04/42] nvme: bump spec data structures to v1.3

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:37, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Add missing fields in the Identify Controller and Identify Namespace
> > data structures to bring them in line with NVMe v1.3.
> > 
> > This also adds data structures and defines for SGL support which
> > requires a couple of trivial changes to the nvme block driver as well.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Fam Zheng 
> > ---
> >  block/nvme.c |  18 ++---
> >  hw/block/nvme.c  |  12 ++--
> >  include/block/nvme.h | 153 ++-
> >  3 files changed, 151 insertions(+), 32 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index d41c4bda6e39..99b9bb3dac96 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -589,6 +675,16 @@ enum NvmeIdCtrlOncs {
> >  #define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf)
> >  #define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf)
> >  
> > +#define NVME_CTRL_SGLS_SUPPORTED_MASK(0x3 <<  0)
> > +#define NVME_CTRL_SGLS_SUPPORTED_NO_ALIGNMENT(0x1 <<  0)
> > +#define NVME_CTRL_SGLS_SUPPORTED_DWORD_ALIGNMENT (0x1 <<  1)
> > +#define NVME_CTRL_SGLS_KEYED (0x1 <<  2)
> > +#define NVME_CTRL_SGLS_BITBUCKET (0x1 << 16)
> > +#define NVME_CTRL_SGLS_MPTR_CONTIGUOUS   (0x1 << 17)
> > +#define NVME_CTRL_SGLS_EXCESS_LENGTH (0x1 << 18)
> > +#define NVME_CTRL_SGLS_MPTR_SGL  (0x1 << 19)
> > +#define NVME_CTRL_SGLS_ADDR_OFFSET   (0x1 << 20)
> OK
> > +
> >  typedef struct NvmeFeatureVal {
> >  uint32_tarbitration;
> >  uint32_tpower_mgmt;
> > @@ -611,6 +707,10 @@ typedef struct NvmeFeatureVal {
> >  #define NVME_INTC_THR(intc) (intc & 0xff)
> >  #define NVME_INTC_TIME(intc)((intc >> 8) & 0xff)
> >  
> > +#define NVME_TEMP_THSEL(temp)  ((temp >> 20) & 0x3)
> Nitpick: If we are adding this, I'll add a #define for the values as well
> 

Done. And used in the subsequent "nvme: add temperature threshold
feature" patch.

> > +#define NVME_TEMP_TMPSEL(temp) ((temp >> 16) & 0xf)
> > +#define NVME_TEMP_TMPTH(temp)  ((temp >>  0) & 0x)
> > +
> >  enum NvmeFeatureIds {
> >  NVME_ARBITRATION= 0x1,
> >  NVME_POWER_MANAGEMENT   = 0x2,
> > @@ -653,18 +753,37 @@ typedef struct NvmeIdNs {
> >  uint8_t mc;
> >  uint8_t dpc;
> >  uint8_t dps;
> > -
> >  uint8_t nmic;
> >  uint8_t rescap;
> >  uint8_t fpi;
> >  uint8_t dlfeat;
> > -
> > -uint8_t res34[94];
> > +uint16_tnawun;
> > +uint16_tnawupf;
> > +uint16_tnacwu;
> > +uint16_tnabsn;
> > +uint16_tnabo;
> > +uint16_tnabspf;
> > +uint16_tnoiob;
> > +uint8_t nvmcap[16];
> > +uint8_t rsvd64[40];
> > +uint8_t nguid[16];
> > +uint64_teui64;
> >  NvmeLBAFlbaf[16];
> > -uint8_t res192[192];
> > +uint8_t rsvd192[192];
> >  uint8_t vs[3712];
> >  } NvmeIdNs;
> Also checked this against V5, looks OK now
> 
> >  
> > +typedef struct NvmeIdNsDescr {
> > +uint8_t nidt;
> > +uint8_t nidl;
> > +uint8_t rsvd2[2];
> > +} NvmeIdNsDescr;
> OK
> 
> 
> 
> > +
> > +#define NVME_NIDT_UUID_LEN 16
> > +
> > +enum {
> > +NVME_NIDT_UUID = 0x3,
> Very minor nitpick: I'll would add others as well just for the sake
> of better understanding what this is
> 

Done.

> > +};
> >  
> >  /*Deallocate Logical Block Features*/
> >  #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)   ((dlfeat) & 0x10)
> 
> Looks very good.
> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 



Re: [PATCH v6 05/42] nvme: use constant for identify data size

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:37, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 40cb176dea3c..f716f690a594 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -679,7 +679,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
> > NvmeIdentify *c)
> >  
> >  static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
> >  {
> > -static const int data_len = 4 * KiB;
> > +static const int data_len = NVME_IDENTIFY_DATA_SIZE;
> >  uint32_t min_nsid = le32_to_cpu(c->nsid);
> >  uint64_t prp1 = le64_to_cpu(c->prp1);
> >  uint64_t prp2 = le64_to_cpu(c->prp2);
> 
> I'll probably squash this with some other refactoring patch,
> but I absolutely don't mind leaving this as is.
> Fine grained patches never cause any harm.
> 

Squashed into 06/42.

> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 
> 
> 
> 



Re: [PATCH v6 01/42] nvme: rename trace events to nvme_dev

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 25 12:36, Maxim Levitsky wrote:
> On Mon, 2020-03-16 at 07:28 -0700, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Change the prefix of all nvme device related trace events to 'nvme_dev'
> > to not clash with trace events from the nvme block driver.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > Reviewed-by: Maxim Levitsky 
> > ---
> >  hw/block/nvme.c   | 188 +-
> >  hw/block/trace-events | 172 +++---
> >  2 files changed, 180 insertions(+), 180 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index d28335cbf377..3e4b18956ed2 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1035,32 +1035,32 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr 
> > offset, uint64_t data,
> >  switch (offset) {
> >  case 0xc:   /* INTMS */
> >  if (unlikely(msix_enabled(&(n->parent_obj {
> > -NVME_GUEST_ERR(nvme_ub_mmiowr_intmask_with_msix,
> > +NVME_GUEST_ERR(nvme_dev_ub_mmiowr_intmask_with_msix,
> > "undefined access to interrupt mask set"
> > " when MSI-X is enabled");
> >  /* should be ignored, fall through for now */
> >  }
> >  n->bar.intms |= data & 0x;
> >  n->bar.intmc = n->bar.intms;
> > -trace_nvme_mmio_intm_set(data & 0x,
> > +trace_nvme_dev_mmio_intm_set(data & 0x,
> >   n->bar.intmc);
> Indention.
> 

Fixed.

> >  nvme_irq_check(n);
> >  break;
> >  case 0x10:  /* INTMC */
> >  if (unlikely(msix_enabled(&(n->parent_obj {
> > -NVME_GUEST_ERR(nvme_ub_mmiowr_intmask_with_msix,
> > +NVME_GUEST_ERR(nvme_dev_ub_mmiowr_intmask_with_msix,
> > "undefined access to interrupt mask clr"
> > " when MSI-X is enabled");
> >  /* should be ignored, fall through for now */
> >  }
> >  n->bar.intms &= ~(data & 0x);
> >  n->bar.intmc = n->bar.intms;
> > -trace_nvme_mmio_intm_clr(data & 0x,
> > +trace_nvme_dev_mmio_intm_clr(data & 0x,
> >   n->bar.intmc);
> Indention.
> 

Fixed.

> 
> 
> Other that indention nitpicks, no changes vs V5,
> so my reviewed-by kept correctly.
> 
> Best regards,
>   Maxim Levitsky
> 



Re: [PATCH RESEND v4] nvme: introduce PMR support from NVMe 1.4 spec

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 31 03:13, Keith Busch wrote:
> On Mon, Mar 30, 2020 at 08:07:32PM +0200, Klaus Birkelund Jensen wrote:
> > On Mar 31 01:55, Keith Busch wrote:
> > > On Mon, Mar 30, 2020 at 09:46:56AM -0700, Andrzej Jakowski wrote:
> > > > This patch introduces support for PMR that has been defined as part of 
> > > > NVMe 1.4
> > > > spec. User can now specify a pmrdev option that should point to 
> > > > HostMemoryBackend.
> > > > pmrdev memory region will subsequently be exposed as PCI BAR 2 in 
> > > > emulated NVMe
> > > > device. Guest OS can perform mmio read and writes to the PMR region 
> > > > that will stay
> > > > persistent across system reboot.
> > > 
> > > Looks pretty good to me.
> > > 
> > > Reviewed-by: Keith Busch 
> > > 
> > > For a possible future extention, it could be interesting to select the
> > > BAR for PMR dynamically, so that you could have CMB and PMR enabled on
> > > the same device.
> > > 
> > 
> > I thought about the same thing, but this would require disabling MSI-X
> > right? Otherwise there is not enough room. AFAIU, the iomem takes up
> > BAR0 and BAR1, CMB (or PMR) uses BAR2 and BAR3 and MSI-X uses BAR4 or 5.
> 
> That's the way the emulated device is currently set, but there's no
> reason for CMB or MSIx to use an exlusive bar. Both of those can be
> be offsets into BAR 0/1, for example. PMR is the only one that can't
> share.
> 

Oh, I did not consider that at all. Thanks!



Re: [PATCH RESEND v4] nvme: introduce PMR support from NVMe 1.4 spec

2020-03-30 Thread Klaus Birkelund Jensen
On Mar 31 01:55, Keith Busch wrote:
> On Mon, Mar 30, 2020 at 09:46:56AM -0700, Andrzej Jakowski wrote:
> > This patch introduces support for PMR that has been defined as part of NVMe 
> > 1.4
> > spec. User can now specify a pmrdev option that should point to 
> > HostMemoryBackend.
> > pmrdev memory region will subsequently be exposed as PCI BAR 2 in emulated 
> > NVMe
> > device. Guest OS can perform mmio read and writes to the PMR region that 
> > will stay
> > persistent across system reboot.
> 
> Looks pretty good to me.
> 
> Reviewed-by: Keith Busch 
> 
> For a possible future extention, it could be interesting to select the
> BAR for PMR dynamically, so that you could have CMB and PMR enabled on
> the same device.
> 

I thought about the same thing, but this would require disabling MSI-X
right? Otherwise there is not enough room. AFAIU, the iomem takes up
BAR0 and BAR1, CMB (or PMR) uses BAR2 and BAR3 and MSI-X uses BAR4 or 5.



Re: [PATCH v3] block/nvme: introduce PMR support from NVMe 1.4 spec

2020-03-19 Thread Klaus Birkelund Jensen
On Mar 18 13:03, Andrzej Jakowski wrote:
> This patch introduces support for PMR that has been defined as part of NVMe 
> 1.4
> spec. User can now specify a pmrdev option that should point to 
> HostMemoryBackend.
> pmrdev memory region will subsequently be exposed as PCI BAR 2 in emulated 
> NVMe
> device. Guest OS can perform mmio read and writes to the PMR region that will 
> stay
> persistent across system reboot.
> 
> Signed-off-by: Andrzej Jakowski 
> ---
> v2:
>  - reworked PMR to use HostMemoryBackend instead of directly mapping PMR
>backend file into qemu [1] (Stefan)
> 
> v1:
>  - provided support for Bit 1 from PMRWBM register instead of Bit 0 to ensure
>improved performance in virtualized environment [2] (Stefan)
> 
>  - added check if pmr size is power of two in size [3] (David)
> 
>  - addressed cross compilation build problems reported by CI environment
> 
> [1]: 
> https://lore.kernel.org/qemu-devel/20200306223853.37958-1-andrzej.jakow...@linux.intel.com/
> [2]: 
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf
> [3]: 
> https://lore.kernel.org/qemu-devel/20200218224811.30050-1-andrzej.jakow...@linux.intel.com/
> ---
> Persistent Memory Region (PMR) is a new optional feature provided in NVMe 1.4
> specification. This patch implements initial support for it in NVMe driver.
> ---
>  hw/block/nvme.c   | 117 +++-
>  hw/block/nvme.h   |   2 +
>  hw/block/trace-events |   5 ++
>  include/block/nvme.h  | 172 ++
>  4 files changed, 294 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index d28335cbf3..70fd09d293 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -19,10 +19,18 @@
>   *  -drive file=,if=none,id=
>   *  -device nvme,drive=,serial=,id=, \
>   *  cmb_size_mb=, \
> + *  [pmrdev=,] \
>   *  num_queues=
>   *
>   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
>   * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> + *
> + * Either cmb or pmr - due to limitation in available BAR indexes.
> + * pmr_file file needs to be power of two in size.
> + * Enabling pmr emulation can be achieved by pointing to memory-backend-file.
> + * For example:
> + * -object memory-backend-file,id=,share=on,mem-path=, \
> + *  size=  -device nvme,...,pmrdev=
>   */
>  
>  #include "qemu/osdep.h"
> @@ -35,7 +43,9 @@
>  #include "sysemu/sysemu.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> +#include "sysemu/hostmem.h"
>  #include "sysemu/block-backend.h"
> +#include "exec/ramblock.h"
>  
>  #include "qemu/log.h"
>  #include "qemu/module.h"
> @@ -1141,6 +1151,26 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
> uint64_t data,
>  NVME_GUEST_ERR(nvme_ub_mmiowr_cmbsz_readonly,
> "invalid write to read only CMBSZ, ignored");
>  return;
> +case 0xE00: /* PMRCAP */
> +NVME_GUEST_ERR(nvme_ub_mmiowr_pmrcap_readonly,
> +   "invalid write to PMRCAP register, ignored");
> +return;
> +case 0xE04: /* TODO PMRCTL */
> +break;
> +case 0xE08: /* PMRSTS */
> +NVME_GUEST_ERR(nvme_ub_mmiowr_pmrsts_readonly,
> +   "invalid write to PMRSTS register, ignored");
> +return;
> +case 0xE0C: /* PMREBS */
> +NVME_GUEST_ERR(nvme_ub_mmiowr_pmrebs_readonly,
> +   "invalid write to PMREBS register, ignored");
> +return;
> +case 0xE10: /* PMRSWTP */
> +NVME_GUEST_ERR(nvme_ub_mmiowr_pmrswtp_readonly,
> +   "invalid write to PMRSWTP register, ignored");
> +return;
> +case 0xE14: /* TODO PMRMSC */
> + break;
>  default:
>  NVME_GUEST_ERR(nvme_ub_mmiowr_invalid,
> "invalid MMIO write,"
> @@ -1169,6 +1199,23 @@ static uint64_t nvme_mmio_read(void *opaque, hwaddr 
> addr, unsigned size)
>  }
>  
>  if (addr < sizeof(n->bar)) {
> +/*
> + * When PMRWBM bit 1 is set then read from
> + * from PMRSTS should ensure prior writes
> + * made it to persistent media
> + */
> +if (addr == 0xE08 &&
> +(NVME_PMRCAP_PMRWBM(n->bar.pmrcap) & 0x02) >> 1) {

Don't think that shift is needed.

> +int status;
> +
> +status = qemu_msync((void *)n->pmrdev->mr.ram_block->host,
> +n->pmrdev->size,
> +n->pmrdev->mr.ram_block->fd);
> +if (!status) {
> +NVME_GUEST_ERR(nvme_ub_mmiord_pmrread_barrier,
> +   "error while persisting data");
> +}
> +}
>  memcpy(, ptr + addr, size);
>  } else {
>  NVME_GUEST_ERR(nvme_ub_mmiord_invalid_ofs,
> @@ -1332,6 +1379,23 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> 

Re: [PATCH v5 22/26] nvme: support multiple namespaces

2020-03-16 Thread Klaus Birkelund Jensen
On Feb 12 14:34, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> Very reasonable way to do it. 
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/Makefile.objs |   2 +-
> >  hw/block/nvme-ns.c | 158 +++
> >  hw/block/nvme-ns.h |  60 +++
> >  hw/block/nvme.c| 235 +
> >  hw/block/nvme.h|  47 -
> >  hw/block/trace-events  |   6 +-
> >  6 files changed, 389 insertions(+), 119 deletions(-)
> >  create mode 100644 hw/block/nvme-ns.c
> >  create mode 100644 hw/block/nvme-ns.h
> > 
> > diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
> > index 28c2495a00dc..45f463462f1e 100644
> > --- a/hw/block/Makefile.objs
> > +++ b/hw/block/Makefile.objs
> > @@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
> >  common-obj-$(CONFIG_XEN) += xen-block.o
> >  common-obj-$(CONFIG_ECC) += ecc.o
> >  common-obj-$(CONFIG_ONENAND) += onenand.o
> > -common-obj-$(CONFIG_NVME_PCI) += nvme.o
> > +common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
> >  common-obj-$(CONFIG_SWIM) += swim.o
> >  
> >  obj-$(CONFIG_SH4) += tc58128.o
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > new file mode 100644
> > index ..0e5be44486f4
> > --- /dev/null
> > +++ b/hw/block/nvme-ns.c
> > @@ -0,0 +1,158 @@
> > +#include "qemu/osdep.h"
> > +#include "qemu/units.h"
> > +#include "qemu/cutils.h"
> > +#include "qemu/log.h"
> > +#include "hw/block/block.h"
> > +#include "hw/pci/msix.h"
> Do you need this include?

No, I needed hw/pci/pci.h instead :)

> > +#include "sysemu/sysemu.h"
> > +#include "sysemu/block-backend.h"
> > +#include "qapi/error.h"
> > +
> > +#include "hw/qdev-properties.h"
> > +#include "hw/qdev-core.h"
> > +
> > +#include "nvme.h"
> > +#include "nvme-ns.h"
> > +
> > +static int nvme_ns_init(NvmeNamespace *ns)
> > +{
> > +NvmeIdNs *id_ns = >id_ns;
> > +
> > +id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->nuse = id_ns->ncap = id_ns->nsze =
> > +cpu_to_le64(nvme_ns_nlbas(ns));
> Nitpick: To be honest I don't really like that chain assignment, 
> especially since it forces to wrap the line, but that is just my
> personal taste.

Fixed, and also added a comment as to why they are the same.

> > +
> > +return 0;
> > +}
> > +
> > +static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, NvmeIdCtrl *id,
> > +Error **errp)
> > +{
> > +uint64_t perm, shared_perm;
> > +
> > +Error *local_err = NULL;
> > +int ret;
> > +
> > +perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
> > +shared_perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
> > +BLK_PERM_GRAPH_MOD;
> > +
> > +ret = blk_set_perm(ns->blk, perm, shared_perm, _err);
> > +if (ret) {
> > +error_propagate_prepend(errp, local_err, "blk_set_perm: ");
> > +return ret;
> > +}
> 
> You should consider using blkconf_apply_backend_options.
> Take a look at for example virtio_blk_device_realize.
> That will give you support for read only block devices as well.

So, yeah. There is a reason for this. And I will add that as a comment,
but I will write it here for posterity.

The problem is when the nvme-ns device starts getting more than just a
single drive attached (I have patches ready that will add a "metadata"
and a "state" drive). The blkconf_ functions work on a BlockConf that
embeds a BlockBackend, so you can't have one BlockConf with multiple
BlockBackend's. That is why I'm kinda copying the "good parts" of
the blkconf_apply_backend_options code here.

> 
> I personally only once grazed the area of block permissions,
> so I prefer someone from the block layer to review this as well.
> 
> > +
> > +ns->size = blk_getlength(ns->blk);
> > +if (ns->size < 0) {
> > +error_setg_errno(errp, -ns->size, "blk_getlength");
> > +return 1;
> > +}
> > +
> > +switch (n->conf.wce) {
> > +case ON_OFF_AUTO_ON:
> > +n->features.volatile_wc = 1;
> > +break;
> > +case ON_OFF_AUTO_OFF:
> > +n->features.volatile_wc = 0;
> > +case 

Re: [PATCH v5 20/26] nvme: handle dma errors

2020-03-16 Thread Klaus Birkelund Jensen
On Feb 12 13:52, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > Handling DMA errors gracefully is required for the device to pass the
> > block/011 test ("disable PCI device while doing I/O") in the blktests
> > suite.
> > 
> > With this patch the device passes the test by retrying "critical"
> > transfers (posting of completion entries and processing of submission
> > queue entries).
> > 
> > If DMA errors occur at any other point in the execution of the command
> > (say, while mapping the PRPs), the command is aborted with a Data
> > Transfer Error status code.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 42 +-
> >  hw/block/trace-events |  2 ++
> >  include/block/nvme.h  |  2 +-
> >  3 files changed, 36 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f8c81b9e2202..204ae1d33234 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -73,14 +73,14 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr 
> > addr)
> >  return addr >= low && addr < hi;
> >  }
> >  
> > -static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> > +static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> >  if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> >  memcpy(buf, (void *) >cmbuf[addr - n->ctrl_mem.addr], size);
> > -return;
> > +return 0;
> >  }
> >  
> > -pci_dma_read(>parent_obj, addr, buf, size);
> > +return pci_dma_read(>parent_obj, addr, buf, size);
> >  }
> >  
> >  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
> > @@ -168,6 +168,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> > *qsg, QEMUIOVector *iov,
> >  uint16_t status = NVME_SUCCESS;
> >  bool is_cmb = false;
> >  bool prp_list_in_cmb = false;
> > +int ret;
> >  
> >  trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
> >  prp1, prp2, num_prps);
> > @@ -218,7 +219,12 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> > *qsg, QEMUIOVector *iov,
> >  
> >  nents = (len + n->page_size - 1) >> n->page_bits;
> >  prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > -nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> > +ret = nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> > +if (ret) {
> > +trace_nvme_dev_err_addr_read(prp2);
> > +status = NVME_DATA_TRANSFER_ERROR;
> > +goto unmap;
> > +}
> >  while (len != 0) {
> >  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
> >  
> > @@ -237,7 +243,13 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList 
> > *qsg, QEMUIOVector *iov,
> >  i = 0;
> >  nents = (len + n->page_size - 1) >> n->page_bits;
> >  prp_trans = MIN(n->max_prp_ents, nents) * 
> > sizeof(uint64_t);
> > -nvme_addr_read(n, prp_ent, (void *) prp_list, 
> > prp_trans);
> > +ret = nvme_addr_read(n, prp_ent, (void *) prp_list,
> > +prp_trans);
> > +if (ret) {
> > +trace_nvme_dev_err_addr_read(prp_ent);
> > +status = NVME_DATA_TRANSFER_ERROR;
> > +goto unmap;
> > +}
> >  prp_ent = le64_to_cpu(prp_list[i]);
> >  }
> >  
> > @@ -443,6 +455,7 @@ static void nvme_post_cqes(void *opaque)
> >  NvmeCQueue *cq = opaque;
> >  NvmeCtrl *n = cq->ctrl;
> >  NvmeRequest *req, *next;
> > +int ret;
> >  
> >  QTAILQ_FOREACH_SAFE(req, >req_list, entry, next) {
> >  NvmeSQueue *sq;
> > @@ -452,15 +465,21 @@ static void nvme_post_cqes(void *opaque)
> >  break;
> >  }
> >  
> > -QTAILQ_REMOVE(>req_list, req, entry);
> >  sq = req->sq;
> >  req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
> >  req->cqe.sq_id = cpu_to_le16(sq->sqid);
> >  req->cqe.sq_head = cpu_to_le16(sq->head);
> >  addr = cq->dma_addr + cq->tail * n->cqe_size;
> > -nvme_inc_cq_tail(cq);
> > -pci_dma_write(>parent_obj, addr, (void *)>cqe,
> > +ret = pci_dma_write(>parent_obj, addr, (void *)>cqe,
> >  sizeof(req->cqe));
> > +if (ret) {
> > +trace_nvme_dev_err_addr_write(addr);
> > +timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> > +100 * SCALE_MS);
> > +break;
> > +}
> > +QTAILQ_REMOVE(>req_list, req, entry);
> > +nvme_inc_cq_tail(cq);
> >  nvme_req_clear(req);
> >  QTAILQ_INSERT_TAIL(>req_list, req, entry);
> >  }
> > @@ -1588,7 +1607,12 @@ static void nvme_process_sq(void *opaque)
> >  
> 

Re: [PATCH v5 17/26] nvme: allow multiple aios per command

2020-03-16 Thread Klaus Birkelund Jensen
On Feb 12 13:48, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > This refactors how the device issues asynchronous block backend
> > requests. The NvmeRequest now holds a queue of NvmeAIOs that are
> > associated with the command. This allows multiple aios to be issued for
> > a command. Only when all requests have been completed will the device
> > post a completion queue entry.
> > 
> > Because the device is currently guaranteed to only issue a single aio
> > request per command, the benefit is not immediately obvious. But this
> > functionality is required to support metadata, the dataset management
> > command and other features.
> 
> I don't know what the strategy will be chosen for supporting metadata
> (qemu doesn't have any notion of metadata in the block layer), but for 
> dataset management
> you are right. Dataset management command can contain a table of areas to 
> discard
> (although in reality I have seen no driver putting there more that one entry).
> 

The strategy is different depending on how the metadata is transferred
between host and device. For the "separate buffer" case, metadata is
transferred using a separate memory pointer in the nvme command (MPTR).
In this case the metadata is kept separately on a new blockdev attached
to the namespace.

In the other case, metadata is transferred as part of an extended lba
(say 512 + 8 bytes) and kept inline on the main namespace blockdev. This
is challenging for QEMU as it breaks interoperability of the image with
other devices. But that is a discussion for fresh RFC ;)

Note that the support for multiple AIOs is also used for DULBE support
down the line when I get around to posting those patches. So this is
preparatory for a lot of features that requires persistant state across
device power off.

> 
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 449 +-
> >  hw/block/nvme.h   | 134 +++--
> >  hw/block/trace-events |   8 +
> >  3 files changed, 480 insertions(+), 111 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 334265efb21e..e97da35c4ca1 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -19,7 +19,8 @@
> >   *  -drive file=,if=none,id=
> >   *  -device nvme,drive=,serial=,id=, \
> >   *  cmb_size_mb=, \
> > - *  num_queues=
> > + *  num_queues=, \
> > + *  mdts=
> 
> Could you split mdts checks into a separate patch? This is not related to the 
> series.

Absolutely. Done.

> 
> >   *
> >   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
> >   * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> > @@ -57,6 +58,7 @@
> >  } while (0)
> >  
> >  static void nvme_process_sq(void *opaque);
> > +static void nvme_aio_cb(void *opaque, int ret);
> >  
> >  static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
> >  {
> > @@ -341,6 +343,107 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t 
> > *ptr, uint32_t len,
> >  return status;
> >  }
> >  
> > +static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +NvmeNamespace *ns = req->ns;
> > +
> > +uint32_t len = req->nlb << nvme_ns_lbads(ns);
> > +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +
> > +return nvme_map_prp(n, >qsg, >iov, prp1, prp2, len, req);
> > +}
> 
> Same here, this is another nice refactoring and it should be in separate 
> patch.

Done.

> 
> > +
> > +static void nvme_aio_destroy(NvmeAIO *aio)
> > +{
> > +g_free(aio);
> > +}
> > +
> > +static inline void nvme_req_register_aio(NvmeRequest *req, NvmeAIO *aio,
> > +NvmeAIOOp opc)
> > +{
> > +aio->opc = opc;
> > +
> > +trace_nvme_dev_req_register_aio(nvme_cid(req), aio, blk_name(aio->blk),
> > +aio->offset, aio->len, nvme_aio_opc_str(aio), req);
> > +
> > +if (req) {
> > +QTAILQ_INSERT_TAIL(>aio_tailq, aio, tailq_entry);
> > +}
> > +}
> > +
> > +static void nvme_aio(NvmeAIO *aio)
> Function name not clear to me. Maybe change this to something like 
> nvme_submit_aio.

Fixed.

> > +{
> > +BlockBackend *blk = aio->blk;
> > +BlockAcctCookie *acct = >acct;
> > +BlockAcctStats *stats = blk_get_stats(blk);
> > +
> > +bool is_write, dma;
> > +
> > +switch (aio->opc) {
> > +case NVME_AIO_OPC_NONE:
> > +break;
> > +
> > +case NVME_AIO_OPC_FLUSH:
> > +block_acct_start(stats, acct, 0, BLOCK_ACCT_FLUSH);
> > +aio->aiocb = blk_aio_flush(blk, nvme_aio_cb, aio);
> > +break;
> > +
> > +case NVME_AIO_OPC_WRITE_ZEROES:
> > +block_acct_start(stats, acct, aio->len, BLOCK_ACCT_WRITE);
> > +aio->aiocb = blk_aio_pwrite_zeroes(blk, aio->offset, aio->len,
> > +BDRV_REQ_MAY_UNMAP, nvme_aio_cb, aio);
> > +break;
> > +
> > +case 

Re: [PATCH v5 16/26] nvme: refactor prp mapping

2020-03-16 Thread Klaus Birkelund Jensen
On Feb 12 13:44, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Refactor nvme_map_prp and allow PRPs to be located in the CMB. The logic
> > ensures that if some of the PRP is in the CMB, all of it must be located
> > there, as per the specification.
> 
> To be honest this looks like not refactoring but a bugfix
> (old code was just assuming that if first prp entry is in cmb, the rest also 
> is)

I split it up into a separate bugfix patch.

> > 
> > Also combine nvme_dma_{read,write}_prp into a single nvme_dma_prp that
> > takes an additional DMADirection parameter.
> 
> To be honest 'nvme_dma_prp' was not a clear function name to me at first 
> glance.
> Could you rename this to nvme_dma_prp_rw or so? (Although even that is 
> somewhat unclear
> to convey the meaning of read/write the data to/from the guest memory areas 
> defined by the prp list.
> Also could you split this change into a new patch?
> 

Splitting into new patch.

> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> Now you even use your both addresses :-)
> 
> > ---
> >  hw/block/nvme.c   | 245 +++---
> >  hw/block/nvme.h   |   2 +-
> >  hw/block/trace-events |   1 +
> >  include/block/nvme.h  |   1 +
> >  4 files changed, 160 insertions(+), 89 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 4acfc85b56a2..334265efb21e 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -58,6 +58,11 @@
> >  
> >  static void nvme_process_sq(void *opaque);
> >  
> > +static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
> > +{
> > +return >cmbuf[addr - n->ctrl_mem.addr];
> > +}
> 
> To my taste I would put this together with the patch that
> added nvme_addr_is_cmb. I know that some people are against
> this citing the fact that you should use the code you add
> in the same patch. Your call.
> 
> Regardless of this I also prefer to put refactoring patches first in the 
> series.
> 
> > +
> >  static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> >  {
> >  hwaddr low = n->ctrl_mem.addr;
> > @@ -152,138 +157,187 @@ static void nvme_irq_deassert(NvmeCtrl *n, 
> > NvmeCQueue *cq)
> >  }
> >  }
> >  
> > -static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t 
> > prp1,
> > - uint64_t prp2, uint32_t len, NvmeCtrl *n)
> > +static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector 
> > *iov,
> > +uint64_t prp1, uint64_t prp2, uint32_t len, NvmeRequest *req)
> 
> Split line alignment (it was correct before).
> Also while at the refactoring, it would be great to add some documentation
> to this and few more functions, since its not clear immediately what this 
> does.
> 
> 
> >  {
> >  hwaddr trans_len = n->page_size - (prp1 % n->page_size);
> >  trans_len = MIN(len, trans_len);
> >  int num_prps = (len >> n->page_bits) + 1;
> > +uint16_t status = NVME_SUCCESS;
> > +bool is_cmb = false;
> > +bool prp_list_in_cmb = false;
> > +
> > +trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
> > +prp1, prp2, num_prps);
> >  
> >  if (unlikely(!prp1)) {
> >  trace_nvme_dev_err_invalid_prp();
> >  return NVME_INVALID_FIELD | NVME_DNR;
> > -} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
> > -   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
> > -qsg->nsg = 0;
> > +}
> > +
> > +if (nvme_addr_is_cmb(n, prp1)) {
> > +is_cmb = true;
> > +
> >  qemu_iovec_init(iov, num_prps);
> > -qemu_iovec_add(iov, (void *)>cmbuf[prp1 - n->ctrl_mem.addr], 
> > trans_len);
> > +
> > +/*
> > + * PRPs do not cross page boundaries, so if the start address 
> > (here,
> > + * prp1) is within the CMB, it cannot cross outside the controller
> > + * memory buffer range. This is ensured by
> > + *
> > + *   len = n->page_size - (addr % n->page_size)
> > + *
> > + * Thus, we can directly add to the iovec without risking an out of
> > + * bounds access. This also holds for the remaining qemu_iovec_add
> > + * calls.
> > + */
> > +qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp1), trans_len);
> >  } else {
> >  pci_dma_sglist_init(qsg, >parent_obj, num_prps);
> >  qemu_sglist_add(qsg, prp1, trans_len);
> >  }
> > +
> >  len -= trans_len;
> >  if (len) {
> >  if (unlikely(!prp2)) {
> >  trace_nvme_dev_err_invalid_prp2_missing();
> > +status = NVME_INVALID_FIELD | NVME_DNR;
> >  goto unmap;
> >  }
> > +
> >  if (len > n->page_size) {
> >  uint64_t prp_list[n->max_prp_ents];
> >  uint32_t nents, prp_trans;
> >  int i = 0;
> >  
> > +if (nvme_addr_is_cmb(n, prp2)) {
> > +prp_list_in_cmb 

Re: [PATCH v5 21/26] nvme: add support for scatter gather lists

2020-03-16 Thread Klaus Birkelund Jensen
On Feb 12 14:07, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > For now, support the Data Block, Segment and Last Segment descriptor
> > types.
> > 
> > See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Fam Zheng 
> > ---
> >  block/nvme.c  |  18 +-
> >  hw/block/nvme.c   | 375 +++---
> >  hw/block/trace-events |   4 +
> >  include/block/nvme.h  |  62 ++-
> >  4 files changed, 389 insertions(+), 70 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index d41c4bda6e39..521f521054d5 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -446,7 +446,7 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >  error_setg(errp, "Cannot map buffer for DMA");
> >  goto out;
> >  }
> > -cmd.prp1 = cpu_to_le64(iova);
> > +cmd.dptr.prp.prp1 = cpu_to_le64(iova);
> >  
> >  if (nvme_cmd_sync(bs, s->queues[0], )) {
> >  error_setg(errp, "Failed to identify controller");
> > @@ -545,7 +545,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, 
> > Error **errp)
> >  }
> >  cmd = (NvmeCmd) {
> >  .opcode = NVME_ADM_CMD_CREATE_CQ,
> > -.prp1 = cpu_to_le64(q->cq.iova),
> > +.dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
> >  .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
> >  .cdw11 = cpu_to_le32(0x3),
> >  };
> > @@ -556,7 +556,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, 
> > Error **errp)
> >  }
> >  cmd = (NvmeCmd) {
> >  .opcode = NVME_ADM_CMD_CREATE_SQ,
> > -.prp1 = cpu_to_le64(q->sq.iova),
> > +.dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
> >  .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
> >  .cdw11 = cpu_to_le32(0x1 | (n << 16)),
> >  };
> > @@ -906,16 +906,16 @@ try_map:
> >  case 0:
> >  abort();
> >  case 1:
> > -cmd->prp1 = pagelist[0];
> > -cmd->prp2 = 0;
> > +cmd->dptr.prp.prp1 = pagelist[0];
> > +cmd->dptr.prp.prp2 = 0;
> >  break;
> >  case 2:
> > -cmd->prp1 = pagelist[0];
> > -cmd->prp2 = pagelist[1];
> > +cmd->dptr.prp.prp1 = pagelist[0];
> > +cmd->dptr.prp.prp2 = pagelist[1];
> >  break;
> >  default:
> > -cmd->prp1 = pagelist[0];
> > -cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
> > +cmd->dptr.prp.prp1 = pagelist[0];
> > +cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + 
> > sizeof(uint64_t));
> >  break;
> >  }
> >  trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 204ae1d33234..a91c60fdc111 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -75,8 +75,10 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr 
> > addr)
> >  
> >  static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> > -if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> > -memcpy(buf, (void *) >cmbuf[addr - n->ctrl_mem.addr], size);
> > +hwaddr hi = addr + size;
> Are you sure you don't want to check for overflow here?
> Its theoretical issue since addr has to be almost full 64 bit
> but still for those things I check this very defensively.
> 

The use of nvme_addr_read in map_prp simply cannot overflow due to how
the size is calculated, but for SGLs it's different. But the overflow is
checked in map_sgl because we have to return a special error code in
that case.

On the other hand there may be other callers of nvme_addr_read in the
future that does not check this, so I'll re-add it.

> > +
> > +if (n->cmbsz && nvme_addr_is_cmb(n, addr) && nvme_addr_is_cmb(n, hi)) {
> Here you fix the bug I mentioned in patch 6. I suggest you to move the fix 
> there.

Done.

> > +memcpy(buf, nvme_addr_to_cmb(n, addr), size);
> >  return 0;
> >  }
> >  
> > @@ -159,6 +161,48 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue 
> > *cq)
> >  }
> >  }
> >  
> > +static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr 
> > addr,
> > +size_t len)
> > +{
> > +if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len)) {
> > +return NVME_DATA_TRANSFER_ERROR;
> > +}
> > +
> > +qemu_iovec_add(iov, nvme_addr_to_cmb(n, addr), len);
> > +
> > +return NVME_SUCCESS;
> > +}
> > +
> > +static uint16_t nvme_map_addr(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector 
> > *iov,
> > +hwaddr addr, size_t len)
> > +{
> > +bool addr_is_cmb = nvme_addr_is_cmb(n, addr);
> > +
> > +if (addr_is_cmb) {
> > +if (qsg->sg) {
> > +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +}
> > +
> > +if (!iov->iov) {
> > +qemu_iovec_init(iov, 1);
> > +}
> > +
> > +  

Re: [PATCH v5 15/26] nvme: bump supported specification to 1.3

2020-03-16 Thread Klaus Birkelund Jensen
On Feb 12 12:35, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Add new fields to the Identify Controller and Identify Namespace data
> > structures accoding to NVM Express 1.3d.
> > 
> > NVM Express 1.3d requires the following additional features:
> >   - addition of the Namespace Identification Descriptor List (CNS 03h)
> > for the Identify command
> >   - support for returning Command Sequence Error if a Set Features
> > command is submitted for the Number of Queues feature after any I/O
> > queues have been created.
> >   - The addition of the Log Specific Field (LSP) in the Get Log Page
> > command.
> 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 57 ---
> >  hw/block/nvme.h   |  1 +
> >  hw/block/trace-events |  3 ++-
> >  include/block/nvme.h  | 20 ++-
> >  4 files changed, 71 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 900732bb2f38..4acfc85b56a2 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -9,7 +9,7 @@
> >   */
> >  
> >  /**
> > - * Reference Specification: NVM Express 1.2.1
> > + * Reference Specification: NVM Express 1.3d
> >   *
> >   *   https://nvmexpress.org/resources/specifications/
> >   */
> > @@ -43,7 +43,7 @@
> >  #include "trace.h"
> >  #include "nvme.h"
> >  
> > -#define NVME_SPEC_VER 0x00010201
> > +#define NVME_SPEC_VER 0x00010300
> >  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> >  #define NVME_TEMPERATURE 0x143
> >  #define NVME_TEMPERATURE_WARNING 0x157
> > @@ -735,6 +735,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
> > NvmeRequest *req)
> >  uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> >  uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> >  uint8_t  lid = dw10 & 0xff;
> > +uint8_t  lsp = (dw10 >> 8) & 0xf;
> >  uint8_t  rae = (dw10 >> 15) & 0x1;
> >  uint32_t numdl, numdu;
> >  uint64_t off, lpol, lpou;
> > @@ -752,7 +753,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, 
> > NvmeRequest *req)
> >  return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> >  
> > -trace_nvme_dev_get_log(nvme_cid(req), lid, rae, len, off);
> > +trace_nvme_dev_get_log(nvme_cid(req), lid, lsp, rae, len, off);
> >  
> >  switch (lid) {
> >  case NVME_LOG_ERROR_INFO:
> > @@ -863,6 +864,8 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  cq = g_malloc0(sizeof(*cq));
> >  nvme_init_cq(cq, n, prp1, cqid, vector, qsize + 1,
> >  NVME_CQ_FLAGS_IEN(qflags));
> Code alignment on that '('
> > +
> > +n->qs_created = true;
> Should be done also at nvme_create_sq

No, because you can't create a SQ without a matching CQ:

if (unlikely(!cqid || nvme_check_cqid(n, cqid))) {
trace_nvme_dev_err_invalid_create_sq_cqid(cqid);
return NVME_INVALID_CQID | NVME_DNR;
}


So if there is a matching cq, then qs_created = true.

> >  return NVME_SUCCESS;
> >  }
> >  
> > @@ -924,6 +927,47 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, 
> > NvmeIdentify *c)
> >  return ret;
> >  }
> >  
> > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> > +{
> > +static const int len = 4096;
> The spec caps the Identify payload size to 4K,
> thus this should go to nvme.h

Done.

> > +
> > +struct ns_descr {
> > +uint8_t nidt;
> > +uint8_t nidl;
> > +uint8_t rsvd2[2];
> > +uint8_t nid[16];
> > +};
> This is also part of the spec, thus should
> move to nvme.h
> 

Done - and cleaned up.

> > +
> > +uint32_t nsid = le32_to_cpu(c->nsid);
> > +uint64_t prp1 = le64_to_cpu(c->prp1);
> > +uint64_t prp2 = le64_to_cpu(c->prp2);
> > +
> > +struct ns_descr *list;
> > +uint16_t ret;
> > +
> > +trace_nvme_dev_identify_ns_descr_list(nsid);
> > +
> > +if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > +trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > +return NVME_INVALID_NSID | NVME_DNR;
> > +}
> > +
> > +/*
> > + * Because the NGUID and EUI64 fields are 0 in the Identify Namespace 
> > data
> > + * structure, a Namespace UUID (nidt = 0x3) must be reported in the
> > + * Namespace Identification Descriptor. Add a very basic Namespace UUID
> > + * here.
> Some per namespace uuid qemu property will be very nice to have to have a 
> uuid that
> is at least somewhat unique.
> Linux kernel I think might complain if it detects namespaces with duplicate 
> uuids.

It will be "unique" per controller (because it's just the namespace id).
The spec also says that it should be fixed for the lifetime of the
namespace, but I'm not sure how to ensure that without keeping that
state on disk somehow. I have a solution for this in a later series, but
for now, I think this is ok.

But since we actually support multiple controllers, there certainly is
an issue here. Maybe we can 

Re: [PATCH v5 14/26] nvme: make sure ncqr and nsqr is valid

2020-03-16 Thread Klaus Birkelund Jensen
On Feb 12 12:30, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > 0x is not an allowed value for NCQR and NSQR in Set Features on
> > Number of Queues.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 4 
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 30c5b3e7a67d..900732bb2f38 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1133,6 +1133,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >  break;
> >  case NVME_NUMBER_OF_QUEUES:
> > +if ((dw11 & 0x) == 0x || ((dw11 >> 16) & 0x) == 
> > 0x) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> Very minor nitpick: since this spec requirement is not obvious, a 
> quote/reference to the spec
> would be nice to have here. 
> 

Added.

> > +
> >  trace_nvme_dev_setfeat_numq((dw11 & 0x) + 1,
> >  ((dw11 >> 16) & 0x) + 1, n->params.num_queues - 1,
> >  n->params.num_queues - 1);
> 
> Reviewed-by: Maxim Levitsky 
> 
> Best regards,
>   Maxim Levitsky
> 



Re: [PATCH v5 12/26] nvme: add missing mandatory features

2020-03-16 Thread Klaus Birkelund Jensen
On Feb 12 12:27, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Add support for returning a resonable response to Get/Set Features of
> > mandatory features.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 57 ---
> >  hw/block/trace-events |  2 ++
> >  include/block/nvme.h  |  3 ++-
> >  3 files changed, 58 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index a186d95df020..3267ee2de47a 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1008,7 +1008,15 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  uint32_t result;
> >  
> > +trace_nvme_dev_getfeat(nvme_cid(req), dw10);
> > +
> >  switch (dw10) {
> > +case NVME_ARBITRATION:
> > +result = cpu_to_le32(n->features.arbitration);
> > +break;
> > +case NVME_POWER_MANAGEMENT:
> > +result = cpu_to_le32(n->features.power_mgmt);
> > +break;
> >  case NVME_TEMPERATURE_THRESHOLD:
> >  result = 0;
> >  
> > @@ -1029,6 +1037,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  break;
> >  }
> >  
> > +break;
> > +case NVME_ERROR_RECOVERY:
> > +result = cpu_to_le32(n->features.err_rec);
> >  break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  result = blk_enable_write_cache(n->conf.blk);
> 
> This is existing code but still like to point out that endianess conversion 
> is missing.

Fixed.

> Also we need to think if we need to do some flush if the write cache is 
> disabled.
> I don't know yet that area well enough.
> 

Looking at the block layer code it just sets a flag when disabling, but
subsequent requests will have BDRV_REQ_FUA set. So to make sure that
stuff in the cache is flushed, let's do a flush.

> > @@ -1041,6 +1052,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  break;
> >  case NVME_TIMESTAMP:
> >  return nvme_get_feature_timestamp(n, cmd);
> > +case NVME_INTERRUPT_COALESCING:
> > +result = cpu_to_le32(n->features.int_coalescing);
> > +break;
> > +case NVME_INTERRUPT_VECTOR_CONF:
> > +if ((dw11 & 0x) > n->params.num_queues) {
> Looks like it should be >= since interrupt vector is not zero based.

Fixed in other patch.

> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +result = cpu_to_le32(n->features.int_vector_config[dw11 & 0x]);
> > +break;
> > +case NVME_WRITE_ATOMICITY:
> > +result = cpu_to_le32(n->features.write_atomicity);
> > +break;
> >  case NVME_ASYNCHRONOUS_EVENT_CONF:
> >  result = cpu_to_le32(n->features.async_config);
> >  break;
> > @@ -1076,6 +1100,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> >  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  
> > +trace_nvme_dev_setfeat(nvme_cid(req), dw10, dw11);
> > +
> >  switch (dw10) {
> >  case NVME_TEMPERATURE_THRESHOLD:
> >  if (NVME_TEMP_TMPSEL(dw11)) {
> > @@ -1116,6 +1142,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  case NVME_ASYNCHRONOUS_EVENT_CONF:
> >  n->features.async_config = dw11;
> >  break;
> > +case NVME_ARBITRATION:
> > +case NVME_POWER_MANAGEMENT:
> > +case NVME_ERROR_RECOVERY:
> > +case NVME_INTERRUPT_COALESCING:
> > +case NVME_INTERRUPT_VECTOR_CONF:
> > +case NVME_WRITE_ATOMICITY:
> > +return NVME_FEAT_NOT_CHANGABLE | NVME_DNR;
> >  default:
> >  trace_nvme_dev_err_invalid_setfeat(dw10);
> >  return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1689,6 +1722,21 @@ static void nvme_init_state(NvmeCtrl *n)
> >  n->temperature = NVME_TEMPERATURE;
> >  n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> >  n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > +
> > +/*
> > + * There is no limit on the number of commands that the controller may
> > + * launch at one time from a particular Submission Queue.
> > + */
> > +n->features.arbitration = 0x7;
> A nice #define in nvme.h stating that 0x7 means no burst limit would be nice.
> 

Done.

> > +
> > +n->features.int_vector_config = g_malloc0_n(n->params.num_queues,
> > +sizeof(*n->features.int_vector_config));
> > +
> > +/* disable coalescing (not supported) */
> > +for (int i = 0; i < n->params.num_queues; i++) {
> > +n->features.int_vector_config[i] = i | (1 << 16);
> Same here

Done.

> > +}
> > +
> >  n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
> >  }
> >  
> > @@ -1782,15 +1830,17 @@ static void 

Re: [PATCH v5 10/26] nvme: add support for the get log page command

2020-03-16 Thread Klaus Birkelund Jensen
On Feb 12 11:35, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Add support for the Get Log Page command and basic implementations of
> > the mandatory Error Information, SMART / Health Information and Firmware
> > Slot Information log pages.
> > 
> > In violation of the specification, the SMART / Health Information log
> > page does not persist information over the lifetime of the controller
> > because the device has no place to store such persistent state.
> Yea, not the end of the world.
> > 
> > Note that the LPA field in the Identify Controller data structure
> > intentionally has bit 0 cleared because there is no namespace specific
> > information in the SMART / Health information log page.
> Makes sense.
> > 
> > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > Section 5.10 ("Get Log Page command").
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 122 +-
> >  hw/block/nvme.h   |  10 
> >  hw/block/trace-events |   2 +
> >  include/block/nvme.h  |   2 +-
> >  4 files changed, 134 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f72348344832..468c36918042 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -569,6 +569,123 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  return NVME_SUCCESS;
> >  }
> >  
> > +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > buf_len,
> > +uint64_t off, NvmeRequest *req)
> > +{
> > +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +uint32_t nsid = le32_to_cpu(cmd->nsid);
> > +
> > +uint32_t trans_len;
> > +time_t current_ms;
> > +uint64_t units_read = 0, units_written = 0, read_commands = 0,
> > +write_commands = 0;
> > +NvmeSmartLog smart;
> > +BlockAcctStats *s;
> > +
> > +if (nsid && nsid != 0x) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +s = blk_get_stats(n->conf.blk);
> > +
> > +units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > +units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > +read_commands = s->nr_ops[BLOCK_ACCT_READ];
> > +write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> > +
> > +if (off > sizeof(smart)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +trans_len = MIN(sizeof(smart) - off, buf_len);
> > +
> > +memset(, 0x0, sizeof(smart));
> > +
> > +smart.data_units_read[0] = cpu_to_le64(units_read / 1000);
> > +smart.data_units_written[0] = cpu_to_le64(units_written / 1000);
> > +smart.host_read_commands[0] = cpu_to_le64(read_commands);
> > +smart.host_write_commands[0] = cpu_to_le64(write_commands);
> > +
> > +smart.temperature[0] = n->temperature & 0xff;
> > +smart.temperature[1] = (n->temperature >> 8) & 0xff;
> > +
> > +if ((n->temperature > n->features.temp_thresh_hi) ||
> > +(n->temperature < n->features.temp_thresh_low)) {
> > +smart.critical_warning |= NVME_SMART_TEMPERATURE;
> > +}
> > +
> > +current_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > +smart.power_on_hours[0] = cpu_to_le64(
> > +(((current_ms - n->starttime_ms) / 1000) / 60) / 60);
> > +
> > +return nvme_dma_read_prp(n, (uint8_t *)  + off, trans_len, prp1,
> > +prp2);
> > +}
> Looks OK.
> > +
> > +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > buf_len,
> > +uint64_t off, NvmeRequest *req)
> > +{
> > +uint32_t trans_len;
> > +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +NvmeFwSlotInfoLog fw_log;
> > +
> > +if (off > sizeof(fw_log)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +memset(_log, 0, sizeof(NvmeFwSlotInfoLog));
> > +
> > +trans_len = MIN(sizeof(fw_log) - off, buf_len);
> > +
> > +return nvme_dma_read_prp(n, (uint8_t *) _log + off, trans_len, prp1,
> > +prp2);
> > +}
> Looks OK
> > +
> > +static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > +uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > +uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> > +uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> > +uint8_t  lid = dw10 & 0xff;
> > +uint8_t  rae = (dw10 >> 15) & 0x1;
> > +uint32_t numdl, numdu;
> > +uint64_t off, lpol, lpou;
> > +size_t   len;
> > +
> > +numdl = (dw10 >> 16);
> > +numdu = (dw11 & 0x);
> > +lpol = dw12;
> > +lpou = dw13;
> > +
> > +len = (((numdu << 16) | numdl) + 1) << 2;
> > +off = (lpou << 32ULL) | lpol;
> > +
> > +if (off & 0x3) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> 
> Good. 
> Note that there are plenty of other places in the driver 

Re: [PATCH v5 09/26] nvme: add temperature threshold feature

2020-03-16 Thread Klaus Birkelund Jensen
On Feb 12 11:31, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > It might seem wierd to implement this feature for an emulated device,
> > but it is mandatory to support and the feature is useful for testing
> > asynchronous event request support, which will be added in a later
> > patch.
> 
> Absolutely but as the old saying is, rules are rules.
> At least, to the defense of the spec, making this mandatory
> forced the vendors to actually report some statistics about
> the device in neutral format as opposed to yet another
> vendor proprietary thing (I am talking about SMART log page).
> 
> > 
> > Signed-off-by: Klaus Jensen 
> 
> I noticed that you sign off some patches with your @samsung.com email,
> and some with @cnexlabs.com
> Is there a reason for that?

Yeah. Some of this code was made while I was at CNEX Labs. I've since
moved to Samsung. But credit where credit's due.

> 
> 
> > ---
> >  hw/block/nvme.c  | 50 
> >  hw/block/nvme.h  |  2 ++
> >  include/block/nvme.h |  7 ++-
> >  3 files changed, 58 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 81514eaef63a..f72348344832 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -45,6 +45,9 @@
> >  
> >  #define NVME_SPEC_VER 0x00010201
> >  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> > +#define NVME_TEMPERATURE 0x143
> > +#define NVME_TEMPERATURE_WARNING 0x157
> > +#define NVME_TEMPERATURE_CRITICAL 0x175
> >  
> >  #define NVME_GUEST_ERR(trace, fmt, ...) \
> >  do { \
> > @@ -798,9 +801,31 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl 
> > *n, NvmeCmd *cmd)
> >  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest 
> > *req)
> >  {
> >  uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > +uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  uint32_t result;
> >  
> >  switch (dw10) {
> > +case NVME_TEMPERATURE_THRESHOLD:
> > +result = 0;
> > +
> > +/*
> > + * The controller only implements the Composite Temperature 
> > sensor, so
> > + * return 0 for all other sensors.
> > + */
> > +if (NVME_TEMP_TMPSEL(dw11)) {
> > +break;
> > +}
> > +
> > +switch (NVME_TEMP_THSEL(dw11)) {
> > +case 0x0:
> > +result = cpu_to_le16(n->features.temp_thresh_hi);
> > +break;
> > +case 0x1:
> > +result = cpu_to_le16(n->features.temp_thresh_low);
> > +break;
> > +}
> > +
> > +break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  result = blk_enable_write_cache(n->conf.blk);
> >  trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
> > @@ -845,6 +870,23 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  
> >  switch (dw10) {
> > +case NVME_TEMPERATURE_THRESHOLD:
> > +if (NVME_TEMP_TMPSEL(dw11)) {
> > +break;
> > +}
> > +
> > +switch (NVME_TEMP_THSEL(dw11)) {
> > +case 0x0:
> > +n->features.temp_thresh_hi = NVME_TEMP_TMPTH(dw11);
> > +break;
> > +case 0x1:
> > +n->features.temp_thresh_low = NVME_TEMP_TMPTH(dw11);
> > +break;
> > +default:
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >  break;
> > @@ -1366,6 +1408,9 @@ static void nvme_init_state(NvmeCtrl *n)
> >  n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> >  n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> >  n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> > +
> > +n->temperature = NVME_TEMPERATURE;
> 
> This appears not to be used in the patch.
> I think you should move that to the next patch that
> adds the get log page support.
> 

Fixed.

> > +n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> >  }
> >  
> >  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > @@ -1447,6 +1492,11 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >  id->acl = 3;
> >  id->frmw = 7 << 1;
> >  id->lpa = 1 << 0;
> > +
> > +/* recommended default value (~70 C) */
> > +id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> > +id->cctemp = cpu_to_le16(NVME_TEMPERATURE_CRITICAL);
> > +
> >  id->sqes = (0x6 << 4) | 0x6;
> >  id->cqes = (0x4 << 4) | 0x4;
> >  id->nn = cpu_to_le32(n->num_namespaces);
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index a867bdfabafd..1518f32557a3 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -108,6 +108,7 @@ typedef struct NvmeCtrl {
> >  uint64_tirq_status;
> >  uint64_thost_timestamp; /* Timestamp sent by the 
> > host */
> >  

Re: [PATCH v5 08/26] nvme: refactor device realization

2020-03-16 Thread Klaus Birkelund Jensen
On Feb 12 11:27, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > This patch splits up nvme_realize into multiple individual functions,
> > each initializing a different subset of the device.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 175 +++-
> >  hw/block/nvme.h |  21 ++
> >  2 files changed, 133 insertions(+), 63 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index e1810260d40b..81514eaef63a 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -44,6 +44,7 @@
> >  #include "nvme.h"
> >  
> >  #define NVME_SPEC_VER 0x00010201
> > +#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> >  
> >  #define NVME_GUEST_ERR(trace, fmt, ...) \
> >  do { \
> > @@ -1325,67 +1326,106 @@ static const MemoryRegionOps nvme_cmb_ops = {
> >  },
> >  };
> >  
> > -static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > +static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
> >  {
> > -NvmeCtrl *n = NVME(pci_dev);
> > -NvmeIdCtrl *id = >id_ctrl;
> > -
> > -int i;
> > -int64_t bs_size;
> > -uint8_t *pci_conf;
> > -
> > -if (!n->params.num_queues) {
> > -error_setg(errp, "num_queues can't be zero");
> > -return;
> > -}
> > +NvmeParams *params = >params;
> >  
> >  if (!n->conf.blk) {
> > -error_setg(errp, "drive property not set");
> > -return;
> > +error_setg(errp, "nvme: block backend not configured");
> > +return 1;
> As a matter of taste, negative values indicate error, and 0 is the success 
> value.
> In Linux kernel this is even an official rule.
> >  }

Fixed.

> >  
> > -bs_size = blk_getlength(n->conf.blk);
> > -if (bs_size < 0) {
> > -error_setg(errp, "could not get backing file size");
> > -return;
> > +if (!params->serial) {
> > +error_setg(errp, "nvme: serial not configured");
> > +return 1;
> >  }
> >  
> > -if (!n->params.serial) {
> > -error_setg(errp, "serial property not set");
> > -return;
> > +if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
> > +error_setg(errp, "nvme: invalid queue configuration");
> Maybe something like "nvme: invalid queue count specified, should be between 
> 1 and ..."?
> > +return 1;
> >  }

Fixed.

> > +
> > +return 0;
> > +}
> > +
> > +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > +{
> >  blkconf_blocksizes(>conf);
> >  if (!blkconf_apply_backend_options(>conf, 
> > blk_is_read_only(n->conf.blk),
> > -   false, errp)) {
> > -return;
> > +false, errp)) {
> > +return 1;
> >  }
> >  
> > -pci_conf = pci_dev->config;
> > -pci_conf[PCI_INTERRUPT_PIN] = 1;
> > -pci_config_set_prog_interface(pci_dev->config, 0x2);
> > -pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> > -pcie_endpoint_cap_init(pci_dev, 0x80);
> > +return 0;
> > +}
> >  
> > +static void nvme_init_state(NvmeCtrl *n)
> > +{
> >  n->num_namespaces = 1;
> >  n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
> 
> Isn't that wrong?
> First 4K of mmio (0x1000) is the registers, and that is followed by the 
> doorbells,
> and each doorbell takes 8 bytes (assuming regular doorbell stride).
> so n->params.num_queues + 1 should be total number of queues, thus the 0x1004 
> should be 0x1000 IMHO.
> I might miss some rounding magic here though.
> 

Yeah. I think you are right. It all becomes slightly more fishy due to
the num_queues device parameter being 1's based and accounts for the
admin queue pair.

But in get/set features, the value has to be 0's based and only account
for the I/O queues, so we need to subtract 2 from the value. It's
confusing all around.

Since the admin queue pair isn't really optional I think it would be
better that we introduces a new max_ioqpairs parameter that is 1's
based, counts number of pairs and obviously only accounts for the io
queues.

I guess we need to keep the num_queues parameter around for
compatibility.

The doorbells are only 4 bytes btw, but the calculation still looks
wrong. With a max_ioqpairs parameter in place, the reg_size should be

pow2ceil(0x1008 + 2 * (n->params.max_ioqpairs) * 4)

Right? Thats 0x1000 for the core registers, 8 bytes for the sq/cq
doorbells for the admin queue pair, and then room for the i/o queue
pairs.

I added a patch for this in v6.

> > -n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> > -
> >  n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> >  n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> >  n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> > +}
> >  
> > -memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n,
> > -  "nvme", n->reg_size);
> > +static void nvme_init_cmb(NvmeCtrl *n, 

Re: [PATCH RESEND v2] block/nvme: introduce PMR support from NVMe 1.4 spec

2020-03-12 Thread Klaus Birkelund Jensen
On Mar 11 15:54, Andrzej Jakowski wrote:
> On 3/11/20 2:20 AM, Stefan Hajnoczi wrote:
> > Please try:
> > 
> >   $ git grep pmem
> > 
> > backends/hostmem-file.c is the backend that can be used and the
> > pmem_persist() API can be used to flush writes.
> 
> I've reworked this patch into hostmem-file type of backend.
> From simple tests in virtual machine: writing to PMR region
> and then reading from it after VM power cycle I have observed that
> there is no persistency.
> 
> I guess that persistent behavior can be achieved if memory backend file
> resides on actual persistent memory in VMM. I haven't found mechanism to
> persist memory backend file when it resides in the file system on block
> storage. My original mmap + msync based solution worked well there.
> I believe that main problem with mmap was with "ifdef _WIN32" that made it 
> platform specific and w/o it patchew CI complained. 
> Is there a way that I could rework mmap + msync solution so it would fit
> into qemu design?
> 

Hi Andrzej,

Thanks for working on this!

FWIW, I have implemented other stuff for the NVMe device that requires
persistent storage (e.g. LBA allocation tracking for DULBE support). I
used the approach of adding an additional blockdev and simply use the
qemu block layer. This would also make it work on WIN32. And if we just
set bit 0 in PMRWBM and disable the write cache on the blockdev we
should be good on the durability requirements.

Unfortunately, I do not see (or know, maybe Stefan has an idea?) an easy
way of using the MemoryRegionOps nicely with async block backend i/o. so
we either have to use blocking I/O or fire and forget aio. Or, we can
maybe keep bit 1 set in PMRWBM and force a blocking blk_flush on PMRSTS
read.

Finally, a thing to consider is that this is adding an optional NVMe 1.4
feature to an already frankenstein device that doesn't even implement
mandatory v1.2. I think that bumping the NVMe version to 1.4 is out of
the question until we actually implement it fully wrt. mandatory
features. My patchset brings the device up to v1.3 and I have v1.4 ready
for posting, so I think we can get there.


Klaus



Re: [PATCH v5 01/26] nvme: rename trace events to nvme_dev

2020-02-12 Thread Klaus Birkelund Jensen
On Feb 12 11:08, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Change the prefix of all nvme device related trace events to 'nvme_dev'
> > to not clash with trace events from the nvme block driver.
> > 

Hi Maxim,

Thank you very much for your thorough reviews! Utterly appreciated!

I'll start going through your suggested changes. There is a bit of work
to do on splitting patches into refactoring and bugfixes, but I can
definitely see the reason for this, so I'll get to work.

You mention the alignment with split lines alot. I actually thought I
was following CODING_STYLE.rst (which allows a single 4 space indent for
functions, but not statements such as if/else and while/for). But since
hw/block/nvme.c is originally written in the style of aligning with the
opening paranthesis I'm in the wrong here, so I will of course amend
it. Should have done that from the beginning, it's just my personal
taste shining through ;)


Thanks again,
Klaus



Re: [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2020-02-05 Thread Klaus Birkelund Jensen
On Feb  5 01:47, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:51:42AM +0100, Klaus Jensen wrote:
> > Hi,
> > 
> > 
> > Changes since v4
> >  - Changed vendor and device id to use a Red Hat allocated one. For
> >backwards compatibility add the 'x-use-intel-id' nvme device
> >parameter. This is off by default but is added as a machine compat
> >property to be true for machine types <= 4.2.
> > 
> >  - SGL mapping code has been refactored.
> 
> Looking pretty good to me. For the series beyond the individually
> reviewed patches:
> 
> Acked-by: Keith Busch 
> 
> If you need to send a v5, you may add my tag to the patches that are not
> substaintially modified if you like.

I'll send a v6 with the changes to "nvme: make lba data size
configurable". It won't be substantially changed, I will just only
accept 9 and 12 as valid values for lbads.

Thanks for the Ack's and Reviews Keith!


Klaus


Re: [PATCH v5 24/26] nvme: change controller pci id

2020-02-05 Thread Klaus Birkelund Jensen
On Feb  5 01:35, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:06AM +0100, Klaus Jensen wrote:
> > There are two reasons for changing this:
> > 
> >   1. The nvme device currently uses an internal Intel device id.
> > 
> >   2. Since commits "nvme: fix write zeroes offset and count" and "nvme:
> >  support multiple namespaces" the controller device no longer has
> >  the quirks that the Linux kernel think it has.
> > 
> >  As the quirks are applied based on pci vendor and device id, change
> >  them to get rid of the quirks.
> > 
> > To keep backward compatibility, add a new 'x-use-intel-id' parameter to
> > the nvme device to force use of the Intel vendor and device id. This is
> > off by default but add a compat property to set this for machines 4.2
> > and older.
> > 
> > Signed-off-by: Klaus Jensen 
> 
> Yay, thank you for following through on getting this identifier assigned.
> 
> Reviewed-by: Keith Busch 

This is technically not "officially" sanctioned yet, but I got an
indication from Gerd that we are good to proceed with this.



Re: [PATCH v5 22/26] nvme: support multiple namespaces

2020-02-05 Thread Klaus Birkelund Jensen
On Feb  5 01:31, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:04AM +0100, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> 
> I like this feature a lot, thanks for doing it.
> 
> Reviewed-by: Keith Busch 
> 
> > @@ -1256,18 +1272,24 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, 
> > NvmeCmd *cmd, uint8_t rae,
> >  uint64_t units_read = 0, units_written = 0, read_commands = 0,
> >  write_commands = 0;
> >  NvmeSmartLog smart;
> > -BlockAcctStats *s;
> >  
> >  if (nsid && nsid != 0x) {
> >  return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> 
> This is totally optional, but worth mentioning: this patch makes it
> possible to remove this check and allow per-namespace smart logs. The
> ID_CTRL.LPA would need to updated to reflect that if you wanted to
> go that route.

Yeah, I thought about that, but with NVMe v1.4 support arriving in a
later series, there are no longer any namespace specific stuff in the
log page anyway.

The spec isn't really clear on what the preferred behavior for a 1.4
compliant device is. Either

  1. LBA bit 0 set and just return the same page for each namespace or,
  2. LBA bit 0 unset and fail when NSID is set



Re: [PATCH v5 26/26] nvme: make lba data size configurable

2020-02-05 Thread Klaus Birkelund Jensen
On Feb  5 01:43, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:08AM +0100, Klaus Jensen wrote:
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme-ns.c | 2 +-
> >  hw/block/nvme-ns.h | 4 +++-
> >  hw/block/nvme.c| 1 +
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index 0e5be44486f4..981d7101b8f2 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -18,7 +18,7 @@ static int nvme_ns_init(NvmeNamespace *ns)
> >  {
> >  NvmeIdNs *id_ns = >id_ns;
> >  
> > -id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->lbaf[0].ds = ns->params.lbads;
> >  id_ns->nuse = id_ns->ncap = id_ns->nsze =
> >  cpu_to_le64(nvme_ns_nlbas(ns));
> >  
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > index b564bac25f6d..f1fe4db78b41 100644
> > --- a/hw/block/nvme-ns.h
> > +++ b/hw/block/nvme-ns.h
> > @@ -7,10 +7,12 @@
> >  
> >  #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> >  DEFINE_PROP_DRIVE("drive", _state, blk), \
> > -DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> > +DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
> > +DEFINE_PROP_UINT8("lbads", _state, _props.lbads, BDRV_SECTOR_BITS)
> 
> I think we need to validate the parameter is between 9 and 12 before
> trusting it can be used safely.
> 
> Alternatively, add supported formats to the lbaf array and let the host
> decide on a live system with the 'format' command.

The device does not yet support Format NVM, but we have a patch ready
for that to be submitted with a new series when this is merged.

For now, while it does not support Format, I will change this patch such
that it defaults to 9 (BRDV_SECTOR_BITS) and only accept 12 as an
alternative (while always keeping the number of formats available to 1).


Re: [PATCH v4 20/24] nvme: add support for scatter gather lists

2020-01-13 Thread Klaus Birkelund Jensen
On Jan  9 11:44, Beata Michalska wrote:
> Hi Klaus,
> 
> On Thu, 19 Dec 2019 at 13:09, Klaus Jensen  wrote:
> > @@ -73,7 +73,12 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr 
> > addr)
> >
> >  static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> > -if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> > +hwaddr hi = addr + size;
> > +if (hi < addr) {
> 
> What is the actual use case for that ?

This was for detecting wrap around in the unsigned addition. I found
that nvme_map_sgl does not check if addr + size is out of bounds (which
it should). With that in place this check is belt and braces, so I might
remove it.

> > +static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
> > +QEMUIOVector *iov, NvmeSglDescriptor *segment, uint64_t nsgld,
> > +uint32_t *len, bool is_cmb, NvmeRequest *req)
> > +{
> > +dma_addr_t addr, trans_len;
> > +uint16_t status;
> > +
> > +for (int i = 0; i < nsgld; i++) {
> > +if (NVME_SGL_TYPE(segment[i].type) != SGL_DESCR_TYPE_DATA_BLOCK) {
> > +trace_nvme_dev_err_invalid_sgl_descriptor(nvme_cid(req),
> > +NVME_SGL_TYPE(segment[i].type));
> > +return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
> > +}
> > +
> > +if (*len == 0) {
> > +if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
> > +
> > trace_nvme_dev_err_invalid_sgl_excess_length(nvme_cid(req));
> > +return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> > +}
> > +
> > +break;
> > +}
> > +
> > +addr = le64_to_cpu(segment[i].addr);
> > +trans_len = MIN(*len, le64_to_cpu(segment[i].len));
> > +
> > +if (nvme_addr_is_cmb(n, addr)) {
> > +/*
> > + * All data and metadata, if any, associated with a particular
> > + * command shall be located in either the CMB or host memory. 
> > Thus,
> > + * if an address if found to be in the CMB and we have already
> 
> s/address if/address is ?

Fixed, thanks.

> > +static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector 
> > *iov,
> > +NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
> > +{
> > +const int MAX_NSGLD = 256;
> > +
> > +NvmeSglDescriptor segment[MAX_NSGLD];
> > +uint64_t nsgld;
> > +uint16_t status;
> > +bool is_cmb = false;
> > +bool sgl_in_cmb = false;
> > +hwaddr addr = le64_to_cpu(sgl.addr);
> > +
> > +trace_nvme_dev_map_sgl(nvme_cid(req), NVME_SGL_TYPE(sgl.type), 
> > req->nlb, len);
> > +
> > +if (nvme_addr_is_cmb(n, addr)) {
> > +is_cmb = true;
> > +
> > +qemu_iovec_init(iov, 1);
> > +} else {
> > +pci_dma_sglist_init(qsg, >parent_obj, 1);
> > +}
> > +
> > +/*
> > + * If the entire transfer can be described with a single data block it 
> > can
> > + * be mapped directly.
> > + */
> > +if (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_DATA_BLOCK) {
> > +status = nvme_map_sgl_data(n, qsg, iov, , 1, , is_cmb, 
> > req);
> > +if (status) {
> > +goto unmap;
> > +}
> > +
> > +goto out;
> > +}
> > +
> > +/*
> > + * If the segment is located in the CMB, the submission queue of the
> > + * request must also reside there.
> > + */
> > +if (nvme_addr_is_cmb(n, addr)) {
> > +if (!nvme_addr_is_cmb(n, req->sq->dma_addr)) {
> > +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +}
> > +
> > +sgl_in_cmb = true;
> 
> Why not combining this with the condition few lines above
> for the nvme_addr_is_cmb ? Also is the sgl_in_cmb really needed ?
> If the address is from CMB, that  implies the queue is also there,
> otherwise we wouldn't progress beyond this point. Isn't is_cmb sufficient ?
> 

You are right, there is no need for sgl_in_cmb.

But checking if the queue is in the cmb only needs to be done if the
descriptor in DPTR is *not* a "singleton" data block. But I think I can
refactor it to be slightly nicer, or at least be more specific in the
comments.

> > +}
> > +
> > +while (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_SEGMENT) {
> > +bool addr_is_cmb;
> > +
> > +nsgld = le64_to_cpu(sgl.len) / sizeof(NvmeSglDescriptor);
> > +
> > +/* read the segment in chunks of 256 descriptors (4k) */
> > +while (nsgld > MAX_NSGLD) {
> > +if (nvme_addr_read(n, addr, segment, sizeof(segment))) {
> > +trace_nvme_dev_err_addr_read(addr);
> > +status = NVME_DATA_TRANSFER_ERROR;
> > +goto unmap;
> > +}
> > +
> > +status = nvme_map_sgl_data(n, qsg, iov, segment, MAX_NSGLD, 
> > ,
> > +is_cmb, req);
> 
> This will probably fail if there is a BitBucket Descriptor on the way (?)
> 

nvme_map_sgl_data will error out on any descriptors different from
"DATA_BLOCK". So I think we are 

Re: [PATCH v4 17/24] nvme: allow multiple aios per command

2020-01-13 Thread Klaus Birkelund Jensen
On Jan  9 11:40, Beata Michalska wrote:
> Hi Klaus,
> 
> On Thu, 19 Dec 2019 at 13:09, Klaus Jensen  wrote:
> > +static NvmeAIO *nvme_aio_new(BlockBackend *blk, int64_t offset, size_t len,
> > +QEMUSGList *qsg, QEMUIOVector *iov, NvmeRequest *req,
> > +NvmeAIOCompletionFunc *cb)
> 
> Minor: The indentation here (and in a few other places across the patchset)
> does not seem right . And maybe inline ?

I tried to follow the style in CODING_STYLE.rst for "Multiline Indent",
but how the style is for function definition is a bit underspecified.

I can change it to align with the opening paranthesis. I just found the
"one indent" more readable for these long function definitions.

> Also : seems that there are cases when some of the parameters are
> not required (NULL) , maybe having a simplified version for those cases
> might be useful ?
> 

True. Actually - at this point in the series there are no users of the
NvmeAIOCompletionFunc. It is preparatory for other patches I have in the
pipeline. But I'll clean it up.

> > +static void nvme_aio_cb(void *opaque, int ret)
> > +{
> > +NvmeAIO *aio = opaque;
> > +NvmeRequest *req = aio->req;
> > +
> > +BlockBackend *blk = aio->blk;
> > +BlockAcctCookie *acct = >acct;
> > +BlockAcctStats *stats = blk_get_stats(blk);
> > +
> > +Error *local_err = NULL;
> > +
> > +trace_nvme_dev_aio_cb(nvme_cid(req), aio, blk_name(blk), aio->offset,
> > +nvme_aio_opc_str(aio), req);
> > +
> > +if (req) {
> > +QTAILQ_REMOVE(>aio_tailq, aio, tailq_entry);
> > +}
> > +
> >  if (!ret) {
> > -block_acct_done(blk_get_stats(n->conf.blk), >acct);
> > -req->status = NVME_SUCCESS;
> > +block_acct_done(stats, acct);
> > +
> > +if (aio->cb) {
> > +aio->cb(aio, aio->cb_arg);
> 
> We are dropping setting status to SUCCESS here,
> is that expected ?

Yes, that is on purpose. nvme_aio_cb is called for *each* issued AIO and
we do not want to overwrite a previously set error status with a success
(if one aio in the request fails even though others succeed, it should
not go unnoticed). Note that NVME_SUCCESS is the default setting in the
request, so if no one sets an error code we are still good.

> Also the aio callback will not get
> called case failure and it probably should ?
> 

I tried both but ended up with just not calling it on failure, but I
think that in the future some AIO callbacks might want to take a
different action if the request failed, so I'll add it back in an add
the aio return value (ret) to the callback function definition.


Thanks,
Klaus


Re: [PATCH v4 19/24] nvme: handle dma errors

2020-01-13 Thread Klaus Birkelund Jensen
On Jan  9 11:35, Beata Michalska wrote:
> Hi Klaus,
> 

Hi Beata,

Your reviews are, as always, much appreciated! Thanks!

> On Thu, 19 Dec 2019 at 13:09, Klaus Jensen  wrote:
> > @@ -1595,7 +1611,12 @@ static void nvme_process_sq(void *opaque)
> >
> >  while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(>req_list))) {
> >  addr = sq->dma_addr + sq->head * n->sqe_size;
> > -nvme_addr_read(n, addr, (void *), sizeof(cmd));
> > +if (nvme_addr_read(n, addr, (void *), sizeof(cmd))) {
> > +trace_nvme_dev_err_addr_read(addr);
> > +timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> > +100 * SCALE_MS);
> > +break;
> > +}
> 
> Is there a chance we will end up repeatedly triggering the read error here
> as this will come back to the same memory location each time (the sq->head
> is not moving here) ?
> 

Absolutely, and that was the point. Not being able to read the
submission queue is pretty bad, so the device just keeps retrying every
100 ms. This is the same for when writing to the completion queue fails.

But... It would probably be prudent to track how long it has been since
a successful DMA transfer was done and timeout, shutting down the
device. Say maybe after 60 seconds. I'll try to add something like that.


Thanks,
Klaus


Re: [PATCH v4 22/24] nvme: bump controller pci device id

2019-12-19 Thread Klaus Birkelund Jensen
On Dec 20 02:46, Keith Busch wrote:
> On Thu, Dec 19, 2019 at 06:24:57PM +0100, Klaus Birkelund Jensen wrote:
> > On Dec 20 01:16, Keith Busch wrote:
> > > On Thu, Dec 19, 2019 at 02:09:19PM +0100, Klaus Jensen wrote:
> > > > @@ -2480,7 +2480,7 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice 
> > > > *pci_dev)
> > > >  pci_conf[PCI_INTERRUPT_PIN] = 1;
> > > >  pci_config_set_prog_interface(pci_conf, 0x2);
> > > >  pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
> > > > -pci_config_set_device_id(pci_conf, 0x5845);
> > > > +pci_config_set_device_id(pci_conf, 0x5846);
> > > >  pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
> > > >  pcie_endpoint_cap_init(pci_dev, 0x80);
> > > 
> > > We can't just pick a number here, these are supposed to be assigned by the
> > > vendor. A day will come when I will be in trouble for using the existing
> > > identifier: I found out to late it was supposed to be for internal use
> > > only as it was never officially reserved, so lets not make the same
> > > mistake for some future device.
> > > 
> > 
> > Makes sense. And there is no "QEMU" vendor, is there?
> 
> I'm not sure if we can use this, but there is a PCI_VENDOR_ID_QEMU,
> 0x1234, defined in include/hw/pci/pci.h.
> 

Maybe it's possible to use PCI_VENDOR_ID_REDHAT?



Re: [PATCH v4 21/24] nvme: support multiple namespaces

2019-12-19 Thread Klaus Birkelund Jensen
On Dec 19 16:11, Michal Prívozník wrote:
> On 12/19/19 2:09 PM, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> 
> Klaus, just to make sure I understand correctly, this implements
> multiple namespaces for *emulated* NVMe, right? I'm asking because I
> just merged libvirt patches to support:
> 
> -drive
> file.driver=nvme,file.device=:01:00.0,file.namespace=1,format=raw,if=none,id=drive-virtio-disk0
> -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
> 
> and seeing these patches made me doubt my design. But if your patches
> touch emulated NVMe only, then libvirt's fine because it doesn't expose
> that just yet.
> 
> Michal
> 

Hi Michal,

Yes, this is only for the emulated nvme controller.



Re: [PATCH v4 22/24] nvme: bump controller pci device id

2019-12-19 Thread Klaus Birkelund Jensen
On Dec 20 01:16, Keith Busch wrote:
> On Thu, Dec 19, 2019 at 02:09:19PM +0100, Klaus Jensen wrote:
> > @@ -2480,7 +2480,7 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice 
> > *pci_dev)
> >  pci_conf[PCI_INTERRUPT_PIN] = 1;
> >  pci_config_set_prog_interface(pci_conf, 0x2);
> >  pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
> > -pci_config_set_device_id(pci_conf, 0x5845);
> > +pci_config_set_device_id(pci_conf, 0x5846);
> >  pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
> >  pcie_endpoint_cap_init(pci_dev, 0x80);
> 
> We can't just pick a number here, these are supposed to be assigned by the
> vendor. A day will come when I will be in trouble for using the existing
> identifier: I found out to late it was supposed to be for internal use
> only as it was never officially reserved, so lets not make the same
> mistake for some future device.
> 

Makes sense. And there is no "QEMU" vendor, is there? But it would be
really nice to get rid of the quirks.



Re: [PATCH v2 15/20] nvme: add support for scatter gather lists

2019-11-26 Thread Klaus Birkelund
On Mon, Nov 25, 2019 at 02:10:37PM +, Beata Michalska wrote:
> On Mon, 25 Nov 2019 at 06:21, Klaus Birkelund  wrote:
> >
> > On Tue, Nov 12, 2019 at 03:25:18PM +, Beata Michalska wrote:
> > > Hi Klaus,
> > >
> > > On Tue, 15 Oct 2019 at 11:57, Klaus Jensen  wrote:
> > > >
> > > > +#define NVME_CMD_FLAGS_FUSE(flags) (flags & 0x3)
> > > > +#define NVME_CMD_FLAGS_PSDT(flags) ((flags >> 6) & 0x3)
> > > Minor: This one is slightly misleading - as per the naming and it's usage:
> > > the PSDT is a field name and as such does not imply using SGLs
> > > and it is being used to verify if given command is actually using
> > > SGLs.
> > >
> >
> > Ah, is this because I do
> >
> >   if (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
> >
> > in the code? That is, just checks for it not being zero? The value of
> > the PRP or SGL for Data Transfer (PSDT) field *does* specify if the
> > command uses SGLs or not. 0x0: PRPs, 0x1 SGL for data, 0x10: SGLs for
> > both data and metadata. Would you prefer the condition was more
> > explicit?
> >
> Yeah, it is just not obvious( at least to me)  without referencing the spec
> that non-zero value implies SGL usage. Guess a comment would be helpful
> but that is not major.
> 
 
Nah. Thats a good point. I have changed it to use a switch on the value.
This technically also fixes a bug because the above would accept 0x3 as
a valid value and interpret it as SGL use.


Klaus



Re: [PATCH v2 12/20] nvme: bump supported specification version to 1.3

2019-11-26 Thread Klaus Birkelund
On Mon, Nov 25, 2019 at 12:13:15PM +, Beata Michalska wrote:
> On Mon, 18 Nov 2019 at 09:48, Klaus Birkelund  wrote:
> >
> > On Tue, Nov 12, 2019 at 03:05:06PM +, Beata Michalska wrote:
> > > Hi Klaus,
> > >
> > > On Tue, 15 Oct 2019 at 11:52, Klaus Jensen  wrote:
> > > >
> > > > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> > > > +{
> > > > +static const int len = 4096;
> > > > +
> > > > +struct ns_descr {
> > > > +uint8_t nidt;
> > > > +uint8_t nidl;
> > > > +uint8_t rsvd2[2];
> > > > +uint8_t nid[16];
> > > > +};
> > > > +
> > > > +uint32_t nsid = le32_to_cpu(c->nsid);
> > > > +uint64_t prp1 = le64_to_cpu(c->prp1);
> > > > +uint64_t prp2 = le64_to_cpu(c->prp2);
> > > > +
> > > > +struct ns_descr *list;
> > > > +uint16_t ret;
> > > > +
> > > > +trace_nvme_identify_ns_descr_list(nsid);
> > > > +
> > > > +if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > > > +trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
> > > > +return NVME_INVALID_NSID | NVME_DNR;
> > > > +}
> > > > +
> > > In theory this should abort the command for inactive NSIDs as well.
> > > But I guess this will come later on.
> > >
> >
> > At this point in the series, the device does not support multiple
> > namespaces anyway and num_namespaces is always 1. But this has also been
> > reported seperately in relation the patch adding multiple namespaces and
> > is fixed in v3.
> >
> > > > +list = g_malloc0(len);
> > > > +list->nidt = 0x3;
> > > > +list->nidl = 0x10;
> > > > +*(uint32_t *) >nid[12] = cpu_to_be32(nsid);
> > > > +
> > > Might be worth to add some comment here -> as per the NGUID/EUI64 format.
> > > Also those are not specified currently in the namespace identity data 
> > > structure.
> > >
> >
> > I'll add a comment for why the Namespace UUID is set to this value here.
> > The NGUID/EUI64 fields are not set in the namespace identity data
> > structure as they are not required. See the descriptions of NGUID and
> > EUI64. Here for NGUID:
> >
> > "The controller shall specify a globally unique namespace identifier
> > in this field, the EUI64 field, or a Namespace UUID in the Namespace
> > Identification Descriptor..."
> >
> > Here, I chose to provide it in the Namespace Identification Descriptor
> > (by setting `list->nidt = 0x3`).
> >
> > > > +ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
> > > > +g_free(list);
> > > > +return ret;
> > > > +}
> > > > +
> > > >  static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> > > >  {
> > > >  NvmeIdentify *c = (NvmeIdentify *)cmd;
> > > > @@ -934,7 +978,9 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd 
> > > > *cmd)
> > > >  case 0x01:
> > > >  return nvme_identify_ctrl(n, c);
> > > >  case 0x02:
> > > > -return nvme_identify_nslist(n, c);
> > > > +return nvme_identify_ns_list(n, c);
> > > > +case 0x03:
> > > > +return nvme_identify_ns_descr_list(n, cmd);
> > > >  default:
> > > >  trace_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
> > > >  return NVME_INVALID_FIELD | NVME_DNR;
> > > > @@ -1101,6 +1147,14 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > > > NvmeCmd *cmd, NvmeRequest *req)
> > > >  blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> > > >  break;
> > > >  case NVME_NUMBER_OF_QUEUES:
> > > > +if (n->qs_created > 2) {
> > > > +return NVME_CMD_SEQ_ERROR | NVME_DNR;
> > > > +}
> > > > +
> > > I am not sure this is entirely correct as the spec says:
> > > "if any I/O Submission and/or Completion Queues (...)"
> > > so it might be enough to have a single queue created
> > > for this command to be valid.
> > > Also I think that the condition here is to make sure that the number
> > > of queues requested is bei

Re: [PATCH v2 15/20] nvme: add support for scatter gather lists

2019-11-24 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:25:18PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:57, Klaus Jensen  wrote:
> > +static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
> > +NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
> > +{
> > +const int MAX_NSGLD = 256;
> > +
> > +NvmeSglDescriptor segment[MAX_NSGLD];
> > +uint64_t nsgld;
> > +uint16_t status;
> > +bool sgl_in_cmb = false;
> > +hwaddr addr = le64_to_cpu(sgl.addr);
> > +
> > +trace_nvme_map_sgl(req->cid, NVME_SGL_TYPE(sgl.type), req->nlb, len);
> > +
> > +pci_dma_sglist_init(qsg, >parent_obj, 1);
> > +
> > +/*
> > + * If the entire transfer can be described with a single data block it 
> > can
> > + * be mapped directly.
> > + */
> > +if (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_DATA_BLOCK) {
> > +status = nvme_map_sgl_data(n, qsg, , 1, , req);
> > +if (status) {
> > +goto unmap;
> > +}
> > +
> > +goto out;
> > +}
> > +
> > +/*
> > + * If the segment is located in the CMB, the submission queue of the
> > + * request must also reside there.
> > + */
> > +if (nvme_addr_is_cmb(n, addr)) {
> > +if (!nvme_addr_is_cmb(n, req->sq->dma_addr)) {
> > +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +}
> > +
> > +sgl_in_cmb = true;
> > +}
> > +
> > +while (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_SEGMENT) {
> > +bool addr_is_cmb;
> > +
> > +nsgld = le64_to_cpu(sgl.len) / sizeof(NvmeSglDescriptor);
> > +
> > +/* read the segment in chunks of 256 descriptors (4k) */
> > +while (nsgld > MAX_NSGLD) {
> > +nvme_addr_read(n, addr, segment, sizeof(segment));
> Is there any chance this will go outside the CMB?
> 

Yes, there certainly was a chance of that. This has been fixed in a
general way for both nvme_map_sgl and nvme_map_sgl_data.

> > +
> > +status = nvme_map_sgl_data(n, qsg, segment, MAX_NSGLD, , 
> > req);
> > +if (status) {
> > +goto unmap;
> > +}
> > +
> > +nsgld -= MAX_NSGLD;
> > +addr += MAX_NSGLD * sizeof(NvmeSglDescriptor);
> > +}
> > +
> > +nvme_addr_read(n, addr, segment, nsgld * 
> > sizeof(NvmeSglDescriptor));
> > +
> > +sgl = segment[nsgld - 1];
> > +addr = le64_to_cpu(sgl.addr);
> > +
> > +/* an SGL is allowed to end with a Data Block in a regular Segment 
> > */
> > +if (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_DATA_BLOCK) {
> > +status = nvme_map_sgl_data(n, qsg, segment, nsgld, , req);
> > +if (status) {
> > +goto unmap;
> > +}
> > +
> > +goto out;
> > +}
> > +
> > +/* do not map last descriptor */
> > +status = nvme_map_sgl_data(n, qsg, segment, nsgld - 1, , req);
> > +if (status) {
> > +goto unmap;
> > +}
> > +
> > +/*
> > + * If the next segment is in the CMB, make sure that the sgl was
> > + * already located there.
> > + */
> > +addr_is_cmb = nvme_addr_is_cmb(n, addr);
> > +if ((sgl_in_cmb && !addr_is_cmb) || (!sgl_in_cmb && addr_is_cmb)) {
> > +status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +goto unmap;
> > +}
> > +}
> > +
> > +/*
> > + * If the segment did not end with a Data Block or a Segment 
> > descriptor, it
> > + * must be a Last Segment descriptor.
> > + */
> > +if (NVME_SGL_TYPE(sgl.type) != SGL_DESCR_TYPE_LAST_SEGMENT) {
> > +trace_nvme_err_invalid_sgl_descriptor(req->cid,
> > +NVME_SGL_TYPE(sgl.type));
> > +return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
> Shouldn't we handle a case here that requires calling unmap ?

Woops. Fixed.

> > +static uint16_t nvme_dma_read_sgl(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > +NvmeSglDescriptor sgl, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +QEMUSGList qsg;
> > +uint16_t err = NVME_SUCCESS;
> > +
> Very minor: Mixing convention: status vs error
> 

Fixed by proxy in another refactor.

> >
> > +#define NVME_CMD_FLAGS_FUSE(flags) (flags & 0x3)
> > +#define NVME_CMD_FLAGS_PSDT(flags) ((flags >> 6) & 0x3)
> Minor: This one is slightly misleading - as per the naming and it's usage:
> the PSDT is a field name and as such does not imply using SGLs
> and it is being used to verify if given command is actually using
> SGLs.
> 

Ah, is this because I do

  if (NVME_CMD_FLAGS_PSDT(cmd->flags)) {

in the code? That is, just checks for it not being zero? The value of
the PRP or SGL for Data Transfer (PSDT) field *does* specify if the
command uses SGLs or not. 0x0: PRPs, 0x1 SGL for data, 0x10: SGLs for
both data and metadata. Would you prefer the condition was more
explicit?


Thanks!
Klaus



Re: [PATCH v2 14/20] nvme: allow multiple aios per command

2019-11-21 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:25:06PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:55, Klaus Jensen  wrote:
> > @@ -341,19 +344,18 @@ static uint16_t nvme_dma_write_prp(NvmeCtrl *n, 
> > uint8_t *ptr, uint32_t len,
> Any reason why the nvme_dma_write_prp is missing the changes applied
> to nvme_dma_read_prp ?
> 

This was adressed by proxy through changes to the previous patch
(by combining the read/write functions).

> > +case NVME_AIO_OPC_WRITE_ZEROES:
> > +block_acct_start(stats, acct, aio->iov.size, BLOCK_ACCT_WRITE);
> > +aio->aiocb = blk_aio_pwrite_zeroes(aio->blk, aio->offset,
> > +aio->iov.size, BDRV_REQ_MAY_UNMAP, nvme_aio_cb, aio);
> Minor: aio->blk  => blk
> 

Thanks. Fixed this in a couple of other places as well.

> > @@ -621,8 +880,11 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
> >  sq = n->sq[qid];
> >  while (!QTAILQ_EMPTY(>out_req_list)) {
> >  req = QTAILQ_FIRST(>out_req_list);
> > -assert(req->aiocb);
> > -blk_aio_cancel(req->aiocb);
> > +while (!QTAILQ_EMPTY(>aio_tailq)) {
> > +aio = QTAILQ_FIRST(>aio_tailq);
> > +assert(aio->aiocb);
> > +blk_aio_cancel(aio->aiocb);
> What about releasing memory associated with given aio ?

I believe the callback is still called when cancelled? That should take
care of it. Or have I misunderstood that? At least for the DMAAIOCBs it
is.

> > +struct NvmeAIO {
> > +NvmeRequest *req;
> > +
> > +NvmeAIOOp   opc;
> > +int64_t offset;
> > +BlockBackend*blk;
> > +BlockAIOCB  *aiocb;
> > +BlockAcctCookie acct;
> > +
> > +NvmeAIOCompletionFunc *cb;
> > +void  *cb_arg;
> > +
> > +QEMUSGList   *qsg;
> > +QEMUIOVector iov;
> 
> There is a bit of inconsistency on the ownership of IOVs and SGLs.
> SGLs now seem to be owned by request whereas IOVs by the aio.
> WOuld be good to have that unified or documented at least.
> 

Fixed this. The NvmeAIO only holds pointers now.

> > +#define NVME_REQ_TRANSFER_DMA  0x1
> This one does not seem to be used 
> 

I have dropped the flags and reverted to a simple req->is_cmb as that is
all that is really needed.




Re: [PATCH v2 13/20] nvme: refactor prp mapping

2019-11-20 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:23:43PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:57, Klaus Jensen  wrote:
> >
> > Instead of handling both QSGs and IOVs in multiple places, simply use
> > QSGs everywhere by assuming that the request does not involve the
> > controller memory buffer (CMB). If the request is found to involve the
> > CMB, convert the QSG to an IOV and issue the I/O. The QSG is converted
> > to an IOV by the dma helpers anyway, so the CMB path is not unfairly
> > affected by this simplifying change.
> >
> 
> Out of curiosity, in how many cases the SG list will have to
> be converted to IOV ? Does that justify creating the SG list in vain ?
> 

You got me wondering. Only using QSGs does not really remove much
complexity, so I readded the direct use of IOVs for the CMB path. There
is no harm in that.

> > +static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, uint64_t prp1,
> > +uint64_t prp2, uint32_t len, NvmeRequest *req)
> >  {
> >  hwaddr trans_len = n->page_size - (prp1 % n->page_size);
> >  trans_len = MIN(len, trans_len);
> >  int num_prps = (len >> n->page_bits) + 1;
> > +uint16_t status = NVME_SUCCESS;
> > +bool prp_list_in_cmb = false;
> > +
> > +trace_nvme_map_prp(req->cid, req->cmd.opcode, trans_len, len, prp1, 
> > prp2,
> > +num_prps);
> >
> >  if (unlikely(!prp1)) {
> >  trace_nvme_err_invalid_prp();
> >  return NVME_INVALID_FIELD | NVME_DNR;
> > -} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
> > -   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
> > -qsg->nsg = 0;
> > -qemu_iovec_init(iov, num_prps);
> > -qemu_iovec_add(iov, (void *)>cmbuf[prp1 - n->ctrl_mem.addr], 
> > trans_len);
> > -} else {
> > -pci_dma_sglist_init(qsg, >parent_obj, num_prps);
> > -qemu_sglist_add(qsg, prp1, trans_len);
> >  }
> > +
> > +if (nvme_addr_is_cmb(n, prp1)) {
> > +req->is_cmb = true;
> > +}
> > +
> This seems to be used here and within read/write functions which are calling
> this one. Maybe there is a nicer way to track that instead of passing
> the request
> from multiple places ?
> 

Hmm. Whether or not the command reads/writes from the CMB is really only
something you can determine by looking at the PRPs (which is done in
nvme_map_prp), so I think this is the right way to track it. Or do you
have something else in mind?

> > +pci_dma_sglist_init(qsg, >parent_obj, num_prps);
> > +qemu_sglist_add(qsg, prp1, trans_len);
> > +
> >  len -= trans_len;
> >  if (len) {
> >  if (unlikely(!prp2)) {
> >  trace_nvme_err_invalid_prp2_missing();
> > +status = NVME_INVALID_FIELD | NVME_DNR;
> >  goto unmap;
> >  }
> > +
> >  if (len > n->page_size) {
> >  uint64_t prp_list[n->max_prp_ents];
> >  uint32_t nents, prp_trans;
> >  int i = 0;
> >
> > +if (nvme_addr_is_cmb(n, prp2)) {
> > +prp_list_in_cmb = true;
> > +}
> > +
> >  nents = (len + n->page_size - 1) >> n->page_bits;
> >  prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > -nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
> > +nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> >  while (len != 0) {
> > +bool addr_is_cmb;
> >  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
> >
> >  if (i == n->max_prp_ents - 1 && len > n->page_size) {
> >  if (unlikely(!prp_ent || prp_ent & (n->page_size - 
> > 1))) {
> >  trace_nvme_err_invalid_prplist_ent(prp_ent);
> > +status = NVME_INVALID_FIELD | NVME_DNR;
> > +goto unmap;
> > +}
> > +
> > +addr_is_cmb = nvme_addr_is_cmb(n, prp_ent);
> > +if ((prp_list_in_cmb && !addr_is_cmb) ||
> > +(!prp_list_in_cmb && addr_is_cmb)) {
> 
> Minor: Same condition (based on different vars) is being used in
> multiple places. Might be worth to move it outside and just pass in
> the needed values.
> 

I'm really not sure what I was smoking when writing those conditions.
It's just `var != nvme_addr_is_cmb(n, prp_ent)`. I fixed that. No need
to pull it out I think.

> >  static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > -   uint64_t prp1, uint64_t prp2)
> > +uint64_t prp1, uint64_t prp2, NvmeRequest *req)
> >  {
> >  QEMUSGList qsg;
> > -QEMUIOVector iov;
> >  uint16_t status = NVME_SUCCESS;
> >
> > -if (nvme_map_prp(, , prp1, prp2, len, n)) {
> > -return NVME_INVALID_FIELD | NVME_DNR;
> > +status = nvme_map_prp(n, , prp1, prp2, len, req);
> > +if (status) {
> > +return status;
> >  }
> > -

Re: [PATCH v2 08/20] nvme: add support for the get log page command

2019-11-19 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:04:52PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> 
> On Tue, 15 Oct 2019 at 11:45, Klaus Jensen  wrote:
> > +if (!nsid || (nsid != 0x && nsid > n->num_namespaces)) {
> > +trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
> > +return NVME_INVALID_NSID | NVME_DNR;
> > +}
> > +
> The LAP '0' bit is cleared now - which means there is no support
> for per-namespace data. So in theory, if that was the aim, this condition
> should check for the values different than 0x0 and 0x and either
> abort the command or treat that as request for controller specific data.
> 

This is fixed in v3 (that is, it just checks for values different from
0x0 and 0x).

> > +switch (lid) {
> > +case NVME_LOG_ERROR_INFO:
> > +return nvme_error_info(n, cmd, len, off, req);
> > +case NVME_LOG_SMART_INFO:
> > +return nvme_smart_info(n, cmd, len, off, req);
> > +case NVME_LOG_FW_SLOT_INFO:
> > +return nvme_fw_log_info(n, cmd, len, off, req);
> > +default:
> > +trace_nvme_err_invalid_log_page(req->cid, lid);
> > +return NVME_INVALID_LOG_ID | NVME_DNR;
> 
> The spec mentions the Invalid Field in Command  case processing
> command with an unsupported log id.
> 

Thanks. Fixed!

> > +}
> > +}
> > +
> >  static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
> >  {
> >  n->cq[cq->cqid] = NULL;
> > @@ -812,6 +944,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  uint32_t result;
> >
> >  switch (dw10) {
> > +case NVME_TEMPERATURE_THRESHOLD:
> > +result = cpu_to_le32(n->features.temp_thresh);
> > +break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  result = blk_enable_write_cache(n->conf.blk);
> >  trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
> > @@ -856,6 +991,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >
> >  switch (dw10) {
> > +case NVME_TEMPERATURE_THRESHOLD:
> > +n->features.temp_thresh = dw11;
> > +break;
> > +
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >  break;
> > @@ -884,6 +1023,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  return nvme_del_sq(n, cmd);
> >  case NVME_ADM_CMD_CREATE_SQ:
> >  return nvme_create_sq(n, cmd);
> > +case NVME_ADM_CMD_GET_LOG_PAGE:
> > +return nvme_get_log(n, cmd, req);
> >  case NVME_ADM_CMD_DELETE_CQ:
> >  return nvme_del_cq(n, cmd);
> >  case NVME_ADM_CMD_CREATE_CQ:
> > @@ -923,6 +1064,7 @@ static void nvme_process_sq(void *opaque)
> >  QTAILQ_INSERT_TAIL(>out_req_list, req, entry);
> >  memset(>cqe, 0, sizeof(req->cqe));
> >  req->cqe.cid = cmd.cid;
> > +req->cid = le16_to_cpu(cmd.cid);
> 
> If I haven't missed anything this is being used only in one place
> for tracing - is it really worth to duplicate the cid here ?
> 

At this point in the series, yes - it is only used once. But it will be
used extensively for tracing in the later patches.

> > -id->lpa = 1 << 0;
> > +id->lpa = 1 << 2;
> 
> This sets the bit that states support for GLP command but clears the one
> that states support for per-namespace SMART/Heatld data - is that expected ?
> 

Yes, clearing the bit for per-namespace SMART/Health log page
information is intentional. There is no namespace specific information
defined in the namespace so the global and per-namespace log page
contains the same information.



Re: [PATCH v2 09/20] nvme: add support for the asynchronous event request command

2019-11-19 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:04:59PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:49, Klaus Jensen  wrote:
> > @@ -1188,6 +1326,9 @@ static int nvme_start_ctrl(NvmeCtrl *n)
> >
> >  nvme_set_timestamp(n, 0ULL);
> >
> > +n->aer_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_aers, n);
> > +QTAILQ_INIT(>aer_queue);
> > +
> 
> Is the timer really needed here ? The CEQ can be posted either when requested
> by host through AER, if there are any pending events, or once the
> event is triggered
> and there are active AER's.
> 

I guess you are right. I mostly cribbed this from Keith's tree, but I
see no reason to keep the timer.

Keith, do you have any comments on this?

> > @@ -1380,6 +1521,13 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr 
> > addr, int val)
> > "completion queue doorbell write"
> > " for nonexistent queue,"
> > " sqid=%"PRIu32", ignoring", qid);
> > +
> > +if (n->outstanding_aers) {
> > +nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
> > +NVME_AER_INFO_ERR_INVALID_DB_REGISTER,
> > +NVME_LOG_ERROR_INFO);
> > +}
> > +
> This one (as well as cases below) might not be entirely right
> according to the spec. If given event is enabled for asynchronous
> reporting the controller should retain that even. In this case, the event
> will be ignored as there is no pending request.
> 

I understand these notifications to be special cases (i.e. they cannot
be enabled/disabled through the Asynchronous Event Configuration
feature). See Section 4.1 of NVM Express 1.2.1. The spec specifically
says that "... and an Asynchronous Event Request command is outstanding,
...).

> > @@ -1591,6 +1759,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >  id->ver = cpu_to_le32(0x00010201);
> >  id->oacs = cpu_to_le16(0);
> >  id->acl = 3;
> > +id->aerl = n->params.aerl;
> 
> What about the configuration for the asynchronous events ?
> 

It will default to an AEC vector of 0 (everything disabled).


K



Re: [PATCH v2 12/20] nvme: bump supported specification version to 1.3

2019-11-18 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:05:06PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:52, Klaus Jensen  wrote:
> >
> > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> > +{
> > +static const int len = 4096;
> > +
> > +struct ns_descr {
> > +uint8_t nidt;
> > +uint8_t nidl;
> > +uint8_t rsvd2[2];
> > +uint8_t nid[16];
> > +};
> > +
> > +uint32_t nsid = le32_to_cpu(c->nsid);
> > +uint64_t prp1 = le64_to_cpu(c->prp1);
> > +uint64_t prp2 = le64_to_cpu(c->prp2);
> > +
> > +struct ns_descr *list;
> > +uint16_t ret;
> > +
> > +trace_nvme_identify_ns_descr_list(nsid);
> > +
> > +if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > +trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
> > +return NVME_INVALID_NSID | NVME_DNR;
> > +}
> > +
> In theory this should abort the command for inactive NSIDs as well.
> But I guess this will come later on.
> 

At this point in the series, the device does not support multiple
namespaces anyway and num_namespaces is always 1. But this has also been
reported seperately in relation the patch adding multiple namespaces and
is fixed in v3.

> > +list = g_malloc0(len);
> > +list->nidt = 0x3;
> > +list->nidl = 0x10;
> > +*(uint32_t *) >nid[12] = cpu_to_be32(nsid);
> > +
> Might be worth to add some comment here -> as per the NGUID/EUI64 format.
> Also those are not specified currently in the namespace identity data 
> structure.
> 

I'll add a comment for why the Namespace UUID is set to this value here.
The NGUID/EUI64 fields are not set in the namespace identity data
structure as they are not required. See the descriptions of NGUID and
EUI64. Here for NGUID:

"The controller shall specify a globally unique namespace identifier
in this field, the EUI64 field, or a Namespace UUID in the Namespace
Identification Descriptor..."

Here, I chose to provide it in the Namespace Identification Descriptor
(by setting `list->nidt = 0x3`).

> > +ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
> > +g_free(list);
> > +return ret;
> > +}
> > +
> >  static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> >  {
> >  NvmeIdentify *c = (NvmeIdentify *)cmd;
> > @@ -934,7 +978,9 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> >  case 0x01:
> >  return nvme_identify_ctrl(n, c);
> >  case 0x02:
> > -return nvme_identify_nslist(n, c);
> > +return nvme_identify_ns_list(n, c);
> > +case 0x03:
> > +return nvme_identify_ns_descr_list(n, cmd);
> >  default:
> >  trace_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
> >  return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1101,6 +1147,14 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >  break;
> >  case NVME_NUMBER_OF_QUEUES:
> > +if (n->qs_created > 2) {
> > +return NVME_CMD_SEQ_ERROR | NVME_DNR;
> > +}
> > +
> I am not sure this is entirely correct as the spec says:
> "if any I/O Submission and/or Completion Queues (...)"
> so it might be enough to have a single queue created
> for this command to be valid.
> Also I think that the condition here is to make sure that the number
> of queues requested is being set once at init phase. Currently this will
> allow the setting to happen if there is no active queue -> so at any
> point of time (provided the condition mentioned). I might be wrong here
> but it seems that what we need is a single status saying any queue
> has been created prior to the Set Feature command at all
> 

Internally, the admin queue pair is counted in qs_created, which is the
reason for checking if is above 2. The admin queues are created when the
controller is enabled (mmio write to the EN register in CC).

I'll add a comment about that - I see why it is unclear.

> 
> Small note: this patch seems to be introducing more changes
> than specified in the commit message and especially the subject. Might
> be worth to extend it a bit.
> 

You are right. I'll split it up.



Re: [PATCH v2 06/20] nvme: add support for the abort command

2019-11-18 Thread Klaus Birkelund
On Fri, Nov 15, 2019 at 11:56:00AM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Wed, 13 Nov 2019 at 06:12, Klaus Birkelund  wrote:
> >
> > On Tue, Nov 12, 2019 at 03:04:38PM +, Beata Michalska wrote:
> > > Hi Klaus
> > >
> >
> > Hi Beata,
> >
> > Thank you very much for your thorough reviews! I'll start going through
> > them one by one :) You might have seen that I've posted a v3, but I will
> > make sure to consolidate between v2 and v3!
> >
> > > On Tue, 15 Oct 2019 at 11:41, Klaus Jensen  wrote:
> > > >
> > > > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > > > Section 5.1 ("Abort command").
> > > >
> > > > The Abort command is a best effort command; for now, the device always
> > > > fails to abort the given command.
> > > >
> > > > Signed-off-by: Klaus Jensen 
> > > > ---
> > > >  hw/block/nvme.c | 16 
> > > >  1 file changed, 16 insertions(+)
> > > >
> > > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > > index daa2367b0863..84e4f2ea7a15 100644
> > > > --- a/hw/block/nvme.c
> > > > +++ b/hw/block/nvme.c
> > > > @@ -741,6 +741,18 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd 
> > > > *cmd)
> > > >  }
> > > >  }
> > > >
> > > > +static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > > +{
> > > > +uint16_t sqid = le32_to_cpu(cmd->cdw10) & 0x;
> > > > +
> > > > +req->cqe.result = 1;
> > > > +if (nvme_check_sqid(n, sqid)) {
> > > > +return NVME_INVALID_FIELD | NVME_DNR;
> > > > +}
> > > > +
> > > Shouldn't we validate the CID as well ?
> > >
> >
> > According to the specification it is "implementation specific if/when a
> > controller chooses to complete the command when the command to abort is
> > not found".
> >
> > I'm interpreting this to mean that, yes, an invalid command identifier
> > could be given in the command, but this implementation does not care
> > about that.
> >
> > I still think the controller should check the validity of the submission
> > queue identifier though. It is a general invariant that the sqid should
> > be valid.
> >
> > > > +return NVME_SUCCESS;
> > > > +}
> > > > +
> > > >  static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
> > > >  {
> > > >  trace_nvme_setfeat_timestamp(ts);
> > > > @@ -859,6 +871,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > > > NvmeCmd *cmd, NvmeRequest *req)
> > > >  trace_nvme_err_invalid_setfeat(dw10);
> > > >  return NVME_INVALID_FIELD | NVME_DNR;
> > > >  }
> > > > +
> > > >  return NVME_SUCCESS;
> > > >  }
> > > >
> > > > @@ -875,6 +888,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd 
> > > > *cmd, NvmeRequest *req)
> > > >  return nvme_create_cq(n, cmd);
> > > >  case NVME_ADM_CMD_IDENTIFY:
> > > >  return nvme_identify(n, cmd);
> > > > +case NVME_ADM_CMD_ABORT:
> > > > +return nvme_abort(n, cmd, req);
> > > >  case NVME_ADM_CMD_SET_FEATURES:
> > > >  return nvme_set_feature(n, cmd, req);
> > > >  case NVME_ADM_CMD_GET_FEATURES:
> > > > @@ -1388,6 +1403,7 @@ static void nvme_realize(PCIDevice *pci_dev, 
> > > > Error **errp)
> > > >  id->ieee[2] = 0xb3;
> > > >  id->ver = cpu_to_le32(0x00010201);
> > > >  id->oacs = cpu_to_le16(0);
> > > > +id->acl = 3;
> > > So we are setting the max number of concurrent commands
> > > but there is no logic to enforce that and wrap up with the
> > > status suggested by specification.
> > >
> >
> > That is true, but because the controller always completes the Abort
> > command immediately this cannot happen. If the controller did try to
> > abort executing commands, the Abort command would need to linger in the
> > controller state until a completion queue entry is posted for the
> > command to be aborted before the completion queue entry can be posted
> > for the Abort command. This takes up resources in the controller and is
> > the reason for the Abort Command Limit.
> >
> > You could argue that we should set ACL to 0 then, but the specification
> > recommends a value of 3 and I do not see any harm in conveying a
> > "reasonable", though inconsequential, value.
> 
> Could we  potentially add some comment describing the above ?
> 

Yes, absolutely! :)


Klaus



Re: [PATCH v2 19/20] nvme: make lba data size configurable

2019-11-12 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:24:00PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:50, Klaus Jensen  wrote:
> >  #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> > -DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> > +DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
> > +DEFINE_PROP_UINT8("lbads", _state, _props.lbads, 9)
> >
> Could we actually use BDRV_SECTOR_BITS instead of magic numbers?
> 
 
Yes, better. Fixed in two places.



Re: [PATCH v2 04/20] nvme: populate the mandatory subnqn and ver fields

2019-11-12 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:04:45PM +, Beata Michalska wrote:
> Hi Klaus
> 
> On Tue, 15 Oct 2019 at 11:42, Klaus Jensen  wrote:
> > +n->bar.vs = 0x00010201;
> 
> Very minor:
> 
> The version number is being set twice in the patch series already.
> And it is being set in two places.
> It might be worth to make a #define out of it so that only one
> needs to be changed.
> 

I think you are right. I'll do that.



Re: [PATCH v2 06/20] nvme: add support for the abort command

2019-11-12 Thread Klaus Birkelund
On Tue, Nov 12, 2019 at 03:04:38PM +, Beata Michalska wrote:
> Hi Klaus
> 

Hi Beata,

Thank you very much for your thorough reviews! I'll start going through
them one by one :) You might have seen that I've posted a v3, but I will
make sure to consolidate between v2 and v3!

> On Tue, 15 Oct 2019 at 11:41, Klaus Jensen  wrote:
> >
> > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > Section 5.1 ("Abort command").
> >
> > The Abort command is a best effort command; for now, the device always
> > fails to abort the given command.
> >
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 16 
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index daa2367b0863..84e4f2ea7a15 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -741,6 +741,18 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  }
> >  }
> >
> > +static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +uint16_t sqid = le32_to_cpu(cmd->cdw10) & 0x;
> > +
> > +req->cqe.result = 1;
> > +if (nvme_check_sqid(n, sqid)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> Shouldn't we validate the CID as well ?
> 

According to the specification it is "implementation specific if/when a
controller chooses to complete the command when the command to abort is
not found".

I'm interpreting this to mean that, yes, an invalid command identifier
could be given in the command, but this implementation does not care
about that.

I still think the controller should check the validity of the submission
queue identifier though. It is a general invariant that the sqid should
be valid.

> > +return NVME_SUCCESS;
> > +}
> > +
> >  static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
> >  {
> >  trace_nvme_setfeat_timestamp(ts);
> > @@ -859,6 +871,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  trace_nvme_err_invalid_setfeat(dw10);
> >  return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> > +
> >  return NVME_SUCCESS;
> >  }
> >
> > @@ -875,6 +888,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  return nvme_create_cq(n, cmd);
> >  case NVME_ADM_CMD_IDENTIFY:
> >  return nvme_identify(n, cmd);
> > +case NVME_ADM_CMD_ABORT:
> > +return nvme_abort(n, cmd, req);
> >  case NVME_ADM_CMD_SET_FEATURES:
> >  return nvme_set_feature(n, cmd, req);
> >  case NVME_ADM_CMD_GET_FEATURES:
> > @@ -1388,6 +1403,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >  id->ieee[2] = 0xb3;
> >  id->ver = cpu_to_le32(0x00010201);
> >  id->oacs = cpu_to_le16(0);
> > +id->acl = 3;
> So we are setting the max number of concurrent commands
> but there is no logic to enforce that and wrap up with the
> status suggested by specification.
> 

That is true, but because the controller always completes the Abort
command immediately this cannot happen. If the controller did try to
abort executing commands, the Abort command would need to linger in the
controller state until a completion queue entry is posted for the
command to be aborted before the completion queue entry can be posted
for the Abort command. This takes up resources in the controller and is
the reason for the Abort Command Limit.

You could argue that we should set ACL to 0 then, but the specification
recommends a value of 3 and I do not see any harm in conveying a
"reasonable", though inconsequential, value.



Re: [PATCH 1/1] pci: pass along the return value of dma_memory_rw

2019-11-11 Thread Klaus Birkelund
On Mon, Nov 11, 2019 at 05:16:41AM -0500, Michael S. Tsirkin wrote:
> On Mon, Nov 11, 2019 at 10:30:07AM +0100, Klaus Birkelund wrote:
> > On Thu, Oct 24, 2019 at 01:13:36AM +0200, Philippe Mathieu-Daudé wrote:
> > > On 10/11/19 9:01 AM, Klaus Jensen wrote:
> > > > Some might actually care about the return value of dma_memory_rw. So
> > > > let us pass it along instead of ignoring it.
> > > > 
> > > > Signed-off-by: Klaus Jensen 
> > > > ---
> > > >   include/hw/pci/pci.h | 3 +--
> > > >   1 file changed, 1 insertion(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > > > index f3f0ffd5fb78..4e95bb847857 100644
> > > > --- a/include/hw/pci/pci.h
> > > > +++ b/include/hw/pci/pci.h
> > > > @@ -779,8 +779,7 @@ static inline AddressSpace 
> > > > *pci_get_address_space(PCIDevice *dev)
> > > >   static inline int pci_dma_rw(PCIDevice *dev, dma_addr_t addr,
> > > >void *buf, dma_addr_t len, DMADirection 
> > > > dir)
> > > >   {
> > > > -dma_memory_rw(pci_get_address_space(dev), addr, buf, len, dir);
> > > > -return 0;
> > > > +return dma_memory_rw(pci_get_address_space(dev), addr, buf, len, 
> > > > dir);
> > > >   }
> > > >   static inline int pci_dma_read(PCIDevice *dev, dma_addr_t addr,
> > > > 
> > > 
> > > Reviewed-by: Philippe Mathieu-Daudé 
> > 
> > Gentle ping on this.
> > 
> > This fix is required for the nvme device to start passing some of the
> > nasty tests from blktests that flips bus mastering while doing I/O.
> > 
> > 
> > Cheers,
> > Klaus
> 
> So I looked and it does not seem like anyone at all checks the
> return value. While this makes the patch safe, how come it
> helps anyone at all?

I have a series[1] under review for the nvme device (I should have
mentioned that). There is a certain test (block/011) from blktests[2],
that disables the PCI device while doing I/O by flipping the bus master
register. QEMU correctly understands this which causes the dma_memory_rw
calls to fail while the device is not a bus master. For the NVMe device
to pass that test, it needs to know that it could not do the DMA,
otherwise it will just think that a completion queue entry was
successfully posted or data was correctly read.

> Maybe this is just infrastructure to allow checks in the
> future, in this case do we need this for the freeze?
> Or can it wait until after the release?
> 

The series I'm mentioning won't be going into the release, so yeah - it
can surely wait. I was just pinging to make sure someone would pick it
up at some point :)
 
[1]: https://patchwork.kernel.org/cover/11190045/
[2]: https://github.com/osandov/blktests



Re: [PATCH 1/1] pci: pass along the return value of dma_memory_rw

2019-11-11 Thread Klaus Birkelund
On Thu, Oct 24, 2019 at 01:13:36AM +0200, Philippe Mathieu-Daudé wrote:
> On 10/11/19 9:01 AM, Klaus Jensen wrote:
> > Some might actually care about the return value of dma_memory_rw. So
> > let us pass it along instead of ignoring it.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   include/hw/pci/pci.h | 3 +--
> >   1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > index f3f0ffd5fb78..4e95bb847857 100644
> > --- a/include/hw/pci/pci.h
> > +++ b/include/hw/pci/pci.h
> > @@ -779,8 +779,7 @@ static inline AddressSpace 
> > *pci_get_address_space(PCIDevice *dev)
> >   static inline int pci_dma_rw(PCIDevice *dev, dma_addr_t addr,
> >void *buf, dma_addr_t len, DMADirection dir)
> >   {
> > -dma_memory_rw(pci_get_address_space(dev), addr, buf, len, dir);
> > -return 0;
> > +return dma_memory_rw(pci_get_address_space(dev), addr, buf, len, dir);
> >   }
> >   static inline int pci_dma_read(PCIDevice *dev, dma_addr_t addr,
> > 
> 
> Reviewed-by: Philippe Mathieu-Daudé 

Gentle ping on this.

This fix is required for the nvme device to start passing some of the
nasty tests from blktests that flips bus mastering while doing I/O.


Cheers,
Klaus



Re: [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-11-04 Thread Klaus Birkelund
On Mon, Nov 04, 2019 at 08:46:29AM +, Ross Lagerwall wrote:
> On 8/23/19 9:10 AM, Klaus Birkelund wrote:
> > On Thu, Aug 22, 2019 at 02:18:05PM +0100, Ross Lagerwall wrote:
> >> On 7/5/19 8:23 AM, Klaus Birkelund Jensen wrote:
> >>
> >> I tried this patch series by installing Windows with a single NVME
> >> controller having two namespaces. QEMU crashed in get_feature /
> >> NVME_VOLATILE_WRITE_CACHE because req->ns was NULL.
> >>
> > 
> > Hi Ross,
> > 
> > Good catch!
> > 
> >> nvme_get_feature / nvme_set_feature look wrong to me since I can't see how
> >> req->ns would have been set. Should they have similar code to nvme_io_cmd 
> >> to
> >> set req->ns from cmd->nsid?
> > 
> > Definitely. I will fix that for v2.
> > 
> >>
> >> After working around this issue everything else seemed to be working well.
> >> Thanks for your work on this patch series.
> >>
> > 
> > And thank you for trying out my patches!
> > 
> 
> One more thing... it doesn't handle inactive namespaces properly so if you
> have two namespaces with e.g. nsid=1 and nsid=3 QEMU ends up crashing in
> certain situations. The patch below adds support for inactive namespaces.
> 
> Still hoping to see a v2 some day :-)
> 
 
Hi Ross,

v2[1] is actually out, but only CC'ed Paul. Sorry about that! It fixes
the support for discontiguous nsid's, but does not handle inactive
namespaces correctly in identify.

I'll incorporate that in a v3 along with a couple of other fixes I did.

Thanks!


  [1]: https://patchwork.kernel.org/cover/11190045/



Re: [PATCH v2 00/20] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2019-10-28 Thread Klaus Birkelund
On Tue, Oct 15, 2019 at 12:38:40PM +0200, Klaus Jensen wrote:
> Hi,
> 
> (Quick note to Fam): most of this series is irrelevant to you as the
> maintainer of the nvme block driver, but patch "nvme: add support for
> scatter gather lists" touches block/nvme.c due to changes in the shared
> NvmeCmd struct.
> 
> Anyway, v2 comes with a good bunch of changes. Compared to v1[1], I have
> squashed some commits in the beginning of the series and heavily
> refactored "nvme: support multiple block requests per request" into the
> new commit "nvme: allow multiple aios per command".
> 
> I have also removed the original implementation of the Abort command
> (commit "nvme: add support for the abort command") as it is currently
> too tricky to test reliably. It has been replaced by a stub that,
> besides a trivial sanity check, just fails to abort the given command.
> *Some* implementation of the Abort command is mandatory, but given the
> "best effort" nature of the command this is acceptable for now. When the
> device gains support for arbitration it should be less tricky to test.
> 
> The support for multiple namespaces is now backwards compatible. The
> nvme device still accepts a 'drive' parameter, but for multiple
> namespaces the use of 'nvme-ns' devices are required. I also integrated
> some feedback from Paul so the device supports non-consecutive namespace
> ids.
> 
> I have also added some new commits at the end:
> 
>   - "nvme: bump controller pci device id" makes sure the Linux kernel
> doesn't apply any quirks to the controller that it no longer has.
>   - "nvme: handle dma errors" won't actually do anything before this[2]
> fix to include/hw/pci/pci.h is merged. With these two patches added,
> the device reliably passes some additional nasty tests from blktests
> (block/011 "disable PCI device while doing I/O" and block/019 "break
> PCI link device while doing I/O"). Before this patch, block/011
> would pass from time to time if you were lucky, but would at least
> mess up the controller pretty badly, causing a reset in the best
> case.
> 
> 
>   [1]: https://patchwork.kernel.org/project/qemu-devel/list/?series=142383
>   [2]: https://patchwork.kernel.org/patch/11184911/
> 
> 
> Klaus Jensen (20):
>   nvme: remove superfluous breaks
>   nvme: move device parameters to separate struct
>   nvme: add missing fields in the identify controller data structure
>   nvme: populate the mandatory subnqn and ver fields
>   nvme: allow completion queues in the cmb
>   nvme: add support for the abort command
>   nvme: refactor device realization
>   nvme: add support for the get log page command
>   nvme: add support for the asynchronous event request command
>   nvme: add logging to error information log page
>   nvme: add missing mandatory features
>   nvme: bump supported specification version to 1.3
>   nvme: refactor prp mapping
>   nvme: allow multiple aios per command
>   nvme: add support for scatter gather lists
>   nvme: support multiple namespaces
>   nvme: bump controller pci device id
>   nvme: remove redundant NvmeCmd pointer parameter
>   nvme: make lba data size configurable
>   nvme: handle dma errors
> 
>  block/nvme.c   |   18 +-
>  hw/block/Makefile.objs |2 +-
>  hw/block/nvme-ns.c |  139 +++
>  hw/block/nvme-ns.h |   60 ++
>  hw/block/nvme.c| 1863 +---
>  hw/block/nvme.h|  219 -
>  hw/block/trace-events  |   37 +-
>  include/block/nvme.h   |  132 ++-
>  8 files changed, 2094 insertions(+), 376 deletions(-)
>  create mode 100644 hw/block/nvme-ns.c
>  create mode 100644 hw/block/nvme-ns.h
> 
> -- 
> 2.23.0
> 

Gentle ping on this.

I'm aware that this is a lot to go through, but I would like to know if
anyone has had a chance to look at it?


https://patchwork.kernel.org/project/qemu-devel/list/?series=187637




Re: [PATCH] nvme: fix NSSRS offset in CAP register

2019-10-23 Thread Klaus Birkelund
On Wed, Oct 23, 2019 at 11:26:57AM -0400, John Snow wrote:
> 
> 
> On 10/23/19 3:33 AM, Klaus Jensen wrote:
> > Fix the offset of the NSSRS field the CAP register.
> 
> From NVME 1.4, section 3 ("Controller Registers"), subsection 3.1.1
> ("Offset 0h: CAP – Controller Capabilities") CAP_NSSRS_SHIFT is bit 36,
> not 33.
> 
> > 
> > Signed-off-by: Klaus Jensen 
> > Reported-by: Javier Gonzalez 
> > ---
> >  include/block/nvme.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 3ec8efcc435e..fa15b51c33bb 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -23,7 +23,7 @@ enum NvmeCapShift {
> >  CAP_AMS_SHIFT  = 17,
> >  CAP_TO_SHIFT   = 24,
> >  CAP_DSTRD_SHIFT= 32,
> > -CAP_NSSRS_SHIFT= 33,
> > +CAP_NSSRS_SHIFT= 36,
> >  CAP_CSS_SHIFT  = 37,
> >  CAP_MPSMIN_SHIFT   = 48,
> >  CAP_MPSMAX_SHIFT   = 52,
> > 
> 
> I like updating commit messages with spec references; if it can be
> updated that would be nice.
> 
> Regardless:
> 
> Reviewed-by: John Snow 
> 

Sounds good. Can the committer squash that in?


Cheers,
Klaus



Re: [PATCH 0/1] pci: pass along the return value of dma_memory_rw

2019-10-23 Thread Klaus Birkelund
On Fri, Oct 11, 2019 at 09:01:40AM +0200, Klaus Jensen wrote:
> Hi,
> 
> While working on fixing the emulated nvme device to pass more tests in
> the blktests suite, I discovered that the pci_dma_rw function ignores
> the return value of dma_memory_rw.
> 
> The nvme device needs to handle DMA errors gracefully in order to pass
> the block/011 test ("disable PCI device while doing I/O") in the
> blktests suite. This is only possible if the device knows if the DMA
> transfer was successful or not.
> 
> I can't see what the reason for ignoring the return value would be. But
> if there is a good reason, please enlighten me :)
> 
> 
> Klaus Jensen (1):
>   pci: pass along the return value of dma_memory_rw
> 
>  include/hw/pci/pci.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> -- 
> 2.23.0
> 

Gentle ping.

https://patchwork.kernel.org/patch/11184911/



Re: [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-08-23 Thread Klaus Birkelund
On Thu, Aug 22, 2019 at 02:18:05PM +0100, Ross Lagerwall wrote:
> On 7/5/19 8:23 AM, Klaus Birkelund Jensen wrote:
> 
> I tried this patch series by installing Windows with a single NVME
> controller having two namespaces. QEMU crashed in get_feature /
> NVME_VOLATILE_WRITE_CACHE because req->ns was NULL.
> 

Hi Ross,

Good catch!

> nvme_get_feature / nvme_set_feature look wrong to me since I can't see how
> req->ns would have been set. Should they have similar code to nvme_io_cmd to
> set req->ns from cmd->nsid?

Definitely. I will fix that for v2.

> 
> After working around this issue everything else seemed to be working well.
> Thanks for your work on this patch series.
> 

And thank you for trying out my patches!


Cheers,
Klaus



Re: [Qemu-devel] [Qemu-block] [RFC, v1] Namespace Management Support

2019-07-09 Thread Klaus Birkelund
On Mon, Jul 08, 2019 at 03:52:29PM -0700, Matt Fitzpatrick wrote:
> Hey Klaus,
> 
> Sorry for the late reply!  I finally found this message amid the pile of
> emails Qemu dumped on me.
> 
> I don't know what the right answer is here... NVMe is designed in a way
> where you *do* "carve up" the flash into logical groupings and the nvme
> firmware decides on how that's done. Those logical groupings can be attached
> to different controllers(which we don't have here yet?) after init, but
> that's a problem for future us I guess?But that's all stuff you already
> know.
> 

Yeah, I havn't started worrying about that ;)

> The "nvme-nvm" solution might be the right approach, but I'm a bit hesitant
> on the idea of growing tnvmcap...
> 
> I can't think of any way to create namespaces on the fly and not have it use
> some single existing block backend, unless we defined a range of block
> images on qemu start and namespace create/attach only uses one image up to
> and including it's max size per namespace? That might work, and I think
> that's what you suggested (or at least is similar to), though it could be
> pretty wasteful. It wouldn't offer a "true" namespace management support,
> but could be close enough.
> 

Having an emulated device that supports namespace management would be
very useful for testing software, but yeah, I have a hard time seeing
how we can make that fit with the current "QEMU model".

> I'm in the middle of going through the patch you posted. Nice job!  I'm glad
> to see more people adding enhancements. It was pretty stale for years.
> 

Thanks for looking at it, I know it's a lot to go through ;)

> -Matt
> On 7/5/19 12:50 AM, Klaus Birkelund wrote:
> > On Tue, Jul 02, 2019 at 10:39:36AM -0700, Matt Fitzpatrick wrote:
> > > Adding namespace management support to the nvme device. Namespace creation
> > > requires contiguous block space for a simple method of allocation.
> > > 
> > > I wrote this a few years ago based on Keith's fork and nvmeqemu fork and
> > > have recently re-synced with the latest trunk.  Some data structures in
> > > nvme.h are a bit more filled out that strictly necessary as this is also 
> > > the
> > > base for sr-iov and IOD patched to be submitted later.
> > > 
> > Hi Matt,
> > 
> > Nice! I'm always happy when new features for the nvme device is posted!
> > 
> > I'll be happy to review it, but I won't start going through it in
> > details because I believe the approach to supporting multiple namespaces
> > is flawed. We had a recent discussion on this and I also got some
> > unrelated patches rejected due to implementing it similarly by carving
> > up the image.
> > 
> > I have posted a long series that includes a patch for multiple
> > namespaces. It is implemented by introducing a fresh `nvme-ns` device
> > model that represents a namespace and attaches to a bus created by the
> > parent `nvme` controller device.
> > 
> > The core issue is that a qemu image /should/ be attachable to other
> > devices (say ide) and not strictly tied to the one device model. Thus,
> > we cannot just shove a bunch of namespaces into a single image.
> > 
> > But, in light of your patch, I'm not convinced that my implementation is
> > the correct solution. Maybe the abstraction should not be an `nvme-ns`
> > device, but a `nvme-nvm` device that when attached changes TNVMCAP and
> > UNVMCAP? Maybe you have some input for this? Or we could have both and
> > dynamically create the nvme-ns devices on top of nvme-nvm devices. I
> > think it would still require a 1-to-1 mapping, but it could be a way to
> > support the namespace management capability.
> > 
> > 
> > Cheers,
> > Klaus
> > 
> 

Hi Kevin,

This highlights another situation where the "1 image to 1 block device"
model doesn't fit that well with NVMe. Especially with the introduction
of "NVM Sets" in NVMe 1.4. It would be very nice to introduce a
'nvme-nvmset' device model that adds an NVM Set which the controller can
then create namespaces in.

Is it completely unacceptable for a device to use the image in such a
way that it would not make sense (aka present the same block device)
when attached to another device (ide, ...)?

I really have a hard time seeing how we could support these features
without violating the '1 image to 1 block device" model.


Cheers,
Klaus



Re: [Qemu-devel] [Qemu-block] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Klaus Birkelund
On Fri, Jul 05, 2019 at 02:49:29PM +0100, Daniel P. Berrangé wrote:
> On Fri, Jul 05, 2019 at 03:36:17PM +0200, Klaus Birkelund wrote:
> > On Fri, Jul 05, 2019 at 09:23:33AM +0200, Klaus Birkelund Jensen wrote:
> > > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > > device model. The nvme device creates a bus named from the device name
> > > ('id'). The nvme-ns devices then connect to this and registers
> > > themselves with the nvme device.
> > > 
> > > This changes how an nvme device is created. Example with two namespaces:
> > > 
> > >   -drive file=nvme0n1.img,if=none,id=disk1
> > >   -drive file=nvme0n2.img,if=none,id=disk2
> > >   -device nvme,serial=deadbeef,id=nvme0
> > >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> > >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > > 
> > > A maximum of 256 namespaces can be configured.
> > > 
> >  
> > Well that was embarrasing.
> > 
> > This patch breaks nvme-test.c. Which I obviously did not run.
> > 
> > In my defense, the test doesn't do much currently, but I'll of course
> > fix the test for v2.
> 
> That highlights a more serious problem.  This series changes the syntx
> for configuring the nvme device in a way that is not backwards compatible.
> So anyone who is using QEMU with NVME will be broken when they upgrade
> to the next QEMU release.
> 
> I understand why you wanted to restructure things to have a separate
> nvme-ns device, but there needs to be some backcompat support in there
> for the existing syntax to avoid breaking current users IMHO.
> 
 
Hi Daniel,

I raised this issue previously. I suggested that we keep the drive
property for the nvme device and only accept either that or an nvme-ns
device to be configured (but not both).

That would keep backward compatibilty, but enforce the use of nvme-ns
for any setup that requires multiple namespaces.

Would that work?

Cheers,
Klaus



Re: [Qemu-devel] [Qemu-block] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Klaus Birkelund
On Fri, Jul 05, 2019 at 09:23:33AM +0200, Klaus Birkelund Jensen wrote:
> This adds support for multiple namespaces by introducing a new 'nvme-ns'
> device model. The nvme device creates a bus named from the device name
> ('id'). The nvme-ns devices then connect to this and registers
> themselves with the nvme device.
> 
> This changes how an nvme device is created. Example with two namespaces:
> 
>   -drive file=nvme0n1.img,if=none,id=disk1
>   -drive file=nvme0n2.img,if=none,id=disk2
>   -device nvme,serial=deadbeef,id=nvme0
>   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
>   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> 
> A maximum of 256 namespaces can be configured.
> 
 
Well that was embarrasing.

This patch breaks nvme-test.c. Which I obviously did not run.

In my defense, the test doesn't do much currently, but I'll of course
fix the test for v2.



Re: [Qemu-devel] [Qemu-block] [RFC, v1] Namespace Management Support

2019-07-05 Thread Klaus Birkelund
On Tue, Jul 02, 2019 at 10:39:36AM -0700, Matt Fitzpatrick wrote:
> Adding namespace management support to the nvme device. Namespace creation
> requires contiguous block space for a simple method of allocation.
> 
> I wrote this a few years ago based on Keith's fork and nvmeqemu fork and
> have recently re-synced with the latest trunk.  Some data structures in
> nvme.h are a bit more filled out that strictly necessary as this is also the
> base for sr-iov and IOD patched to be submitted later.
> 

Hi Matt,

Nice! I'm always happy when new features for the nvme device is posted!

I'll be happy to review it, but I won't start going through it in
details because I believe the approach to supporting multiple namespaces
is flawed. We had a recent discussion on this and I also got some
unrelated patches rejected due to implementing it similarly by carving
up the image.

I have posted a long series that includes a patch for multiple
namespaces. It is implemented by introducing a fresh `nvme-ns` device
model that represents a namespace and attaches to a bus created by the
parent `nvme` controller device.

The core issue is that a qemu image /should/ be attachable to other
devices (say ide) and not strictly tied to the one device model. Thus,
we cannot just shove a bunch of namespaces into a single image.

But, in light of your patch, I'm not convinced that my implementation is
the correct solution. Maybe the abstraction should not be an `nvme-ns`
device, but a `nvme-nvm` device that when attached changes TNVMCAP and
UNVMCAP? Maybe you have some input for this? Or we could have both and
dynamically create the nvme-ns devices on top of nvme-nvm devices. I
think it would still require a 1-to-1 mapping, but it could be a way to
support the namespace management capability.


Cheers,
Klaus



[Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Klaus Birkelund Jensen
This adds support for multiple namespaces by introducing a new 'nvme-ns'
device model. The nvme device creates a bus named from the device name
('id'). The nvme-ns devices then connect to this and registers
themselves with the nvme device.

This changes how an nvme device is created. Example with two namespaces:

  -drive file=nvme0n1.img,if=none,id=disk1
  -drive file=nvme0n2.img,if=none,id=disk2
  -device nvme,serial=deadbeef,id=nvme0
  -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
  -device nvme-ns,drive=disk2,bus=nvme0,nsid=2

A maximum of 256 namespaces can be configured.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/Makefile.objs |   2 +-
 hw/block/nvme-ns.c | 139 +
 hw/block/nvme-ns.h |  35 +
 hw/block/nvme.c| 169 -
 hw/block/nvme.h|  29 ---
 hw/block/trace-events  |   1 +
 6 files changed, 255 insertions(+), 120 deletions(-)
 create mode 100644 hw/block/nvme-ns.c
 create mode 100644 hw/block/nvme-ns.h

diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
index f5f643f0cc06..d44a2f4b780d 100644
--- a/hw/block/Makefile.objs
+++ b/hw/block/Makefile.objs
@@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
 common-obj-$(CONFIG_XEN) += xen-block.o
 common-obj-$(CONFIG_ECC) += ecc.o
 common-obj-$(CONFIG_ONENAND) += onenand.o
-common-obj-$(CONFIG_NVME_PCI) += nvme.o
+common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
 
 obj-$(CONFIG_SH4) += tc58128.o
 
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
new file mode 100644
index ..11b594467991
--- /dev/null
+++ b/hw/block/nvme-ns.c
@@ -0,0 +1,139 @@
+#include "qemu/osdep.h"
+#include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
+#include "hw/block/block.h"
+#include "hw/pci/msix.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+
+#include "hw/qdev-core.h"
+
+#include "nvme.h"
+#include "nvme-ns.h"
+
+static uint64_t nvme_ns_calc_blks(NvmeNamespace *ns)
+{
+return ns->size / nvme_ns_lbads_bytes(ns);
+}
+
+static void nvme_ns_init_identify(NvmeIdNs *id_ns)
+{
+id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+}
+
+static int nvme_ns_init(NvmeNamespace *ns)
+{
+uint64_t ns_blks;
+NvmeIdNs *id_ns = >id_ns;
+
+nvme_ns_init_identify(id_ns);
+
+ns_blks = nvme_ns_calc_blks(ns);
+id_ns->nuse = id_ns->ncap = id_ns->nsze = cpu_to_le64(ns_blks);
+
+return 0;
+}
+
+static int nvme_ns_init_blk(NvmeNamespace *ns, NvmeIdCtrl *id, Error **errp)
+{
+blkconf_blocksizes(>conf);
+
+if (!blkconf_apply_backend_options(>conf,
+blk_is_read_only(ns->conf.blk), false, errp)) {
+return 1;
+}
+
+ns->size = blk_getlength(ns->conf.blk);
+if (ns->size < 0) {
+error_setg_errno(errp, -ns->size, "blk_getlength");
+return 1;
+}
+
+if (!blk_enable_write_cache(ns->conf.blk)) {
+id->vwc = 0;
+}
+
+return 0;
+}
+
+static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
+{
+if (!ns->conf.blk) {
+error_setg(errp, "nvme-ns: block backend not configured");
+return 1;
+}
+
+return 0;
+}
+
+
+static void nvme_ns_realize(DeviceState *dev, Error **errp)
+{
+NvmeNamespace *ns = NVME_NS(dev);
+BusState *s = qdev_get_parent_bus(dev);
+NvmeCtrl *n = NVME(s->parent);
+Error *local_err = NULL;
+
+if (nvme_ns_check_constraints(ns, _err)) {
+error_propagate_prepend(errp, local_err,
+"nvme_ns_check_constraints: ");
+return;
+}
+
+if (nvme_ns_init_blk(ns, >id_ctrl, _err)) {
+error_propagate_prepend(errp, local_err, "nvme_ns_init_blk: ");
+return;
+}
+
+nvme_ns_init(ns);
+if (nvme_register_namespace(n, ns, _err)) {
+error_propagate_prepend(errp, local_err, "nvme_register_namespace: ");
+return;
+}
+}
+
+static Property nvme_ns_props[] = {
+DEFINE_BLOCK_PROPERTIES(NvmeNamespace, conf),
+DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),
+DEFINE_PROP_END_OF_LIST(),
+};
+
+static void nvme_ns_class_init(ObjectClass *oc, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(oc);
+
+set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
+
+dc->bus_type = TYPE_NVME_BUS;
+dc->realize = nvme_ns_realize;
+dc->props = nvme_ns_props;
+dc->desc = "virtual nvme namespace";
+}
+
+static void nvme_ns_instance_init(Object *obj)
+{
+NvmeNamespace *ns = NVME_NS(obj);
+char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
+
+device_add_bootindex_property(obj, >conf.bootindex, "bootindex",
+bootindex, DEVICE(obj), _abort);
+
+g_free(bootindex

[Qemu-devel] [PATCH 02/16] nvme: move device parameters to separate struct

2019-07-05 Thread Klaus Birkelund Jensen
Move device configuration parameters to separate struct to make it
explicit what is configurable and what is set internally.

Also, clean up some includes.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 54 +++--
 hw/block/nvme.h | 16 ---
 2 files changed, 38 insertions(+), 32 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 28ebaf1368b1..a3f83f3c2135 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -27,18 +27,14 @@
 
 #include "qemu/osdep.h"
 #include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
 #include "hw/block/block.h"
-#include "hw/hw.h"
 #include "hw/pci/msix.h"
-#include "hw/pci/pci.h"
 #include "sysemu/sysemu.h"
-#include "qapi/error.h"
-#include "qapi/visitor.h"
 #include "sysemu/block-backend.h"
+#include "qapi/error.h"
 
-#include "qemu/log.h"
-#include "qemu/module.h"
-#include "qemu/cutils.h"
 #include "trace.h"
 #include "nvme.h"
 
@@ -63,12 +59,12 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->num_queues && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->num_queues && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->params.num_queues && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -630,7 +626,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
 trace_nvme_err_invalid_create_cq_addr(prp1);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
-if (unlikely(vector > n->num_queues)) {
+if (unlikely(vector > n->params.num_queues)) {
 trace_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -782,7 +778,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 
16));
+result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 trace_nvme_getfeat_numq(result);
 break;
 case NVME_TIMESTAMP:
@@ -827,9 +824,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_NUMBER_OF_QUEUES:
 trace_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->num_queues - 1, n->num_queues - 1);
-req->cqe.result =
-cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
+n->params.num_queues - 1,
+n->params.num_queues - 1);
+req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 break;
 
 case NVME_TIMESTAMP:
@@ -903,12 +901,12 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
 blk_drain(n->conf.blk);
 
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->sq[i] != NULL) {
 nvme_free_sq(n->sq[i], n);
 }
 }
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->cq[i] != NULL) {
 nvme_free_cq(n->cq[i], n);
 }
@@ -1311,7 +1309,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 int64_t bs_size;
 uint8_t *pci_conf;
 
-if (!n->num_queues) {
+if (!n->params.num_queues) {
 error_setg(errp, "num_queues can't be zero");
 return;
 }
@@ -1327,7 +1325,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 return;
 }
 
-if (!n->serial) {
+if (!n->params.serial) {
 error_setg(errp, "serial property not set");
 return;
 }
@@ -1344,24 +1342,24 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 pcie_endpoint_cap_init(pci_dev, 0x80);
 
 n->num_namespaces = 1;
-n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
+n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
-n->sq = g_new0(NvmeSQueue *, n->num_queues);
-n->cq = g_new0(NvmeCQueue *, n->

[Qemu-devel] [PATCH 13/16] nvme: simplify dma/cmb mappings

2019-07-05 Thread Klaus Birkelund Jensen
Instead of handling both QSGs and IOVs in multiple places, simply use
QSGs everywhere by assuming that the request does not involve the
controller memory buffer (CMB). If the request is found to involve the
CMB, convert the QSG to an IOV and issue the I/O. The QSG is converted
to an IOV by the dma helpers anyway, so the CMB path is not unfairly
affected by this simplifying change.

As a side-effect, this patch also allows PRPs to be located in the CMB.
The logic ensures that if some of the PRP is in the CMB, all of it must
be located there.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 277 --
 hw/block/nvme.h   |   3 +-
 hw/block/trace-events |   1 +
 include/block/nvme.h  |   1 +
 4 files changed, 187 insertions(+), 95 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 8ad95fdfa261..02888dbfdbc1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -55,14 +55,21 @@
 
 static void nvme_process_sq(void *opaque);
 
+static inline uint8_t nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
+{
+return n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size));
+}
+
 static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
-if (n->cmbsz && addr >= n->ctrl_mem.addr &&
-addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+if (nvme_addr_is_cmb(n, addr)) {
 memcpy(buf, (void *)>cmbuf[addr - n->ctrl_mem.addr], size);
-} else {
-pci_dma_read(>parent_obj, addr, buf, size);
+
+return;
 }
+
+pci_dma_read(>parent_obj, addr, buf, size);
 }
 
 static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
@@ -151,139 +158,200 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue 
*cq)
 }
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
- uint64_t prp2, uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, uint64_t prp1,
+uint64_t prp2, uint32_t len, NvmeRequest *req)
 {
 hwaddr trans_len = n->page_size - (prp1 % n->page_size);
 trans_len = MIN(len, trans_len);
 int num_prps = (len >> n->page_bits) + 1;
+uint16_t status = NVME_SUCCESS;
+bool prp_list_in_cmb = false;
+
+trace_nvme_map_prp(req->cmd.opcode, trans_len, len, prp1, prp2, num_prps);
 
 if (unlikely(!prp1)) {
 trace_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
-} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
-   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
-qsg->nsg = 0;
-qemu_iovec_init(iov, num_prps);
-qemu_iovec_add(iov, (void *)>cmbuf[prp1 - n->ctrl_mem.addr], 
trans_len);
-} else {
-pci_dma_sglist_init(qsg, >parent_obj, num_prps);
-qemu_sglist_add(qsg, prp1, trans_len);
 }
+
+if (nvme_addr_is_cmb(n, prp1)) {
+req->is_cmb = true;
+}
+
+pci_dma_sglist_init(qsg, >parent_obj, num_prps);
+qemu_sglist_add(qsg, prp1, trans_len);
+
 len -= trans_len;
 if (len) {
 if (unlikely(!prp2)) {
 trace_nvme_err_invalid_prp2_missing();
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
+
 if (len > n->page_size) {
 uint64_t prp_list[n->max_prp_ents];
 uint32_t nents, prp_trans;
 int i = 0;
 
+if (nvme_addr_is_cmb(n, prp2)) {
+prp_list_in_cmb = true;
+}
+
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
+nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
 while (len != 0) {
+bool addr_is_cmb;
 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
 if (i == n->max_prp_ents - 1 && len > n->page_size) {
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_nvme_err_invalid_prplist_ent(prp_ent);
+status = NVME_INVALID_FIELD | NVME_DNR;
+goto unmap;
+}
+
+addr_is_cmb = nvme_addr_is_cmb(n, prp_ent);
+if ((prp_list_in_cmb && !addr_is_cmb) ||
+(!prp_list_in_cmb && addr_is_cmb)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
 goto unmap;
 }
 
 i = 0;
 nents = (len + n->page_size - 1) >> n->page_bits;
   

[Qemu-devel] [PATCH 08/16] nvme: refactor device realization

2019-07-05 Thread Klaus Birkelund Jensen
Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 196 ++--
 hw/block/nvme.h |  11 +++
 2 files changed, 152 insertions(+), 55 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 4b9ff51868c0..eb6af6508e2d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -38,6 +38,7 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -1365,66 +1366,105 @@ static const MemoryRegionOps nvme_cmb_ops = {
 },
 };
 
-static void nvme_realize(PCIDevice *pci_dev, Error **errp)
+static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
-NvmeCtrl *n = NVME(pci_dev);
-NvmeIdCtrl *id = >id_ctrl;
-NvmeIdNs *id_ns = >namespace.id_ns;
-
-int64_t bs_size;
-uint8_t *pci_conf;
-
-if (!n->params.num_queues) {
-error_setg(errp, "num_queues can't be zero");
-return;
-}
+NvmeParams *params = >params;
 
 if (!n->conf.blk) {
-error_setg(errp, "drive property not set");
-return;
+error_setg(errp, "nvme: block backend not configured");
+return 1;
 }
 
-bs_size = blk_getlength(n->conf.blk);
-if (bs_size < 0) {
-error_setg(errp, "could not get backing file size");
-return;
+if (!params->serial) {
+error_setg(errp, "nvme: serial not configured");
+return 1;
 }
 
-if (!n->params.serial) {
-error_setg(errp, "serial property not set");
-return;
+if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
+error_setg(errp, "nvme: invalid queue configuration");
+return 1;
 }
+
+return 0;
+}
+
+static int nvme_init_blk(NvmeCtrl *n, Error **errp)
+{
 blkconf_blocksizes(>conf);
 if (!blkconf_apply_backend_options(>conf, blk_is_read_only(n->conf.blk),
-   false, errp)) {
-return;
+false, errp)) {
+return 1;
 }
 
-pci_conf = pci_dev->config;
-pci_conf[PCI_INTERRUPT_PIN] = 1;
-pci_config_set_prog_interface(pci_dev->config, 0x2);
-pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
-pcie_endpoint_cap_init(pci_dev, 0x80);
+return 0;
+}
 
+static void nvme_init_state(NvmeCtrl *n)
+{
 n->num_namespaces = 1;
 n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
-n->ns_size = bs_size / (uint64_t)n->num_namespaces;
-
 n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
 n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
+}
 
-memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n,
-  "nvme", n->reg_size);
+static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
+NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
+
+NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
+NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
+
+n->cmbloc = n->bar.cmbloc;
+n->cmbsz = n->bar.cmbsz;
+
+n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+memory_region_init_io(>ctrl_mem, OBJECT(n), _cmb_ops, n,
+"nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
+PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
+PCI_BASE_ADDRESS_MEM_PREFETCH, >ctrl_mem);
+}
+
+static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+uint8_t *pci_conf = pci_dev->config;
+
+pci_conf[PCI_INTERRUPT_PIN] = 1;
+pci_config_set_prog_interface(pci_conf, 0x2);
+pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
+pci_config_set_device_id(pci_conf, 0x5845);
+pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
+pcie_endpoint_cap_init(pci_dev, 0x80);
+
+memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n, "nvme",
+n->reg_size);
 pci_register_bar(pci_dev, 0,
 PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
 >iomem);
 msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
 
+if (n->params.cmb_size_mb) {
+nvme_init_cmb(n, pci_dev);
+}
+}
+
+static void nvme_init_ctrl(NvmeCtrl *n)
+{
+NvmeIdCtrl *id = >id_ctrl;
+NvmeParams *params = >params;
+uint8_t *pci_conf = n->parent_obj.config;
+
 id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
  

[Qemu-devel] [PATCH 15/16] nvme: support scatter gather lists

2019-07-05 Thread Klaus Birkelund Jensen
For now, support the Data Block, Segment and Last Segment descriptor
types.

See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").

Signed-off-by: Klaus Birkelund Jensen 
---
 block/nvme.c  |  18 +-
 hw/block/nvme.c   | 390 +++---
 hw/block/nvme.h   |   6 +
 hw/block/trace-events |   3 +
 include/block/nvme.h  |  64 ++-
 5 files changed, 410 insertions(+), 71 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 73ed5fa75f2e..907a610633f2 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -438,7 +438,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 error_setg(errp, "Cannot map buffer for DMA");
 goto out;
 }
-cmd.prp1 = cpu_to_le64(iova);
+cmd.dptr.prp.prp1 = cpu_to_le64(iova);
 
 if (nvme_cmd_sync(bs, s->queues[0], )) {
 error_setg(errp, "Failed to identify controller");
@@ -512,7 +512,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_CQ,
-.prp1 = cpu_to_le64(q->cq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x3),
 };
@@ -523,7 +523,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_SQ,
-.prp1 = cpu_to_le64(q->sq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x1 | (n << 16)),
 };
@@ -858,16 +858,16 @@ try_map:
 case 0:
 abort();
 case 1:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = 0;
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = 0;
 break;
 case 2:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = pagelist[1];
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = pagelist[1];
 break;
 default:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + 
sizeof(uint64_t));
 break;
 }
 trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b285119fd29a..6bf62952dd13 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -273,6 +273,198 @@ unmap:
 return status;
 }
 
+static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor *segment, uint64_t nsgld, uint32_t *len,
+NvmeRequest *req)
+{
+dma_addr_t addr, trans_len;
+
+for (int i = 0; i < nsgld; i++) {
+if (NVME_SGL_TYPE(segment[i].type) != SGL_DESCR_TYPE_DATA_BLOCK) {
+trace_nvme_err_invalid_sgl_descriptor(req->cqe.cid,
+NVME_SGL_TYPE(segment[i].type));
+return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
+}
+
+if (*len == 0) {
+if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
+trace_nvme_err_invalid_sgl_excess_length(req->cqe.cid);
+return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
+}
+
+break;
+}
+
+addr = le64_to_cpu(segment[i].addr);
+trans_len = MIN(*len, le64_to_cpu(segment[i].len));
+
+if (nvme_addr_is_cmb(n, addr)) {
+/*
+ * All data and metadata, if any, associated with a particular
+ * command shall be located in either the CMB or host memory. Thus,
+ * if an address if found to be in the CMB and we have already
+ * mapped data that is in host memory, the use is invalid.
+ */
+if (!req->is_cmb && qsg->size) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+
+req->is_cmb = true;
+} else {
+/*
+ * Similarly, if the address does not reference the CMB, but we
+ * have already established that the request has data or metadata
+ * in the CMB, the use is invalid.
+ */
+if (req->is_cmb) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+}
+
+qemu_sglist_add(qsg, addr, trans_len);
+
+*len -= trans_len;
+}
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
+{
+const int MAX_NSGLD = 256;
+
+NvmeSglDescriptor segment[MAX_NSGLD];
+uint64_t nsgld;
+uint16_t status;
+bool sgl_in_cmb = false;
+hwaddr addr = le64_to_cpu(sgl.addr);
+
+trace_nvm

[Qemu-devel] [PATCH 14/16] nvme: support multiple block requests per request

2019-07-05 Thread Klaus Birkelund Jensen
Currently, the device only issues a single block backend request per
NVMe request, but as we move towards supporting metadata (and
discontiguous vector requests supported by OpenChannel 2.0) it will be
required to issue multiple block backend requests per NVMe request.

With this patch the NVMe device is ready for that.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 322 --
 hw/block/nvme.h   |  49 +--
 hw/block/trace-events |   3 +
 3 files changed, 290 insertions(+), 84 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 02888dbfdbc1..b285119fd29a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -25,6 +25,8 @@
  *  Default: 64
  *   cmb_size_mb= : Size of Controller Memory Buffer in MBs.
  *  Default: 0 (disabled)
+ *   mdts= : Maximum Data Transfer Size (power of two)
+ *  Default: 7
  */
 
 #include "qemu/osdep.h"
@@ -319,10 +321,9 @@ static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
 uint64_t prp1, uint64_t prp2, NvmeRequest *req)
 {
-QEMUSGList qsg;
 uint16_t err = NVME_SUCCESS;
 
-err = nvme_map_prp(n, , prp1, prp2, len, req);
+err = nvme_map_prp(n, >qsg, prp1, prp2, len, req);
 if (err) {
 return err;
 }
@@ -330,8 +331,8 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 if (req->is_cmb) {
 QEMUIOVector iov;
 
-qemu_iovec_init(, qsg.nsg);
-dma_to_cmb(n, , );
+qemu_iovec_init(, req->qsg.nsg);
+dma_to_cmb(n, >qsg, );
 
 if (unlikely(qemu_iovec_from_buf(, 0, ptr, len) != len)) {
 trace_nvme_err_invalid_dma();
@@ -343,17 +344,86 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 goto out;
 }
 
-if (unlikely(dma_buf_read(ptr, len, ))) {
+if (unlikely(dma_buf_read(ptr, len, >qsg))) {
 trace_nvme_err_invalid_dma();
 err = NVME_INVALID_FIELD | NVME_DNR;
 }
 
 out:
-qemu_sglist_destroy();
+qemu_sglist_destroy(>qsg);
 
 return err;
 }
 
+static void nvme_blk_req_destroy(NvmeBlockBackendRequest *blk_req)
+{
+if (blk_req->iov.nalloc) {
+qemu_iovec_destroy(_req->iov);
+}
+
+g_free(blk_req);
+}
+
+static void nvme_blk_req_put(NvmeCtrl *n, NvmeBlockBackendRequest *blk_req)
+{
+nvme_blk_req_destroy(blk_req);
+}
+
+static NvmeBlockBackendRequest *nvme_blk_req_get(NvmeCtrl *n, NvmeRequest *req,
+QEMUSGList *qsg)
+{
+NvmeBlockBackendRequest *blk_req = g_malloc0(sizeof(*blk_req));
+
+blk_req->req = req;
+
+if (qsg) {
+blk_req->qsg = qsg;
+}
+
+return blk_req;
+}
+
+static uint16_t nvme_blk_setup(NvmeCtrl *n, NvmeNamespace *ns, QEMUSGList *qsg,
+NvmeRequest *req)
+{
+NvmeBlockBackendRequest *blk_req = nvme_blk_req_get(n, req, qsg);
+if (!blk_req) {
+NVME_GUEST_ERR(nvme_err_internal_dev_error, "nvme_blk_req_get: %s",
+"could not allocate memory");
+return NVME_INTERNAL_DEV_ERROR;
+}
+
+blk_req->slba = req->slba;
+blk_req->nlb = req->nlb;
+blk_req->blk_offset = req->slba * nvme_ns_lbads_bytes(ns);
+
+QTAILQ_INSERT_TAIL(>blk_req_tailq, blk_req, tailq_entry);
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+NvmeNamespace *ns = req->ns;
+uint16_t err;
+
+uint32_t len = req->nlb * nvme_ns_lbads_bytes(ns);
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+err = nvme_map_prp(n, >qsg, prp1, prp2, len, req);
+if (err) {
+return err;
+}
+
+err = nvme_blk_setup(n, ns, >qsg, req);
+if (err) {
+return err;
+}
+
+return NVME_SUCCESS;
+}
+
 static void nvme_post_cqes(void *opaque)
 {
 NvmeCQueue *cq = opaque;
@@ -388,6 +458,10 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 {
 assert(cq->cqid == req->sq->cqid);
 
+if (req->qsg.nalloc) {
+qemu_sglist_destroy(>qsg);
+}
+
 trace_nvme_enqueue_req_completion(req->cqe.cid, cq->cqid);
 QTAILQ_REMOVE(>sq->out_req_list, req, entry);
 QTAILQ_INSERT_TAIL(>req_list, req, entry);
@@ -471,130 +545,224 @@ static void nvme_process_aers(void *opaque)
 
 static void nvme_rw_cb(void *opaque, int ret)
 {
-NvmeRequest *req = opaque;
+NvmeBlockBackendRequest *blk_req = opaque;
+NvmeRequest *req = blk_req->req;
 NvmeSQueue *sq = req->sq;
 NvmeCtrl *n = sq->ctrl;
 NvmeCQueue *cq = n->cq[sq->cqid];
 
+QTAILQ_REMOVE(>blk_req_tailq, blk_req, tailq_entry);
+
+trace_nvme_rw_cb(req->cqe.cid, req->

[Qemu-devel] [PATCH 05/16] nvme: populate the mandatory subnqn and ver fields

2019-07-05 Thread Klaus Birkelund Jensen
Required for compliance with NVMe revision 1.2.1 or later. See NVM
Express 1.2.1, Section 5.11 ("Identify command"), Figure 90 and Section
7.9 ("NVMe Qualified Names").

This also bumps the supported version to 1.2.1.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ce2e5365385b..3c392dc336a8 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1364,12 +1364,18 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 id->ieee[0] = 0x00;
 id->ieee[1] = 0x02;
 id->ieee[2] = 0xb3;
+id->ver = cpu_to_le32(0x00010201);
 id->oacs = cpu_to_le16(0);
 id->frmw = 7 << 1;
 id->sqes = (0x6 << 4) | 0x6;
 id->cqes = (0x4 << 4) | 0x4;
 id->nn = cpu_to_le32(n->num_namespaces);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
+
+strcpy((char *) id->subnqn, "nqn.2014-08.org.nvmexpress:uuid:");
+qemu_uuid_unparse(_uuid,
+(char *) id->subnqn + strlen((char *) id->subnqn));
+
 id->psd[0].mp = cpu_to_le16(0x9c4);
 id->psd[0].enlat = cpu_to_le32(0x10);
 id->psd[0].exlat = cpu_to_le32(0x4);
@@ -1384,7 +1390,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 NVME_CAP_SET_CSS(n->bar.cap, 1);
 NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
 
-n->bar.vs = 0x00010200;
+n->bar.vs = 0x00010201;
 n->bar.intmc = n->bar.intms = 0;
 
 if (n->params.cmb_size_mb) {
-- 
2.20.1




[Qemu-devel] [PATCH 11/16] nvme: add missing mandatory Features

2019-07-05 Thread Klaus Birkelund Jensen
Add support for returning a resonable response to Get/Set Features of
mandatory features.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 49 ---
 hw/block/trace-events |  2 ++
 include/block/nvme.h  |  3 ++-
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 93f5dff197e0..8259dd7c1d6c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -860,13 +860,24 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, 
NvmeCmd *cmd)
 static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
 uint32_t dw10 = le32_to_cpu(cmd->cdw10);
+uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 uint32_t result;
 
+trace_nvme_getfeat(dw10);
+
 switch (dw10) {
+case NVME_ARBITRATION:
+result = cpu_to_le32(n->features.arbitration);
+break;
+case NVME_POWER_MANAGEMENT:
+result = cpu_to_le32(n->features.power_mgmt);
+break;
 case NVME_TEMPERATURE_THRESHOLD:
 result = cpu_to_le32(n->features.temp_thresh);
 break;
 case NVME_ERROR_RECOVERY:
+result = cpu_to_le32(n->features.err_rec);
+break;
 case NVME_VOLATILE_WRITE_CACHE:
 result = blk_enable_write_cache(n->conf.blk);
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -878,6 +889,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
+case NVME_INTERRUPT_COALESCING:
+result = cpu_to_le32(n->features.int_coalescing);
+break;
+case NVME_INTERRUPT_VECTOR_CONF:
+if ((dw11 & 0x) > n->params.num_queues) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+result = cpu_to_le32(n->features.int_vector_config[dw11 & 0x]);
+break;
+case NVME_WRITE_ATOMICITY:
+result = cpu_to_le32(n->features.write_atomicity);
+break;
 case NVME_ASYNCHRONOUS_EVENT_CONF:
 result = cpu_to_le32(n->features.async_config);
 break;
@@ -913,6 +937,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 uint32_t dw10 = le32_to_cpu(cmd->cdw10);
 uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
+trace_nvme_setfeat(dw10, dw11);
+
 switch (dw10) {
 case NVME_TEMPERATURE_THRESHOLD:
 n->features.temp_thresh = dw11;
@@ -937,6 +963,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_ASYNCHRONOUS_EVENT_CONF:
 n->features.async_config = dw11;
 break;
+case NVME_ARBITRATION:
+case NVME_POWER_MANAGEMENT:
+case NVME_ERROR_RECOVERY:
+case NVME_INTERRUPT_COALESCING:
+case NVME_INTERRUPT_VECTOR_CONF:
+case NVME_WRITE_ATOMICITY:
+return NVME_FEAT_NOT_CHANGABLE | NVME_DNR;
 default:
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -1693,6 +1726,14 @@ static void nvme_init_state(NvmeCtrl *n)
 n->aer_reqs = g_new0(NvmeRequest *, NVME_AERL + 1);
 n->temperature = NVME_TEMPERATURE;
 n->features.temp_thresh = 0x14d;
+n->features.int_vector_config = g_malloc0_n(n->params.num_queues,
+sizeof(*n->features.int_vector_config));
+
+/* disable coalescing (not supported) */
+for (int i = 0; i < n->params.num_queues; i++) {
+n->features.int_vector_config[i] = i | (1 << 16);
+}
+
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -1769,6 +1810,10 @@ static void nvme_init_ctrl(NvmeCtrl *n)
 id->nn = cpu_to_le32(n->num_namespaces);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
 
+if (blk_enable_write_cache(n->conf.blk)) {
+id->vwc = 1;
+}
+
 strcpy((char *) id->subnqn, "nqn.2014-08.org.nvmexpress:uuid:");
 qemu_uuid_unparse(_uuid,
 (char *) id->subnqn + strlen((char *) id->subnqn));
@@ -1776,9 +1821,6 @@ static void nvme_init_ctrl(NvmeCtrl *n)
 id->psd[0].mp = cpu_to_le16(0x9c4);
 id->psd[0].enlat = cpu_to_le32(0x10);
 id->psd[0].exlat = cpu_to_le32(0x4);
-if (blk_enable_write_cache(n->conf.blk)) {
-id->vwc = 1;
-}
 
 n->bar.cap = 0;
 NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
@@ -1876,6 +1918,7 @@ static void nvme_exit(PCIDevice *pci_dev)
 g_free(n->sq);
 g_free(n->elpes);
 g_free(n->aer_reqs);
+g_free(n->features.int_vector_config);
 
 if (n->params.cmb_size_mb) {
 g_free(n->cmbuf);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index ed666bbc94f2..17485bb0375b 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -41,6 +41,8 @@ nvme_del_cq(uint16_t cqid) "deleted compl

[Qemu-devel] [PATCH 06/16] nvme: support completion queue in cmb

2019-07-05 Thread Klaus Birkelund Jensen
While not particularly useful, allow completion queues in the controller
memory buffer. Could be useful for testing.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3c392dc336a8..b31e5ff681bd 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -57,6 +57,16 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 }
 }
 
+static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
+{
+if (n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+memcpy((void *)>cmbuf[addr - n->ctrl_mem.addr], buf, size);
+return;
+}
+pci_dma_write(>parent_obj, addr, buf, size);
+}
+
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
 return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
@@ -276,6 +286,7 @@ static void nvme_post_cqes(void *opaque)
 
 QTAILQ_FOREACH_SAFE(req, >req_list, entry, next) {
 NvmeSQueue *sq;
+NvmeCqe *cqe = >cqe;
 hwaddr addr;
 
 if (nvme_cq_full(cq)) {
@@ -289,8 +300,7 @@ static void nvme_post_cqes(void *opaque)
 req->cqe.sq_head = cpu_to_le16(sq->head);
 addr = cq->dma_addr + cq->tail * n->cqe_size;
 nvme_inc_cq_tail(cq);
-pci_dma_write(>parent_obj, addr, (void *)>cqe,
-sizeof(req->cqe));
+nvme_addr_write(n, addr, (void *) cqe, sizeof(*cqe));
 QTAILQ_INSERT_TAIL(>req_list, req, entry);
 }
 if (cq->tail != cq->head) {
@@ -1399,7 +1409,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
 
 NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
-NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
 NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
-- 
2.20.1




[Qemu-devel] [PATCH 09/16] nvme: support Asynchronous Event Request command

2019-07-05 Thread Klaus Birkelund Jensen
Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.2 ("Asynchronous Event Request command").

Modified from Keith's qemu-nvme tree.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 88 ++-
 hw/block/nvme.h   |  7 
 hw/block/trace-events |  7 
 3 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index eb6af6508e2d..a20576654f1b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -39,6 +39,7 @@
 #include "nvme.h"
 
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_AERL 3
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -318,6 +319,51 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_process_aers(void *opaque)
+{
+NvmeCtrl *n = opaque;
+NvmeRequest *req;
+NvmeAerResult *result;
+NvmeAsyncEvent *event, *next;
+
+trace_nvme_process_aers();
+
+QSIMPLEQ_FOREACH_SAFE(event, >aer_queue, entry, next) {
+/* can't post cqe if there is nothing to complete */
+if (!n->outstanding_aers) {
+trace_nvme_no_outstanding_aers();
+break;
+}
+
+/* ignore if masked (cqe posted, but event not cleared) */
+if (n->aer_mask & (1 << event->result.event_type)) {
+trace_nvme_aer_masked(event->result.event_type, n->aer_mask);
+continue;
+}
+
+QSIMPLEQ_REMOVE_HEAD(>aer_queue, entry);
+
+n->aer_mask |= 1 << event->result.event_type;
+n->aer_mask_queued &= ~(1 << event->result.event_type);
+n->outstanding_aers--;
+
+req = n->aer_reqs[n->outstanding_aers];
+
+result = (NvmeAerResult *) >cqe.result;
+result->event_type = event->result.event_type;
+result->event_info = event->result.event_info;
+result->log_page = event->result.log_page;
+g_free(event);
+
+req->status = NVME_SUCCESS;
+
+trace_nvme_aer_post_cqe(result->event_type, result->event_info,
+result->log_page);
+
+nvme_enqueue_req_completion(>admin_cq, req);
+}
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
 NvmeRequest *req = opaque;
@@ -796,6 +842,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
+case NVME_ASYNCHRONOUS_EVENT_CONF:
+result = cpu_to_le32(n->features.async_config);
 break;
 default:
 trace_nvme_err_invalid_getfeat(dw10);
@@ -841,11 +889,11 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
 ((n->params.num_queues - 2) << 16));
 break;
-
 case NVME_TIMESTAMP:
 return nvme_set_feature_timestamp(n, cmd);
+case NVME_ASYNCHRONOUS_EVENT_CONF:
+n->features.async_config = dw11;
 break;
-
 default:
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -854,6 +902,22 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 return NVME_SUCCESS;
 }
 
+static uint16_t nvme_aer(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+trace_nvme_aer(req->cqe.cid);
+
+if (n->outstanding_aers > NVME_AERL) {
+trace_nvme_aer_aerl_exceeded();
+return NVME_AER_LIMIT_EXCEEDED;
+}
+
+n->aer_reqs[n->outstanding_aers] = req;
+timer_mod(n->aer_timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+n->outstanding_aers++;
+
+return NVME_NO_COMPLETE;
+}
+
 static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
 NvmeSQueue *sq;
@@ -918,6 +982,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return nvme_set_feature(n, cmd, req);
 case NVME_ADM_CMD_GET_FEATURES:
 return nvme_get_feature(n, cmd, req);
+case NVME_ADM_CMD_ASYNC_EV_REQ:
+return nvme_aer(n, cmd, req);
 case NVME_ADM_CMD_ABORT:
 return nvme_abort(n, cmd, req);
 default:
@@ -963,6 +1029,7 @@ static void nvme_process_sq(void *opaque)
 
 static void nvme_clear_ctrl(NvmeCtrl *n)
 {
+NvmeAsyncEvent *event;
 int i;
 
 blk_drain(n->conf.blk);
@@ -978,8 +1045,19 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 }
 }
 
+if (n->aer_timer) {
+timer_del(n->aer_timer);
+timer_free(n->aer_timer);
+n->aer_timer = NULL;
+}
+while ((event = QSIMPLEQ_FIRST(>aer_queue)) != NULL) {
+QSIMPLEQ_REMOVE_HEAD(>aer_queue, entry);

[Qemu-devel] [PATCH 10/16] nvme: support Get Log Page command

2019-07-05 Thread Klaus Birkelund Jensen
Add support for the Get Log Page command and stub/dumb implementations
of the mandatory Error Information, SMART/Health Information and
Firmware Slot Information log pages.

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.10 ("Get Log Page command").

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 209 ++
 hw/block/nvme.h   |   3 +
 hw/block/trace-events |   3 +
 include/block/nvme.h  |   4 +-
 4 files changed, 217 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a20576654f1b..93f5dff197e0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -39,6 +39,8 @@
 #include "nvme.h"
 
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_TEMPERATURE 0x143
+#define NVME_ELPE 3
 #define NVME_AERL 3
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
@@ -319,6 +321,36 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_enqueue_event(NvmeCtrl *n, uint8_t event_type,
+uint8_t event_info, uint8_t log_page)
+{
+NvmeAsyncEvent *event;
+
+trace_nvme_enqueue_event(event_type, event_info, log_page);
+
+/*
+ * Do not enqueue the event if something of this type is already queued.
+ * This bounds the size of the event queue and makes sure it does not grow
+ * indefinitely when events are not processed by the host (i.e. does not
+ * issue any AERs).
+ */
+if (n->aer_mask_queued & (1 << event_type)) {
+return;
+}
+n->aer_mask_queued |= (1 << event_type);
+
+event = g_new(NvmeAsyncEvent, 1);
+event->result = (NvmeAerResult) {
+.event_type = event_type,
+.event_info = event_info,
+.log_page   = log_page,
+};
+
+QSIMPLEQ_INSERT_TAIL(>aer_queue, event, entry);
+
+timer_mod(n->aer_timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+}
+
 static void nvme_process_aers(void *opaque)
 {
 NvmeCtrl *n = opaque;
@@ -831,6 +863,10 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 uint32_t result;
 
 switch (dw10) {
+case NVME_TEMPERATURE_THRESHOLD:
+result = cpu_to_le32(n->features.temp_thresh);
+break;
+case NVME_ERROR_RECOVERY:
 case NVME_VOLATILE_WRITE_CACHE:
 result = blk_enable_write_cache(n->conf.blk);
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -878,6 +914,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
 switch (dw10) {
+case NVME_TEMPERATURE_THRESHOLD:
+n->features.temp_thresh = dw11;
+if (n->features.temp_thresh <= n->temperature) {
+nvme_enqueue_event(n, NVME_AER_TYPE_SMART,
+NVME_AER_INFO_SMART_TEMP_THRESH, NVME_LOG_SMART_INFO);
+}
+break;
 case NVME_VOLATILE_WRITE_CACHE:
 blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
 break;
@@ -902,6 +945,137 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 return NVME_SUCCESS;
 }
 
+static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
+{
+n->aer_mask &= ~(1 << event_type);
+if (!QSIMPLEQ_EMPTY(>aer_queue)) {
+timer_mod(n->aer_timer,
+qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+}
+}
+
+static uint16_t nvme_error_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
+uint32_t buf_len, uint64_t off, NvmeRequest *req)
+{
+uint32_t trans_len;
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+if (off > sizeof(*n->elpes) * (NVME_ELPE + 1)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+trans_len = MIN(sizeof(*n->elpes) * (NVME_ELPE + 1) - off, buf_len);
+
+if (!rae) {
+nvme_clear_events(n, NVME_AER_TYPE_ERROR);
+}
+
+return nvme_dma_read_prp(n, (uint8_t *) n->elpes + off, trans_len, prp1,
+prp2);
+}
+
+static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
+uint32_t buf_len, uint64_t off, NvmeRequest *req)
+{
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+uint32_t trans_len;
+time_t current_ms;
+NvmeSmartLog smart;
+
+if (cmd->nsid != 0 && cmd->nsid != 0x) {
+trace_nvme_err(req->cqe.cid, "smart log not supported for namespace",
+NVME_INVALID_FIELD | NVME_DNR);
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+if (off > sizeof(smart)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+trans_len = MIN(sizeof(smart) - off, buf_len);
+
+memset(, 0x0, sizeof(s

  1   2   >