from:"Klaus Birkelund"

Re: [Qemu-block] [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-08-23 Thread Klaus Birkelund

On Thu, Aug 22, 2019 at 02:18:05PM +0100, Ross Lagerwall wrote:
> On 7/5/19 8:23 AM, Klaus Birkelund Jensen wrote:
> 
> I tried this patch series by installing Windows with a single NVME
> controller having two namespaces. QEMU crashed in get_feature /
> NVME_VOLATILE_WRITE_CACHE because req->ns was NULL.
> 

Hi Ross,

Good catch!

> nvme_get_feature / nvme_set_feature look wrong to me since I can't see how
> req->ns would have been set. Should they have similar code to nvme_io_cmd to
> set req->ns from cmd->nsid?

Definitely. I will fix that for v2.

> 
> After working around this issue everything else seemed to be working well.
> Thanks for your work on this patch series.
> 

And thank you for trying out my patches!


Cheers,
Klaus

Re: [PATCH] nvme: fix NSSRS offset in CAP register

2019-10-23 Thread Klaus Birkelund

On Wed, Oct 23, 2019 at 11:26:57AM -0400, John Snow wrote:
> 
> 
> On 10/23/19 3:33 AM, Klaus Jensen wrote:
> > Fix the offset of the NSSRS field the CAP register.
> 
> From NVME 1.4, section 3 ("Controller Registers"), subsection 3.1.1
> ("Offset 0h: CAP – Controller Capabilities") CAP_NSSRS_SHIFT is bit 36,
> not 33.
> 
> > 
> > Signed-off-by: Klaus Jensen 
> > Reported-by: Javier Gonzalez 
> > ---
> >  include/block/nvme.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 3ec8efcc435e..fa15b51c33bb 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -23,7 +23,7 @@ enum NvmeCapShift {
> >  CAP_AMS_SHIFT  = 17,
> >  CAP_TO_SHIFT   = 24,
> >  CAP_DSTRD_SHIFT= 32,
> > -CAP_NSSRS_SHIFT= 33,
> > +CAP_NSSRS_SHIFT= 36,
> >  CAP_CSS_SHIFT  = 37,
> >  CAP_MPSMIN_SHIFT   = 48,
> >  CAP_MPSMAX_SHIFT   = 52,
> > 
> 
> I like updating commit messages with spec references; if it can be
> updated that would be nice.
> 
> Regardless:
> 
> Reviewed-by: John Snow 
> 

Sounds good. Can the committer squash that in?


Cheers,
Klaus

Re: [PATCH v2 00/20] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2019-10-27 Thread Klaus Birkelund

On Tue, Oct 15, 2019 at 12:38:40PM +0200, Klaus Jensen wrote:
> Hi,
> 
> (Quick note to Fam): most of this series is irrelevant to you as the
> maintainer of the nvme block driver, but patch "nvme: add support for
> scatter gather lists" touches block/nvme.c due to changes in the shared
> NvmeCmd struct.
> 
> Anyway, v2 comes with a good bunch of changes. Compared to v1[1], I have
> squashed some commits in the beginning of the series and heavily
> refactored "nvme: support multiple block requests per request" into the
> new commit "nvme: allow multiple aios per command".
> 
> I have also removed the original implementation of the Abort command
> (commit "nvme: add support for the abort command") as it is currently
> too tricky to test reliably. It has been replaced by a stub that,
> besides a trivial sanity check, just fails to abort the given command.
> *Some* implementation of the Abort command is mandatory, but given the
> "best effort" nature of the command this is acceptable for now. When the
> device gains support for arbitration it should be less tricky to test.
> 
> The support for multiple namespaces is now backwards compatible. The
> nvme device still accepts a 'drive' parameter, but for multiple
> namespaces the use of 'nvme-ns' devices are required. I also integrated
> some feedback from Paul so the device supports non-consecutive namespace
> ids.
> 
> I have also added some new commits at the end:
> 
>   - "nvme: bump controller pci device id" makes sure the Linux kernel
> doesn't apply any quirks to the controller that it no longer has.
>   - "nvme: handle dma errors" won't actually do anything before this[2]
> fix to include/hw/pci/pci.h is merged. With these two patches added,
> the device reliably passes some additional nasty tests from blktests
> (block/011 "disable PCI device while doing I/O" and block/019 "break
> PCI link device while doing I/O"). Before this patch, block/011
> would pass from time to time if you were lucky, but would at least
> mess up the controller pretty badly, causing a reset in the best
> case.
> 
> 
>   [1]: https://patchwork.kernel.org/project/qemu-devel/list/?series=142383
>   [2]: https://patchwork.kernel.org/patch/11184911/
> 
> 
> Klaus Jensen (20):
>   nvme: remove superfluous breaks
>   nvme: move device parameters to separate struct
>   nvme: add missing fields in the identify controller data structure
>   nvme: populate the mandatory subnqn and ver fields
>   nvme: allow completion queues in the cmb
>   nvme: add support for the abort command
>   nvme: refactor device realization
>   nvme: add support for the get log page command
>   nvme: add support for the asynchronous event request command
>   nvme: add logging to error information log page
>   nvme: add missing mandatory features
>   nvme: bump supported specification version to 1.3
>   nvme: refactor prp mapping
>   nvme: allow multiple aios per command
>   nvme: add support for scatter gather lists
>   nvme: support multiple namespaces
>   nvme: bump controller pci device id
>   nvme: remove redundant NvmeCmd pointer parameter
>   nvme: make lba data size configurable
>   nvme: handle dma errors
> 
>  block/nvme.c   |   18 +-
>  hw/block/Makefile.objs |2 +-
>  hw/block/nvme-ns.c |  139 +++
>  hw/block/nvme-ns.h |   60 ++
>  hw/block/nvme.c| 1863 +---
>  hw/block/nvme.h|  219 -
>  hw/block/trace-events  |   37 +-
>  include/block/nvme.h   |  132 ++-
>  8 files changed, 2094 insertions(+), 376 deletions(-)
>  create mode 100644 hw/block/nvme-ns.c
>  create mode 100644 hw/block/nvme-ns.h
> 
> -- 
> 2.23.0
> 

Gentle ping on this.

I'm aware that this is a lot to go through, but I would like to know if
anyone has had a chance to look at it?


https://patchwork.kernel.org/project/qemu-devel/list/?series=187637

Re: [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-11-04 Thread Klaus Birkelund

On Mon, Nov 04, 2019 at 08:46:29AM +, Ross Lagerwall wrote:
> On 8/23/19 9:10 AM, Klaus Birkelund wrote:
> > On Thu, Aug 22, 2019 at 02:18:05PM +0100, Ross Lagerwall wrote:
> >> On 7/5/19 8:23 AM, Klaus Birkelund Jensen wrote:
> >>
> >> I tried this patch series by installing Windows with a single NVME
> >> controller having two namespaces. QEMU crashed in get_feature /
> >> NVME_VOLATILE_WRITE_CACHE because req->ns was NULL.
> >>
> > 
> > Hi Ross,
> > 
> > Good catch!
> > 
> >> nvme_get_feature / nvme_set_feature look wrong to me since I can't see how
> >> req->ns would have been set. Should they have similar code to nvme_io_cmd 
> >> to
> >> set req->ns from cmd->nsid?
> > 
> > Definitely. I will fix that for v2.
> > 
> >>
> >> After working around this issue everything else seemed to be working well.
> >> Thanks for your work on this patch series.
> >>
> > 
> > And thank you for trying out my patches!
> > 
> 
> One more thing... it doesn't handle inactive namespaces properly so if you
> have two namespaces with e.g. nsid=1 and nsid=3 QEMU ends up crashing in
> certain situations. The patch below adds support for inactive namespaces.
> 
> Still hoping to see a v2 some day :-)
> 
 
Hi Ross,

v2[1] is actually out, but only CC'ed Paul. Sorry about that! It fixes
the support for discontiguous nsid's, but does not handle inactive
namespaces correctly in identify.

I'll incorporate that in a v3 along with a couple of other fixes I did.

Thanks!


  [1]: https://patchwork.kernel.org/cover/11190045/

Re: [PATCH v2 06/20] nvme: add support for the abort command

2019-11-12 Thread Klaus Birkelund

On Tue, Nov 12, 2019 at 03:04:38PM +, Beata Michalska wrote:
> Hi Klaus
> 

Hi Beata,

Thank you very much for your thorough reviews! I'll start going through
them one by one :) You might have seen that I've posted a v3, but I will
make sure to consolidate between v2 and v3!

> On Tue, 15 Oct 2019 at 11:41, Klaus Jensen  wrote:
> >
> > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > Section 5.1 ("Abort command").
> >
> > The Abort command is a best effort command; for now, the device always
> > fails to abort the given command.
> >
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 16 
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index daa2367b0863..84e4f2ea7a15 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -741,6 +741,18 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  }
> >  }
> >
> > +static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +uint16_t sqid = le32_to_cpu(cmd->cdw10) & 0x;
> > +
> > +req->cqe.result = 1;
> > +if (nvme_check_sqid(n, sqid)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> Shouldn't we validate the CID as well ?
> 

According to the specification it is "implementation specific if/when a
controller chooses to complete the command when the command to abort is
not found".

I'm interpreting this to mean that, yes, an invalid command identifier
could be given in the command, but this implementation does not care
about that.

I still think the controller should check the validity of the submission
queue identifier though. It is a general invariant that the sqid should
be valid.

> > +return NVME_SUCCESS;
> > +}
> > +
> >  static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
> >  {
> >  trace_nvme_setfeat_timestamp(ts);
> > @@ -859,6 +871,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  trace_nvme_err_invalid_setfeat(dw10);
> >  return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> > +
> >  return NVME_SUCCESS;
> >  }
> >
> > @@ -875,6 +888,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  return nvme_create_cq(n, cmd);
> >  case NVME_ADM_CMD_IDENTIFY:
> >  return nvme_identify(n, cmd);
> > +case NVME_ADM_CMD_ABORT:
> > +return nvme_abort(n, cmd, req);
> >  case NVME_ADM_CMD_SET_FEATURES:
> >  return nvme_set_feature(n, cmd, req);
> >  case NVME_ADM_CMD_GET_FEATURES:
> > @@ -1388,6 +1403,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >  id->ieee[2] = 0xb3;
> >  id->ver = cpu_to_le32(0x00010201);
> >  id->oacs = cpu_to_le16(0);
> > +id->acl = 3;
> So we are setting the max number of concurrent commands
> but there is no logic to enforce that and wrap up with the
> status suggested by specification.
> 

That is true, but because the controller always completes the Abort
command immediately this cannot happen. If the controller did try to
abort executing commands, the Abort command would need to linger in the
controller state until a completion queue entry is posted for the
command to be aborted before the completion queue entry can be posted
for the Abort command. This takes up resources in the controller and is
the reason for the Abort Command Limit.

You could argue that we should set ACL to 0 then, but the specification
recommends a value of 3 and I do not see any harm in conveying a
"reasonable", though inconsequential, value.

Re: [PATCH v2 04/20] nvme: populate the mandatory subnqn and ver fields

2019-11-12 Thread Klaus Birkelund

On Tue, Nov 12, 2019 at 03:04:45PM +, Beata Michalska wrote:
> Hi Klaus
> 
> On Tue, 15 Oct 2019 at 11:42, Klaus Jensen  wrote:
> > +n->bar.vs = 0x00010201;
> 
> Very minor:
> 
> The version number is being set twice in the patch series already.
> And it is being set in two places.
> It might be worth to make a #define out of it so that only one
> needs to be changed.
> 

I think you are right. I'll do that.

Re: [PATCH v2 19/20] nvme: make lba data size configurable

2019-11-12 Thread Klaus Birkelund

On Tue, Nov 12, 2019 at 03:24:00PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:50, Klaus Jensen  wrote:
> >  #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> > -DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> > +DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
> > +DEFINE_PROP_UINT8("lbads", _state, _props.lbads, 9)
> >
> Could we actually use BDRV_SECTOR_BITS instead of magic numbers?
> 
 
Yes, better. Fixed in two places.

Re: [PATCH v2 06/20] nvme: add support for the abort command

2019-11-18 Thread Klaus Birkelund

On Fri, Nov 15, 2019 at 11:56:00AM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Wed, 13 Nov 2019 at 06:12, Klaus Birkelund  wrote:
> >
> > On Tue, Nov 12, 2019 at 03:04:38PM +, Beata Michalska wrote:
> > > Hi Klaus
> > >
> >
> > Hi Beata,
> >
> > Thank you very much for your thorough reviews! I'll start going through
> > them one by one :) You might have seen that I've posted a v3, but I will
> > make sure to consolidate between v2 and v3!
> >
> > > On Tue, 15 Oct 2019 at 11:41, Klaus Jensen  wrote:
> > > >
> > > > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > > > Section 5.1 ("Abort command").
> > > >
> > > > The Abort command is a best effort command; for now, the device always
> > > > fails to abort the given command.
> > > >
> > > > Signed-off-by: Klaus Jensen 
> > > > ---
> > > >  hw/block/nvme.c | 16 
> > > >  1 file changed, 16 insertions(+)
> > > >
> > > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > > index daa2367b0863..84e4f2ea7a15 100644
> > > > --- a/hw/block/nvme.c
> > > > +++ b/hw/block/nvme.c
> > > > @@ -741,6 +741,18 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd 
> > > > *cmd)
> > > >  }
> > > >  }
> > > >
> > > > +static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > > +{
> > > > +uint16_t sqid = le32_to_cpu(cmd->cdw10) & 0x;
> > > > +
> > > > +req->cqe.result = 1;
> > > > +if (nvme_check_sqid(n, sqid)) {
> > > > +return NVME_INVALID_FIELD | NVME_DNR;
> > > > +}
> > > > +
> > > Shouldn't we validate the CID as well ?
> > >
> >
> > According to the specification it is "implementation specific if/when a
> > controller chooses to complete the command when the command to abort is
> > not found".
> >
> > I'm interpreting this to mean that, yes, an invalid command identifier
> > could be given in the command, but this implementation does not care
> > about that.
> >
> > I still think the controller should check the validity of the submission
> > queue identifier though. It is a general invariant that the sqid should
> > be valid.
> >
> > > > +return NVME_SUCCESS;
> > > > +}
> > > > +
> > > >  static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
> > > >  {
> > > >  trace_nvme_setfeat_timestamp(ts);
> > > > @@ -859,6 +871,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > > > NvmeCmd *cmd, NvmeRequest *req)
> > > >  trace_nvme_err_invalid_setfeat(dw10);
> > > >  return NVME_INVALID_FIELD | NVME_DNR;
> > > >  }
> > > > +
> > > >  return NVME_SUCCESS;
> > > >  }
> > > >
> > > > @@ -875,6 +888,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd 
> > > > *cmd, NvmeRequest *req)
> > > >  return nvme_create_cq(n, cmd);
> > > >  case NVME_ADM_CMD_IDENTIFY:
> > > >  return nvme_identify(n, cmd);
> > > > +case NVME_ADM_CMD_ABORT:
> > > > +return nvme_abort(n, cmd, req);
> > > >  case NVME_ADM_CMD_SET_FEATURES:
> > > >  return nvme_set_feature(n, cmd, req);
> > > >  case NVME_ADM_CMD_GET_FEATURES:
> > > > @@ -1388,6 +1403,7 @@ static void nvme_realize(PCIDevice *pci_dev, 
> > > > Error **errp)
> > > >  id->ieee[2] = 0xb3;
> > > >  id->ver = cpu_to_le32(0x00010201);
> > > >  id->oacs = cpu_to_le16(0);
> > > > +id->acl = 3;
> > > So we are setting the max number of concurrent commands
> > > but there is no logic to enforce that and wrap up with the
> > > status suggested by specification.
> > >
> >
> > That is true, but because the controller always completes the Abort
> > command immediately this cannot happen. If the controller did try to
> > abort executing commands, the Abort command would need to linger in the
> > controller state until a completion queue entry is posted for the
> > command to be aborted before the completion queue entry can be posted
> > for the Abort command. This takes up resources in the controller and is
> > the reason for the Abort Command Limit.
> >
> > You could argue that we should set ACL to 0 then, but the specification
> > recommends a value of 3 and I do not see any harm in conveying a
> > "reasonable", though inconsequential, value.
> 
> Could we  potentially add some comment describing the above ?
> 

Yes, absolutely! :)


Klaus

Re: [PATCH v2 12/20] nvme: bump supported specification version to 1.3

2019-11-18 Thread Klaus Birkelund

On Tue, Nov 12, 2019 at 03:05:06PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:52, Klaus Jensen  wrote:
> >
> > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> > +{
> > +static const int len = 4096;
> > +
> > +struct ns_descr {
> > +uint8_t nidt;
> > +uint8_t nidl;
> > +uint8_t rsvd2[2];
> > +uint8_t nid[16];
> > +};
> > +
> > +uint32_t nsid = le32_to_cpu(c->nsid);
> > +uint64_t prp1 = le64_to_cpu(c->prp1);
> > +uint64_t prp2 = le64_to_cpu(c->prp2);
> > +
> > +struct ns_descr *list;
> > +uint16_t ret;
> > +
> > +trace_nvme_identify_ns_descr_list(nsid);
> > +
> > +if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > +trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
> > +return NVME_INVALID_NSID | NVME_DNR;
> > +}
> > +
> In theory this should abort the command for inactive NSIDs as well.
> But I guess this will come later on.
> 

At this point in the series, the device does not support multiple
namespaces anyway and num_namespaces is always 1. But this has also been
reported seperately in relation the patch adding multiple namespaces and
is fixed in v3.

> > +list = g_malloc0(len);
> > +list->nidt = 0x3;
> > +list->nidl = 0x10;
> > +*(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
> > +
> Might be worth to add some comment here -> as per the NGUID/EUI64 format.
> Also those are not specified currently in the namespace identity data 
> structure.
> 

I'll add a comment for why the Namespace UUID is set to this value here.
The NGUID/EUI64 fields are not set in the namespace identity data
structure as they are not required. See the descriptions of NGUID and
EUI64. Here for NGUID:

"The controller shall specify a globally unique namespace identifier
in this field, the EUI64 field, or a Namespace UUID in the Namespace
Identification Descriptor..."

Here, I chose to provide it in the Namespace Identification Descriptor
(by setting `list->nidt = 0x3`).

> > +ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
> > +g_free(list);
> > +return ret;
> > +}
> > +
> >  static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> >  {
> >  NvmeIdentify *c = (NvmeIdentify *)cmd;
> > @@ -934,7 +978,9 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> >  case 0x01:
> >  return nvme_identify_ctrl(n, c);
> >  case 0x02:
> > -return nvme_identify_nslist(n, c);
> > +return nvme_identify_ns_list(n, c);
> > +case 0x03:
> > +return nvme_identify_ns_descr_list(n, cmd);
> >  default:
> >  trace_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
> >  return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1101,6 +1147,14 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > NvmeCmd *cmd, NvmeRequest *req)
> >  blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >  break;
> >  case NVME_NUMBER_OF_QUEUES:
> > +if (n->qs_created > 2) {
> > +return NVME_CMD_SEQ_ERROR | NVME_DNR;
> > +}
> > +
> I am not sure this is entirely correct as the spec says:
> "if any I/O Submission and/or Completion Queues (...)"
> so it might be enough to have a single queue created
> for this command to be valid.
> Also I think that the condition here is to make sure that the number
> of queues requested is being set once at init phase. Currently this will
> allow the setting to happen if there is no active queue -> so at any
> point of time (provided the condition mentioned). I might be wrong here
> but it seems that what we need is a single status saying any queue
> has been created prior to the Set Feature command at all
> 

Internally, the admin queue pair is counted in qs_created, which is the
reason for checking if is above 2. The admin queues are created when the
controller is enabled (mmio write to the EN register in CC).

I'll add a comment about that - I see why it is unclear.

> 
> Small note: this patch seems to be introducing more changes
> than specified in the commit message and especially the subject. Might
> be worth to extend it a bit.
> 

You are right. I'll split it up.

Re: [PATCH v2 09/20] nvme: add support for the asynchronous event request command

2019-11-19 Thread Klaus Birkelund

On Tue, Nov 12, 2019 at 03:04:59PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:49, Klaus Jensen  wrote:
> > @@ -1188,6 +1326,9 @@ static int nvme_start_ctrl(NvmeCtrl *n)
> >
> >  nvme_set_timestamp(n, 0ULL);
> >
> > +n->aer_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_aers, n);
> > +QTAILQ_INIT(&n->aer_queue);
> > +
> 
> Is the timer really needed here ? The CEQ can be posted either when requested
> by host through AER, if there are any pending events, or once the
> event is triggered
> and there are active AER's.
> 

I guess you are right. I mostly cribbed this from Keith's tree, but I
see no reason to keep the timer.

Keith, do you have any comments on this?

> > @@ -1380,6 +1521,13 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr 
> > addr, int val)
> > "completion queue doorbell write"
> > " for nonexistent queue,"
> > " sqid=%"PRIu32", ignoring", qid);
> > +
> > +if (n->outstanding_aers) {
> > +nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
> > +NVME_AER_INFO_ERR_INVALID_DB_REGISTER,
> > +NVME_LOG_ERROR_INFO);
> > +}
> > +
> This one (as well as cases below) might not be entirely right
> according to the spec. If given event is enabled for asynchronous
> reporting the controller should retain that even. In this case, the event
> will be ignored as there is no pending request.
> 

I understand these notifications to be special cases (i.e. they cannot
be enabled/disabled through the Asynchronous Event Configuration
feature). See Section 4.1 of NVM Express 1.2.1. The spec specifically
says that "... and an Asynchronous Event Request command is outstanding,
...).

> > @@ -1591,6 +1759,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >  id->ver = cpu_to_le32(0x00010201);
> >  id->oacs = cpu_to_le16(0);
> >  id->acl = 3;
> > +id->aerl = n->params.aerl;
> 
> What about the configuration for the asynchronous events ?
> 

It will default to an AEC vector of 0 (everything disabled).


K

Re: [PATCH v2 08/20] nvme: add support for the get log page command

2019-11-19 Thread Klaus Birkelund

On Tue, Nov 12, 2019 at 03:04:52PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> 
> On Tue, 15 Oct 2019 at 11:45, Klaus Jensen  wrote:
> > +if (!nsid || (nsid != 0x && nsid > n->num_namespaces)) {
> > +trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
> > +return NVME_INVALID_NSID | NVME_DNR;
> > +}
> > +
> The LAP '0' bit is cleared now - which means there is no support
> for per-namespace data. So in theory, if that was the aim, this condition
> should check for the values different than 0x0 and 0x and either
> abort the command or treat that as request for controller specific data.
> 

This is fixed in v3 (that is, it just checks for values different from
0x0 and 0x).

> > +switch (lid) {
> > +case NVME_LOG_ERROR_INFO:
> > +return nvme_error_info(n, cmd, len, off, req);
> > +case NVME_LOG_SMART_INFO:
> > +return nvme_smart_info(n, cmd, len, off, req);
> > +case NVME_LOG_FW_SLOT_INFO:
> > +return nvme_fw_log_info(n, cmd, len, off, req);
> > +default:
> > +trace_nvme_err_invalid_log_page(req->cid, lid);
> > +return NVME_INVALID_LOG_ID | NVME_DNR;
> 
> The spec mentions the Invalid Field in Command  case processing
> command with an unsupported log id.
> 

Thanks. Fixed!

> > +}
> > +}
> > +
> >  static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
> >  {
> >  n->cq[cq->cqid] = NULL;
> > @@ -812,6 +944,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  uint32_t result;
> >
> >  switch (dw10) {
> > +case NVME_TEMPERATURE_THRESHOLD:
> > +result = cpu_to_le32(n->features.temp_thresh);
> > +break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  result = blk_enable_write_cache(n->conf.blk);
> >  trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
> > @@ -856,6 +991,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >
> >  switch (dw10) {
> > +case NVME_TEMPERATURE_THRESHOLD:
> > +n->features.temp_thresh = dw11;
> > +break;
> > +
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >  break;
> > @@ -884,6 +1023,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  return nvme_del_sq(n, cmd);
> >  case NVME_ADM_CMD_CREATE_SQ:
> >  return nvme_create_sq(n, cmd);
> > +case NVME_ADM_CMD_GET_LOG_PAGE:
> > +return nvme_get_log(n, cmd, req);
> >  case NVME_ADM_CMD_DELETE_CQ:
> >  return nvme_del_cq(n, cmd);
> >  case NVME_ADM_CMD_CREATE_CQ:
> > @@ -923,6 +1064,7 @@ static void nvme_process_sq(void *opaque)
> >  QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
> >  memset(&req->cqe, 0, sizeof(req->cqe));
> >  req->cqe.cid = cmd.cid;
> > +req->cid = le16_to_cpu(cmd.cid);
> 
> If I haven't missed anything this is being used only in one place
> for tracing - is it really worth to duplicate the cid here ?
> 

At this point in the series, yes - it is only used once. But it will be
used extensively for tracing in the later patches.

> > -id->lpa = 1 << 0;
> > +id->lpa = 1 << 2;
> 
> This sets the bit that states support for GLP command but clears the one
> that states support for per-namespace SMART/Heatld data - is that expected ?
> 

Yes, clearing the bit for per-namespace SMART/Health log page
information is intentional. There is no namespace specific information
defined in the namespace so the global and per-namespace log page
contains the same information.

Re: [PATCH v2 13/20] nvme: refactor prp mapping

2019-11-20 Thread Klaus Birkelund

On Tue, Nov 12, 2019 at 03:23:43PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:57, Klaus Jensen  wrote:
> >
> > Instead of handling both QSGs and IOVs in multiple places, simply use
> > QSGs everywhere by assuming that the request does not involve the
> > controller memory buffer (CMB). If the request is found to involve the
> > CMB, convert the QSG to an IOV and issue the I/O. The QSG is converted
> > to an IOV by the dma helpers anyway, so the CMB path is not unfairly
> > affected by this simplifying change.
> >
> 
> Out of curiosity, in how many cases the SG list will have to
> be converted to IOV ? Does that justify creating the SG list in vain ?
> 

You got me wondering. Only using QSGs does not really remove much
complexity, so I readded the direct use of IOVs for the CMB path. There
is no harm in that.

> > +static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, uint64_t prp1,
> > +uint64_t prp2, uint32_t len, NvmeRequest *req)
> >  {
> >  hwaddr trans_len = n->page_size - (prp1 % n->page_size);
> >  trans_len = MIN(len, trans_len);
> >  int num_prps = (len >> n->page_bits) + 1;
> > +uint16_t status = NVME_SUCCESS;
> > +bool prp_list_in_cmb = false;
> > +
> > +trace_nvme_map_prp(req->cid, req->cmd.opcode, trans_len, len, prp1, 
> > prp2,
> > +num_prps);
> >
> >  if (unlikely(!prp1)) {
> >  trace_nvme_err_invalid_prp();
> >  return NVME_INVALID_FIELD | NVME_DNR;
> > -} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
> > -   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
> > -qsg->nsg = 0;
> > -qemu_iovec_init(iov, num_prps);
> > -qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], 
> > trans_len);
> > -} else {
> > -pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
> > -qemu_sglist_add(qsg, prp1, trans_len);
> >  }
> > +
> > +if (nvme_addr_is_cmb(n, prp1)) {
> > +req->is_cmb = true;
> > +}
> > +
> This seems to be used here and within read/write functions which are calling
> this one. Maybe there is a nicer way to track that instead of passing
> the request
> from multiple places ?
> 

Hmm. Whether or not the command reads/writes from the CMB is really only
something you can determine by looking at the PRPs (which is done in
nvme_map_prp), so I think this is the right way to track it. Or do you
have something else in mind?

> > +pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
> > +qemu_sglist_add(qsg, prp1, trans_len);
> > +
> >  len -= trans_len;
> >  if (len) {
> >  if (unlikely(!prp2)) {
> >  trace_nvme_err_invalid_prp2_missing();
> > +status = NVME_INVALID_FIELD | NVME_DNR;
> >  goto unmap;
> >  }
> > +
> >  if (len > n->page_size) {
> >  uint64_t prp_list[n->max_prp_ents];
> >  uint32_t nents, prp_trans;
> >  int i = 0;
> >
> > +if (nvme_addr_is_cmb(n, prp2)) {
> > +prp_list_in_cmb = true;
> > +}
> > +
> >  nents = (len + n->page_size - 1) >> n->page_bits;
> >  prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > -nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
> > +nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> >  while (len != 0) {
> > +bool addr_is_cmb;
> >  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
> >
> >  if (i == n->max_prp_ents - 1 && len > n->page_size) {
> >  if (unlikely(!prp_ent || prp_ent & (n->page_size - 
> > 1))) {
> >  trace_nvme_err_invalid_prplist_ent(prp_ent);
> > +status = NVME_INVALID_FIELD | NVME_DNR;
> > +goto unmap;
> > +}
> > +
> > +addr_is_cmb = nvme_addr_is_cmb(n, prp_ent);
> > +if ((prp_list_in_cmb && !addr_is_cmb) ||
> > +(!prp_list_in_cmb && addr_is_cmb)) {
> 
> Minor: Same condition (based on different vars) is being used in
> multiple places. Might be worth to move it outside and just pass in
> the needed values.
> 

I'm really not sure what I was smoking when writing those conditions.
It's just `var != nvme_addr_is_cmb(n, prp_ent)`. I fixed that. No need
to pull it out I think.

> >  static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > -   uint64_t prp1, uint64_t prp2)
> > +uint64_t prp1, uint64_t prp2, NvmeRequest *req)
> >  {
> >  QEMUSGList qsg;
> > -QEMUIOVector iov;
> >  uint16_t status = NVME_SUCCESS;
> >
> > -if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
> > -return NVME_INVALID_FIELD | NVME_DNR;
> > +status = nvme_map_prp(n, &qsg, prp1, prp2, len, req);
> > +if (status) {
> > +return status;
>

Re: [PATCH v2 14/20] nvme: allow multiple aios per command

2019-11-21 Thread Klaus Birkelund

On Tue, Nov 12, 2019 at 03:25:06PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:55, Klaus Jensen  wrote:
> > @@ -341,19 +344,18 @@ static uint16_t nvme_dma_write_prp(NvmeCtrl *n, 
> > uint8_t *ptr, uint32_t len,
> Any reason why the nvme_dma_write_prp is missing the changes applied
> to nvme_dma_read_prp ?
> 

This was adressed by proxy through changes to the previous patch
(by combining the read/write functions).

> > +case NVME_AIO_OPC_WRITE_ZEROES:
> > +block_acct_start(stats, acct, aio->iov.size, BLOCK_ACCT_WRITE);
> > +aio->aiocb = blk_aio_pwrite_zeroes(aio->blk, aio->offset,
> > +aio->iov.size, BDRV_REQ_MAY_UNMAP, nvme_aio_cb, aio);
> Minor: aio->blk  => blk
> 

Thanks. Fixed this in a couple of other places as well.

> > @@ -621,8 +880,11 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
> >  sq = n->sq[qid];
> >  while (!QTAILQ_EMPTY(&sq->out_req_list)) {
> >  req = QTAILQ_FIRST(&sq->out_req_list);
> > -assert(req->aiocb);
> > -blk_aio_cancel(req->aiocb);
> > +while (!QTAILQ_EMPTY(&req->aio_tailq)) {
> > +aio = QTAILQ_FIRST(&req->aio_tailq);
> > +assert(aio->aiocb);
> > +blk_aio_cancel(aio->aiocb);
> What about releasing memory associated with given aio ?

I believe the callback is still called when cancelled? That should take
care of it. Or have I misunderstood that? At least for the DMAAIOCBs it
is.

> > +struct NvmeAIO {
> > +NvmeRequest *req;
> > +
> > +NvmeAIOOp   opc;
> > +int64_t offset;
> > +BlockBackend*blk;
> > +BlockAIOCB  *aiocb;
> > +BlockAcctCookie acct;
> > +
> > +NvmeAIOCompletionFunc *cb;
> > +void  *cb_arg;
> > +
> > +QEMUSGList   *qsg;
> > +QEMUIOVector iov;
> 
> There is a bit of inconsistency on the ownership of IOVs and SGLs.
> SGLs now seem to be owned by request whereas IOVs by the aio.
> WOuld be good to have that unified or documented at least.
> 

Fixed this. The NvmeAIO only holds pointers now.

> > +#define NVME_REQ_TRANSFER_DMA  0x1
> This one does not seem to be used 
> 

I have dropped the flags and reverted to a simple req->is_cmb as that is
all that is really needed.

Re: [PATCH v2 15/20] nvme: add support for scatter gather lists

2019-11-24 Thread Klaus Birkelund

On Tue, Nov 12, 2019 at 03:25:18PM +, Beata Michalska wrote:
> Hi Klaus,
> 
> On Tue, 15 Oct 2019 at 11:57, Klaus Jensen  wrote:
> > +static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
> > +NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
> > +{
> > +const int MAX_NSGLD = 256;
> > +
> > +NvmeSglDescriptor segment[MAX_NSGLD];
> > +uint64_t nsgld;
> > +uint16_t status;
> > +bool sgl_in_cmb = false;
> > +hwaddr addr = le64_to_cpu(sgl.addr);
> > +
> > +trace_nvme_map_sgl(req->cid, NVME_SGL_TYPE(sgl.type), req->nlb, len);
> > +
> > +pci_dma_sglist_init(qsg, &n->parent_obj, 1);
> > +
> > +/*
> > + * If the entire transfer can be described with a single data block it 
> > can
> > + * be mapped directly.
> > + */
> > +if (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_DATA_BLOCK) {
> > +status = nvme_map_sgl_data(n, qsg, &sgl, 1, &len, req);
> > +if (status) {
> > +goto unmap;
> > +}
> > +
> > +goto out;
> > +}
> > +
> > +/*
> > + * If the segment is located in the CMB, the submission queue of the
> > + * request must also reside there.
> > + */
> > +if (nvme_addr_is_cmb(n, addr)) {
> > +if (!nvme_addr_is_cmb(n, req->sq->dma_addr)) {
> > +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +}
> > +
> > +sgl_in_cmb = true;
> > +}
> > +
> > +while (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_SEGMENT) {
> > +bool addr_is_cmb;
> > +
> > +nsgld = le64_to_cpu(sgl.len) / sizeof(NvmeSglDescriptor);
> > +
> > +/* read the segment in chunks of 256 descriptors (4k) */
> > +while (nsgld > MAX_NSGLD) {
> > +nvme_addr_read(n, addr, segment, sizeof(segment));
> Is there any chance this will go outside the CMB?
> 

Yes, there certainly was a chance of that. This has been fixed in a
general way for both nvme_map_sgl and nvme_map_sgl_data.

> > +
> > +status = nvme_map_sgl_data(n, qsg, segment, MAX_NSGLD, &len, 
> > req);
> > +if (status) {
> > +goto unmap;
> > +}
> > +
> > +nsgld -= MAX_NSGLD;
> > +addr += MAX_NSGLD * sizeof(NvmeSglDescriptor);
> > +}
> > +
> > +nvme_addr_read(n, addr, segment, nsgld * 
> > sizeof(NvmeSglDescriptor));
> > +
> > +sgl = segment[nsgld - 1];
> > +addr = le64_to_cpu(sgl.addr);
> > +
> > +/* an SGL is allowed to end with a Data Block in a regular Segment 
> > */
> > +if (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_DATA_BLOCK) {
> > +status = nvme_map_sgl_data(n, qsg, segment, nsgld, &len, req);
> > +if (status) {
> > +goto unmap;
> > +}
> > +
> > +goto out;
> > +}
> > +
> > +/* do not map last descriptor */
> > +status = nvme_map_sgl_data(n, qsg, segment, nsgld - 1, &len, req);
> > +if (status) {
> > +goto unmap;
> > +}
> > +
> > +/*
> > + * If the next segment is in the CMB, make sure that the sgl was
> > + * already located there.
> > + */
> > +addr_is_cmb = nvme_addr_is_cmb(n, addr);
> > +if ((sgl_in_cmb && !addr_is_cmb) || (!sgl_in_cmb && addr_is_cmb)) {
> > +status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +goto unmap;
> > +}
> > +}
> > +
> > +/*
> > + * If the segment did not end with a Data Block or a Segment 
> > descriptor, it
> > + * must be a Last Segment descriptor.
> > + */
> > +if (NVME_SGL_TYPE(sgl.type) != SGL_DESCR_TYPE_LAST_SEGMENT) {
> > +trace_nvme_err_invalid_sgl_descriptor(req->cid,
> > +NVME_SGL_TYPE(sgl.type));
> > +return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
> Shouldn't we handle a case here that requires calling unmap ?

Woops. Fixed.

> > +static uint16_t nvme_dma_read_sgl(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > +NvmeSglDescriptor sgl, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +QEMUSGList qsg;
> > +uint16_t err = NVME_SUCCESS;
> > +
> Very minor: Mixing convention: status vs error
> 

Fixed by proxy in another refactor.

> >
> > +#define NVME_CMD_FLAGS_FUSE(flags) (flags & 0x3)
> > +#define NVME_CMD_FLAGS_PSDT(flags) ((flags >> 6) & 0x3)
> Minor: This one is slightly misleading - as per the naming and it's usage:
> the PSDT is a field name and as such does not imply using SGLs
> and it is being used to verify if given command is actually using
> SGLs.
> 

Ah, is this because I do

  if (NVME_CMD_FLAGS_PSDT(cmd->flags)) {

in the code? That is, just checks for it not being zero? The value of
the PRP or SGL for Data Transfer (PSDT) field *does* specify if the
command uses SGLs or not. 0x0: PRPs, 0x1 SGL for data, 0x10: SGLs for
both data and metadata. Would you prefer the condition was more
explicit?


Thanks!
Klaus

Re: [PATCH v2 12/20] nvme: bump supported specification version to 1.3

2019-11-26 Thread Klaus Birkelund

On Mon, Nov 25, 2019 at 12:13:15PM +, Beata Michalska wrote:
> On Mon, 18 Nov 2019 at 09:48, Klaus Birkelund  wrote:
> >
> > On Tue, Nov 12, 2019 at 03:05:06PM +, Beata Michalska wrote:
> > > Hi Klaus,
> > >
> > > On Tue, 15 Oct 2019 at 11:52, Klaus Jensen  wrote:
> > > >
> > > > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> > > > +{
> > > > +static const int len = 4096;
> > > > +
> > > > +struct ns_descr {
> > > > +uint8_t nidt;
> > > > +uint8_t nidl;
> > > > +uint8_t rsvd2[2];
> > > > +uint8_t nid[16];
> > > > +};
> > > > +
> > > > +uint32_t nsid = le32_to_cpu(c->nsid);
> > > > +uint64_t prp1 = le64_to_cpu(c->prp1);
> > > > +uint64_t prp2 = le64_to_cpu(c->prp2);
> > > > +
> > > > +struct ns_descr *list;
> > > > +uint16_t ret;
> > > > +
> > > > +trace_nvme_identify_ns_descr_list(nsid);
> > > > +
> > > > +if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > > > +trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
> > > > +return NVME_INVALID_NSID | NVME_DNR;
> > > > +}
> > > > +
> > > In theory this should abort the command for inactive NSIDs as well.
> > > But I guess this will come later on.
> > >
> >
> > At this point in the series, the device does not support multiple
> > namespaces anyway and num_namespaces is always 1. But this has also been
> > reported seperately in relation the patch adding multiple namespaces and
> > is fixed in v3.
> >
> > > > +list = g_malloc0(len);
> > > > +list->nidt = 0x3;
> > > > +list->nidl = 0x10;
> > > > +*(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
> > > > +
> > > Might be worth to add some comment here -> as per the NGUID/EUI64 format.
> > > Also those are not specified currently in the namespace identity data 
> > > structure.
> > >
> >
> > I'll add a comment for why the Namespace UUID is set to this value here.
> > The NGUID/EUI64 fields are not set in the namespace identity data
> > structure as they are not required. See the descriptions of NGUID and
> > EUI64. Here for NGUID:
> >
> > "The controller shall specify a globally unique namespace identifier
> > in this field, the EUI64 field, or a Namespace UUID in the Namespace
> > Identification Descriptor..."
> >
> > Here, I chose to provide it in the Namespace Identification Descriptor
> > (by setting `list->nidt = 0x3`).
> >
> > > > +ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
> > > > +g_free(list);
> > > > +return ret;
> > > > +}
> > > > +
> > > >  static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> > > >  {
> > > >  NvmeIdentify *c = (NvmeIdentify *)cmd;
> > > > @@ -934,7 +978,9 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd 
> > > > *cmd)
> > > >  case 0x01:
> > > >  return nvme_identify_ctrl(n, c);
> > > >  case 0x02:
> > > > -return nvme_identify_nslist(n, c);
> > > > +return nvme_identify_ns_list(n, c);
> > > > +case 0x03:
> > > > +return nvme_identify_ns_descr_list(n, cmd);
> > > >  default:
> > > >  trace_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
> > > >  return NVME_INVALID_FIELD | NVME_DNR;
> > > > @@ -1101,6 +1147,14 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
> > > > NvmeCmd *cmd, NvmeRequest *req)
> > > >  blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> > > >  break;
> > > >  case NVME_NUMBER_OF_QUEUES:
> > > > +if (n->qs_created > 2) {
> > > > +return NVME_CMD_SEQ_ERROR | NVME_DNR;
> > > > +}
> > > > +
> > > I am not sure this is entirely correct as the spec says:
> > > "if any I/O Submission and/or Completion Queues (...)"
> > > so it might be enough to have a single queue created
> > > for this command to be valid.
> > > Also I think that the condition here is to make sure that the number
> > > of queues r

Re: [PATCH v2 15/20] nvme: add support for scatter gather lists

2019-11-26 Thread Klaus Birkelund

On Mon, Nov 25, 2019 at 02:10:37PM +, Beata Michalska wrote:
> On Mon, 25 Nov 2019 at 06:21, Klaus Birkelund  wrote:
> >
> > On Tue, Nov 12, 2019 at 03:25:18PM +, Beata Michalska wrote:
> > > Hi Klaus,
> > >
> > > On Tue, 15 Oct 2019 at 11:57, Klaus Jensen  wrote:
> > > >
> > > > +#define NVME_CMD_FLAGS_FUSE(flags) (flags & 0x3)
> > > > +#define NVME_CMD_FLAGS_PSDT(flags) ((flags >> 6) & 0x3)
> > > Minor: This one is slightly misleading - as per the naming and it's usage:
> > > the PSDT is a field name and as such does not imply using SGLs
> > > and it is being used to verify if given command is actually using
> > > SGLs.
> > >
> >
> > Ah, is this because I do
> >
> >   if (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
> >
> > in the code? That is, just checks for it not being zero? The value of
> > the PRP or SGL for Data Transfer (PSDT) field *does* specify if the
> > command uses SGLs or not. 0x0: PRPs, 0x1 SGL for data, 0x10: SGLs for
> > both data and metadata. Would you prefer the condition was more
> > explicit?
> >
> Yeah, it is just not obvious( at least to me)  without referencing the spec
> that non-zero value implies SGL usage. Guess a comment would be helpful
> but that is not major.
> 
 
Nah. Thats a good point. I have changed it to use a switch on the value.
This technically also fixes a bug because the above would accept 0x3 as
a valid value and interpret it as SGL use.


Klaus

Re: [Qemu-block] [PATCH] nvme: add Get/Set Feature Timestamp support

2019-05-13 Thread Klaus Birkelund

On Fri, Apr 05, 2019 at 03:41:17PM -0600, Kenneth Heitke wrote:
> Signed-off-by: Kenneth Heitke 
> ---
>  hw/block/nvme.c   | 120 +-
>  hw/block/nvme.h   |   3 ++
>  hw/block/trace-events |   2 +
>  include/block/nvme.h  |   2 +
>  4 files changed, 125 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 7caf92532a..e775e89299 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -219,6 +219,30 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
> QEMUIOVector *iov, uint64_t prp1,
>  return NVME_INVALID_FIELD | NVME_DNR;
>  }
>  
> +static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> +   uint64_t prp1, uint64_t prp2)
> +{
> +QEMUSGList qsg;
> +QEMUIOVector iov;
> +uint16_t status = NVME_SUCCESS;
> +
> +if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +if (qsg.nsg > 0) {
> +if (dma_buf_write(ptr, len, &qsg)) {
> +status = NVME_INVALID_FIELD | NVME_DNR;
> +}
> +qemu_sglist_destroy(&qsg);
> +} else {
> +if (qemu_iovec_from_buf(&iov, 0, ptr, len) != len) {

This should be `qemu_iovec_to_buf`.

> +status = NVME_INVALID_FIELD | NVME_DNR;
> +}
> +qemu_iovec_destroy(&iov);
> +}
> +return status;
> +}
> +
>  static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
>  uint64_t prp1, uint64_t prp2)
>  {
> @@ -678,7 +702,6 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
> NvmeIdentify *c)
>  return ret;
>  }
>  
> -
>  static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
>  {
>  NvmeIdentify *c = (NvmeIdentify *)cmd;
> @@ -696,6 +719,63 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
>  }
>  }
>  
> +static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
> +{
> +n->host_timestamp = ts;
> +n->timestamp_set_qemu_clock_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +}
> +
> +static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
> +{
> +uint64_t current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +uint64_t elapsed_time = current_time - n->timestamp_set_qemu_clock_ms;
> +
> +union nvme_timestamp {
> +struct {
> +uint64_t timestamp:48;
> +uint64_t sync:1;
> +uint64_t origin:3;
> +uint64_t rsvd1:12;
> +};
> +uint64_t all;
> +};
> +
> +union nvme_timestamp ts;
> +ts.all = 0;
> +
> +/*
> + * If the sum of the Timestamp value set by the host and the elapsed
> + * time exceeds 2^48, the value returned should be reduced modulo 2^48.
> + */
> +ts.timestamp = (n->host_timestamp + elapsed_time) & 0x;
> +
> +/* If the host timestamp is non-zero, set the timestamp origin */
> +ts.origin = n->host_timestamp ? 0x01 : 0x00;
> +
> +return ts.all;
> +}
> +
> +static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
> +{
> +uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +
> +uint64_t timestamp = nvme_get_timestamp(n);
> +
> +if (!(n->oncs & NVME_ONCS_TIMESTAMP)) {

Any particular reason we want to sometimes not support this? Could we
just do with out this check?

> +trace_nvme_err_invalid_getfeat(dw10);
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +trace_nvme_getfeat_timestamp(timestamp);
> +
> +timestamp = cpu_to_le64(timestamp);
> +
> +return nvme_dma_read_prp(n, (uint8_t *)×tamp,
> + sizeof(timestamp), prp1, prp2);
> +}
> +
>  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
>  uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> @@ -710,6 +790,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
> *cmd, NvmeRequest *req)
>  result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 
> 16));
>  trace_nvme_getfeat_numq(result);
>  break;
> +case NVME_TIMESTAMP:
> +return nvme_get_feature_timestamp(n, cmd);
> +break;
>  default:
>  trace_nvme_err_invalid_getfeat(dw10);
>  return NVME_INVALID_FIELD | NVME_DNR;
> @@ -719,6 +802,31 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
> *cmd, NvmeRequest *req)
>  return NVME_SUCCESS;
>  }
>  
> +static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
> +{
> +uint16_t ret;
> +uint64_t timestamp;
> +uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +
> +if (!(n->oncs & NVME_ONCS_TIMESTAMP)) {

Any particular reason we want to sometimes not support this? Could we
just do with out this check?

> +trace_nvme_err_invalid_setfeat(dw10);
> +r

Re: [Qemu-block] [PATCH] nvme: add Get/Set Feature Timestamp support

2019-05-16 Thread Klaus Birkelund

Hi Kenneth,

On Thu, May 16, 2019 at 05:24:47PM -0600, Heitke, Kenneth wrote:
> Hi Klaus, thank you for you review. I have one comment inline
> 
> On 5/14/2019 12:02 AM, Klaus Birkelund wrote:
> > On Fri, Apr 05, 2019 at 03:41:17PM -0600, Kenneth Heitke wrote:
> > > Signed-off-by: Kenneth Heitke 
> > > ---
> > >   hw/block/nvme.c   | 120 +-
> > >   hw/block/nvme.h   |   3 ++
> > >   hw/block/trace-events |   2 +
> > >   include/block/nvme.h  |   2 +
> > >   4 files changed, 125 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index 7caf92532a..e775e89299 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -219,6 +219,30 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
> > > QEMUIOVector *iov, uint64_t prp1,
> > >   return NVME_INVALID_FIELD | NVME_DNR;
> > >   }
> > > +static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t 
> > > len,
> > > +   uint64_t prp1, uint64_t prp2)
> > > +{
> > > +QEMUSGList qsg;
> > > +QEMUIOVector iov;
> > > +uint16_t status = NVME_SUCCESS;
> > > +
> > > +if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
> > > +return NVME_INVALID_FIELD | NVME_DNR;
> > > +}
> > > +if (qsg.nsg > 0) {
> > > +if (dma_buf_write(ptr, len, &qsg)) {
> > > +status = NVME_INVALID_FIELD | NVME_DNR;
> > > +}
> > > +qemu_sglist_destroy(&qsg);
> > > +} else {
> > > +if (qemu_iovec_from_buf(&iov, 0, ptr, len) != len) {
> > 
> > This should be `qemu_iovec_to_buf`.
> > 
> 
> This function is transferring data from the "host" to the device so I
> believe I am using the correct function.
> 

Exactly, but this means that you need to populate `ptr` with data
described by the prps, hence dma_buf_*write* and qemu_iovec_*to*_buf. In
this case `ptr` is set to the address of the uint64_t timestamp, and
that is what we need to write to.

Re: [Qemu-block] [PATCH] nvme: add Get/Set Feature Timestamp support

2019-05-16 Thread Klaus Birkelund

On Fri, May 17, 2019 at 07:35:04AM +0200, Klaus Birkelund wrote:
> Hi Kenneth,
> 
> On Thu, May 16, 2019 at 05:24:47PM -0600, Heitke, Kenneth wrote:
> > Hi Klaus, thank you for you review. I have one comment inline
> > 
> > On 5/14/2019 12:02 AM, Klaus Birkelund wrote:
> > > On Fri, Apr 05, 2019 at 03:41:17PM -0600, Kenneth Heitke wrote:
> > > > Signed-off-by: Kenneth Heitke 
> > > > ---
> > > >   hw/block/nvme.c   | 120 +-
> > > >   hw/block/nvme.h   |   3 ++
> > > >   hw/block/trace-events |   2 +
> > > >   include/block/nvme.h  |   2 +
> > > >   4 files changed, 125 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > > index 7caf92532a..e775e89299 100644
> > > > --- a/hw/block/nvme.c
> > > > +++ b/hw/block/nvme.c
> > > > @@ -219,6 +219,30 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
> > > > QEMUIOVector *iov, uint64_t prp1,
> > > >   return NVME_INVALID_FIELD | NVME_DNR;
> > > >   }
> > > > +static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t 
> > > > len,
> > > > +   uint64_t prp1, uint64_t prp2)
> > > > +{
> > > > +QEMUSGList qsg;
> > > > +QEMUIOVector iov;
> > > > +uint16_t status = NVME_SUCCESS;
> > > > +
> > > > +if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
> > > > +return NVME_INVALID_FIELD | NVME_DNR;
> > > > +}
> > > > +if (qsg.nsg > 0) {
> > > > +if (dma_buf_write(ptr, len, &qsg)) {
> > > > +status = NVME_INVALID_FIELD | NVME_DNR;
> > > > +}
> > > > +qemu_sglist_destroy(&qsg);
> > > > +} else {
> > > > +if (qemu_iovec_from_buf(&iov, 0, ptr, len) != len) {
> > > 
> > > This should be `qemu_iovec_to_buf`.
> > > 
> > 
> > This function is transferring data from the "host" to the device so I
> > believe I am using the correct function.
> > 
> 
> Exactly, but this means that you need to populate `ptr` with data
> described by the prps, hence dma_buf_*write* and qemu_iovec_*to*_buf. In
> this case `ptr` is set to the address of the uint64_t timestamp, and
> that is what we need to write to.
> 

I was going to argue with the fact that nvme_dma_read_prp uses
qemu_iovec_from_buf. But it uses _to_buf which as far as I can tell is
also wrong.

Re: [Qemu-block] [PATCH] nvme: add Get/Set Feature Timestamp support

2019-05-18 Thread Klaus Birkelund

On Fri, May 17, 2019 at 07:49:18PM -0600, Heitke, Kenneth wrote:
> > > > > > +if (qemu_iovec_from_buf(&iov, 0, ptr, len) != len) {
> > > > > 
> > > > > This should be `qemu_iovec_to_buf`.
> > > > > 
> > > > 
> > > > This function is transferring data from the "host" to the device so I
> > > > believe I am using the correct function.
> > > > 
> > > 
> > > Exactly, but this means that you need to populate `ptr` with data
> > > described by the prps, hence dma_buf_*write* and qemu_iovec_*to*_buf. In
> > > this case `ptr` is set to the address of the uint64_t timestamp, and
> > > that is what we need to write to.
> > > 
> > 
> > I was going to argue with the fact that nvme_dma_read_prp uses
> > qemu_iovec_from_buf. But it uses _to_buf which as far as I can tell is
> > also wrong.
> > 
> 
> Okay, I'm onboard. You're correct. I'll update my patch and re-submit. I can
> also submit a patch to fix nvme_dma_read_prp() unless you or someone else
> wants to.
> 

Hi Kenneth,

The `nvme_dma_read_prp` case is actually already fixed in one of the
patches I sent yesterday ("nvme: simplify PRP mappings"), but I'll
submit it as a separate patch.

Cheers

Re: [Qemu-block] [PATCH 0/8] nvme: v1.3, sgls, metadata and new 'ocssd' device

2019-05-20 Thread Klaus Birkelund

On Mon, May 20, 2019 at 03:01:24PM +0200, Kevin Wolf wrote:
> Am 17.05.2019 um 10:42 hat Klaus Birkelund Jensen geschrieben:
> > Hi,
> > 
> > This series of patches contains a number of refactorings to the emulated
> > nvme device, adds additional features, such as support for metadata and
> > scatter gather lists, and bumps the supported NVMe version to 1.3.
> > Lastly, it contains a new 'ocssd' device.
> > 
> > The motivation for the first seven patches is to set everything up for
> > the final patch that adds a new 'ocssd' device and associated block
> > driver that implements the OpenChannel 2.0 specification[1]. Many of us
> > in the OpenChannel comunity have used a qemu fork[2] for emulation of
> > OpenChannel devices. The fork is itself based on Keith's qemu-nvme
> > tree[3] and we recently merged mainline qemu into it, but the result is
> > still a "hybrid" nvme device that supports both conventional nvme and
> > the OCSSD 2.0 spec through a 'dialect' mechanism. Merging instead of
> > rebasing also created a pretty messy commit history and my efforts to
> > try and rebase our work onto mainline was getting hairy to say the
> > least. And I was never really happy with the dialect approach anyway.
> > 
> > I have instead prepared this series of fresh patches that incrementally
> > adds additional features to the nvme device to bring it into shape for
> > finally introducing a new (and separate) 'ocssd' device that emulates an
> > OpenChannel 2.0 device by reusing core functionality from the nvme
> > device. Providing a separate ocssd device ensures that no ocssd specific
> > stuff creeps into the nvme device.
> > 
> > The ocssd device is backed by a new 'ocssd' block driver that holds
> > internal meta data and keeps state permanent across power cycles. In the
> > future I think we could use the same approach for the nvme device to
> > keep internal metadata such as utilization and deallocated blocks.
> 
> A backend driver that is specific for a guest device model (i.e. the
> device model requires this driver, and the backend is useless without
> the device) sounds like a very questionable design.
> 
> Metadata like OcssdFormatHeader that is considered part of the image
> data, which means that the _actual_ image content without metadata isn't
> directly accessible any more feels like a bad idea, too. Simple things
> like what a resize operation means (change only the actual disk size as
> usual, or is the new size disk + metadata?) become confusing. Attaching
> an image to a different device becomes impossible.
> 
> The block format driver doesn't seem to actually add much functionality
> to a specially crafted raw image: It provides a convenient way to create
> such special images and it dumps some values in 'qemu-img info', but the
> actual interpretation of the data is left to the device model.
> 
> Looking at the options it does provide, my impression is that these
> should really be qdev properties, and the place to store them
> persistently is something like the libvirt XML. The device doesn't
> change any of the values, so there is nothing that QEMU actually needs
> to store. What you invented is a one-off way to pass a config file to a
> device, but only for one specific device type.
> 
> I think this needs to use a much more standard approach to be mergable.
> 
> Markus (CCed) as the maintainer for the configuration mechanisms may
> have an opinion on this, too.

Hi Kevin,

Thank you for going through my motivations. I see what you mean. And
yes, the main reason I did it like that was for the convenience of being
able to `qemu-img create`'ing the image. I'll reconsider how to do this.

> 
> > For now, the nvme device does not support the Deallocated and
> > Unwritten Logical Block Error (DULBE) feature or the Data Set
> > Management command as this would require such support.
> 
> Doesn't bdrv_co_block_status() provide all the information you need for
> that?
> 

That does look useful. I'll look into it.

Thanks!

Re: [Qemu-block] [Qemu-devel] [PATCH 8/8] nvme: add an OpenChannel 2.0 NVMe device (ocssd)

2019-05-20 Thread Klaus Birkelund

On Mon, May 20, 2019 at 11:45:00AM -0500, Eric Blake wrote:
> On 5/17/19 3:42 AM, Klaus Birkelund Jensen wrote:
> > This adds a new 'ocssd' block device that emulates an OpenChannel 2.0
> > device. The device is backed by a new 'ocssd' block backend that is
> > based on the raw format driver but includes a header that holds the
> > device geometry and write data requirements. This new block backend is
> > special in that the size is not specified explicitly but in terms of
> > sector size, number of chunks, number of parallel units, etc. This
> > called for the addition of the `no_size_required` field in `struct
> > BlockDriver` to not fail image creation when the size parameter is
> > missing.
> > 
> > The ocssd device is an individual device but shares a lot of code with
> > the nvme device. Thus, some core functionality of nvme/nvme.c has been
> > exported for use by nvme/ocssd.c.
> > 
> > Thank you to the following people for their contributions to the
> > original qemu-nvme (github.com/OpenChannelSSD/qemu-nvme) implementation.
> > 
> >   Matias Bjørling 
> >   Javier González 
> >   Simon Andreas Frimann Lund 
> >   Hans Holmberg 
> >   Jesper Devantier 
> >   Young Tack Jin 
> > 
> > Signed-off-by: Klaus Birkelund Jensen 
> > ---
> >  MAINTAINERS |   14 +-
> >  Makefile.objs   |1 +
> >  block.c |2 +-
> >  block/Makefile.objs |2 +-
> >  block/nvme.c|2 +-
> >  block/ocssd.c   |  690 
> >  hw/block/Makefile.objs  |2 +-
> >  hw/block/{ => nvme}/nvme.c  |  192 ++-
> >  hw/block/nvme/ocssd.c   | 2647 +++
> >  hw/block/nvme/ocssd.h   |  140 ++
> >  hw/block/nvme/trace-events  |  136 ++
> >  hw/block/trace-events   |  109 --
> >  include/block/block_int.h   |3 +
> >  include/block/nvme.h|   12 +-
> >  include/block/ocssd.h   |  231 +++
> >  {hw => include/hw}/block/nvme.h |   61 +
> >  include/hw/pci/pci_ids.h|2 +
> >  qapi/block-core.json|   47 +-
> >  18 files changed, 4121 insertions(+), 172 deletions(-)
> >  create mode 100644 block/ocssd.c
> >  rename hw/block/{ => nvme}/nvme.c (94%)
> >  create mode 100644 hw/block/nvme/ocssd.c
> >  create mode 100644 hw/block/nvme/ocssd.h
> >  create mode 100644 hw/block/nvme/trace-events
> >  create mode 100644 include/block/ocssd.h
> >  rename {hw => include/hw}/block/nvme.h (63%)
> 
> Feels big; are you sure this can't be split into smaller pieces to ease
> review?
> 

I know, but I'm not sure how to meaningfully split it up. Would you
prefer that I move files in one commit? Changed stuff to nvme.{c,h} is
mostly removing static from functions and creating a prototype in the
header files to allow the ocssd device to use the functions. The commit
should be restricted to just adding the ocssd device. Any features and
additions required in the nvme device are added in previous commits.

> I'm focusing just on the qapi portions:
> 

Thank you for the review of that, but it looks like this will all be
dropped from a v2 (see mail from Kevin), because it's simply bad
design to have the driver and device depend so closely on each other.


Thanks,
Klaus

Re: [Qemu-block] [PATCH] nvme: fix copy direction in DMA reads going to CMB

2019-05-20 Thread Klaus Birkelund

On Mon, May 20, 2019 at 09:05:57AM -0600, Keith Busch wrote:
> On Sat, May 18, 2019 at 09:39:05AM +0200, Klaus Birkelund Jensen wrote:
> > `nvme_dma_read_prp` erronously used `qemu_iovec_*to*_buf` instead of
> > `qemu_iovec_*from*_buf` when the request involved the controller memory
> > buffer.
> > 
> > Signed-off-by: Klaus Birkelund Jensen 
> 
> I was wondering how this mistake got by for so long, and it looks like
> the only paths here require an admin command with dev->host transfer
> to CMB. That's just not done in any host implementation I'm aware of
> since it'd make it more difficult to use for no particular gain AFAICS,
> so I'd be curious to hear if you have a legit implementation doing this.
> 

I'm just trying to get the device to be as compliant as possible, but I
don't know why you'd have any reason to do a, say Get Feature, to the
CMB.

Re: [Qemu-block] [PATCH 0/8] nvme: v1.3, sgls, metadata and new 'ocssd' device

2019-05-21 Thread Klaus Birkelund

On Tue, May 21, 2019 at 10:01:15AM +0200, Kevin Wolf wrote:
> Am 20.05.2019 um 21:34 hat Klaus Birkelund geschrieben:
> > On Mon, May 20, 2019 at 03:01:24PM +0200, Kevin Wolf wrote:
> > > Am 17.05.2019 um 10:42 hat Klaus Birkelund Jensen geschrieben:
> > > > Hi,
> > > > 
> > > > This series of patches contains a number of refactorings to the emulated
> > > > nvme device, adds additional features, such as support for metadata and
> > > > scatter gather lists, and bumps the supported NVMe version to 1.3.
> > > > Lastly, it contains a new 'ocssd' device.
> > > > 
> > > > The motivation for the first seven patches is to set everything up for
> > > > the final patch that adds a new 'ocssd' device and associated block
> > > > driver that implements the OpenChannel 2.0 specification[1]. Many of us
> > > > in the OpenChannel comunity have used a qemu fork[2] for emulation of
> > > > OpenChannel devices. The fork is itself based on Keith's qemu-nvme
> > > > tree[3] and we recently merged mainline qemu into it, but the result is
> > > > still a "hybrid" nvme device that supports both conventional nvme and
> > > > the OCSSD 2.0 spec through a 'dialect' mechanism. Merging instead of
> > > > rebasing also created a pretty messy commit history and my efforts to
> > > > try and rebase our work onto mainline was getting hairy to say the
> > > > least. And I was never really happy with the dialect approach anyway.
> > > > 
> > > > I have instead prepared this series of fresh patches that incrementally
> > > > adds additional features to the nvme device to bring it into shape for
> > > > finally introducing a new (and separate) 'ocssd' device that emulates an
> > > > OpenChannel 2.0 device by reusing core functionality from the nvme
> > > > device. Providing a separate ocssd device ensures that no ocssd specific
> > > > stuff creeps into the nvme device.
> > > > 
> > > > The ocssd device is backed by a new 'ocssd' block driver that holds
> > > > internal meta data and keeps state permanent across power cycles. In the
> > > > future I think we could use the same approach for the nvme device to
> > > > keep internal metadata such as utilization and deallocated blocks.
> > > 
> > > A backend driver that is specific for a guest device model (i.e. the
> > > device model requires this driver, and the backend is useless without
> > > the device) sounds like a very questionable design.
> > > 
> > > Metadata like OcssdFormatHeader that is considered part of the image
> > > data, which means that the _actual_ image content without metadata isn't
> > > directly accessible any more feels like a bad idea, too. Simple things
> > > like what a resize operation means (change only the actual disk size as
> > > usual, or is the new size disk + metadata?) become confusing. Attaching
> > > an image to a different device becomes impossible.
> > > 
> > > The block format driver doesn't seem to actually add much functionality
> > > to a specially crafted raw image: It provides a convenient way to create
> > > such special images and it dumps some values in 'qemu-img info', but the
> > > actual interpretation of the data is left to the device model.
> > > 
> > > Looking at the options it does provide, my impression is that these
> > > should really be qdev properties, and the place to store them
> > > persistently is something like the libvirt XML. The device doesn't
> > > change any of the values, so there is nothing that QEMU actually needs
> > > to store. What you invented is a one-off way to pass a config file to a
> > > device, but only for one specific device type.
> > > 
> > > I think this needs to use a much more standard approach to be mergable.
> > > 
> > > Markus (CCed) as the maintainer for the configuration mechanisms may
> > > have an opinion on this, too.
> > > 
> > > > For now, the nvme device does not support the Deallocated and
> > > > Unwritten Logical Block Error (DULBE) feature or the Data Set
> > > > Management command as this would require such support.
> > > 
> > > Doesn't bdrv_co_block_status() provide all the information you need for
> > > that?
> > 
> > Is it wrong for a device to store such "internal" metadata on

Re: [Qemu-block] [PATCH 5/8] nvme: add support for metadata

2019-05-21 Thread Klaus Birkelund

On Fri, May 17, 2019 at 10:42:31AM +0200, Klaus Birkelund Jensen wrote:
> The new `ms` parameter may be used to indicate the number of metadata
> bytes provided per LBA.
> 
> Signed-off-by: Klaus Birkelund Jensen 
> ---
>  hw/block/nvme.c | 31 +--
>  hw/block/nvme.h | 11 ++-
>  2 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index c514f93f3867..675967a596d1 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -33,6 +33,8 @@
>   *   num_ns=  : Namespaces to make out of the backing storage,
>   *   Default:1
>   *   num_queues=  : Number of possible IO Queues, Default:64
> + *   ms=  : Number of metadata bytes provided per LBA,
> + *   Default:0
>   *   cmb_size_mb= : Size of CMB in MBs, Default:0
>   *
>   * Parameters will be verified against conflicting capabilities and 
> attributes
> @@ -386,6 +388,8 @@ static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>  
>  uint32_t unit_len = nvme_ns_lbads_bytes(ns);
>  uint32_t len = req->nlb * unit_len;
> +uint32_t meta_unit_len = nvme_ns_ms(ns);
> +uint32_t meta_len = req->nlb * meta_unit_len;
>  uint64_t prp1 = le64_to_cpu(cmd->prp1);
>  uint64_t prp2 = le64_to_cpu(cmd->prp2);
>  
> @@ -399,6 +403,19 @@ static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, 
> NvmeRequest *req)
>  return err;
>  }
>  
> +qsg.nsg = 0;
> +qsg.size = 0;
> +
> +if (cmd->mptr && n->params.ms) {
> +qemu_sglist_add(&qsg, le64_to_cpu(cmd->mptr), meta_len);
> +
> +err = nvme_blk_setup(n, ns, &qsg, ns->blk_offset_md, meta_unit_len,
> +req);
> +if (err) {
> +return err;
> +}
> +}
> +
>  qemu_sglist_destroy(&qsg);
>  
>  return NVME_SUCCESS;
> @@ -1902,6 +1919,11 @@ static int nvme_check_constraints(NvmeCtrl *n, Error 
> **errp)
>  return 1;
>  }
>  
> +if (params->ms && !is_power_of_2(params->ms)) {
> +error_setg(errp, "nvme: invalid metadata configuration");
> +return 1;
> +}
> +
>  return 0;
>  }
>  
> @@ -2066,17 +2088,20 @@ static void nvme_init_ctrl(NvmeCtrl *n)
>  
>  static uint64_t nvme_ns_calc_blks(NvmeCtrl *n, NvmeNamespace *ns)
>  {
> -return n->ns_size / nvme_ns_lbads_bytes(ns);
> +return n->ns_size / (nvme_ns_lbads_bytes(ns) + nvme_ns_ms(ns));
>  }
>  
>  static void nvme_ns_init_identify(NvmeCtrl *n, NvmeIdNs *id_ns)
>  {
> +NvmeParams *params = &n->params;
> +
>  id_ns->nlbaf = 0;
>  id_ns->flbas = 0;
> -id_ns->mc = 0;
> +id_ns->mc = params->ms ? 0x2 : 0;
>  id_ns->dpc = 0;
>  id_ns->dps = 0;
>  id_ns->lbaf[0].lbads = BDRV_SECTOR_BITS;
> +id_ns->lbaf[0].ms = params->ms;
>  }
>  
>  static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> @@ -2086,6 +2111,8 @@ static int nvme_init_namespace(NvmeCtrl *n, 
> NvmeNamespace *ns, Error **errp)
>  nvme_ns_init_identify(n, id_ns);
>  
>  ns->ns_blks = nvme_ns_calc_blks(n, ns);
> +ns->blk_offset_md = ns->blk_offset + nvme_ns_lbads_bytes(ns) * 
> ns->ns_blks;
> +
>  id_ns->nuse = id_ns->ncap = id_ns->nsze = cpu_to_le64(ns->ns_blks);
>  
>  return 0;
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 711ca249eac5..81ee0c5173d5 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -8,13 +8,15 @@
>  DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
>  DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \
>  DEFINE_PROP_UINT32("num_ns", _state, _props.num_ns, 1), \
> -DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7)
> +DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7), \
> +DEFINE_PROP_UINT8("ms", _state, _props.ms, 0)
>  
>  typedef struct NvmeParams {
>  char *serial;
>  uint32_t num_queues;
>  uint32_t num_ns;
>  uint8_t  mdts;
> +uint8_t  ms;
>  uint32_t cmb_size_mb;
>  } NvmeParams;
>  
> @@ -91,6 +93,7 @@ typedef struct NvmeNamespace {
>  uint32_tid;
>  uint64_tns_blks;
>  uint64_tblk_offset;
> +uint64_tblk_offset_md;
>  } NvmeNamespace;
>  
>  #define TYPE_NVME "nvme"
> @@ -154,4 +157,10 @@ static inline size_t nvme_ns_lbads_bytes(NvmeNamespace 
> *ns)
>

Re: [Qemu-block] [PATCH v2] nvme: add Get/Set Feature Timestamp support

2019-05-27 Thread Klaus Birkelund

imestamp), prp1, prp2);
> +if (ret != NVME_SUCCESS) {
> +return ret;
> +}
> +
> +nvme_set_timestamp(n, timestamp);
> +
> +return NVME_SUCCESS;
> +}
> +
>  static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
>  uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> @@ -735,6 +830,11 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> *cmd, NvmeRequest *req)
>  req->cqe.result =
>  cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
>  break;
> +
> +case NVME_TIMESTAMP:
> +return nvme_set_feature_timestamp(n, cmd);
> +break;
> +
>  default:
>  trace_nvme_err_invalid_setfeat(dw10);
>  return NVME_INVALID_FIELD | NVME_DNR;
> @@ -907,6 +1007,8 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>  nvme_init_sq(&n->admin_sq, n, n->bar.asq, 0, 0,
>  NVME_AQA_ASQS(n->bar.aqa) + 1);
>  
> +nvme_set_timestamp(n, 0ULL);
> +
>  return 0;
>  }
>  
> @@ -1270,7 +1372,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> **errp)
>  id->sqes = (0x6 << 4) | 0x6;
>  id->cqes = (0x4 << 4) | 0x4;
>  id->nn = cpu_to_le32(n->num_namespaces);
> -id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS);
> +id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
>  id->psd[0].mp = cpu_to_le16(0x9c4);
>  id->psd[0].enlat = cpu_to_le32(0x10);
>  id->psd[0].exlat = cpu_to_le32(0x4);
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 56c9d4b4b1..d7277e72b7 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -69,6 +69,7 @@ typedef struct NvmeCtrl {
>  uint16_tmax_prp_ents;
>  uint16_tcqe_size;
>  uint16_tsqe_size;
> +uint16_toncs;

Looks like this unused member snuck its way into the patch. But I see no
harm in it being there.

>  uint32_treg_size;
>  uint32_tnum_namespaces;
>  uint32_tnum_queues;
> @@ -79,6 +80,8 @@ typedef struct NvmeCtrl {
>  uint32_tcmbloc;
>  uint8_t *cmbuf;
>  uint64_tirq_status;
> +uint64_thost_timestamp; /* Timestamp sent by the 
> host */
> +uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
>  
>  char*serial;
>  NvmeNamespace   *namespaces;
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index b92039a573..97a17838ed 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -46,6 +46,8 @@ nvme_identify_nslist(uint16_t ns) "identify namespace list, 
> nsid=%"PRIu16""
>  nvme_getfeat_vwcache(const char* result) "get feature volatile write cache, 
> result=%s"
>  nvme_getfeat_numq(int result) "get feature number of queues, result=%d"
>  nvme_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested 
> cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> +nvme_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
> +nvme_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
>  nvme_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt 
> mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
>  nvme_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt 
> mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
>  nvme_mmio_cfg(uint64_t data) "wrote MMIO, config controller 
> config=0x%"PRIx64""
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 849a6f3fa3..3ec8efcc43 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -581,6 +581,7 @@ enum NvmeIdCtrlOncs {
>  NVME_ONCS_WRITE_ZEROS   = 1 << 3,
>  NVME_ONCS_FEATURES  = 1 << 4,
>  NVME_ONCS_RESRVATIONS   = 1 << 5,
> +NVME_ONCS_TIMESTAMP = 1 << 6,
>  };
>  
>  #define NVME_CTRL_SQES_MIN(sqes) ((sqes) & 0xf)
> @@ -622,6 +623,7 @@ enum NvmeFeatureIds {
>  NVME_INTERRUPT_VECTOR_CONF  = 0x9,
>  NVME_WRITE_ATOMICITY= 0xa,
>  NVME_ASYNCHRONOUS_EVENT_CONF= 0xb,
> +NVME_TIMESTAMP  = 0xe,
>  NVME_SOFTWARE_PROGRESS_MARKER   = 0x80
>  };
>  
> -- 
> 2.17.1
> 

Reviewed-by: Klaus Birkelund Jensen

Re: [Qemu-block] [PATCH v2] nvme: add Get/Set Feature Timestamp support

2019-06-04 Thread Klaus Birkelund

On Mon, Jun 03, 2019 at 09:30:53AM -0600, Heitke, Kenneth wrote:
> 
> 
> On 6/3/2019 5:14 AM, Kevin Wolf wrote:
> > Am 28.05.2019 um 08:18 hat Klaus Birkelund geschrieben:
> > > On Mon, May 20, 2019 at 11:40:30AM -0600, Kenneth Heitke wrote:
> > > > Signed-off-by: Kenneth Heitke 
> > 
> > > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > > > index 56c9d4b4b1..d7277e72b7 100644
> > > > --- a/hw/block/nvme.h
> > > > +++ b/hw/block/nvme.h
> > > > @@ -69,6 +69,7 @@ typedef struct NvmeCtrl {
> > > >   uint16_tmax_prp_ents;
> > > >   uint16_tcqe_size;
> > > >   uint16_tsqe_size;
> > > > +uint16_toncs;
> > > 
> > > Looks like this unused member snuck its way into the patch. But I see no
> > > harm in it being there.
> > 
> > Good catch. I'll just remove it again from my branch.
> > 
> > > > +static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
> > > > +{
> > > > +trace_nvme_setfeat_timestamp(ts);
> > > > +
> > > > +n->host_timestamp = le64_to_cpu(ts);
> > > > +n->timestamp_set_qemu_clock_ms = 
> > > > qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > > > +}
> > > > +
> > > > +static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
> > > > +{
> > > > +uint64_t current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > 
> > Here I wonder why we use QEMU_CLOCK_REALTIME in a device emulation.
> > Wouldn't QEMU_CLOCK_VIRTUAL make more sense?
> > 
> 
> QEMU_CLOCK_VIRTUAL probably would make more sense. When I was reading
> through the differences I wasn't really sure what to pick. iven that this is
> the time within the device's context, the virtual time seems more correct.
> 
 
I thought about this too when I reviewed, but came to the conclusion
that REALTIME was correct. The timestamp is basically a value that the
host stores in the controller. When the host uses Get Features to get
the the current time it would expect it to match the progression for its
own wall clockright? If I understand REALTIME vs VIRTUAL correctly,
using VIRTUAL, it would go way out of sync.

Klaus

Re: [Qemu-block] [PATCH v2] nvme: add Get/Set Feature Timestamp support

2019-06-04 Thread Klaus Birkelund

On Tue, Jun 04, 2019 at 10:46:45AM +0200, Kevin Wolf wrote:
> Am 04.06.2019 um 10:28 hat Klaus Birkelund geschrieben:
> > On Mon, Jun 03, 2019 at 09:30:53AM -0600, Heitke, Kenneth wrote:
> > > 
> > > 
> > > On 6/3/2019 5:14 AM, Kevin Wolf wrote:
> > > > Am 28.05.2019 um 08:18 hat Klaus Birkelund geschrieben:
> > > > > On Mon, May 20, 2019 at 11:40:30AM -0600, Kenneth Heitke wrote:
> > > > > > Signed-off-by: Kenneth Heitke 
> > > > 
> > > > > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > > > > > index 56c9d4b4b1..d7277e72b7 100644
> > > > > > --- a/hw/block/nvme.h
> > > > > > +++ b/hw/block/nvme.h
> > > > > > @@ -69,6 +69,7 @@ typedef struct NvmeCtrl {
> > > > > >   uint16_tmax_prp_ents;
> > > > > >   uint16_tcqe_size;
> > > > > >   uint16_tsqe_size;
> > > > > > +uint16_toncs;
> > > > > 
> > > > > Looks like this unused member snuck its way into the patch. But I see 
> > > > > no
> > > > > harm in it being there.
> > > > 
> > > > Good catch. I'll just remove it again from my branch.
> > > > 
> > > > > > +static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
> > > > > > +{
> > > > > > +trace_nvme_setfeat_timestamp(ts);
> > > > > > +
> > > > > > +n->host_timestamp = le64_to_cpu(ts);
> > > > > > +n->timestamp_set_qemu_clock_ms = 
> > > > > > qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > > > > > +}
> > > > > > +
> > > > > > +static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
> > > > > > +{
> > > > > > +uint64_t current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> > > > 
> > > > Here I wonder why we use QEMU_CLOCK_REALTIME in a device emulation.
> > > > Wouldn't QEMU_CLOCK_VIRTUAL make more sense?
> > > > 
> > > 
> > > QEMU_CLOCK_VIRTUAL probably would make more sense. When I was reading
> > > through the differences I wasn't really sure what to pick. iven that this 
> > > is
> > > the time within the device's context, the virtual time seems more correct.
> > > 
> >  
> > I thought about this too when I reviewed, but came to the conclusion
> > that REALTIME was correct. The timestamp is basically a value that the
> > host stores in the controller. When the host uses Get Features to get
> > the the current time it would expect it to match the progression for its
> > own wall clockright? If I understand REALTIME vs VIRTUAL correctly,
> > using VIRTUAL, it would go way out of sync.
> 
> Which two things would go out of sync with VIRTUAL?
> 
> Not an expert on clocks myself, but I think the main question is what
> happens to the clock while the VM is stopped. REALTIME continues running
> where as VIRTUAL is stopped. If we expose REALTIME measurements to the
> guest, the time passed may look a lot longer than what the guest's clock
> actually says. So this is the thing I am worried would go out of sync
> with REALTIME.
> 

OK, fair point.

Thinking about this some more, I agree that VIRTUAL is more correct. An
application should never track elapsed time using real wall clock time,
but some monotonic clock that is oblivious to say NTP adjustments.

Klaus

Re: [Qemu-block] [PATCH] nvme: do not advertise support for unsupported arbitration mechanism

2019-06-16 Thread Klaus Birkelund

On Fri, Jun 14, 2019 at 10:39:27PM +0200, Max Reitz wrote:
> On 06.06.19 11:25, Klaus Birkelund Jensen wrote:
> > The device mistakenly reports that the Weighted Round Robin with Urgent
> > Priority Class arbitration mechanism is supported.
> > 
> > It is not.
> 
> I believe you based on the fact that there is no “weight” or “priority”
> anywhere in nvme.c, and that it does not evaluate the Arbitration
> Mechanism Selected field.
> 

Not sure if you want me to change the commit message? Feel free to
change it if you want to ;)

> > Signed-off-by: Klaus Birkelund Jensen 
> > ---
> >  hw/block/nvme.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 30e50f7a3853..415b4641d6b4 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1383,7 +1383,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >  n->bar.cap = 0;
> >  NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
> >  NVME_CAP_SET_CQR(n->bar.cap, 1);
> > -NVME_CAP_SET_AMS(n->bar.cap, 1);
> 
> I suppose the better way would be to pass 0, so it is more explicit, I
> think.
> 
> (Just removing it looks like it may have just been forgotten.)
> 

Not explicitly setting it to zero aligns with how the other fields in
CAP are also left out if kept at zero. If we explicitly set it to zero I
think we should also set all the other fields that way (DSTRD, NSSRS,
etc.).


Klaus

[Qemu-block] [RFC] nvme: how to support multiple namespaces

2019-06-17 Thread Klaus Birkelund

Hi all,

I'm thinking about how to support multiple namespaces in the NVMe
device. My first idea was to add a "namespaces" property array to the
device that references blockdevs, but as Laszlo writes below, this might
not be the best idea. It also makes it troublesome to add per-namespace
parameters (which is something I will be required to do for other
reasons). Some of you might remember my first attempt at this that
included adding a new block driver (derived from raw) that could be
given certain parameters that would then be stored in the image. But I
understand that this is a no-go, and I can see why.

I guess the optimal way would be such that the parameters was something
like:

   -blockdev raw,node-name=blk_ns1,file.driver=file,file.filename=blk_ns1.img
   -blockdev raw,node-name=blk_ns2,file.driver=file,file.filename=blk_ns2.img
   -device nvme-ns,drive=blk_ns1,ns-specific-options (nsfeat,mc,dlfeat)...
   -device nvme-ns,drive=blk_ns2,...
   -device nvme,...

My question is how to state the parent/child relationship between the
nvme and nvme-ns devices. I've been looking at how ide and virtio does
this, and maybe a "bus" is the right way to go?

Can anyone give any advice as to how to proceed? I have a functioning
patch that adds multiple namespaces, but it uses the "namespaces" array
method and I don't think that is the right approach.

I've copied my initial discussion with Laszlo below.


Cheers,
Klaus


On Wed, Jun 05, 2019 at 07:09:43PM +0200, Laszlo Ersek wrote:
> On 06/05/19 15:44, Klaus Birkelund wrote:
> > On Tue, Jun 04, 2019 at 06:52:38PM +0200, Laszlo Ersek wrote:
> >> Hi Klaus,
> >>
> >> On 06/04/19 14:59, Klaus Birkelund wrote:
> >>> Hi Laszlo,
> >>>
> >>> I'm implementing multiple namespace support for the NVMe device in QEMU
> >>> and I'm not sure how to handle the bootindex property.
> >>>
> >>> Your commit message from a907ec52cc1a provides great insight, but do you
> >>> have any recommendations to how the bootindex property should be
> >>> handled?
> >>>
> >>> Multiple namespaces work by having multiple -blockdevs and then using
> >>> the property array functionality to reference a list of blockdevs from
> >>> the nvme device:
> >>>
> >>> -device nvme,serial=,len-namespaces=1,namespace[0]=
> >>>
> >>> A bootindex property would be global to the device. Should it just
> >>> always default to the first namespace? I'm really unsure about how the
> >>> firmware handles it.
> >>>
> >>> Hope you can shed some light on this.
> >>
> >> this is getting quite seriously into QOM and QEMU options, so I
> >> definitely suggest to take this to the list, because I'm not an expert
> >> in all that, at all :)
> >>
> >> Based on a re-reading of the commit (which I have *completely* forgotten
> >> about by now!), and based on your description, my opinion is that
> >> introducing the "namespace" property to the "nvme" device as an array is
> >> a bad fit. Because, as you say, a single device may only take a single
> >> bootindex property. If it suffices to designate at most one namespace
> >> for booting purposes, then I *guess* an extra property can be
> >> introduced, to state *which* namespace the bootindex property should
> >> apply to (and the rest of the namespaces will be ignored for that
> >> purpose). However, if it's necessary to add at least two namespaces to
> >> the boot order, then the namespaces will have to be split to distinct
> >> "-device" options.
> >>
> >> My impression is that the "namespace" property isn't upstream yet; i.e.
> >> it is your work in progress. As a "QOM noob" I would suggest introducing
> >> a new device model, called "nvme-namespace". This could have its own
> >> "bootindex" property. On the "nvme" device model's level, the currently
> >> existing "bootindex" property would become mutually exclusive with the
> >> "nvme" device having "nvme-namespace" child devices. The parent-child
> >> relationship could be expressed from either direction, i.e. either the
> >> "nvme" parent device could reference the children with the "namespace"
> >> array property (it wouldn't refer to s but to the IDs of
> >> "nvme-namespace" devices), or the "nvme-namespace" devices could
> >> reference the parent "nvme"

Re: [Qemu-block] [Qemu-devel] [RFC] nvme: how to support multiple namespaces

2019-06-24 Thread Klaus Birkelund

On Thu, Jun 20, 2019 at 05:37:24PM +0200, Laszlo Ersek wrote:
> On 06/17/19 10:12, Klaus Birkelund wrote:
> > Hi all,
> > 
> > I'm thinking about how to support multiple namespaces in the NVMe
> > device. My first idea was to add a "namespaces" property array to the
> > device that references blockdevs, but as Laszlo writes below, this might
> > not be the best idea. It also makes it troublesome to add per-namespace
> > parameters (which is something I will be required to do for other
> > reasons). Some of you might remember my first attempt at this that
> > included adding a new block driver (derived from raw) that could be
> > given certain parameters that would then be stored in the image. But I
> > understand that this is a no-go, and I can see why.
> > 
> > I guess the optimal way would be such that the parameters was something
> > like:
> > 
> >-blockdev 
> > raw,node-name=blk_ns1,file.driver=file,file.filename=blk_ns1.img
> >-blockdev 
> > raw,node-name=blk_ns2,file.driver=file,file.filename=blk_ns2.img
> >-device nvme-ns,drive=blk_ns1,ns-specific-options (nsfeat,mc,dlfeat)...
> >-device nvme-ns,drive=blk_ns2,...
> >-device nvme,...
> > 
> > My question is how to state the parent/child relationship between the
> > nvme and nvme-ns devices. I've been looking at how ide and virtio does
> > this, and maybe a "bus" is the right way to go?
> 
> I've added Markus to the address list, because of this question. No
> other (new) comments from me on the thread starter at this time, just
> keeping the full context.
> 

Hi all,

I've succesfully implemented this by introducing a new 'nvme-ns' device
model. The nvme device creates a bus named from the device id ('id'
parameter) and the nvme-ns devices are then registered on this.

This results in an nvme device being creates like this (two namespaces
example):

  -drive file=nvme0n1.img,if=none,id=disk1
  -drive file=nvme0n2.img,if=none,id=disk2
  -device nvme,serial=deadbeef,id=nvme0
  -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
  -device nvme-ns,drive=disk2,bus=nvme0,nsid=2

How does that look as a way forward?

Cheers,
Klaus

Re: [Qemu-block] [Qemu-devel] [RFC] nvme: how to support multiple namespaces

2019-06-25 Thread Klaus Birkelund

On Mon, Jun 24, 2019 at 12:18:45PM +0200, Kevin Wolf wrote:
> Am 24.06.2019 um 10:01 hat Klaus Birkelund geschrieben:
> > On Thu, Jun 20, 2019 at 05:37:24PM +0200, Laszlo Ersek wrote:
> > > On 06/17/19 10:12, Klaus Birkelund wrote:
> > > > Hi all,
> > > > 
> > > > I'm thinking about how to support multiple namespaces in the NVMe
> > > > device. My first idea was to add a "namespaces" property array to the
> > > > device that references blockdevs, but as Laszlo writes below, this might
> > > > not be the best idea. It also makes it troublesome to add per-namespace
> > > > parameters (which is something I will be required to do for other
> > > > reasons). Some of you might remember my first attempt at this that
> > > > included adding a new block driver (derived from raw) that could be
> > > > given certain parameters that would then be stored in the image. But I
> > > > understand that this is a no-go, and I can see why.
> > > > 
> > > > I guess the optimal way would be such that the parameters was something
> > > > like:
> > > > 
> > > >-blockdev 
> > > > raw,node-name=blk_ns1,file.driver=file,file.filename=blk_ns1.img
> > > >-blockdev 
> > > > raw,node-name=blk_ns2,file.driver=file,file.filename=blk_ns2.img
> > > >-device nvme-ns,drive=blk_ns1,ns-specific-options 
> > > > (nsfeat,mc,dlfeat)...
> > > >-device nvme-ns,drive=blk_ns2,...
> > > >-device nvme,...
> > > > 
> > > > My question is how to state the parent/child relationship between the
> > > > nvme and nvme-ns devices. I've been looking at how ide and virtio does
> > > > this, and maybe a "bus" is the right way to go?
> > > 
> > > I've added Markus to the address list, because of this question. No
> > > other (new) comments from me on the thread starter at this time, just
> > > keeping the full context.
> > > 
> > 
> > Hi all,
> > 
> > I've succesfully implemented this by introducing a new 'nvme-ns' device
> > model. The nvme device creates a bus named from the device id ('id'
> > parameter) and the nvme-ns devices are then registered on this.
> > 
> > This results in an nvme device being creates like this (two namespaces
> > example):
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > How does that look as a way forward?
> 
> This looks very similar to what other devices do (one bus controller
> that has multiple devices on its but), so I like it.
> 
> The thing that is special here is that -device nvme is already a block
> device by itself that can take a drive property. So how does this play
> together? Can I choose to either specify a drive directly for the nvme
> device or nvme-ns devices, but when I do both, I will get an error? What
> happens if I don't specify a drive for nvme, but also don't add nvme-ns
> devices?
> 

Hi Kevin,

Yes, the nvme device is already a block device. My current patch removes
that property from the nvme device. I guess this breaks backward
compatibiltiy. We could accept a drive for the nvme device only if no
nvme-ns devices are configured and connected on the bus.

I'm not entirely sure on the spec, but my gut tells me that an nvme
device without any namespaces is technically a valid device, although it
is a bit useless.

I will post my patch (as part of a larger series) and we can discuss it
there.

Thanks for the feedback!

Klaus

Re: [Qemu-block] [Qemu-devel] [RFC] nvme: how to support multiple namespaces

2019-06-25 Thread Klaus Birkelund

On Tue, Jun 25, 2019 at 07:51:29AM +0200, Markus Armbruster wrote:
> Laszlo Ersek  writes:
> 
> > On 06/24/19 12:18, Kevin Wolf wrote:
> >> Am 24.06.2019 um 10:01 hat Klaus Birkelund geschrieben:
> >>> On Thu, Jun 20, 2019 at 05:37:24PM +0200, Laszlo Ersek wrote:
> >>>> On 06/17/19 10:12, Klaus Birkelund wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> I'm thinking about how to support multiple namespaces in the NVMe
> >>>>> device. My first idea was to add a "namespaces" property array to the
> >>>>> device that references blockdevs, but as Laszlo writes below, this might
> >>>>> not be the best idea. It also makes it troublesome to add per-namespace
> >>>>> parameters (which is something I will be required to do for other
> >>>>> reasons). Some of you might remember my first attempt at this that
> >>>>> included adding a new block driver (derived from raw) that could be
> >>>>> given certain parameters that would then be stored in the image. But I
> >>>>> understand that this is a no-go, and I can see why.
> >>>>>
> >>>>> I guess the optimal way would be such that the parameters was something
> >>>>> like:
> >>>>>
> >>>>>-blockdev 
> >>>>> raw,node-name=blk_ns1,file.driver=file,file.filename=blk_ns1.img
> >>>>>-blockdev 
> >>>>> raw,node-name=blk_ns2,file.driver=file,file.filename=blk_ns2.img
> >>>>>-device nvme-ns,drive=blk_ns1,ns-specific-options 
> >>>>> (nsfeat,mc,dlfeat)...
> >>>>>-device nvme-ns,drive=blk_ns2,...
> >>>>>-device nvme,...
> >>>>>
> >>>>> My question is how to state the parent/child relationship between the
> >>>>> nvme and nvme-ns devices. I've been looking at how ide and virtio does
> >>>>> this, and maybe a "bus" is the right way to go?
> >>>>
> >>>> I've added Markus to the address list, because of this question. No
> >>>> other (new) comments from me on the thread starter at this time, just
> >>>> keeping the full context.
> >>>>
> >>>
> >>> Hi all,
> >>>
> >>> I've succesfully implemented this by introducing a new 'nvme-ns' device
> >>> model. The nvme device creates a bus named from the device id ('id'
> >>> parameter) and the nvme-ns devices are then registered on this.
> >>>
> >>> This results in an nvme device being creates like this (two namespaces
> >>> example):
> >>>
> >>>   -drive file=nvme0n1.img,if=none,id=disk1
> >>>   -drive file=nvme0n2.img,if=none,id=disk2
> >>>   -device nvme,serial=deadbeef,id=nvme0
> >>>   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >>>   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> >>>
> >>> How does that look as a way forward?
> >> 
> >> This looks very similar to what other devices do (one bus controller
> >> that has multiple devices on its but), so I like it.
> 
> Devices can be wired together without a bus intermediary.  You
> definitely want a bus when the physical connection you model has one.
> If not, a bus may be useful anyway, say because it provides a convenient
> way to encapsulate the connection model, or to support -device bus=...
> 
 
I'm not sure how to wire it together without the bus abstraction? So
I'll stick with the bus for now. It *is* extremely convenient!

Cheers,
Klaus

Re: [Qemu-block] [Qemu-devel] [RFC] nvme: how to support multiple namespaces

2019-06-26 Thread Klaus Birkelund

On Wed, Jun 26, 2019 at 12:14:15PM +0200, Paolo Bonzini wrote:
> On 26/06/19 06:46, Markus Armbruster wrote:
> >> I'm not sure how to wire it together without the bus abstraction? So
> >> I'll stick with the bus for now. It *is* extremely convenient!
> > 
> > As far as I can tell offhand, a common use of bus-less connections
> > between devices is wiring together composite devices.  Example:
> > 
> > static void designware_pcie_host_init(Object *obj)
> > {
> > DesignwarePCIEHost *s = DESIGNWARE_PCIE_HOST(obj);
> > DesignwarePCIERoot *root = &s->root;
> > 
> > object_initialize_child(obj, "root",  root, sizeof(*root),
> > TYPE_DESIGNWARE_PCIE_ROOT, &error_abort, 
> > NULL);
> > qdev_prop_set_int32(DEVICE(root), "addr", PCI_DEVFN(0, 0));
> > qdev_prop_set_bit(DEVICE(root), "multifunction", false);
> > }
> > 
> > This creates a TYPE_DESIGNWARE_PCIE_ROOT device "within" the
> > TYPE_DESIGNWARE_PCIE_HOST device.
> > 
> > Bus-less connections between separate devices (i.e. neither device is a
> > part of the other) are also possible.  But I'm failing at grep right
> > now.  Here's an example for connecting a device to a machine:
> > 
> > static void mch_realize(PCIDevice *d, Error **errp)
> > {
> > int i;
> > MCHPCIState *mch = MCH_PCI_DEVICE(d);
> > 
> > [...]
> > object_property_add_const_link(qdev_get_machine(), "smram",
> >OBJECT(&mch->smram), &error_abort);
> > [...]
> > }
> 
> This is a link to a memory region.  A connection to a separate device
> can be found in hw/dma/xilinx_axidma.c and hw/net/xilinx_axienet.c,
> where you have
> 
>  data stream <> data stream
>/\
>dmaenet
>\/
>  control stream <--> control stream
> 
> where the horizontal links in the middle are set up by board code, while
> the diagonal lines on the side are set up by device code.
> 
> > Paolo, can you provide guidance on when to use a bus, and when not to?
> 
> I would definitely use a bus if 1) it is common for the user (and not
> for machine code) to set up the connection 2) the relationship is
> parent-child.  Link properties are basically unused on the command line,
> and it only makes sense to make something different if the connection is
> some kind of graph so bus-child does not cut it.
> 

Definitely looks like the bus is the way to go. The controller/namespace
relationship is strictly parent-child.

Thanks both of you for the advice!


Klaus

Re: [Qemu-block] [RFC,v1] Namespace Management Support

2019-07-05 Thread Klaus Birkelund

On Tue, Jul 02, 2019 at 10:39:36AM -0700, Matt Fitzpatrick wrote:
> Adding namespace management support to the nvme device. Namespace creation
> requires contiguous block space for a simple method of allocation.
> 
> I wrote this a few years ago based on Keith's fork and nvmeqemu fork and
> have recently re-synced with the latest trunk.  Some data structures in
> nvme.h are a bit more filled out that strictly necessary as this is also the
> base for sr-iov and IOD patched to be submitted later.
> 

Hi Matt,

Nice! I'm always happy when new features for the nvme device is posted!

I'll be happy to review it, but I won't start going through it in
details because I believe the approach to supporting multiple namespaces
is flawed. We had a recent discussion on this and I also got some
unrelated patches rejected due to implementing it similarly by carving
up the image.

I have posted a long series that includes a patch for multiple
namespaces. It is implemented by introducing a fresh `nvme-ns` device
model that represents a namespace and attaches to a bus created by the
parent `nvme` controller device.

The core issue is that a qemu image /should/ be attachable to other
devices (say ide) and not strictly tied to the one device model. Thus,
we cannot just shove a bunch of namespaces into a single image.

But, in light of your patch, I'm not convinced that my implementation is
the correct solution. Maybe the abstraction should not be an `nvme-ns`
device, but a `nvme-nvm` device that when attached changes TNVMCAP and
UNVMCAP? Maybe you have some input for this? Or we could have both and
dynamically create the nvme-ns devices on top of nvme-nvm devices. I
think it would still require a 1-to-1 mapping, but it could be a way to
support the namespace management capability.

Cheers,
Klaus

Re: [Qemu-block] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Klaus Birkelund

On Fri, Jul 05, 2019 at 09:23:33AM +0200, Klaus Birkelund Jensen wrote:
> This adds support for multiple namespaces by introducing a new 'nvme-ns'
> device model. The nvme device creates a bus named from the device name
> ('id'). The nvme-ns devices then connect to this and registers
> themselves with the nvme device.
> 
> This changes how an nvme device is created. Example with two namespaces:
> 
>   -drive file=nvme0n1.img,if=none,id=disk1
>   -drive file=nvme0n2.img,if=none,id=disk2
>   -device nvme,serial=deadbeef,id=nvme0
>   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
>   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> 
> A maximum of 256 namespaces can be configured.
> 
 
Well that was embarrasing.

This patch breaks nvme-test.c. Which I obviously did not run.

In my defense, the test doesn't do much currently, but I'll of course
fix the test for v2.

Re: [Qemu-block] [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Klaus Birkelund

On Fri, Jul 05, 2019 at 02:49:29PM +0100, Daniel P. Berrangé wrote:
> On Fri, Jul 05, 2019 at 03:36:17PM +0200, Klaus Birkelund wrote:
> > On Fri, Jul 05, 2019 at 09:23:33AM +0200, Klaus Birkelund Jensen wrote:
> > > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > > device model. The nvme device creates a bus named from the device name
> > > ('id'). The nvme-ns devices then connect to this and registers
> > > themselves with the nvme device.
> > > 
> > > This changes how an nvme device is created. Example with two namespaces:
> > > 
> > >   -drive file=nvme0n1.img,if=none,id=disk1
> > >   -drive file=nvme0n2.img,if=none,id=disk2
> > >   -device nvme,serial=deadbeef,id=nvme0
> > >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> > >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > > 
> > > A maximum of 256 namespaces can be configured.
> > > 
> >  
> > Well that was embarrasing.
> > 
> > This patch breaks nvme-test.c. Which I obviously did not run.
> > 
> > In my defense, the test doesn't do much currently, but I'll of course
> > fix the test for v2.
> 
> That highlights a more serious problem.  This series changes the syntx
> for configuring the nvme device in a way that is not backwards compatible.
> So anyone who is using QEMU with NVME will be broken when they upgrade
> to the next QEMU release.
> 
> I understand why you wanted to restructure things to have a separate
> nvme-ns device, but there needs to be some backcompat support in there
> for the existing syntax to avoid breaking current users IMHO.
> 
 
Hi Daniel,

I raised this issue previously. I suggested that we keep the drive
property for the nvme device and only accept either that or an nvme-ns
device to be configured (but not both).

That would keep backward compatibilty, but enforce the use of nvme-ns
for any setup that requires multiple namespaces.

Would that work?

Cheers,
Klaus

Re: [Qemu-block] [Qemu-devel] [RFC, v1] Namespace Management Support

2019-07-08 Thread Klaus Birkelund

On Mon, Jul 08, 2019 at 03:52:29PM -0700, Matt Fitzpatrick wrote:
> Hey Klaus,
> 
> Sorry for the late reply!  I finally found this message amid the pile of
> emails Qemu dumped on me.
> 
> I don't know what the right answer is here... NVMe is designed in a way
> where you *do* "carve up" the flash into logical groupings and the nvme
> firmware decides on how that's done. Those logical groupings can be attached
> to different controllers(which we don't have here yet?) after init, but
> that's a problem for future us I guess?But that's all stuff you already
> know.
> 

Yeah, I havn't started worrying about that ;)

> The "nvme-nvm" solution might be the right approach, but I'm a bit hesitant
> on the idea of growing tnvmcap...
> 
> I can't think of any way to create namespaces on the fly and not have it use
> some single existing block backend, unless we defined a range of block
> images on qemu start and namespace create/attach only uses one image up to
> and including it's max size per namespace? That might work, and I think
> that's what you suggested (or at least is similar to), though it could be
> pretty wasteful. It wouldn't offer a "true" namespace management support,
> but could be close enough.
> 

Having an emulated device that supports namespace management would be
very useful for testing software, but yeah, I have a hard time seeing
how we can make that fit with the current "QEMU model".

> I'm in the middle of going through the patch you posted. Nice job!  I'm glad
> to see more people adding enhancements. It was pretty stale for years.
> 

Thanks for looking at it, I know it's a lot to go through ;)

> -Matt
> On 7/5/19 12:50 AM, Klaus Birkelund wrote:
> > On Tue, Jul 02, 2019 at 10:39:36AM -0700, Matt Fitzpatrick wrote:
> > > Adding namespace management support to the nvme device. Namespace creation
> > > requires contiguous block space for a simple method of allocation.
> > > 
> > > I wrote this a few years ago based on Keith's fork and nvmeqemu fork and
> > > have recently re-synced with the latest trunk.  Some data structures in
> > > nvme.h are a bit more filled out that strictly necessary as this is also 
> > > the
> > > base for sr-iov and IOD patched to be submitted later.
> > > 
> > Hi Matt,
> > 
> > Nice! I'm always happy when new features for the nvme device is posted!
> > 
> > I'll be happy to review it, but I won't start going through it in
> > details because I believe the approach to supporting multiple namespaces
> > is flawed. We had a recent discussion on this and I also got some
> > unrelated patches rejected due to implementing it similarly by carving
> > up the image.
> > 
> > I have posted a long series that includes a patch for multiple
> > namespaces. It is implemented by introducing a fresh `nvme-ns` device
> > model that represents a namespace and attaches to a bus created by the
> > parent `nvme` controller device.
> > 
> > The core issue is that a qemu image /should/ be attachable to other
> > devices (say ide) and not strictly tied to the one device model. Thus,
> > we cannot just shove a bunch of namespaces into a single image.
> > 
> > But, in light of your patch, I'm not convinced that my implementation is
> > the correct solution. Maybe the abstraction should not be an `nvme-ns`
> > device, but a `nvme-nvm` device that when attached changes TNVMCAP and
> > UNVMCAP? Maybe you have some input for this? Or we could have both and
> > dynamically create the nvme-ns devices on top of nvme-nvm devices. I
> > think it would still require a 1-to-1 mapping, but it could be a way to
> > support the namespace management capability.
> > 
> > 
> > Cheers,
> > Klaus
> > 
> 

Hi Kevin,

This highlights another situation where the "1 image to 1 block device"
model doesn't fit that well with NVMe. Especially with the introduction
of "NVM Sets" in NVMe 1.4. It would be very nice to introduce a
'nvme-nvmset' device model that adds an NVM Set which the controller can
then create namespaces in.

Is it completely unacceptable for a device to use the image in such a
way that it would not make sense (aka present the same block device)
when attached to another device (ide, ...)?

I really have a hard time seeing how we could support these features
without violating the '1 image to 1 block device" model.


Cheers,
Klaus

Re: [PATCH v1] nvme: indicate CMB support through controller capabilities register

2020-04-07 Thread Klaus Birkelund Jensen

On Apr  1 11:42, Andrzej Jakowski wrote:
> This patch sets CMBS bit in controller capabilities register when user
> configures NVMe driver with CMB support, so capabilites are correctly reported
> to guest OS.
> 
> Signed-off-by: Andrzej Jakowski 
> ---
>  hw/block/nvme.c  | 2 ++
>  include/block/nvme.h | 4 
>  2 files changed, 6 insertions(+)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index d28335cbf3..986803398f 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1393,6 +1393,8 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> **errp)
>  n->bar.intmc = n->bar.intms = 0;
>  
>  if (n->cmb_size_mb) {
> +/* Contoller capabilities */
> +NVME_CAP_SET_CMBS(n->bar.cap, 1);
>  
>  NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
>  NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 8fb941c653..561891b140 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -27,6 +27,7 @@ enum NvmeCapShift {
>  CAP_CSS_SHIFT  = 37,
>  CAP_MPSMIN_SHIFT   = 48,
>  CAP_MPSMAX_SHIFT   = 52,
> +CAP_CMB_SHIFT  = 57,
>  };
>  
>  enum NvmeCapMask {
> @@ -39,6 +40,7 @@ enum NvmeCapMask {
>  CAP_CSS_MASK   = 0xff,
>  CAP_MPSMIN_MASK= 0xf,
>  CAP_MPSMAX_MASK= 0xf,
> +CAP_CMB_MASK   = 0x1,
>  };
>  
>  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
> @@ -69,6 +71,8 @@ enum NvmeCapMask {
> << 
> CAP_MPSMIN_SHIFT)
>  #define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & 
> CAP_MPSMAX_MASK)\
>  << 
> CAP_MPSMAX_SHIFT)
> +#define NVME_CAP_SET_CMBS(cap, val) (cap |= (uint64_t)(val & CAP_CMB_MASK)\
> +<< CAP_CMB_SHIFT)
>  
>  enum NvmeCcShift {
>  CC_EN_SHIFT = 0,
> -- 
> 2.21.1
> 

Looks good.

Reviewed-by: Klaus Jensen

Re: [PATCH v6 14/42] nvme: add missing mandatory features

2020-04-08 Thread Klaus Birkelund Jensen

On Mar 31 12:39, Maxim Levitsky wrote:
> On Tue, 2020-03-31 at 07:41 +0200, Klaus Birkelund Jensen wrote:
> > On Mar 25 12:41, Maxim Levitsky wrote:
> > > BTW the user of the device doesn't have to have 1:1 mapping between qid 
> > > and msi interrupt index,
> > > in fact when MSI is not used, all the queues will map to the same vector, 
> > > which will be interrupt 0
> > > from point of view of the device IMHO.
> > > So it kind of makes sense IMHO to have num_irqs or something, even if it 
> > > technically equals to number of queues.
> > > 
> > 
> > Yeah, but the device will still *support* the N IVs, so they can still
> > be configured even though they will not be used. So I don't think we
> > need to introduce an additional parameter?
> 
> Yes and no.
> I wasn't thinking to add a new parameter for number of supporter interrupt 
> vectors,
> but just to have an internal variable to represent it so that we could 
> support in future
> case where these are not equal.
> 
> Also from point of view of validating the users of this virtual nvme drive, I 
> think it kind
> of makes sense to allow having less supported IRQ vectors than IO queues, so 
> to check
> how userspace copes with it. It is valid after all to have same interrupt 
> vector shared between
> multiple queues.
> 

I see that this could be useful for testing, but I think we can defer
that to a later patch. Would you be okay with that for now?

> In fact in theory (but that would complicate the implementation greatly) we 
> should even support
> case when number of submission queues is not equal to number of completion 
> queues. Yes nobody does in real hardware,
> and at least Linux nvme driver hard assumes 1:1 SQ/CQ mapping but still.
> 

It is not the hardware that decides this and I believe that there
definitely are applications that chooses to associate multiple SQs with
a single CQ. The CQ is an attribute of the SQ and the IV of the CQ is
also specified in the create command. I believe this is already
supported.

> My nvme-mdev doesn't make this assumpiton (and neither any assumptions on 
> interrupt vector counts) 
> and allows the user to have any SQ/CQ mapping as far as the spec allows
> (but it does hardcode maximum number of SQ/CQ supported)
> 
> BTW, I haven't looked at that but we should check that the virtual nvme drive 
> can cope with using legacy
> interrupt (that is MSI disabled) - nvme-mdev does support this and was tested 
> with it.
> 

Yes, this is definitely not very well tested.

If you insist on all of the above being implemented, then I will do it,
but I would rather defer this to later patches as this series is already
pretty large ;)

Re: [PATCH v6 32/42] nvme: allow multiple aios per command

2020-04-08 Thread Klaus Birkelund Jensen

On Mar 31 12:10, Maxim Levitsky wrote:
> On Tue, 2020-03-31 at 07:47 +0200, Klaus Birkelund Jensen wrote:
> > On Mar 25 12:57, Maxim Levitsky wrote:
> > > On Mon, 2020-03-16 at 07:29 -0700, Klaus Jensen wrote:
> > > > @@ -516,10 +613,10 @@ static inline uint16_t nvme_check_prinfo(NvmeCtrl 
> > > > *n, NvmeNamespace *ns,
> > > >  return NVME_SUCCESS;
> > > >  }
> > > >  
> > > > -static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace 
> > > > *ns,
> > > > - uint64_t slba, uint32_t nlb,
> > > > - NvmeRequest *req)
> > > > +static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
> > > > + uint32_t nlb, NvmeRequest 
> > > > *req)
> > > >  {
> > > > +NvmeNamespace *ns = req->ns;
> > > >  uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
> > > 
> > > This should go to the patch that added nvme_check_bounds as well
> > > 
> > 
> > We can't really, because the NvmeRequest does not hold a reference to
> > the namespace as a struct member at that point. This is also an issue
> > with the nvme_check_prinfo function above.
> 
> I see it now. The changes to NvmeRequest together with this are a good 
> candidate
> to split from this patch to get this patch to size that is easy to review.
> 

I'm factoring those changes and other stuff out into separate patches!

Re: [PATCH v7 10/48] nvme: remove redundant cmbloc/cmbsz members

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:10, Philippe Mathieu-Daudé wrote:
> 
> "hw/block/nvme.h" should not pull in "block/nvme.h", both should include a
> common "hw/block/nvme_spec.h" (or better named). Not related to this patch
> although.
> 

Hmm. It does pull in the "include/block/nvme.h" which is basically the
"nvme_spec.h" you are talking about. That file holds all spec related
structs that are shared between the VFIO based nvme driver
(block/nvme.c) and the emulated device (hw/block/nvme.c).

Isn't that what is intended?

Re: [PATCH v7 12/48] nvme: add temperature threshold feature

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:19, Philippe Mathieu-Daudé wrote:
> On 4/15/20 7:51 AM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > It might seem wierd to implement this feature for an emulated device,
> 
> 'weird'

Thanks, fixed :)

> 
> > but it is mandatory to support and the feature is useful for testing
> > asynchronous event request support, which will be added in a later
> > patch.
> 
> Which patch? I can't find how you set the temperature in this series.
> 

The temperature cannot be changed, but the thresholds can with the Set
Features command (and that can then trigger AERs). That is added in
"nvme: add temperature threshold feature" and "nvme: add support for the
asynchronous event request command" respectively.

There is a test in SPDK that does this.

Re: [PATCH v7 11/48] nvme: refactor device realization

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:14, Philippe Mathieu-Daudé wrote:
> Hi Klaus,
> 
> This patch is a pain to review... Could you split it? I'd use one trivial
> patch for each function extracted from nvme_realize().
> 

Understood, I will split it up!

Re: [PATCH v7 12/48] nvme: add temperature threshold feature

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:24, Klaus Birkelund Jensen wrote:
> On Apr 15 09:19, Philippe Mathieu-Daudé wrote:
> > On 4/15/20 7:51 AM, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > > 
> > > It might seem wierd to implement this feature for an emulated device,
> > 
> > 'weird'
> 
> Thanks, fixed :)
> 
> > 
> > > but it is mandatory to support and the feature is useful for testing
> > > asynchronous event request support, which will be added in a later
> > > patch.
> > 
> > Which patch? I can't find how you set the temperature in this series.
> > 
> 
> The temperature cannot be changed, but the thresholds can with the Set
> Features command (and that can then trigger AERs). That is added in
> "nvme: add temperature threshold feature" and "nvme: add support for the
> asynchronous event request command" respectively.
> 
> There is a test in SPDK that does this.
> 

Oh, I think I misunderstood you.

No, setting the temperature was moved to the "nvme: add support for the
get log page command" patch since that is the patch that actually uses
it. This was on request by Maxim in an earlier review.

Re: [PATCH v7 06/48] nvme: refactor nvme_addr_read

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:03, Philippe Mathieu-Daudé wrote:
> On 4/15/20 7:50 AM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Pull the controller memory buffer check to its own function. The check
> > will be used on its own in later patches.
> > 
> > Signed-off-by: Klaus Jensen 
> > Acked-by: Keith Busch 
> > ---
> >   hw/block/nvme.c | 16 
> >   1 file changed, 12 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 622103c42d0a..02d3dde90842 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -52,14 +52,22 @@
> >   static void nvme_process_sq(void *opaque);
> > +static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> 
> 'inline' not really necessary here.
> 

Fixed.

Re: [PATCH v7 45/48] nvme: support multiple namespaces

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:38, Philippe Mathieu-Daudé wrote:
> On 4/15/20 7:51 AM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >-drive file=nvme0n1.img,if=none,id=disk1
> >-drive file=nvme0n2.img,if=none,id=disk2
> >-device nvme,serial=deadbeef,id=nvme0
> >-device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >-device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > Reviewed-by: Keith Busch 
> > ---
> >   hw/block/Makefile.objs |   2 +-
> >   hw/block/nvme-ns.c | 157 +++
> >   hw/block/nvme-ns.h |  60 +++
> >   hw/block/nvme.c| 233 +++--
> >   hw/block/nvme.h|  47 -
> >   hw/block/trace-events  |   8 +-
> >   6 files changed, 396 insertions(+), 111 deletions(-)
> >   create mode 100644 hw/block/nvme-ns.c
> >   create mode 100644 hw/block/nvme-ns.h
> > 
> > diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
> > index 4b4a2b338dc4..d9141d6a4b9b 100644
> > --- a/hw/block/Makefile.objs
> > +++ b/hw/block/Makefile.objs
> > @@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
> >   common-obj-$(CONFIG_XEN) += xen-block.o
> >   common-obj-$(CONFIG_ECC) += ecc.o
> >   common-obj-$(CONFIG_ONENAND) += onenand.o
> > -common-obj-$(CONFIG_NVME_PCI) += nvme.o
> > +common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
> >   common-obj-$(CONFIG_SWIM) += swim.o
> >   common-obj-$(CONFIG_SH4) += tc58128.o
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > new file mode 100644
> > index ..bd64d4a94632
> > --- /dev/null
> > +++ b/hw/block/nvme-ns.c
> > @@ -0,0 +1,157 @@
> 
> Missing copyright + license.
> 

Fixed.

> > +
> > +switch (n->conf.wce) {
> > +case ON_OFF_AUTO_ON:
> > +n->features.volatile_wc = 1;
> > +break;
> > +case ON_OFF_AUTO_OFF:
> > +n->features.volatile_wc = 0;
> 
> Missing 'break'?
> 

Ouch. Fixed.

> > +case ON_OFF_AUTO_AUTO:
> > +n->features.volatile_wc = blk_enable_write_cache(ns->blk);
> > +break;
> > +default:
> > +abort();
> > +}
> > +
> > +blk_set_enable_write_cache(ns->blk, n->features.volatile_wc);
> > +
> > +return 0;
> > +}
> > +
> > +static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
> > +{
> > +if (!ns->blk) {
> > +error_setg(errp, "block backend not configured");
> > +return -1;
> > +}
> > +
> > +return 0;
> > +}
> > +
> > +int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > +{
> > +if (nvme_ns_check_constraints(ns, errp)) {
> > +return -1;
> > +}
> > +
> > +if (nvme_ns_init_blk(n, ns, &n->id_ctrl, errp)) {
> > +return -1;
> > +}
> > +
> > +nvme_ns_init(ns);
> > +if (nvme_register_namespace(n, ns, errp)) {
> > +return -1;
> > +}
> > +
> > +return 0;
> > +}
> > +
> > +static void nvme_ns_realize(DeviceState *dev, Error **errp)
> > +{
> > +NvmeNamespace *ns = NVME_NS(dev);
> > +BusState *s = qdev_get_parent_bus(dev);
> > +NvmeCtrl *n = NVME(s->parent);
> > +Error *local_err = NULL;
> > +
> > +if (nvme_ns_setup(n, ns, &local_err)) {
> > +error_propagate_prepend(errp, local_err,
> > +"could not setup namespace: ");
> > +return;
> > +}
> > +}
> > +
> > +static Property nvme_ns_props[] = {
> > +DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),
> > +DEFINE_PROP_END_OF_LIST(),
> > +};
> > +
> > +static void nvme_ns_class_init(ObjectClass *oc, void *data)
> > +{
> > +DeviceClass *dc = DEVICE_CLASS(oc);
> > +
> > +set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
> > +
> > +dc->bus_type = TYPE_NVME_BUS;
> > +dc->realize = nvme_ns_realize;
> > +device_class_set_props(dc, nvme_ns_props);
> > +dc->desc = "virtual nvme namespace";
> 
> "Virtual NVMe namespace"?
> 

Fixed.

> > +}
> > +
> > +static void nvme_ns_instance_init(Object *obj)
> > +{
> > +NvmeNamespace *ns = NVME_NS(obj);
> > +char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
> > +
> > +device_add_bootindex_property(obj, &ns->bootindex, "bootindex",
> > +  bootindex, DEVICE(obj), &error_abort);
> > +
> > +g_free(bootindex);
> > +}
> > +
> > +static const TypeInfo nvme_ns_info = {
> > +.name = TYPE_NVME_NS,
> > +.

Re: [PATCH v7 11/48] nvme: refactor device realization

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 09:55, Philippe Mathieu-Daudé wrote:
> On 4/15/20 9:25 AM, Klaus Birkelund Jensen wrote:
> > On Apr 15 09:14, Philippe Mathieu-Daudé wrote:
> > > Hi Klaus,
> > > 
> > > This patch is a pain to review... Could you split it? I'd use one trivial
> > > patch for each function extracted from nvme_realize().
> > > 
> > 
> > Understood, I will split it up!
> 
> Thanks, that will help the review.
> 
> As this series is quite big, I recommend you to split it, so part of it can
> get merged quicker and you don't have to carry tons of patches that scare
> reviewers/maintainers.
> 
> Suggestions:
> 
> - 1: cleanups/refactors
> - 2: support v1.3
> - 3: more refactors, strengthening code
> - 4: improve DMA & S/G
> - 5: support for multiple NS
> - 6: tests for multiple NS feature
> - 7: tests bus unplug/replug (idea)
> 
> Or less :)
> 

Okay, good idea. Thanks.

Re: [PATCH 13/16] nvme: factor out namespace setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 12:38, Philippe Mathieu-Daudé wrote:
> On 4/15/20 12:24 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/block/nvme.c | 47 ++-
> >   1 file changed, 26 insertions(+), 21 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f0989cbb4335..08f7ae0a48b3 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1359,13 +1359,35 @@ static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> >   return 0;
> >   }
> > +static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error 
> > **errp)
> > +{
> > +int64_t bs_size;
> > +NvmeIdNs *id_ns = &ns->id_ns;
> > +
> > +bs_size = blk_getlength(n->conf.blk);
> > +if (bs_size < 0) {
> > +error_setg_errno(errp, -bs_size, "could not get backing file 
> > size");
> > +return -1;
> > +}
> > +
> > +n->ns_size = bs_size;
> > +
> > +id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
> > +
> > +/* no thin provisioning */
> > +id_ns->ncap = id_ns->nsze;
> > +id_ns->nuse = id_ns->ncap;
> > +
> > +return 0;
> > +}
> > +
> >   static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> >   {
> >   NvmeCtrl *n = NVME(pci_dev);
> >   NvmeIdCtrl *id = &n->id_ctrl;
> >   int i;
> > -int64_t bs_size;
> >   uint8_t *pci_conf;
> >   if (nvme_check_constraints(n, errp)) {
> > @@ -1374,12 +1396,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   nvme_init_state(n);
> > -bs_size = blk_getlength(n->conf.blk);
> > -if (bs_size < 0) {
> > -error_setg(errp, "could not get backing file size");
> > -return;
> > -}
> > -
> >   if (nvme_init_blk(n, errp)) {
> >   return;
> >   }
> > @@ -1390,8 +1406,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> >   pcie_endpoint_cap_init(pci_dev, 0x80);
> > -n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> 
> I'm not sure this line belong to this patch.
> 
 
It does. It is already there in the middle of the realize function. It
is moved to nvme_init_namespace as

  n->ns_size = bs_size;

since only a single namespace can be configured anyway. I will remove
the for-loop that initializes multiple namespaces as well.

Re: [PATCH 13/16] nvme: factor out namespace setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 12:53, Klaus Birkelund Jensen wrote:
> On Apr 15 12:38, Philippe Mathieu-Daudé wrote:
> > 
> > I'm not sure this line belong to this patch.
> > 
>  
> It does. It is already there in the middle of the realize function. It
> is moved to nvme_init_namespace as
> 
>   n->ns_size = bs_size;
> 
> since only a single namespace can be configured anyway. I will remove
> the for-loop that initializes multiple namespaces as well.

I'm gonna backtrack on that. Removing that for loop is just noise I
think.

Re: [PATCH 11/16] nvme: factor out block backend setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 12:52, Philippe Mathieu-Daudé wrote:
> On 4/15/20 12:24 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/block/nvme.c | 15 ---
> >   1 file changed, 12 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index e67f578fbf79..f0989cbb4335 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1348,6 +1348,17 @@ static void nvme_init_state(NvmeCtrl *n)
> >   n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
> >   }
> > +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > +{
> > +blkconf_blocksizes(&n->conf);
> > +if (!blkconf_apply_backend_options(&n->conf, 
> > blk_is_read_only(n->conf.blk),
> > +   false, errp)) {
> > +return -1;
> > +}
> > +
> > +return 0;
> 
> I'm not sure this is a correct usage of the 'propagating errors' API (see
> CODING_STYLE.rst and include/qapi/error.h), I'd expect this function to
> return void, and use a local_error & error_propagate() in nvme_realize().
> 
> However this works, so:
> Reviewed-by: Philippe Mathieu-Daudé 
> 

So, I get that and did use the propagate functionality earlier. But I
still used the int return. I'm not sure about the style if returning
void - should I check if errp is now non-NULL? Point is that I need to
return early since the later calls could fail if previous calls did not
complete successfully.

Re: [PATCH 11/16] nvme: factor out block backend setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 13:02, Klaus Birkelund Jensen wrote:
> On Apr 15 12:52, Philippe Mathieu-Daudé wrote:
> > On 4/15/20 12:24 PM, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > > 
> > > Signed-off-by: Klaus Jensen 
> > > ---
> > >   hw/block/nvme.c | 15 ---
> > >   1 file changed, 12 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index e67f578fbf79..f0989cbb4335 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -1348,6 +1348,17 @@ static void nvme_init_state(NvmeCtrl *n)
> > >   n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
> > >   }
> > > +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > > +{
> > > +blkconf_blocksizes(&n->conf);
> > > +if (!blkconf_apply_backend_options(&n->conf, 
> > > blk_is_read_only(n->conf.blk),
> > > +   false, errp)) {
> > > +return -1;
> > > +}
> > > +
> > > +return 0;
> > 
> > I'm not sure this is a correct usage of the 'propagating errors' API (see
> > CODING_STYLE.rst and include/qapi/error.h), I'd expect this function to
> > return void, and use a local_error & error_propagate() in nvme_realize().
> > 
> > However this works, so:
> > Reviewed-by: Philippe Mathieu-Daudé 
> > 
> 
> So, I get that and did use the propagate functionality earlier. But I
> still used the int return. I'm not sure about the style if returning
> void - should I check if errp is now non-NULL? Point is that I need to
> return early since the later calls could fail if previous calls did not
> complete successfully.

Nevermind, I got it. I've changed it to propagate it correctly.

Re: [PATCH v2 13/16] nvme: factor out namespace setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 15:16, Philippe Mathieu-Daudé wrote:
> On 4/15/20 3:01 PM, Klaus Jensen wrote:
> > From: Klaus Jensen 
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >   hw/block/nvme.c | 46 ++
> >   1 file changed, 26 insertions(+), 20 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index d5244102252c..2b007115c302 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1358,6 +1358,27 @@ static void nvme_init_blk(NvmeCtrl *n, Error **errp)
> > false, errp);
> >   }
> > +static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error 
> > **errp)
> > +{
> > +int64_t bs_size;
> > +NvmeIdNs *id_ns = &ns->id_ns;
> > +
> > +bs_size = blk_getlength(n->conf.blk);
> > +if (bs_size < 0) {
> > +error_setg_errno(errp, -bs_size, "could not get backing file 
> > size");
> > +return;
> > +}
> > +
> > +n->ns_size = bs_size;
> > +
> > +id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
> > +
> > +/* no thin provisioning */
> > +id_ns->ncap = id_ns->nsze;
> > +id_ns->nuse = id_ns->ncap;
> > +}
> > +
> >   static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> >   {
> >   NvmeCtrl *n = NVME(pci_dev);
> > @@ -1365,7 +1386,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   Error *err = NULL;
> >   int i;
> > -int64_t bs_size;
> >   uint8_t *pci_conf;
> >   nvme_check_constraints(n, &err);
> > @@ -1376,12 +1396,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   nvme_init_state(n);
> > -bs_size = blk_getlength(n->conf.blk);
> > -if (bs_size < 0) {
> > -error_setg(errp, "could not get backing file size");
> > -return;
> > -}
> > -
> >   nvme_init_blk(n, &err);
> >   if (err) {
> >   error_propagate(errp, err);
> > @@ -1394,8 +1408,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> >   pcie_endpoint_cap_init(pci_dev, 0x80);
> > -n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> 
> Valid because currently 'n->num_namespaces' = 1, OK.
> 
> Reviewed-by: Philippe Mathieu-Daudé 
> 
 
Thank you for the reviews Philippe and the suggesting that I split up
the series :)

I'll get the v1.3 series ready next.

Re: [PATCH v2 13/16] nvme: factor out namespace setup

2020-04-15 Thread Klaus Birkelund Jensen

On Apr 15 15:26, Philippe Mathieu-Daudé wrote:
> On 4/15/20 3:20 PM, Klaus Birkelund Jensen wrote:
> > 
> > I'll get the v1.3 series ready next.
> > 
> 
> Cool. What really matters (to me) is seeing tests. If we can merge tests
> (without multiple namespaces) before the rest of your series, even better.
> Tests give reviewers/maintainers confidence that code isn't breaking ;)
> 

The patches that I contribute have been pretty extensively tested by
various means in a "host setting" (e.g. blktests and some internal
tools), which really exercise the device by doing heavy I/O, testing for
compliance and also just being mean to it (e.g. tripping bus mastering
while doing I/O).

Don't misunderstand me as trying to weasel my way out of writing tests,
but I just want to understand the scope of the tests that you are
looking for? I believe (hope!) that you are not asking me to implement a
user-space NVMe driver in the test, so I assume the tests should varify
more low level details?

Re: [PATCH v2 00/16] nvme: refactoring and cleanups

2020-04-19 Thread Klaus Birkelund Jensen

On Apr 15 15:01, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Changes since v1
> 
> * nvme: fix pci doorbell size calculation
>   - added some defines and a better comment (Philippe)
> 
> * nvme: rename trace events to pci_nvme
>   - changed the prefix from nvme_dev to pci_nvme (Philippe)
> 
> * nvme: add max_ioqpairs device parameter
>   - added a deprecation comment. I doubt this will go in until 5.1, so
> changed it to "deprecated from 5.1" (Philippe)
> 
> * nvme: factor out property/constraint checks
> * nvme: factor out block backend setup
>   - changed to return void and propagate errors in proper QEMU style
> (Philippe)
> 
> * nvme: add namespace helpers
>   - use the helper immediately (Philippe)
> 
> * nvme: factor out pci setup
>   - removed setting of vendor and device id which is already inherited
> from nvme_class_init() (Philippe)
> 
> * nvme: factor out cmb setup
>   - add lost comment (Philippe)
> 
> 
> Klaus Jensen (16):
>   nvme: fix pci doorbell size calculation
>   nvme: rename trace events to pci_nvme
>   nvme: remove superfluous breaks
>   nvme: move device parameters to separate struct
>   nvme: use constants in identify
>   nvme: refactor nvme_addr_read
>   nvme: add max_ioqpairs device parameter
>   nvme: remove redundant cmbloc/cmbsz members
>   nvme: factor out property/constraint checks
>   nvme: factor out device state setup
>   nvme: factor out block backend setup
>   nvme: add namespace helpers
>   nvme: factor out namespace setup
>   nvme: factor out pci setup
>   nvme: factor out cmb setup
>   nvme: factor out controller identify setup
> 
>  hw/block/nvme.c   | 433 --
>  hw/block/nvme.h   |  36 +++-
>  hw/block/trace-events | 172 -
>  include/block/nvme.h  |   8 +
>  4 files changed, 372 insertions(+), 277 deletions(-)
> 
> -- 
> 2.26.0
> 

Hi Keith,

You have acked most of this previously, but not in it's most recent
state. Since a good bunch of the refactoring patches have been split up
and changed, only a small subset of the patches still carry your
Acked-by.

The 'nvme: fix pci doorbell size calculation' and 'nvme: add
max_ioqpairs device parameter' are new since your ack and given their
nature a review from you would be nice :)


Thanks,
Klaus

Re: [PATCH v2 00/16] nvme: refactoring and cleanups

2020-04-20 Thread Klaus Birkelund Jensen

On Apr 21 02:38, Keith Busch wrote:
> The series looks good to me.
> 
> Reviewed-by: Keith Busch 

Thanks for the review Keith!

Kevin, should I rebase this on block-next? I think it might have some
conflicts with the PMR patch that went in previously.

Philippe, then I can also change the *err to *local_err ;)

Re: [PATCH v2 00/16] nvme: refactoring and cleanups

2020-04-21 Thread Klaus Birkelund Jensen

On Apr 21 19:24, Maxim Levitsky wrote:
> Should I also review the V7 series or I should wait for V8 which will
> not include these cleanups?

Hi Maxim,

Just wait for another series - I don't think I will post a v8, I will
chop op the series into smaller ones instead.

Most patches will hopefully not change too much, so should keep your
Reviewed-by's ;)

Thanks,
Klaus

Re: [PATCH v4 22/24] nvme: bump controller pci device id

2019-12-19 Thread Klaus Birkelund Jensen

On Dec 20 01:16, Keith Busch wrote:
> On Thu, Dec 19, 2019 at 02:09:19PM +0100, Klaus Jensen wrote:
> > @@ -2480,7 +2480,7 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice 
> > *pci_dev)
> >  pci_conf[PCI_INTERRUPT_PIN] = 1;
> >  pci_config_set_prog_interface(pci_conf, 0x2);
> >  pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
> > -pci_config_set_device_id(pci_conf, 0x5845);
> > +pci_config_set_device_id(pci_conf, 0x5846);
> >  pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
> >  pcie_endpoint_cap_init(pci_dev, 0x80);
> 
> We can't just pick a number here, these are supposed to be assigned by the
> vendor. A day will come when I will be in trouble for using the existing
> identifier: I found out to late it was supposed to be for internal use
> only as it was never officially reserved, so lets not make the same
> mistake for some future device.
> 

Makes sense. And there is no "QEMU" vendor, is there? But it would be
really nice to get rid of the quirks.

Re: [PATCH v4 21/24] nvme: support multiple namespaces

2019-12-19 Thread Klaus Birkelund Jensen

On Dec 19 16:11, Michal Prívozník wrote:
> On 12/19/19 2:09 PM, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> 
> Klaus, just to make sure I understand correctly, this implements
> multiple namespaces for *emulated* NVMe, right? I'm asking because I
> just merged libvirt patches to support:
> 
> -drive
> file.driver=nvme,file.device=:01:00.0,file.namespace=1,format=raw,if=none,id=drive-virtio-disk0
> -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
> 
> and seeing these patches made me doubt my design. But if your patches
> touch emulated NVMe only, then libvirt's fine because it doesn't expose
> that just yet.
> 
> Michal
> 

Hi Michal,

Yes, this is only for the emulated nvme controller.

Re: [PATCH v4 22/24] nvme: bump controller pci device id

2019-12-19 Thread Klaus Birkelund Jensen

On Dec 20 02:46, Keith Busch wrote:
> On Thu, Dec 19, 2019 at 06:24:57PM +0100, Klaus Birkelund Jensen wrote:
> > On Dec 20 01:16, Keith Busch wrote:
> > > On Thu, Dec 19, 2019 at 02:09:19PM +0100, Klaus Jensen wrote:
> > > > @@ -2480,7 +2480,7 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice 
> > > > *pci_dev)
> > > >  pci_conf[PCI_INTERRUPT_PIN] = 1;
> > > >  pci_config_set_prog_interface(pci_conf, 0x2);
> > > >  pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
> > > > -pci_config_set_device_id(pci_conf, 0x5845);
> > > > +pci_config_set_device_id(pci_conf, 0x5846);
> > > >  pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
> > > >  pcie_endpoint_cap_init(pci_dev, 0x80);
> > > 
> > > We can't just pick a number here, these are supposed to be assigned by the
> > > vendor. A day will come when I will be in trouble for using the existing
> > > identifier: I found out to late it was supposed to be for internal use
> > > only as it was never officially reserved, so lets not make the same
> > > mistake for some future device.
> > > 
> > 
> > Makes sense. And there is no "QEMU" vendor, is there?
> 
> I'm not sure if we can use this, but there is a PCI_VENDOR_ID_QEMU,
> 0x1234, defined in include/hw/pci/pci.h.
> 

Maybe it's possible to use PCI_VENDOR_ID_REDHAT?

Re: [PATCH v4 19/24] nvme: handle dma errors

2020-01-13 Thread Klaus Birkelund Jensen

On Jan  9 11:35, Beata Michalska wrote:
> Hi Klaus,
> 

Hi Beata,

Your reviews are, as always, much appreciated! Thanks!

> On Thu, 19 Dec 2019 at 13:09, Klaus Jensen  wrote:
> > @@ -1595,7 +1611,12 @@ static void nvme_process_sq(void *opaque)
> >
> >  while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {
> >  addr = sq->dma_addr + sq->head * n->sqe_size;
> > -nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd));
> > +if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
> > +trace_nvme_dev_err_addr_read(addr);
> > +timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> > +100 * SCALE_MS);
> > +break;
> > +}
> 
> Is there a chance we will end up repeatedly triggering the read error here
> as this will come back to the same memory location each time (the sq->head
> is not moving here) ?
> 

Absolutely, and that was the point. Not being able to read the
submission queue is pretty bad, so the device just keeps retrying every
100 ms. This is the same for when writing to the completion queue fails.

But... It would probably be prudent to track how long it has been since
a successful DMA transfer was done and timeout, shutting down the
device. Say maybe after 60 seconds. I'll try to add something like that.

Thanks,
Klaus

Re: [PATCH v4 17/24] nvme: allow multiple aios per command

2020-01-13 Thread Klaus Birkelund Jensen

On Jan  9 11:40, Beata Michalska wrote:
> Hi Klaus,
> 
> On Thu, 19 Dec 2019 at 13:09, Klaus Jensen  wrote:
> > +static NvmeAIO *nvme_aio_new(BlockBackend *blk, int64_t offset, size_t len,
> > +QEMUSGList *qsg, QEMUIOVector *iov, NvmeRequest *req,
> > +NvmeAIOCompletionFunc *cb)
> 
> Minor: The indentation here (and in a few other places across the patchset)
> does not seem right . And maybe inline ?

I tried to follow the style in CODING_STYLE.rst for "Multiline Indent",
but how the style is for function definition is a bit underspecified.

I can change it to align with the opening paranthesis. I just found the
"one indent" more readable for these long function definitions.

> Also : seems that there are cases when some of the parameters are
> not required (NULL) , maybe having a simplified version for those cases
> might be useful ?
> 

True. Actually - at this point in the series there are no users of the
NvmeAIOCompletionFunc. It is preparatory for other patches I have in the
pipeline. But I'll clean it up.

> > +static void nvme_aio_cb(void *opaque, int ret)
> > +{
> > +NvmeAIO *aio = opaque;
> > +NvmeRequest *req = aio->req;
> > +
> > +BlockBackend *blk = aio->blk;
> > +BlockAcctCookie *acct = &aio->acct;
> > +BlockAcctStats *stats = blk_get_stats(blk);
> > +
> > +Error *local_err = NULL;
> > +
> > +trace_nvme_dev_aio_cb(nvme_cid(req), aio, blk_name(blk), aio->offset,
> > +nvme_aio_opc_str(aio), req);
> > +
> > +if (req) {
> > +QTAILQ_REMOVE(&req->aio_tailq, aio, tailq_entry);
> > +}
> > +
> >  if (!ret) {
> > -block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
> > -req->status = NVME_SUCCESS;
> > +block_acct_done(stats, acct);
> > +
> > +if (aio->cb) {
> > +aio->cb(aio, aio->cb_arg);
> 
> We are dropping setting status to SUCCESS here,
> is that expected ?

Yes, that is on purpose. nvme_aio_cb is called for *each* issued AIO and
we do not want to overwrite a previously set error status with a success
(if one aio in the request fails even though others succeed, it should
not go unnoticed). Note that NVME_SUCCESS is the default setting in the
request, so if no one sets an error code we are still good.

> Also the aio callback will not get
> called case failure and it probably should ?
> 

I tried both but ended up with just not calling it on failure, but I
think that in the future some AIO callbacks might want to take a
different action if the request failed, so I'll add it back in an add
the aio return value (ret) to the callback function definition.

Thanks,
Klaus

Re: [PATCH v4 20/24] nvme: add support for scatter gather lists

2020-01-13 Thread Klaus Birkelund Jensen

On Jan  9 11:44, Beata Michalska wrote:
> Hi Klaus,
> 
> On Thu, 19 Dec 2019 at 13:09, Klaus Jensen  wrote:
> > @@ -73,7 +73,12 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr 
> > addr)
> >
> >  static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> > -if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> > +hwaddr hi = addr + size;
> > +if (hi < addr) {
> 
> What is the actual use case for that ?

This was for detecting wrap around in the unsigned addition. I found
that nvme_map_sgl does not check if addr + size is out of bounds (which
it should). With that in place this check is belt and braces, so I might
remove it.

> > +static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
> > +QEMUIOVector *iov, NvmeSglDescriptor *segment, uint64_t nsgld,
> > +uint32_t *len, bool is_cmb, NvmeRequest *req)
> > +{
> > +dma_addr_t addr, trans_len;
> > +uint16_t status;
> > +
> > +for (int i = 0; i < nsgld; i++) {
> > +if (NVME_SGL_TYPE(segment[i].type) != SGL_DESCR_TYPE_DATA_BLOCK) {
> > +trace_nvme_dev_err_invalid_sgl_descriptor(nvme_cid(req),
> > +NVME_SGL_TYPE(segment[i].type));
> > +return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
> > +}
> > +
> > +if (*len == 0) {
> > +if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
> > +
> > trace_nvme_dev_err_invalid_sgl_excess_length(nvme_cid(req));
> > +return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> > +}
> > +
> > +break;
> > +}
> > +
> > +addr = le64_to_cpu(segment[i].addr);
> > +trans_len = MIN(*len, le64_to_cpu(segment[i].len));
> > +
> > +if (nvme_addr_is_cmb(n, addr)) {
> > +/*
> > + * All data and metadata, if any, associated with a particular
> > + * command shall be located in either the CMB or host memory. 
> > Thus,
> > + * if an address if found to be in the CMB and we have already
> 
> s/address if/address is ?

Fixed, thanks.

> > +static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector 
> > *iov,
> > +NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
> > +{
> > +const int MAX_NSGLD = 256;
> > +
> > +NvmeSglDescriptor segment[MAX_NSGLD];
> > +uint64_t nsgld;
> > +uint16_t status;
> > +bool is_cmb = false;
> > +bool sgl_in_cmb = false;
> > +hwaddr addr = le64_to_cpu(sgl.addr);
> > +
> > +trace_nvme_dev_map_sgl(nvme_cid(req), NVME_SGL_TYPE(sgl.type), 
> > req->nlb, len);
> > +
> > +if (nvme_addr_is_cmb(n, addr)) {
> > +is_cmb = true;
> > +
> > +qemu_iovec_init(iov, 1);
> > +} else {
> > +pci_dma_sglist_init(qsg, &n->parent_obj, 1);
> > +}
> > +
> > +/*
> > + * If the entire transfer can be described with a single data block it 
> > can
> > + * be mapped directly.
> > + */
> > +if (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_DATA_BLOCK) {
> > +status = nvme_map_sgl_data(n, qsg, iov, &sgl, 1, &len, is_cmb, 
> > req);
> > +if (status) {
> > +goto unmap;
> > +}
> > +
> > +goto out;
> > +}
> > +
> > +/*
> > + * If the segment is located in the CMB, the submission queue of the
> > + * request must also reside there.
> > + */
> > +if (nvme_addr_is_cmb(n, addr)) {
> > +if (!nvme_addr_is_cmb(n, req->sq->dma_addr)) {
> > +return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +}
> > +
> > +sgl_in_cmb = true;
> 
> Why not combining this with the condition few lines above
> for the nvme_addr_is_cmb ? Also is the sgl_in_cmb really needed ?
> If the address is from CMB, that  implies the queue is also there,
> otherwise we wouldn't progress beyond this point. Isn't is_cmb sufficient ?
> 

You are right, there is no need for sgl_in_cmb.

But checking if the queue is in the cmb only needs to be done if the
descriptor in DPTR is *not* a "singleton" data block. But I think I can
refactor it to be slightly nicer, or at least be more specific in the
comments.

> > +}
> > +
> > +while (NVME_SGL_TYPE(sgl.type) == SGL_DESCR_TYPE_SEGMENT) {
> > +bool addr_is_cmb;
> > +
> > +nsgld = le64_to_cpu(sgl.len) / sizeof(NvmeSglDescriptor);
> > +
> > +/* read the segment in chunks of 256 descriptors (4k) */
> > +while (nsgld > MAX_NSGLD) {
> > +if (nvme_addr_read(n, addr, segment, sizeof(segment))) {
> > +trace_nvme_dev_err_addr_read(addr);
> > +status = NVME_DATA_TRANSFER_ERROR;
> > +goto unmap;
> > +}
> > +
> > +status = nvme_map_sgl_data(n, qsg, iov, segment, MAX_NSGLD, 
> > &len,
> > +is_cmb, req);
> 
> This will probably fail if there is a BitBucket Descriptor on the way (?)
> 

nvme_map_sgl_data will error out on any descriptors different from
"DATA_BLOCK". So I

[Qemu-block] [PATCH 4/8] nvme: allow multiple i/o's per request

2019-05-17 Thread Klaus Birkelund Jensen

Introduce a new NvmeBlockBackendRequest and move the QEMUSGList and
QEMUIOVector from the NvmeRequest.

This is in preparation for metadata support and makes it easier to
handle multiple block backend requests to different offsets.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 319 --
 hw/block/nvme.h   |  47 +--
 hw/block/trace-events |   2 +
 3 files changed, 286 insertions(+), 82 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 453213f9abb4..c514f93f3867 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -322,6 +322,88 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 return err;
 }
 
+static void nvme_blk_req_destroy(NvmeBlockBackendRequest *blk_req)
+{
+if (blk_req->qsg.nalloc) {
+qemu_sglist_destroy(&blk_req->qsg);
+}
+
+if (blk_req->iov.nalloc) {
+qemu_iovec_destroy(&blk_req->iov);
+}
+
+g_free(blk_req);
+}
+
+static void nvme_blk_req_put(NvmeCtrl *n, NvmeBlockBackendRequest *blk_req)
+{
+nvme_blk_req_destroy(blk_req);
+}
+
+static NvmeBlockBackendRequest *nvme_blk_req_get(NvmeCtrl *n, NvmeRequest *req,
+QEMUSGList *qsg)
+{
+NvmeBlockBackendRequest *blk_req = g_malloc0(sizeof(*blk_req));
+
+blk_req->req = req;
+
+if (qsg) {
+pci_dma_sglist_init(&blk_req->qsg, &n->parent_obj, qsg->nsg);
+memcpy(blk_req->qsg.sg, qsg->sg, qsg->nsg * 
sizeof(ScatterGatherEntry));
+
+blk_req->qsg.nsg = qsg->nsg;
+blk_req->qsg.size = qsg->size;
+}
+
+return blk_req;
+}
+
+static uint16_t nvme_blk_setup(NvmeCtrl *n, NvmeNamespace *ns, QEMUSGList *qsg,
+uint64_t blk_offset, uint32_t unit_len, NvmeRequest *req)
+{
+NvmeBlockBackendRequest *blk_req = nvme_blk_req_get(n, req, qsg);
+if (!blk_req) {
+NVME_GUEST_ERR(nvme_err_internal_dev_error, "nvme_blk_req_get: %s",
+"could not allocate memory");
+return NVME_INTERNAL_DEV_ERROR;
+}
+
+blk_req->slba = req->slba;
+blk_req->nlb = req->nlb;
+blk_req->blk_offset = blk_offset + req->slba * unit_len;
+
+QTAILQ_INSERT_TAIL(&req->blk_req_tailq, blk_req, tailq_entry);
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+NvmeNamespace *ns = req->ns;
+uint16_t err;
+
+QEMUSGList qsg;
+
+uint32_t unit_len = nvme_ns_lbads_bytes(ns);
+uint32_t len = req->nlb * unit_len;
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+err = nvme_map_prp(n, &qsg, prp1, prp2, len, req);
+if (err) {
+return err;
+}
+
+err = nvme_blk_setup(n, ns, &qsg, ns->blk_offset, unit_len, req);
+if (err) {
+return err;
+}
+
+qemu_sglist_destroy(&qsg);
+
+return NVME_SUCCESS;
+}
+
 static void nvme_post_cqe(NvmeCQueue *cq, NvmeRequest *req)
 {
 NvmeCtrl *n = cq->ctrl;
@@ -447,114 +529,190 @@ static void nvme_process_aers(void *opaque)
 
 static void nvme_rw_cb(void *opaque, int ret)
 {
-NvmeRequest *req = opaque;
+NvmeBlockBackendRequest *blk_req = opaque;
+NvmeRequest *req = blk_req->req;
 NvmeSQueue *sq = req->sq;
 NvmeCtrl *n = sq->ctrl;
 NvmeCQueue *cq = n->cq[sq->cqid];
+NvmeNamespace *ns = req->ns;
+
+QTAILQ_REMOVE(&req->blk_req_tailq, blk_req, tailq_entry);
+
+trace_nvme_rw_cb(req->cqe.cid, ns->id);
 
 if (!ret) {
-block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
-req->status = NVME_SUCCESS;
+block_acct_done(blk_get_stats(n->conf.blk), &blk_req->acct);
 } else {
-block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
-req->status = NVME_INTERNAL_DEV_ERROR;
+block_acct_failed(blk_get_stats(n->conf.blk), &blk_req->acct);
+NVME_GUEST_ERR(nvme_err_internal_dev_error, "block request failed: %s",
+strerror(-ret));
+req->status = NVME_INTERNAL_DEV_ERROR | NVME_DNR;
 }
 
-if (req->qsg.nalloc) {
-qemu_sglist_destroy(&req->qsg);
-}
-if (req->iov.nalloc) {
-qemu_iovec_destroy(&req->iov);
+if (QTAILQ_EMPTY(&req->blk_req_tailq)) {
+nvme_enqueue_req_completion(cq, req);
 }
 
-nvme_enqueue_req_completion(cq, req);
+nvme_blk_req_put(n, blk_req);
 }
 
-static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
-NvmeRequest *req)
+static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
-block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
+NvmeBlockBackendRequest *blk_req = nvme_blk_req_get(n, req, NULL);
+if (!blk_req) {
+NVME_GUEST_ERR(nvme_err_internal_dev_error, "nvme_b

[Qemu-block] [PATCH 0/8] nvme: v1.3, sgls, metadata and new 'ocssd' device

2019-05-17 Thread Klaus Birkelund Jensen

Hi,

This series of patches contains a number of refactorings to the emulated
nvme device, adds additional features, such as support for metadata and
scatter gather lists, and bumps the supported NVMe version to 1.3.
Lastly, it contains a new 'ocssd' device.

The motivation for the first seven patches is to set everything up for
the final patch that adds a new 'ocssd' device and associated block
driver that implements the OpenChannel 2.0 specification[1]. Many of us
in the OpenChannel comunity have used a qemu fork[2] for emulation of
OpenChannel devices. The fork is itself based on Keith's qemu-nvme
tree[3] and we recently merged mainline qemu into it, but the result is
still a "hybrid" nvme device that supports both conventional nvme and
the OCSSD 2.0 spec through a 'dialect' mechanism. Merging instead of
rebasing also created a pretty messy commit history and my efforts to
try and rebase our work onto mainline was getting hairy to say the
least. And I was never really happy with the dialect approach anyway.

I have instead prepared this series of fresh patches that incrementally
adds additional features to the nvme device to bring it into shape for
finally introducing a new (and separate) 'ocssd' device that emulates an
OpenChannel 2.0 device by reusing core functionality from the nvme
device. Providing a separate ocssd device ensures that no ocssd specific
stuff creeps into the nvme device.

The ocssd device is backed by a new 'ocssd' block driver that holds
internal meta data and keeps state permanent across power cycles. In the
future I think we could use the same approach for the nvme device to
keep internal metadata such as utilization and deallocated blocks. For
now, the nvme device does not support the Deallocated and Unwritten
Logical Block Error (DULBE) feature or the Data Set Management command
as this would require such support.

I have tried to make the patches to the nvme device in this series as
digestible as possible, but I undestand that commit 310fcd5965e5 ("nvme:
bump supported spec to 1.3") is pretty huge. I can try to chop it up if
required, but the changes pretty much needs to be done in bulk to
actually implement v1.3.

This version was recently used to find a bug in use of SGLs in the Linux
kernel, so I believe there is some value in introducing these new
features. As for the ocssd device I believe that it is time it is
included upstream and not kept seperately. I have knowledge of at least
one other qemu fork implementing OCSSD 2.0 used by the SPDK team and I
think we could all benefit from using a common implementation. The ocssd
device is feature complete with respect to the OCSSD 2.0 spec (mandatory
as well as optional features).

  [1]: http://lightnvm.io/docs/OCSSD-2_0-20180129.pdf
  [2]: https://github.com/OpenChannelSSD/qemu-nvme
  [3]: http://git.infradead.org/users/kbusch/qemu-nvme.git


Klaus Birkelund Jensen (8):
  nvme: move device parameters to separate struct
  nvme: bump supported spec to 1.3
  nvme: simplify PRP mappings
  nvme: allow multiple i/o's per request
  nvme: add support for metadata
  nvme: add support for scatter gather lists
  nvme: keep a copy of the NVMe command in request
  nvme: add an OpenChannel 2.0 NVMe device (ocssd)

 MAINTAINERS|   14 +-
 Makefile.objs  |1 +
 block.c|2 +-
 block/Makefile.objs|2 +-
 block/nvme.c   |   20 +-
 block/ocssd.c  |  690 ++
 hw/block/Makefile.objs |2 +-
 hw/block/nvme.c| 1405 ---
 hw/block/nvme.h|   92 --
 hw/block/nvme/nvme.c   | 2485 +
 hw/block/nvme/ocssd.c  | 2647 
 hw/block/nvme/ocssd.h  |  140 ++
 hw/block/nvme/trace-events |  136 ++
 hw/block/trace-events  |   91 --
 include/block/block_int.h  |3 +
 include/block/nvme.h   |  152 ++-
 include/block/ocssd.h  |  231 
 include/hw/block/nvme.h|  233 
 include/hw/pci/pci_ids.h   |2 +
 qapi/block-core.json   |   47 +-
 20 files changed, 6774 insertions(+), 1621 deletions(-)
 create mode 100644 block/ocssd.c
 delete mode 100644 hw/block/nvme.c
 delete mode 100644 hw/block/nvme.h
 create mode 100644 hw/block/nvme/nvme.c
 create mode 100644 hw/block/nvme/ocssd.c
 create mode 100644 hw/block/nvme/ocssd.h
 create mode 100644 hw/block/nvme/trace-events
 create mode 100644 include/block/ocssd.h
 create mode 100644 include/hw/block/nvme.h

-- 
2.21.0

[Qemu-block] [PATCH 1/8] nvme: move device parameters to separate struct

2019-05-17 Thread Klaus Birkelund Jensen

Move device configuration parameters to separate struct to make it
explicit what is configurable and what is set internally.

Also, clean up some includes.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 53 +++--
 hw/block/nvme.h | 16 ---
 2 files changed, 38 insertions(+), 31 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 7caf92532a09..b689c0776e72 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -27,17 +27,14 @@
 
 #include "qemu/osdep.h"
 #include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
 #include "hw/block/block.h"
-#include "hw/hw.h"
 #include "hw/pci/msix.h"
-#include "hw/pci/pci.h"
 #include "sysemu/sysemu.h"
-#include "qapi/error.h"
-#include "qapi/visitor.h"
 #include "sysemu/block-backend.h"
+#include "qapi/error.h"
 
-#include "qemu/log.h"
-#include "qemu/cutils.h"
 #include "trace.h"
 #include "nvme.h"
 
@@ -62,12 +59,12 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->num_queues && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->num_queues && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->params.num_queues && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -605,7 +602,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
 trace_nvme_err_invalid_create_cq_addr(prp1);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
-if (unlikely(vector > n->num_queues)) {
+if (unlikely(vector > n->params.num_queues)) {
 trace_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -707,7 +704,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 
16));
+result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 trace_nvme_getfeat_numq(result);
 break;
 default:
@@ -731,9 +729,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_NUMBER_OF_QUEUES:
 trace_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->num_queues - 1, n->num_queues - 1);
-req->cqe.result =
-cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
+n->params.num_queues - 1,
+n->params.num_queues - 1);
+req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
+  ((n->params.num_queues - 2) << 16));
 break;
 default:
 trace_nvme_err_invalid_setfeat(dw10);
@@ -802,12 +801,12 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
 blk_drain(n->conf.blk);
 
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->sq[i] != NULL) {
 nvme_free_sq(n->sq[i], n);
 }
 }
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->cq[i] != NULL) {
 nvme_free_cq(n->cq[i], n);
 }
@@ -1208,7 +1207,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 int64_t bs_size;
 uint8_t *pci_conf;
 
-if (!n->num_queues) {
+if (!n->params.num_queues) {
 error_setg(errp, "num_queues can't be zero");
 return;
 }
@@ -1224,7 +1223,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 return;
 }
 
-if (!n->serial) {
+if (!n->params.serial) {
 error_setg(errp, "serial property not set");
 return;
 }
@@ -1241,25 +1240,25 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 pcie_endpoint_cap_init(pci_dev, 0x80);
 
 n->num_namespaces = 1;
-n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
+n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
 n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
-

[Qemu-block] [PATCH 7/8] nvme: keep a copy of the NVMe command in request

2019-05-17 Thread Klaus Birkelund Jensen

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 4 ++--
 hw/block/nvme.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 81201a8b4834..5cd593806701 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -184,7 +184,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, 
uint64_t prp1,
 int num_prps = (len >> n->page_bits) + 1;
 uint16_t status = NVME_SUCCESS;
 
-trace_nvme_map_prp(req->cmd_opcode, trans_len, len, prp1, prp2, num_prps);
+trace_nvme_map_prp(req->cmd.opcode, trans_len, len, prp1, prp2, num_prps);
 
 if (unlikely(!prp1)) {
 trace_nvme_err_invalid_prp();
@@ -1559,7 +1559,7 @@ static void nvme_init_req(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 memset(&req->cqe, 0, sizeof(req->cqe));
 req->cqe.cid = le16_to_cpu(cmd->cid);
 
-req->cmd_opcode = cmd->opcode;
+memcpy(&req->cmd, cmd, sizeof(NvmeCmd));
 req->is_cmb = false;
 
 req->status = NVME_SUCCESS;
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 70f4781a1b61..7e1e026d90e6 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -52,7 +52,7 @@ typedef struct NvmeRequest {
 uint16_t status;
 bool is_cmb;
 bool is_write;
-uint8_t  cmd_opcode;
+NvmeCmd  cmd;
 
 QTAILQ_HEAD(, NvmeBlockBackendRequest) blk_req_tailq;
 QTAILQ_ENTRY(NvmeRequest)entry;
@@ -143,7 +143,7 @@ typedef struct NvmeCtrl {
 
 static inline bool nvme_rw_is_write(NvmeRequest *req)
 {
-return req->cmd_opcode == NVME_CMD_WRITE;
+return req->cmd.opcode == NVME_CMD_WRITE;
 }
 
 static inline bool nvme_is_error(uint16_t status, uint16_t err)
-- 
2.21.0

[Qemu-block] [PATCH 5/8] nvme: add support for metadata

2019-05-17 Thread Klaus Birkelund Jensen

The new `ms` parameter may be used to indicate the number of metadata
bytes provided per LBA.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 31 +--
 hw/block/nvme.h | 11 ++-
 2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index c514f93f3867..675967a596d1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -33,6 +33,8 @@
  *   num_ns=  : Namespaces to make out of the backing storage,
  *   Default:1
  *   num_queues=  : Number of possible IO Queues, Default:64
+ *   ms=  : Number of metadata bytes provided per LBA,
+ *   Default:0
  *   cmb_size_mb= : Size of CMB in MBs, Default:0
  *
  * Parameters will be verified against conflicting capabilities and attributes
@@ -386,6 +388,8 @@ static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 
 uint32_t unit_len = nvme_ns_lbads_bytes(ns);
 uint32_t len = req->nlb * unit_len;
+uint32_t meta_unit_len = nvme_ns_ms(ns);
+uint32_t meta_len = req->nlb * meta_unit_len;
 uint64_t prp1 = le64_to_cpu(cmd->prp1);
 uint64_t prp2 = le64_to_cpu(cmd->prp2);
 
@@ -399,6 +403,19 @@ static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return err;
 }
 
+qsg.nsg = 0;
+qsg.size = 0;
+
+if (cmd->mptr && n->params.ms) {
+qemu_sglist_add(&qsg, le64_to_cpu(cmd->mptr), meta_len);
+
+err = nvme_blk_setup(n, ns, &qsg, ns->blk_offset_md, meta_unit_len,
+req);
+if (err) {
+return err;
+}
+}
+
 qemu_sglist_destroy(&qsg);
 
 return NVME_SUCCESS;
@@ -1902,6 +1919,11 @@ static int nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 return 1;
 }
 
+if (params->ms && !is_power_of_2(params->ms)) {
+error_setg(errp, "nvme: invalid metadata configuration");
+return 1;
+}
+
 return 0;
 }
 
@@ -2066,17 +2088,20 @@ static void nvme_init_ctrl(NvmeCtrl *n)
 
 static uint64_t nvme_ns_calc_blks(NvmeCtrl *n, NvmeNamespace *ns)
 {
-return n->ns_size / nvme_ns_lbads_bytes(ns);
+return n->ns_size / (nvme_ns_lbads_bytes(ns) + nvme_ns_ms(ns));
 }
 
 static void nvme_ns_init_identify(NvmeCtrl *n, NvmeIdNs *id_ns)
 {
+NvmeParams *params = &n->params;
+
 id_ns->nlbaf = 0;
 id_ns->flbas = 0;
-id_ns->mc = 0;
+id_ns->mc = params->ms ? 0x2 : 0;
 id_ns->dpc = 0;
 id_ns->dps = 0;
 id_ns->lbaf[0].lbads = BDRV_SECTOR_BITS;
+id_ns->lbaf[0].ms = params->ms;
 }
 
 static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -2086,6 +2111,8 @@ static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace 
*ns, Error **errp)
 nvme_ns_init_identify(n, id_ns);
 
 ns->ns_blks = nvme_ns_calc_blks(n, ns);
+ns->blk_offset_md = ns->blk_offset + nvme_ns_lbads_bytes(ns) * ns->ns_blks;
+
 id_ns->nuse = id_ns->ncap = id_ns->nsze = cpu_to_le64(ns->ns_blks);
 
 return 0;
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 711ca249eac5..81ee0c5173d5 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -8,13 +8,15 @@
 DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
 DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \
 DEFINE_PROP_UINT32("num_ns", _state, _props.num_ns, 1), \
-DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7)
+DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7), \
+DEFINE_PROP_UINT8("ms", _state, _props.ms, 0)
 
 typedef struct NvmeParams {
 char *serial;
 uint32_t num_queues;
 uint32_t num_ns;
 uint8_t  mdts;
+uint8_t  ms;
 uint32_t cmb_size_mb;
 } NvmeParams;
 
@@ -91,6 +93,7 @@ typedef struct NvmeNamespace {
 uint32_tid;
 uint64_tns_blks;
 uint64_tblk_offset;
+uint64_tblk_offset_md;
 } NvmeNamespace;
 
 #define TYPE_NVME "nvme"
@@ -154,4 +157,10 @@ static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
 return 1 << nvme_ns_lbads(ns);
 }
 
+static inline uint16_t nvme_ns_ms(NvmeNamespace *ns)
+{
+NvmeIdNs *id = &ns->id_ns;
+return le16_to_cpu(id->lbaf[NVME_ID_NS_FLBAS_INDEX(id->flbas)].ms);
+}
+
 #endif /* HW_NVME_H */
-- 
2.21.0

[Qemu-block] [PATCH 3/8] nvme: simplify PRP mappings

2019-05-17 Thread Klaus Birkelund Jensen

Instead of handling both QSGs and IOVs in multiple places, simply use
QSGs everywhere by assuming that the request does not involve the
controller memory buffer (CMB). If the request is found to involve the
CMB, convert the QSG to an IOV and issue the I/O.

The QSG is converted to an IOV by the dma helpers anyway, so it is not
like the CMB path is unfairly affected by this simplifying change.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 205 +++---
 hw/block/nvme.h   |   3 +-
 hw/block/trace-events |   1 +
 include/block/nvme.h  |   1 +
 4 files changed, 138 insertions(+), 72 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 65dfc04f71e5..453213f9abb4 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -71,14 +71,21 @@
 
 static void nvme_process_sq(void *opaque);
 
+static inline uint8_t nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
+{
+return n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size));
+}
+
 static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
-if (n->cmbsz && addr >= n->ctrl_mem.addr &&
-addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+if (nvme_addr_is_cmb(n, addr)) {
 memcpy(buf, (void *)&n->cmbuf[addr - n->ctrl_mem.addr], size);
-} else {
-pci_dma_read(&n->parent_obj, addr, buf, size);
+
+return;
 }
+
+pci_dma_read(&n->parent_obj, addr, buf, size);
 }
 
 static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
@@ -167,31 +174,48 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 }
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
- uint64_t prp2, uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, uint64_t prp1,
+uint64_t prp2, uint32_t len, NvmeRequest *req)
 {
 hwaddr trans_len = n->page_size - (prp1 % n->page_size);
 trans_len = MIN(len, trans_len);
 int num_prps = (len >> n->page_bits) + 1;
+uint16_t status = NVME_SUCCESS;
+
+trace_nvme_map_prp(req->cmd_opcode, trans_len, len, prp1, prp2, num_prps);
 
 if (unlikely(!prp1)) {
 trace_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
-} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
-   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
-qsg->nsg = 0;
-qemu_iovec_init(iov, num_prps);
-qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], 
trans_len);
+}
+
+if (nvme_addr_is_cmb(n, prp1)) {
+NvmeSQueue *sq = req->sq;
+if (!nvme_addr_is_cmb(n, sq->dma_addr)) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+
+req->is_cmb = true;
 } else {
-pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
-qemu_sglist_add(qsg, prp1, trans_len);
+req->is_cmb = false;
 }
+
+pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
+qemu_sglist_add(qsg, prp1, trans_len);
+
 len -= trans_len;
 if (len) {
 if (unlikely(!prp2)) {
 trace_nvme_err_invalid_prp2_missing();
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
+
+if (req->is_cmb && !nvme_addr_is_cmb(n, prp2)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+goto unmap;
+}
+
 if (len > n->page_size) {
 uint64_t prp_list[n->max_prp_ents];
 uint32_t nents, prp_trans;
@@ -203,79 +227,99 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, 
QEMUIOVector *iov, uint64_t prp1,
 while (len != 0) {
 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
+if (req->is_cmb && !nvme_addr_is_cmb(n, prp_ent)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+goto unmap;
+}
+
 if (i == n->max_prp_ents - 1 && len > n->page_size) {
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_nvme_err_invalid_prplist_ent(prp_ent);
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
 
 i = 0;
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-nvme_addr_read(n, prp_ent, (void *)prp_list,
-prp_trans);
+nvme_addr_read(n, prp_ent, (void *)prp_list, prp_trans);

[Qemu-block] [PATCH 6/8] nvme: add support for scatter gather lists

2019-05-17 Thread Klaus Birkelund Jensen

Add partial SGL support. For now, only support a single data block or
last segment descriptor. This is in line with what, for instance, SPDK
currently supports.

Signed-off-by: Klaus Birkelund Jensen 
---
 block/nvme.c  |  18 ++--
 hw/block/nvme.c   | 242 +-
 hw/block/nvme.h   |   6 ++
 hw/block/trace-events |   1 +
 include/block/nvme.h  |  81 +-
 5 files changed, 285 insertions(+), 63 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 0684bbd077dd..12d98c0d0be6 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -437,7 +437,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 error_setg(errp, "Cannot map buffer for DMA");
 goto out;
 }
-cmd.prp1 = cpu_to_le64(iova);
+cmd.dptr.prp.prp1 = cpu_to_le64(iova);
 
 if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
 error_setg(errp, "Failed to identify controller");
@@ -511,7 +511,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_CQ,
-.prp1 = cpu_to_le64(q->cq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x3),
 };
@@ -522,7 +522,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_SQ,
-.prp1 = cpu_to_le64(q->sq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x1 | (n << 16)),
 };
@@ -857,16 +857,16 @@ try_map:
 case 0:
 abort();
 case 1:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = 0;
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = 0;
 break;
 case 2:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = pagelist[1];
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = pagelist[1];
 break;
 default:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + 
sizeof(uint64_t));
 break;
 }
 trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 675967a596d1..81201a8b4834 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -279,6 +279,96 @@ unmap:
 return status;
 }
 
+static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
+{
+NvmeSglDescriptor *sgl_descriptors;
+uint64_t nsgld;
+uint16_t status = NVME_SUCCESS;
+
+trace_nvme_map_sgl(req->cqe.cid, le64_to_cpu(sgl.generic.type), req->nlb,
+len);
+
+int cmb = 0;
+
+switch (le64_to_cpu(sgl.generic.type)) {
+case SGL_DESCR_TYPE_DATA_BLOCK:
+sgl_descriptors = &sgl;
+nsgld = 1;
+
+break;
+
+case SGL_DESCR_TYPE_LAST_SEGMENT:
+sgl_descriptors = g_malloc0(le64_to_cpu(sgl.unkeyed.len));
+nsgld = le64_to_cpu(sgl.unkeyed.len) / sizeof(NvmeSglDescriptor);
+
+if (nvme_addr_is_cmb(n, sgl.addr)) {
+cmb = 1;
+}
+
+nvme_addr_read(n, le64_to_cpu(sgl.addr), sgl_descriptors,
+le64_to_cpu(sgl.unkeyed.len));
+
+break;
+
+default:
+return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
+}
+
+if (nvme_addr_is_cmb(n, le64_to_cpu(sgl_descriptors[0].addr))) {
+if (!cmb) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+goto maybe_free;
+}
+
+req->is_cmb = true;
+} else {
+if (cmb) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+goto maybe_free;
+}
+
+req->is_cmb = false;
+}
+
+pci_dma_sglist_init(qsg, &n->parent_obj, nsgld);
+
+for (int i = 0; i < nsgld; i++) {
+uint64_t addr;
+uint32_t trans_len;
+
+if (len == 0) {
+if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
+status = NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
+qemu_sglist_destroy(qsg);
+goto maybe_free;
+}
+
+break;
+}
+
+addr = le64_to_cpu(sgl_descriptors[i].addr);
+trans_len = MIN(len, le64_to_cpu(sgl_descriptors[i].unkeyed.len));
+
+if (req->is_cmb && !nvme_addr_is_cmb(n, addr)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+qemu_sglist_destroy(qsg);
+goto maybe_free;
+}
+
+qemu_sglist_add(qsg, addr, trans_len);
+
+len -= trans_len;

[Qemu-block] [PATCH 2/8] nvme: bump supported spec to 1.3

2019-05-17 Thread Klaus Birkelund Jensen

Bump the supported NVMe version to 1.3. To do so, this patch adds a
number of missing 'Mandatory' features from the spec:

  * Support for returning a Namespace Identification Descriptor List in
the Identify command (CNS 03h).
  * Support for the Asynchronous Event Request command.
  * Support for the Get Log Page command and the mandatory Error
Information, Smart / Health Information and Firmware Slot
Information log pages.
  * Support for the Abort command.

As a side-effect, this bump also fixes support for multiple namespaces.

The implementation of AER, Get Log Page and Abort commands has been
imported and slightly modified from Keith's qemu-nvme tree[1]. Thanks!

  [1]: http://git.infradead.org/users/kbusch/qemu-nvme.git

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 792 --
 hw/block/nvme.h   |  31 +-
 hw/block/trace-events |  16 +-
 include/block/nvme.h  |  58 +++-
 4 files changed, 783 insertions(+), 114 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b689c0776e72..65dfc04f71e5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,17 +9,35 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.3d, 1.2, 1.1, 1.0e
  *
  *  http://www.nvmexpress.org/resources/
  */
 
 /**
  * Usage: add options:
- *  -drive file=,if=none,id=
- *  -device nvme,drive=,serial=,id=, \
- *  cmb_size_mb=, \
- *  num_queues=
+ * -drive file=,if=none,id=
+ * -device nvme,drive=,serial=,id=
+ *
+ * The "file" option must point to a path to a real file that you will use as
+ * the backing storage for your NVMe device. It must be a non-zero length, as
+ * this will be the disk image that your nvme controller will use to carve up
+ * namespaces for storage.
+ *
+ * Note the "drive" option's "id" name must match the "device nvme" drive's
+ * name to link the block device used for backing storage to the nvme
+ * interface.
+ *
+ * Advanced optional options:
+ *
+ *   num_ns=  : Namespaces to make out of the backing storage,
+ *   Default:1
+ *   num_queues=  : Number of possible IO Queues, Default:64
+ *   cmb_size_mb= : Size of CMB in MBs, Default:0
+ *
+ * Parameters will be verified against conflicting capabilities and attributes
+ * and fail to load if there is a conflict or a configuration the emulated
+ * device is unable to handle.
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -38,6 +56,12 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_TEMPERATURE 0x143
+#define NVME_ELPE 3
+#define NVME_AERL 3
+#define NVME_OP_ABORTED 0xff
+
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
 (trace_##trace)(__VA_ARGS__); \
@@ -57,6 +81,16 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 }
 }
 
+static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
+{
+if (n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+memcpy((void *)&n->cmbuf[addr - n->ctrl_mem.addr], buf, size);
+return;
+}
+pci_dma_write(&n->parent_obj, addr, buf, size);
+}
+
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
 return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
@@ -244,6 +278,24 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 return status;
 }
 
+static void nvme_post_cqe(NvmeCQueue *cq, NvmeRequest *req)
+{
+NvmeCtrl *n = cq->ctrl;
+NvmeSQueue *sq = req->sq;
+NvmeCqe *cqe = &req->cqe;
+uint8_t phase = cq->phase;
+hwaddr addr;
+
+addr = cq->dma_addr + cq->tail * n->cqe_size;
+cqe->status = cpu_to_le16((req->status << 1) | phase);
+cqe->sq_id = cpu_to_le16(sq->sqid);
+cqe->sq_head = cpu_to_le16(sq->head);
+nvme_addr_write(n, addr, (void *) cqe, sizeof(*cqe));
+nvme_inc_cq_tail(cq);
+
+QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
+}
+
 static void nvme_post_cqes(void *opaque)
 {
 NvmeCQueue *cq = opaque;
@@ -251,24 +303,14 @@ static void nvme_post_cqes(void *opaque)
 NvmeRequest *req, *next;
 
 QTAILQ_FOREACH_SAFE(req, &cq->req_list, entry, next) {
-NvmeSQueue *sq;
-hwaddr addr;
-
 if (nvme_cq_full(cq)) {
 break;
 }
 
 QTAILQ_REMOVE(&cq->req_list, req, entry);
-sq = req->sq;
-req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
-req->cqe.sq_id = cpu_to_l

[Qemu-block] [PATCH] nvme: fix copy direction in DMA reads going to CMB

2019-05-18 Thread Klaus Birkelund Jensen

`nvme_dma_read_prp` erronously used `qemu_iovec_*to*_buf` instead of
`qemu_iovec_*from*_buf` when the request involved the controller memory
buffer.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 7caf92532a09..63a5b58849fb 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -238,7 +238,7 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 }
 qemu_sglist_destroy(&qsg);
 } else {
-if (unlikely(qemu_iovec_to_buf(&iov, 0, ptr, len) != len)) {
+if (unlikely(qemu_iovec_from_buf(&iov, 0, ptr, len) != len)) {
 trace_nvme_err_invalid_dma();
 status = NVME_INVALID_FIELD | NVME_DNR;
 }
-- 
2.21.0

[Qemu-block] [PATCH] nvme: do not advertise support for unsupported arbitration mechanism

2019-06-06 Thread Klaus Birkelund Jensen

The device mistakenly reports that the Weighted Round Robin with Urgent
Priority Class arbitration mechanism is supported.

It is not.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 30e50f7a3853..415b4641d6b4 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1383,7 +1383,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 n->bar.cap = 0;
 NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
 NVME_CAP_SET_CQR(n->bar.cap, 1);
-NVME_CAP_SET_AMS(n->bar.cap, 1);
 NVME_CAP_SET_TO(n->bar.cap, 0xf);
 NVME_CAP_SET_CSS(n->bar.cap, 1);
 NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
-- 
2.21.0

[Qemu-block] [PATCH 01/16] nvme: simplify namespace code

2019-07-05 Thread Klaus Birkelund Jensen

The device model currently only supports a single namespace and also
specifically sets num_namespaces to 1. Take this into account and
simplify the code.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 26 +++---
 hw/block/nvme.h |  2 +-
 2 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 36d6a8bb3a3e..28ebaf1368b1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -424,7 +424,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return NVME_INVALID_NSID | NVME_DNR;
 }
 
-ns = &n->namespaces[nsid - 1];
+ns = &n->namespace;
 switch (cmd->opcode) {
 case NVME_CMD_FLUSH:
 return nvme_flush(n, ns, cmd, req);
@@ -670,7 +670,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify 
*c)
 return NVME_INVALID_NSID | NVME_DNR;
 }
 
-ns = &n->namespaces[nsid - 1];
+ns = &n->namespace;
 
 return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
 prp1, prp2);
@@ -1306,8 +1306,8 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 {
 NvmeCtrl *n = NVME(pci_dev);
 NvmeIdCtrl *id = &n->id_ctrl;
+NvmeIdNs *id_ns = &n->namespace.id_ns;
 
-int i;
 int64_t bs_size;
 uint8_t *pci_conf;
 
@@ -1347,7 +1347,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
-n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
 n->sq = g_new0(NvmeSQueue *, n->num_queues);
 n->cq = g_new0(NvmeCQueue *, n->num_queues);
 
@@ -1416,20 +1415,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 
 }
 
-for (i = 0; i < n->num_namespaces; i++) {
-NvmeNamespace *ns = &n->namespaces[i];
-NvmeIdNs *id_ns = &ns->id_ns;
-id_ns->nsfeat = 0;
-id_ns->nlbaf = 0;
-id_ns->flbas = 0;
-id_ns->mc = 0;
-id_ns->dpc = 0;
-id_ns->dps = 0;
-id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
-id_ns->ncap  = id_ns->nuse = id_ns->nsze =
-cpu_to_le64(n->ns_size >>
-id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas)].ds);
-}
+id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+id_ns->ncap  = id_ns->nuse = id_ns->nsze =
+cpu_to_le64(n->ns_size >>
+id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)].ds);
 }
 
 static void nvme_exit(PCIDevice *pci_dev)
@@ -1437,7 +1426,6 @@ static void nvme_exit(PCIDevice *pci_dev)
 NvmeCtrl *n = NVME(pci_dev);
 
 nvme_clear_ctrl(n);
-g_free(n->namespaces);
 g_free(n->cq);
 g_free(n->sq);
 
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 557194ee1954..40cedb1ec932 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -83,7 +83,7 @@ typedef struct NvmeCtrl {
 uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
 
 char*serial;
-NvmeNamespace   *namespaces;
+NvmeNamespace   namespace;
 NvmeSQueue  **sq;
 NvmeCQueue  **cq;
 NvmeSQueue  admin_sq;
-- 
2.20.1

[Qemu-block] [PATCH 08/16] nvme: refactor device realization

2019-07-05 Thread Klaus Birkelund Jensen

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 196 ++--
 hw/block/nvme.h |  11 +++
 2 files changed, 152 insertions(+), 55 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 4b9ff51868c0..eb6af6508e2d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -38,6 +38,7 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -1365,66 +1366,105 @@ static const MemoryRegionOps nvme_cmb_ops = {
 },
 };
 
-static void nvme_realize(PCIDevice *pci_dev, Error **errp)
+static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
-NvmeCtrl *n = NVME(pci_dev);
-NvmeIdCtrl *id = &n->id_ctrl;
-NvmeIdNs *id_ns = &n->namespace.id_ns;
-
-int64_t bs_size;
-uint8_t *pci_conf;
-
-if (!n->params.num_queues) {
-error_setg(errp, "num_queues can't be zero");
-return;
-}
+NvmeParams *params = &n->params;
 
 if (!n->conf.blk) {
-error_setg(errp, "drive property not set");
-return;
+error_setg(errp, "nvme: block backend not configured");
+return 1;
 }
 
-bs_size = blk_getlength(n->conf.blk);
-if (bs_size < 0) {
-error_setg(errp, "could not get backing file size");
-return;
+if (!params->serial) {
+error_setg(errp, "nvme: serial not configured");
+return 1;
 }
 
-if (!n->params.serial) {
-error_setg(errp, "serial property not set");
-return;
+if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
+error_setg(errp, "nvme: invalid queue configuration");
+return 1;
 }
+
+return 0;
+}
+
+static int nvme_init_blk(NvmeCtrl *n, Error **errp)
+{
 blkconf_blocksizes(&n->conf);
 if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
-   false, errp)) {
-return;
+false, errp)) {
+return 1;
 }
 
-pci_conf = pci_dev->config;
-pci_conf[PCI_INTERRUPT_PIN] = 1;
-pci_config_set_prog_interface(pci_dev->config, 0x2);
-pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
-pcie_endpoint_cap_init(pci_dev, 0x80);
+return 0;
+}
 
+static void nvme_init_state(NvmeCtrl *n)
+{
 n->num_namespaces = 1;
 n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
-n->ns_size = bs_size / (uint64_t)n->num_namespaces;
-
 n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
 n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
+}
 
-memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
-  "nvme", n->reg_size);
+static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
+NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
+
+NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
+NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
+
+n->cmbloc = n->bar.cmbloc;
+n->cmbsz = n->bar.cmbsz;
+
+n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
+"nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
+PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
+PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
+}
+
+static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+uint8_t *pci_conf = pci_dev->config;
+
+pci_conf[PCI_INTERRUPT_PIN] = 1;
+pci_config_set_prog_interface(pci_conf, 0x2);
+pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
+pci_config_set_device_id(pci_conf, 0x5845);
+pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
+pcie_endpoint_cap_init(pci_dev, 0x80);
+
+memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
+n->reg_size);
 pci_register_bar(pci_dev, 0,
 PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
 &n->iomem);
 msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
 
+if (n->params.cmb_size_mb) {
+nvme_init_cmb(n, pci_dev);
+}
+}
+
+static void nvme_init_ctrl(NvmeCtrl *n)
+{
+NvmeIdCtrl *id = &n->id_ctrl;
+NvmeParams *params = &n->params;
+uint

[Qemu-block] [PATCH 03/16] nvme: fix lpa field

2019-07-05 Thread Klaus Birkelund Jensen

The Log Page Attributes in the Identify Controller structure indicates
that the controller supports the SMART / Health Information log page on
a per namespace basis. It does not, given that neither this log page or
the Get Log Page command is implemented.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a3f83f3c2135..ce2e5365385b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1366,7 +1366,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 id->ieee[2] = 0xb3;
 id->oacs = cpu_to_le16(0);
 id->frmw = 7 << 1;
-id->lpa = 1 << 0;
 id->sqes = (0x6 << 4) | 0x6;
 id->cqes = (0x4 << 4) | 0x4;
 id->nn = cpu_to_le32(n->num_namespaces);
-- 
2.20.1

[Qemu-block] [PATCH 02/16] nvme: move device parameters to separate struct

2019-07-05 Thread Klaus Birkelund Jensen

Move device configuration parameters to separate struct to make it
explicit what is configurable and what is set internally.

Also, clean up some includes.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 54 +++--
 hw/block/nvme.h | 16 ---
 2 files changed, 38 insertions(+), 32 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 28ebaf1368b1..a3f83f3c2135 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -27,18 +27,14 @@
 
 #include "qemu/osdep.h"
 #include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
 #include "hw/block/block.h"
-#include "hw/hw.h"
 #include "hw/pci/msix.h"
-#include "hw/pci/pci.h"
 #include "sysemu/sysemu.h"
-#include "qapi/error.h"
-#include "qapi/visitor.h"
 #include "sysemu/block-backend.h"
+#include "qapi/error.h"
 
-#include "qemu/log.h"
-#include "qemu/module.h"
-#include "qemu/cutils.h"
 #include "trace.h"
 #include "nvme.h"
 
@@ -63,12 +59,12 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->num_queues && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->num_queues && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->params.num_queues && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -630,7 +626,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
 trace_nvme_err_invalid_create_cq_addr(prp1);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
-if (unlikely(vector > n->num_queues)) {
+if (unlikely(vector > n->params.num_queues)) {
 trace_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -782,7 +778,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 
16));
+result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 trace_nvme_getfeat_numq(result);
 break;
 case NVME_TIMESTAMP:
@@ -827,9 +824,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_NUMBER_OF_QUEUES:
 trace_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->num_queues - 1, n->num_queues - 1);
-req->cqe.result =
-cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
+n->params.num_queues - 1,
+n->params.num_queues - 1);
+req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
+((n->params.num_queues - 2) << 16));
 break;
 
 case NVME_TIMESTAMP:
@@ -903,12 +901,12 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
 blk_drain(n->conf.blk);
 
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->sq[i] != NULL) {
 nvme_free_sq(n->sq[i], n);
 }
 }
-for (i = 0; i < n->num_queues; i++) {
+for (i = 0; i < n->params.num_queues; i++) {
 if (n->cq[i] != NULL) {
 nvme_free_cq(n->cq[i], n);
 }
@@ -1311,7 +1309,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 int64_t bs_size;
 uint8_t *pci_conf;
 
-if (!n->num_queues) {
+if (!n->params.num_queues) {
 error_setg(errp, "num_queues can't be zero");
 return;
 }
@@ -1327,7 +1325,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 return;
 }
 
-if (!n->serial) {
+if (!n->params.serial) {
 error_setg(errp, "serial property not set");
 return;
 }
@@ -1344,24 +1342,24 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 pcie_endpoint_cap_init(pci_dev, 0x80);
 
 n->num_namespaces = 1;
-n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
+n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
 n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
-n->sq = g_new0(NvmeSQueue *, n->num_queues);
-n->cq = g_new0(NvmeCQueue *,

[Qemu-block] [PATCH 04/16] nvme: add missing fields in identify controller

2019-07-05 Thread Klaus Birkelund Jensen

Not used by the device model but added for completeness. See NVM Express
1.2.1, Section 5.11 ("Identify command"), Figure 90.

Signed-off-by: Klaus Birkelund Jensen 
---
 include/block/nvme.h | 34 +-
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 3ec8efcc435e..1b0accd4fe2b 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -543,7 +543,13 @@ typedef struct NvmeIdCtrl {
 uint8_t ieee[3];
 uint8_t cmic;
 uint8_t mdts;
-uint8_t rsvd255[178];
+uint16_tcntlid;
+uint32_tver;
+uint16_trtd3r;
+uint32_trtd3e;
+uint32_toaes;
+uint32_tctratt;
+uint8_t rsvd255[156];
 uint16_toacs;
 uint8_t acl;
 uint8_t aerl;
@@ -551,10 +557,22 @@ typedef struct NvmeIdCtrl {
 uint8_t lpa;
 uint8_t elpe;
 uint8_t npss;
-uint8_t rsvd511[248];
+uint8_t avscc;
+uint8_t apsta;
+uint16_twctemp;
+uint16_tcctemp;
+uint16_tmtfa;
+uint32_thmpre;
+uint32_thmmin;
+uint8_t tnvmcap[16];
+uint8_t unvmcap[16];
+uint32_trpmbs;
+uint8_t rsvd319[4];
+uint16_tkas;
+uint8_t rsvd511[190];
 uint8_t sqes;
 uint8_t cqes;
-uint16_trsvd515;
+uint16_tmaxcmd;
 uint32_tnn;
 uint16_toncs;
 uint16_tfuses;
@@ -562,8 +580,14 @@ typedef struct NvmeIdCtrl {
 uint8_t vwc;
 uint16_tawun;
 uint16_tawupf;
-uint8_t rsvd703[174];
-uint8_t rsvd2047[1344];
+uint8_t nvscc;
+uint8_t rsvd531;
+uint16_tacwu;
+uint16_trsvd535;
+uint32_tsgls;
+uint8_t rsvd767[228];
+uint8_t subnqn[256];
+uint8_t rsvd2047[1024];
 NvmePSD psd[32];
 uint8_t vs[1024];
 } NvmeIdCtrl;
-- 
2.20.1

[Qemu-block] [PATCH 10/16] nvme: support Get Log Page command

2019-07-05 Thread Klaus Birkelund Jensen

Add support for the Get Log Page command and stub/dumb implementations
of the mandatory Error Information, SMART/Health Information and
Firmware Slot Information log pages.

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.10 ("Get Log Page command").

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 209 ++
 hw/block/nvme.h   |   3 +
 hw/block/trace-events |   3 +
 include/block/nvme.h  |   4 +-
 4 files changed, 217 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a20576654f1b..93f5dff197e0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -39,6 +39,8 @@
 #include "nvme.h"
 
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_TEMPERATURE 0x143
+#define NVME_ELPE 3
 #define NVME_AERL 3
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
@@ -319,6 +321,36 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_enqueue_event(NvmeCtrl *n, uint8_t event_type,
+uint8_t event_info, uint8_t log_page)
+{
+NvmeAsyncEvent *event;
+
+trace_nvme_enqueue_event(event_type, event_info, log_page);
+
+/*
+ * Do not enqueue the event if something of this type is already queued.
+ * This bounds the size of the event queue and makes sure it does not grow
+ * indefinitely when events are not processed by the host (i.e. does not
+ * issue any AERs).
+ */
+if (n->aer_mask_queued & (1 << event_type)) {
+return;
+}
+n->aer_mask_queued |= (1 << event_type);
+
+event = g_new(NvmeAsyncEvent, 1);
+event->result = (NvmeAerResult) {
+.event_type = event_type,
+.event_info = event_info,
+.log_page   = log_page,
+};
+
+QSIMPLEQ_INSERT_TAIL(&n->aer_queue, event, entry);
+
+timer_mod(n->aer_timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+}
+
 static void nvme_process_aers(void *opaque)
 {
 NvmeCtrl *n = opaque;
@@ -831,6 +863,10 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 uint32_t result;
 
 switch (dw10) {
+case NVME_TEMPERATURE_THRESHOLD:
+result = cpu_to_le32(n->features.temp_thresh);
+break;
+case NVME_ERROR_RECOVERY:
 case NVME_VOLATILE_WRITE_CACHE:
 result = blk_enable_write_cache(n->conf.blk);
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -878,6 +914,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
 switch (dw10) {
+case NVME_TEMPERATURE_THRESHOLD:
+n->features.temp_thresh = dw11;
+if (n->features.temp_thresh <= n->temperature) {
+nvme_enqueue_event(n, NVME_AER_TYPE_SMART,
+NVME_AER_INFO_SMART_TEMP_THRESH, NVME_LOG_SMART_INFO);
+}
+break;
 case NVME_VOLATILE_WRITE_CACHE:
 blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
 break;
@@ -902,6 +945,137 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 return NVME_SUCCESS;
 }
 
+static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
+{
+n->aer_mask &= ~(1 << event_type);
+if (!QSIMPLEQ_EMPTY(&n->aer_queue)) {
+timer_mod(n->aer_timer,
+qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+}
+}
+
+static uint16_t nvme_error_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
+uint32_t buf_len, uint64_t off, NvmeRequest *req)
+{
+uint32_t trans_len;
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+if (off > sizeof(*n->elpes) * (NVME_ELPE + 1)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+trans_len = MIN(sizeof(*n->elpes) * (NVME_ELPE + 1) - off, buf_len);
+
+if (!rae) {
+nvme_clear_events(n, NVME_AER_TYPE_ERROR);
+}
+
+return nvme_dma_read_prp(n, (uint8_t *) n->elpes + off, trans_len, prp1,
+prp2);
+}
+
+static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
+uint32_t buf_len, uint64_t off, NvmeRequest *req)
+{
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+uint32_t trans_len;
+time_t current_ms;
+NvmeSmartLog smart;
+
+if (cmd->nsid != 0 && cmd->nsid != 0x) {
+trace_nvme_err(req->cqe.cid, "smart log not supported for namespace",
+NVME_INVALID_FIELD | NVME_DNR);
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+if (off > sizeof(smart)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+trans_len = MIN(sizeof(smart) - off, buf_len);
+
+memset(&

[Qemu-block] [PATCH 06/16] nvme: support completion queue in cmb

2019-07-05 Thread Klaus Birkelund Jensen

While not particularly useful, allow completion queues in the controller
memory buffer. Could be useful for testing.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3c392dc336a8..b31e5ff681bd 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -57,6 +57,16 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void 
*buf, int size)
 }
 }
 
+static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
+{
+if (n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+memcpy((void *)&n->cmbuf[addr - n->ctrl_mem.addr], buf, size);
+return;
+}
+pci_dma_write(&n->parent_obj, addr, buf, size);
+}
+
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
 return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
@@ -276,6 +286,7 @@ static void nvme_post_cqes(void *opaque)
 
 QTAILQ_FOREACH_SAFE(req, &cq->req_list, entry, next) {
 NvmeSQueue *sq;
+NvmeCqe *cqe = &req->cqe;
 hwaddr addr;
 
 if (nvme_cq_full(cq)) {
@@ -289,8 +300,7 @@ static void nvme_post_cqes(void *opaque)
 req->cqe.sq_head = cpu_to_le16(sq->head);
 addr = cq->dma_addr + cq->tail * n->cqe_size;
 nvme_inc_cq_tail(cq);
-pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
-sizeof(req->cqe));
+nvme_addr_write(n, addr, (void *) cqe, sizeof(*cqe));
 QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
 }
 if (cq->tail != cq->head) {
@@ -1399,7 +1409,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
 
 NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
-NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
 NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
 NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
-- 
2.20.1

[Qemu-block] [PATCH 00/16] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2019-07-05 Thread Klaus Birkelund Jensen

Matt Fitzpatrick's post ("[RFC,v1] Namespace Management Support") pushed
me to finally get my head out of my a** and post this series.

This is basically a follow-up to my previous series ("nvme: v1.3, sgls,
metadata and new 'ocssd' device"), but I'm not tagging it as a v2
because the patches for metadata and the ocssd device have been dropped.
Instead, this series also includes a patch that enables support for
multiple namespaces in a "proper" way by adding a new 'nvme-ns' device
model such that the "real" nvme device is composed of the 'nvme' device
model (the core controller) and multiple 'nvme-ns' devices that model
the namespaces.

All in all, the patches in this series should be less controversial, but
I know there is a lot to go through. I've kept commit 011de3d531b6
("nvme: refactor device realization") as a single commit, but I can chop
it up if any reviwers would prefer that, but the series is already at 16
patches. The refactor patch is basically just code movement.

At a glance, this series:

  - generally fixes up the device to be as close to NVMe 1.3d compliant as
possible (in terms of 'mandatory' features) by:
  - adding proper setting of the SUBNQN and VER fields
  - supporting the Abort command
  - supporting the Asynchronous Event Request command
  - supporting the Get Log Page command
  - providing reasonable stub responses to Get/Set Feature command of
mandatory features
  - adds support for scatter gather lists (SGLs)
  - simplifies DMA/CMB mappings and support PRPs/SGLs in the CMB
  - adds support for multiple block requests per nvme request (this is
useful for future support for metadata, OCSSD 2.0 vector requests
and upcoming zoned namespaces)
  - adds support for multiple namespaces


Thanks to everyone who chipped in on the discussion on multiple
namespaces! You're CC'ed ;)


Klaus Birkelund Jensen (16):
  nvme: simplify namespace code
  nvme: move device parameters to separate struct
  nvme: fix lpa field
  nvme: add missing fields in identify controller
  nvme: populate the mandatory subnqn and ver fields
  nvme: support completion queue in cmb
  nvme: support Abort command
  nvme: refactor device realization
  nvme: support Asynchronous Event Request command
  nvme: support Get Log Page command
  nvme: add missing mandatory Features
  nvme: bump supported NVMe revision to 1.3d
  nvme: simplify dma/cmb mappings
  nvme: support multiple block requests per request
  nvme: support scatter gather lists
  nvme: support multiple namespaces

 block/nvme.c   |   18 +-
 hw/block/Makefile.objs |2 +-
 hw/block/nvme-ns.c |  139 
 hw/block/nvme-ns.h |   35 +
 hw/block/nvme.c| 1629 
 hw/block/nvme.h|   99 ++-
 hw/block/trace-events  |   24 +-
 include/block/nvme.h   |  130 +++-
 8 files changed, 1739 insertions(+), 337 deletions(-)
 create mode 100644 hw/block/nvme-ns.c
 create mode 100644 hw/block/nvme-ns.h

-- 
2.20.1

[Qemu-block] [PATCH 07/16] nvme: support Abort command

2019-07-05 Thread Klaus Birkelund Jensen

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.1 ("Abort command").

Extracted from Keith's qemu-nvme tree. Modified to only consider queued
and not executing commands.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 56 +
 1 file changed, 56 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b31e5ff681bd..4b9ff51868c0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -38,6 +38,7 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
 (trace_##trace)(__VA_ARGS__); \
@@ -848,6 +849,54 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
 }
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+NvmeSQueue *sq;
+NvmeRequest *new;
+uint32_t index = 0;
+uint16_t sqid = cmd->cdw10 & 0x;
+uint16_t cid = (cmd->cdw10 >> 16) & 0x;
+
+req->cqe.result = 1;
+if (nvme_check_sqid(n, sqid)) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+sq = n->sq[sqid];
+
+/* only consider queued (and not executing) commands for abort */
+while ((sq->head + index) % sq->size != sq->tail) {
+NvmeCmd abort_cmd;
+hwaddr addr;
+
+addr = sq->dma_addr + ((sq->head + index) % sq->size) * n->sqe_size;
+
+nvme_addr_read(n, addr, (void *) &abort_cmd, sizeof(abort_cmd));
+if (abort_cmd.cid == cid) {
+req->cqe.result = 0;
+new = QTAILQ_FIRST(&sq->req_list);
+QTAILQ_REMOVE(&sq->req_list, new, entry);
+QTAILQ_INSERT_TAIL(&sq->out_req_list, new, entry);
+
+memset(&new->cqe, 0, sizeof(new->cqe));
+new->cqe.cid = cid;
+new->status = NVME_CMD_ABORT_REQ;
+
+abort_cmd.opcode = NVME_OP_ABORTED;
+nvme_addr_write(n, addr, (void *) &abort_cmd, sizeof(abort_cmd));
+
+nvme_enqueue_req_completion(n->cq[sq->cqid], new);
+
+return NVME_SUCCESS;
+}
+
+++index;
+}
+
 return NVME_SUCCESS;
 }
 
@@ -868,6 +917,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return nvme_set_feature(n, cmd, req);
 case NVME_ADM_CMD_GET_FEATURES:
 return nvme_get_feature(n, cmd, req);
+case NVME_ADM_CMD_ABORT:
+return nvme_abort(n, cmd, req);
 default:
 trace_nvme_err_invalid_admin_opc(cmd->opcode);
 return NVME_INVALID_OPCODE | NVME_DNR;
@@ -890,6 +941,10 @@ static void nvme_process_sq(void *opaque)
 nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd));
 nvme_inc_sq_head(sq);
 
+if (cmd.opcode == NVME_OP_ABORTED) {
+continue;
+}
+
 req = QTAILQ_FIRST(&sq->req_list);
 QTAILQ_REMOVE(&sq->req_list, req, entry);
 QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
@@ -1376,6 +1431,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 id->ieee[2] = 0xb3;
 id->ver = cpu_to_le32(0x00010201);
 id->oacs = cpu_to_le16(0);
+id->acl = 3;
 id->frmw = 7 << 1;
 id->sqes = (0x6 << 4) | 0x6;
 id->cqes = (0x4 << 4) | 0x4;
-- 
2.20.1

[Qemu-block] [PATCH 11/16] nvme: add missing mandatory Features

2019-07-05 Thread Klaus Birkelund Jensen

Add support for returning a resonable response to Get/Set Features of
mandatory features.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 49 ---
 hw/block/trace-events |  2 ++
 include/block/nvme.h  |  3 ++-
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 93f5dff197e0..8259dd7c1d6c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -860,13 +860,24 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, 
NvmeCmd *cmd)
 static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
 uint32_t dw10 = le32_to_cpu(cmd->cdw10);
+uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 uint32_t result;
 
+trace_nvme_getfeat(dw10);
+
 switch (dw10) {
+case NVME_ARBITRATION:
+result = cpu_to_le32(n->features.arbitration);
+break;
+case NVME_POWER_MANAGEMENT:
+result = cpu_to_le32(n->features.power_mgmt);
+break;
 case NVME_TEMPERATURE_THRESHOLD:
 result = cpu_to_le32(n->features.temp_thresh);
 break;
 case NVME_ERROR_RECOVERY:
+result = cpu_to_le32(n->features.err_rec);
+break;
 case NVME_VOLATILE_WRITE_CACHE:
 result = blk_enable_write_cache(n->conf.blk);
 trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -878,6 +889,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
+case NVME_INTERRUPT_COALESCING:
+result = cpu_to_le32(n->features.int_coalescing);
+break;
+case NVME_INTERRUPT_VECTOR_CONF:
+if ((dw11 & 0x) > n->params.num_queues) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+result = cpu_to_le32(n->features.int_vector_config[dw11 & 0x]);
+break;
+case NVME_WRITE_ATOMICITY:
+result = cpu_to_le32(n->features.write_atomicity);
+break;
 case NVME_ASYNCHRONOUS_EVENT_CONF:
 result = cpu_to_le32(n->features.async_config);
 break;
@@ -913,6 +937,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 uint32_t dw10 = le32_to_cpu(cmd->cdw10);
 uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
+trace_nvme_setfeat(dw10, dw11);
+
 switch (dw10) {
 case NVME_TEMPERATURE_THRESHOLD:
 n->features.temp_thresh = dw11;
@@ -937,6 +963,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 case NVME_ASYNCHRONOUS_EVENT_CONF:
 n->features.async_config = dw11;
 break;
+case NVME_ARBITRATION:
+case NVME_POWER_MANAGEMENT:
+case NVME_ERROR_RECOVERY:
+case NVME_INTERRUPT_COALESCING:
+case NVME_INTERRUPT_VECTOR_CONF:
+case NVME_WRITE_ATOMICITY:
+return NVME_FEAT_NOT_CHANGABLE | NVME_DNR;
 default:
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -1693,6 +1726,14 @@ static void nvme_init_state(NvmeCtrl *n)
 n->aer_reqs = g_new0(NvmeRequest *, NVME_AERL + 1);
 n->temperature = NVME_TEMPERATURE;
 n->features.temp_thresh = 0x14d;
+n->features.int_vector_config = g_malloc0_n(n->params.num_queues,
+sizeof(*n->features.int_vector_config));
+
+/* disable coalescing (not supported) */
+for (int i = 0; i < n->params.num_queues; i++) {
+n->features.int_vector_config[i] = i | (1 << 16);
+}
+
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -1769,6 +1810,10 @@ static void nvme_init_ctrl(NvmeCtrl *n)
 id->nn = cpu_to_le32(n->num_namespaces);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
 
+if (blk_enable_write_cache(n->conf.blk)) {
+id->vwc = 1;
+}
+
 strcpy((char *) id->subnqn, "nqn.2014-08.org.nvmexpress:uuid:");
 qemu_uuid_unparse(&qemu_uuid,
 (char *) id->subnqn + strlen((char *) id->subnqn));
@@ -1776,9 +1821,6 @@ static void nvme_init_ctrl(NvmeCtrl *n)
 id->psd[0].mp = cpu_to_le16(0x9c4);
 id->psd[0].enlat = cpu_to_le32(0x10);
 id->psd[0].exlat = cpu_to_le32(0x4);
-if (blk_enable_write_cache(n->conf.blk)) {
-id->vwc = 1;
-}
 
 n->bar.cap = 0;
 NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
@@ -1876,6 +1918,7 @@ static void nvme_exit(PCIDevice *pci_dev)
 g_free(n->sq);
 g_free(n->elpes);
 g_free(n->aer_reqs);
+g_free(n->features.int_vector_config);
 
 if (n->params.cmb_size_mb) {
 g_free(n->cmbuf);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index ed666bbc94f2..17485bb0375b 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -41,6 +41,8 @@ nvme_del_cq(uint16_t cqid) "deleted c

[Qemu-block] [PATCH 15/16] nvme: support scatter gather lists

2019-07-05 Thread Klaus Birkelund Jensen

For now, support the Data Block, Segment and Last Segment descriptor
types.

See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").

Signed-off-by: Klaus Birkelund Jensen 
---
 block/nvme.c  |  18 +-
 hw/block/nvme.c   | 390 +++---
 hw/block/nvme.h   |   6 +
 hw/block/trace-events |   3 +
 include/block/nvme.h  |  64 ++-
 5 files changed, 410 insertions(+), 71 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 73ed5fa75f2e..907a610633f2 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -438,7 +438,7 @@ static void nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 error_setg(errp, "Cannot map buffer for DMA");
 goto out;
 }
-cmd.prp1 = cpu_to_le64(iova);
+cmd.dptr.prp.prp1 = cpu_to_le64(iova);
 
 if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
 error_setg(errp, "Failed to identify controller");
@@ -512,7 +512,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_CQ,
-.prp1 = cpu_to_le64(q->cq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x3),
 };
@@ -523,7 +523,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error 
**errp)
 }
 cmd = (NvmeCmd) {
 .opcode = NVME_ADM_CMD_CREATE_SQ,
-.prp1 = cpu_to_le64(q->sq.iova),
+.dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
 .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0x)),
 .cdw11 = cpu_to_le32(0x1 | (n << 16)),
 };
@@ -858,16 +858,16 @@ try_map:
 case 0:
 abort();
 case 1:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = 0;
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = 0;
 break;
 case 2:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = pagelist[1];
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = pagelist[1];
 break;
 default:
-cmd->prp1 = pagelist[0];
-cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
+cmd->dptr.prp.prp1 = pagelist[0];
+cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + 
sizeof(uint64_t));
 break;
 }
 trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b285119fd29a..6bf62952dd13 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -273,6 +273,198 @@ unmap:
 return status;
 }
 
+static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor *segment, uint64_t nsgld, uint32_t *len,
+NvmeRequest *req)
+{
+dma_addr_t addr, trans_len;
+
+for (int i = 0; i < nsgld; i++) {
+if (NVME_SGL_TYPE(segment[i].type) != SGL_DESCR_TYPE_DATA_BLOCK) {
+trace_nvme_err_invalid_sgl_descriptor(req->cqe.cid,
+NVME_SGL_TYPE(segment[i].type));
+return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
+}
+
+if (*len == 0) {
+if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
+trace_nvme_err_invalid_sgl_excess_length(req->cqe.cid);
+return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
+}
+
+break;
+}
+
+addr = le64_to_cpu(segment[i].addr);
+trans_len = MIN(*len, le64_to_cpu(segment[i].len));
+
+if (nvme_addr_is_cmb(n, addr)) {
+/*
+ * All data and metadata, if any, associated with a particular
+ * command shall be located in either the CMB or host memory. Thus,
+ * if an address if found to be in the CMB and we have already
+ * mapped data that is in host memory, the use is invalid.
+ */
+if (!req->is_cmb && qsg->size) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+
+req->is_cmb = true;
+} else {
+/*
+ * Similarly, if the address does not reference the CMB, but we
+ * have already established that the request has data or metadata
+ * in the CMB, the use is invalid.
+ */
+if (req->is_cmb) {
+return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+}
+}
+
+qemu_sglist_add(qsg, addr, trans_len);
+
+*len -= trans_len;
+}
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg,
+NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
+{
+const int MAX_NSGLD = 256;
+
+NvmeSglDescriptor segment[MAX_NSGLD];
+uint64_t nsgld;
+uint16_t status;
+bool sgl_in_cmb = false;
+hwaddr addr = le64_to_cpu(sgl.addr);
+
+t

[Qemu-block] [PATCH 05/16] nvme: populate the mandatory subnqn and ver fields

2019-07-05 Thread Klaus Birkelund Jensen

Required for compliance with NVMe revision 1.2.1 or later. See NVM
Express 1.2.1, Section 5.11 ("Identify command"), Figure 90 and Section
7.9 ("NVMe Qualified Names").

This also bumps the supported version to 1.2.1.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ce2e5365385b..3c392dc336a8 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1364,12 +1364,18 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 id->ieee[0] = 0x00;
 id->ieee[1] = 0x02;
 id->ieee[2] = 0xb3;
+id->ver = cpu_to_le32(0x00010201);
 id->oacs = cpu_to_le16(0);
 id->frmw = 7 << 1;
 id->sqes = (0x6 << 4) | 0x6;
 id->cqes = (0x4 << 4) | 0x4;
 id->nn = cpu_to_le32(n->num_namespaces);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
+
+strcpy((char *) id->subnqn, "nqn.2014-08.org.nvmexpress:uuid:");
+qemu_uuid_unparse(&qemu_uuid,
+(char *) id->subnqn + strlen((char *) id->subnqn));
+
 id->psd[0].mp = cpu_to_le16(0x9c4);
 id->psd[0].enlat = cpu_to_le32(0x10);
 id->psd[0].exlat = cpu_to_le32(0x4);
@@ -1384,7 +1390,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 NVME_CAP_SET_CSS(n->bar.cap, 1);
 NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
 
-n->bar.vs = 0x00010200;
+n->bar.vs = 0x00010201;
 n->bar.intmc = n->bar.intms = 0;
 
 if (n->params.cmb_size_mb) {
-- 
2.20.1

[Qemu-block] [PATCH 09/16] nvme: support Asynchronous Event Request command

2019-07-05 Thread Klaus Birkelund Jensen

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.2 ("Asynchronous Event Request command").

Modified from Keith's qemu-nvme tree.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 88 ++-
 hw/block/nvme.h   |  7 
 hw/block/trace-events |  7 
 3 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index eb6af6508e2d..a20576654f1b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -39,6 +39,7 @@
 #include "nvme.h"
 
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_AERL 3
 #define NVME_OP_ABORTED 0xff
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -318,6 +319,51 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_process_aers(void *opaque)
+{
+NvmeCtrl *n = opaque;
+NvmeRequest *req;
+NvmeAerResult *result;
+NvmeAsyncEvent *event, *next;
+
+trace_nvme_process_aers();
+
+QSIMPLEQ_FOREACH_SAFE(event, &n->aer_queue, entry, next) {
+/* can't post cqe if there is nothing to complete */
+if (!n->outstanding_aers) {
+trace_nvme_no_outstanding_aers();
+break;
+}
+
+/* ignore if masked (cqe posted, but event not cleared) */
+if (n->aer_mask & (1 << event->result.event_type)) {
+trace_nvme_aer_masked(event->result.event_type, n->aer_mask);
+continue;
+}
+
+QSIMPLEQ_REMOVE_HEAD(&n->aer_queue, entry);
+
+n->aer_mask |= 1 << event->result.event_type;
+n->aer_mask_queued &= ~(1 << event->result.event_type);
+n->outstanding_aers--;
+
+req = n->aer_reqs[n->outstanding_aers];
+
+result = (NvmeAerResult *) &req->cqe.result;
+result->event_type = event->result.event_type;
+result->event_info = event->result.event_info;
+result->log_page = event->result.log_page;
+g_free(event);
+
+req->status = NVME_SUCCESS;
+
+trace_nvme_aer_post_cqe(result->event_type, result->event_info,
+result->log_page);
+
+nvme_enqueue_req_completion(&n->admin_cq, req);
+}
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
 NvmeRequest *req = opaque;
@@ -796,6 +842,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 break;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, cmd);
+case NVME_ASYNCHRONOUS_EVENT_CONF:
+result = cpu_to_le32(n->features.async_config);
 break;
 default:
 trace_nvme_err_invalid_getfeat(dw10);
@@ -841,11 +889,11 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
 ((n->params.num_queues - 2) << 16));
 break;
-
 case NVME_TIMESTAMP:
 return nvme_set_feature_timestamp(n, cmd);
+case NVME_ASYNCHRONOUS_EVENT_CONF:
+n->features.async_config = dw11;
 break;
-
 default:
 trace_nvme_err_invalid_setfeat(dw10);
 return NVME_INVALID_FIELD | NVME_DNR;
@@ -854,6 +902,22 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 return NVME_SUCCESS;
 }
 
+static uint16_t nvme_aer(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+trace_nvme_aer(req->cqe.cid);
+
+if (n->outstanding_aers > NVME_AERL) {
+trace_nvme_aer_aerl_exceeded();
+return NVME_AER_LIMIT_EXCEEDED;
+}
+
+n->aer_reqs[n->outstanding_aers] = req;
+timer_mod(n->aer_timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+n->outstanding_aers++;
+
+return NVME_NO_COMPLETE;
+}
+
 static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
 NvmeSQueue *sq;
@@ -918,6 +982,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, 
NvmeRequest *req)
 return nvme_set_feature(n, cmd, req);
 case NVME_ADM_CMD_GET_FEATURES:
 return nvme_get_feature(n, cmd, req);
+case NVME_ADM_CMD_ASYNC_EV_REQ:
+return nvme_aer(n, cmd, req);
 case NVME_ADM_CMD_ABORT:
 return nvme_abort(n, cmd, req);
 default:
@@ -963,6 +1029,7 @@ static void nvme_process_sq(void *opaque)
 
 static void nvme_clear_ctrl(NvmeCtrl *n)
 {
+NvmeAsyncEvent *event;
 int i;
 
 blk_drain(n->conf.blk);
@@ -978,8 +1045,19 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 }
 }
 
+if (n->aer_timer) {
+timer_del(n->aer_timer);
+timer_free(n->aer_timer);
+n->aer_timer = NULL;
+}
+while ((event = QSIMPLEQ_FIRST(&n->aer_queue)) != NULL) {
+QSI

[Qemu-block] [PATCH 13/16] nvme: simplify dma/cmb mappings

2019-07-05 Thread Klaus Birkelund Jensen

Instead of handling both QSGs and IOVs in multiple places, simply use
QSGs everywhere by assuming that the request does not involve the
controller memory buffer (CMB). If the request is found to involve the
CMB, convert the QSG to an IOV and issue the I/O. The QSG is converted
to an IOV by the dma helpers anyway, so the CMB path is not unfairly
affected by this simplifying change.

As a side-effect, this patch also allows PRPs to be located in the CMB.
The logic ensures that if some of the PRP is in the CMB, all of it must
be located there.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 277 --
 hw/block/nvme.h   |   3 +-
 hw/block/trace-events |   1 +
 include/block/nvme.h  |   1 +
 4 files changed, 187 insertions(+), 95 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 8ad95fdfa261..02888dbfdbc1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -55,14 +55,21 @@
 
 static void nvme_process_sq(void *opaque);
 
+static inline uint8_t nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
+{
+return n->cmbsz && addr >= n->ctrl_mem.addr &&
+addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size));
+}
+
 static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
-if (n->cmbsz && addr >= n->ctrl_mem.addr &&
-addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
+if (nvme_addr_is_cmb(n, addr)) {
 memcpy(buf, (void *)&n->cmbuf[addr - n->ctrl_mem.addr], size);
-} else {
-pci_dma_read(&n->parent_obj, addr, buf, size);
+
+return;
 }
+
+pci_dma_read(&n->parent_obj, addr, buf, size);
 }
 
 static void nvme_addr_write(NvmeCtrl *n, hwaddr addr, void *buf, int size)
@@ -151,139 +158,200 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue 
*cq)
 }
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
- uint64_t prp2, uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, uint64_t prp1,
+uint64_t prp2, uint32_t len, NvmeRequest *req)
 {
 hwaddr trans_len = n->page_size - (prp1 % n->page_size);
 trans_len = MIN(len, trans_len);
 int num_prps = (len >> n->page_bits) + 1;
+uint16_t status = NVME_SUCCESS;
+bool prp_list_in_cmb = false;
+
+trace_nvme_map_prp(req->cmd.opcode, trans_len, len, prp1, prp2, num_prps);
 
 if (unlikely(!prp1)) {
 trace_nvme_err_invalid_prp();
 return NVME_INVALID_FIELD | NVME_DNR;
-} else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
-   prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
-qsg->nsg = 0;
-qemu_iovec_init(iov, num_prps);
-qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], 
trans_len);
-} else {
-pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
-qemu_sglist_add(qsg, prp1, trans_len);
 }
+
+if (nvme_addr_is_cmb(n, prp1)) {
+req->is_cmb = true;
+}
+
+pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
+qemu_sglist_add(qsg, prp1, trans_len);
+
 len -= trans_len;
 if (len) {
 if (unlikely(!prp2)) {
 trace_nvme_err_invalid_prp2_missing();
+status = NVME_INVALID_FIELD | NVME_DNR;
 goto unmap;
 }
+
 if (len > n->page_size) {
 uint64_t prp_list[n->max_prp_ents];
 uint32_t nents, prp_trans;
 int i = 0;
 
+if (nvme_addr_is_cmb(n, prp2)) {
+prp_list_in_cmb = true;
+}
+
 nents = (len + n->page_size - 1) >> n->page_bits;
 prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
+nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
 while (len != 0) {
+bool addr_is_cmb;
 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
 if (i == n->max_prp_ents - 1 && len > n->page_size) {
 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
 trace_nvme_err_invalid_prplist_ent(prp_ent);
+status = NVME_INVALID_FIELD | NVME_DNR;
+goto unmap;
+}
+
+addr_is_cmb = nvme_addr_is_cmb(n, prp_ent);
+if ((prp_list_in_cmb && !addr_is_cmb) ||
+(!prp_list_in_cmb && addr_is_cmb)) {
+status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
 goto unmap;
 }
 
 i = 0;
 nents = (len + n->page_size

[Qemu-block] [PATCH 14/16] nvme: support multiple block requests per request

2019-07-05 Thread Klaus Birkelund Jensen

Currently, the device only issues a single block backend request per
NVMe request, but as we move towards supporting metadata (and
discontiguous vector requests supported by OpenChannel 2.0) it will be
required to issue multiple block backend requests per NVMe request.

With this patch the NVMe device is ready for that.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 322 --
 hw/block/nvme.h   |  49 +--
 hw/block/trace-events |   3 +
 3 files changed, 290 insertions(+), 84 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 02888dbfdbc1..b285119fd29a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -25,6 +25,8 @@
  *  Default: 64
  *   cmb_size_mb= : Size of Controller Memory Buffer in MBs.
  *  Default: 0 (disabled)
+ *   mdts= : Maximum Data Transfer Size (power of two)
+ *  Default: 7
  */
 
 #include "qemu/osdep.h"
@@ -319,10 +321,9 @@ static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
 uint64_t prp1, uint64_t prp2, NvmeRequest *req)
 {
-QEMUSGList qsg;
 uint16_t err = NVME_SUCCESS;
 
-err = nvme_map_prp(n, &qsg, prp1, prp2, len, req);
+err = nvme_map_prp(n, &req->qsg, prp1, prp2, len, req);
 if (err) {
 return err;
 }
@@ -330,8 +331,8 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 if (req->is_cmb) {
 QEMUIOVector iov;
 
-qemu_iovec_init(&iov, qsg.nsg);
-dma_to_cmb(n, &qsg, &iov);
+qemu_iovec_init(&iov, req->qsg.nsg);
+dma_to_cmb(n, &req->qsg, &iov);
 
 if (unlikely(qemu_iovec_from_buf(&iov, 0, ptr, len) != len)) {
 trace_nvme_err_invalid_dma();
@@ -343,17 +344,86 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t 
*ptr, uint32_t len,
 goto out;
 }
 
-if (unlikely(dma_buf_read(ptr, len, &qsg))) {
+if (unlikely(dma_buf_read(ptr, len, &req->qsg))) {
 trace_nvme_err_invalid_dma();
 err = NVME_INVALID_FIELD | NVME_DNR;
 }
 
 out:
-qemu_sglist_destroy(&qsg);
+qemu_sglist_destroy(&req->qsg);
 
 return err;
 }
 
+static void nvme_blk_req_destroy(NvmeBlockBackendRequest *blk_req)
+{
+if (blk_req->iov.nalloc) {
+qemu_iovec_destroy(&blk_req->iov);
+}
+
+g_free(blk_req);
+}
+
+static void nvme_blk_req_put(NvmeCtrl *n, NvmeBlockBackendRequest *blk_req)
+{
+nvme_blk_req_destroy(blk_req);
+}
+
+static NvmeBlockBackendRequest *nvme_blk_req_get(NvmeCtrl *n, NvmeRequest *req,
+QEMUSGList *qsg)
+{
+NvmeBlockBackendRequest *blk_req = g_malloc0(sizeof(*blk_req));
+
+blk_req->req = req;
+
+if (qsg) {
+blk_req->qsg = qsg;
+}
+
+return blk_req;
+}
+
+static uint16_t nvme_blk_setup(NvmeCtrl *n, NvmeNamespace *ns, QEMUSGList *qsg,
+NvmeRequest *req)
+{
+NvmeBlockBackendRequest *blk_req = nvme_blk_req_get(n, req, qsg);
+if (!blk_req) {
+NVME_GUEST_ERR(nvme_err_internal_dev_error, "nvme_blk_req_get: %s",
+"could not allocate memory");
+return NVME_INTERNAL_DEV_ERROR;
+}
+
+blk_req->slba = req->slba;
+blk_req->nlb = req->nlb;
+blk_req->blk_offset = req->slba * nvme_ns_lbads_bytes(ns);
+
+QTAILQ_INSERT_TAIL(&req->blk_req_tailq, blk_req, tailq_entry);
+
+return NVME_SUCCESS;
+}
+
+static uint16_t nvme_blk_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+NvmeNamespace *ns = req->ns;
+uint16_t err;
+
+uint32_t len = req->nlb * nvme_ns_lbads_bytes(ns);
+uint64_t prp1 = le64_to_cpu(cmd->prp1);
+uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+err = nvme_map_prp(n, &req->qsg, prp1, prp2, len, req);
+if (err) {
+return err;
+}
+
+err = nvme_blk_setup(n, ns, &req->qsg, req);
+if (err) {
+return err;
+}
+
+return NVME_SUCCESS;
+}
+
 static void nvme_post_cqes(void *opaque)
 {
 NvmeCQueue *cq = opaque;
@@ -388,6 +458,10 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 {
 assert(cq->cqid == req->sq->cqid);
 
+if (req->qsg.nalloc) {
+qemu_sglist_destroy(&req->qsg);
+}
+
 trace_nvme_enqueue_req_completion(req->cqe.cid, cq->cqid);
 QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
 QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
@@ -471,130 +545,224 @@ static void nvme_process_aers(void *opaque)
 
 static void nvme_rw_cb(void *opaque, int ret)
 {
-NvmeRequest *req = opaque;
+NvmeBlockBackendRequest *blk_req = opaque;
+NvmeRequest *req = blk_req->req;
 NvmeSQueue *sq = req->sq;
 NvmeCtrl *

[Qemu-block] [PATCH 16/16] nvme: support multiple namespaces

2019-07-05 Thread Klaus Birkelund Jensen

This adds support for multiple namespaces by introducing a new 'nvme-ns'
device model. The nvme device creates a bus named from the device name
('id'). The nvme-ns devices then connect to this and registers
themselves with the nvme device.

This changes how an nvme device is created. Example with two namespaces:

  -drive file=nvme0n1.img,if=none,id=disk1
  -drive file=nvme0n2.img,if=none,id=disk2
  -device nvme,serial=deadbeef,id=nvme0
  -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
  -device nvme-ns,drive=disk2,bus=nvme0,nsid=2

A maximum of 256 namespaces can be configured.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/Makefile.objs |   2 +-
 hw/block/nvme-ns.c | 139 +
 hw/block/nvme-ns.h |  35 +
 hw/block/nvme.c| 169 -
 hw/block/nvme.h|  29 ---
 hw/block/trace-events  |   1 +
 6 files changed, 255 insertions(+), 120 deletions(-)
 create mode 100644 hw/block/nvme-ns.c
 create mode 100644 hw/block/nvme-ns.h

diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
index f5f643f0cc06..d44a2f4b780d 100644
--- a/hw/block/Makefile.objs
+++ b/hw/block/Makefile.objs
@@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
 common-obj-$(CONFIG_XEN) += xen-block.o
 common-obj-$(CONFIG_ECC) += ecc.o
 common-obj-$(CONFIG_ONENAND) += onenand.o
-common-obj-$(CONFIG_NVME_PCI) += nvme.o
+common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
 
 obj-$(CONFIG_SH4) += tc58128.o
 
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
new file mode 100644
index ..11b594467991
--- /dev/null
+++ b/hw/block/nvme-ns.c
@@ -0,0 +1,139 @@
+#include "qemu/osdep.h"
+#include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
+#include "hw/block/block.h"
+#include "hw/pci/msix.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+
+#include "hw/qdev-core.h"
+
+#include "nvme.h"
+#include "nvme-ns.h"
+
+static uint64_t nvme_ns_calc_blks(NvmeNamespace *ns)
+{
+return ns->size / nvme_ns_lbads_bytes(ns);
+}
+
+static void nvme_ns_init_identify(NvmeIdNs *id_ns)
+{
+id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+}
+
+static int nvme_ns_init(NvmeNamespace *ns)
+{
+uint64_t ns_blks;
+NvmeIdNs *id_ns = &ns->id_ns;
+
+nvme_ns_init_identify(id_ns);
+
+ns_blks = nvme_ns_calc_blks(ns);
+id_ns->nuse = id_ns->ncap = id_ns->nsze = cpu_to_le64(ns_blks);
+
+return 0;
+}
+
+static int nvme_ns_init_blk(NvmeNamespace *ns, NvmeIdCtrl *id, Error **errp)
+{
+blkconf_blocksizes(&ns->conf);
+
+if (!blkconf_apply_backend_options(&ns->conf,
+blk_is_read_only(ns->conf.blk), false, errp)) {
+return 1;
+}
+
+ns->size = blk_getlength(ns->conf.blk);
+if (ns->size < 0) {
+error_setg_errno(errp, -ns->size, "blk_getlength");
+return 1;
+}
+
+if (!blk_enable_write_cache(ns->conf.blk)) {
+id->vwc = 0;
+}
+
+return 0;
+}
+
+static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
+{
+if (!ns->conf.blk) {
+error_setg(errp, "nvme-ns: block backend not configured");
+return 1;
+}
+
+return 0;
+}
+
+
+static void nvme_ns_realize(DeviceState *dev, Error **errp)
+{
+NvmeNamespace *ns = NVME_NS(dev);
+BusState *s = qdev_get_parent_bus(dev);
+NvmeCtrl *n = NVME(s->parent);
+Error *local_err = NULL;
+
+if (nvme_ns_check_constraints(ns, &local_err)) {
+error_propagate_prepend(errp, local_err,
+"nvme_ns_check_constraints: ");
+return;
+}
+
+if (nvme_ns_init_blk(ns, &n->id_ctrl, &local_err)) {
+error_propagate_prepend(errp, local_err, "nvme_ns_init_blk: ");
+return;
+}
+
+nvme_ns_init(ns);
+if (nvme_register_namespace(n, ns, &local_err)) {
+error_propagate_prepend(errp, local_err, "nvme_register_namespace: ");
+return;
+}
+}
+
+static Property nvme_ns_props[] = {
+DEFINE_BLOCK_PROPERTIES(NvmeNamespace, conf),
+DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),
+DEFINE_PROP_END_OF_LIST(),
+};
+
+static void nvme_ns_class_init(ObjectClass *oc, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(oc);
+
+set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
+
+dc->bus_type = TYPE_NVME_BUS;
+dc->realize = nvme_ns_realize;
+dc->props = nvme_ns_props;
+dc->desc = "virtual nvme namespace";
+}
+
+static void nvme_ns_instance_init(Object *obj)
+{
+NvmeNamespace *ns = NVME_NS(obj);
+char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
+
+device_add_bootindex_property(obj, &ns->conf.booti

[Qemu-block] [PATCH 12/16] nvme: bump supported NVMe revision to 1.3d

2019-07-05 Thread Klaus Birkelund Jensen

Add the new Namespace Identification Descriptor List (CNS 03h) and track
creation of queues to enable the controller to return Command Sequence
Error if Set Features is called for Number of Queues after any queues
have been created.

Signed-off-by: Klaus Birkelund Jensen 
---
 hw/block/nvme.c   | 84 ---
 hw/block/nvme.h   |  1 +
 hw/block/trace-events |  4 ++-
 include/block/nvme.h  | 30 +---
 4 files changed, 102 insertions(+), 17 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 8259dd7c1d6c..8ad95fdfa261 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,20 +9,22 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.3d, 1.2, 1.1, 1.0e
  *
  *  http://www.nvmexpress.org/resources/
  */
 
 /**
  * Usage: add options:
- *  -drive file=,if=none,id=
- *  -device nvme,drive=,serial=,id=, \
- *  cmb_size_mb=, \
- *  num_queues=
+ * -drive file=,if=none,id=
+ * -device nvme,drive=,serial=,id=
  *
- * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
- * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
+ * Advanced optional options:
+ *
+ *   num_queues=  : Maximum number of IO Queues.
+ *  Default: 64
+ *   cmb_size_mb= : Size of Controller Memory Buffer in MBs.
+ *  Default: 0 (disabled)
  */
 
 #include "qemu/osdep.h"
@@ -43,6 +45,7 @@
 #define NVME_ELPE 3
 #define NVME_AERL 3
 #define NVME_OP_ABORTED 0xff
+
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
 (trace_##trace)(__VA_ARGS__); \
@@ -316,6 +319,8 @@ static void nvme_post_cqes(void *opaque)
 static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
 {
 assert(cq->cqid == req->sq->cqid);
+
+trace_nvme_enqueue_req_completion(req->cqe.cid, cq->cqid);
 QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
 QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
 timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
@@ -534,6 +539,7 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
 if (sq->sqid) {
 g_free(sq);
 }
+n->qs_created--;
 }
 
 static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -600,6 +606,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, 
uint64_t dma_addr,
 cq = n->cq[cqid];
 QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
 n->sq[sqid] = sq;
+n->qs_created++;
 }
 
 static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -649,6 +656,7 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
 if (cq->cqid) {
 g_free(cq);
 }
+n->qs_created--;
 }
 
 static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -689,6 +697,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, 
uint64_t dma_addr,
 msix_vector_use(&n->parent_obj, cq->vector);
 n->cq[cqid] = cq;
 cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+n->qs_created++;
 }
 
 static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
@@ -762,7 +771,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify 
*c)
 prp1, prp2);
 }
 
-static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
+static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
 {
 static const int data_len = 4 * KiB;
 uint32_t min_nsid = le32_to_cpu(c->nsid);
@@ -772,7 +781,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeIdentify *c)
 uint16_t ret;
 int i, j = 0;
 
-trace_nvme_identify_nslist(min_nsid);
+trace_nvme_identify_ns_list(min_nsid);
 
 list = g_malloc0(data_len);
 for (i = 0; i < n->num_namespaces; i++) {
@@ -789,6 +798,47 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeIdentify *c)
 return ret;
 }
 
+static uint16_t nvme_identify_ns_descriptor_list(NvmeCtrl *n, NvmeCmd *c)
+{
+static const int data_len = 4 * KiB;
+
+/*
+ * The device model does not have anywhere to store a persistent UUID, so
+ * conjure up something that is reproducible. We generate an UUID of the
+ * form "----", where nsid is similar to, say,
+ * 0001.
+ */
+struct ns_descr {
+uint8_t nidt;
+uint8_t nidl;
+uint8_t rsvd[14];
+uint32_t nid;
+};
+
+uint32_t nsid = le32_to_cpu(c->nsid);
+uint64_t prp1 = le64_to_cpu(c->prp1);
+uint64_t prp2 = le64_to_cpu(c->prp2);
+
+struct ns_descr *list;
+uint16_t ret;
+
+trace_nvme_identify_ns_descriptor_list(nsid);
+
+if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
+trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
+return NVME_INVALID_NSID | NVME_DNR;
+}
+
+list = g_malloc0(data_len);
+

Re: [PATCH v5 26/26] nvme: make lba data size configurable

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:43, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:08AM +0100, Klaus Jensen wrote:
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme-ns.c | 2 +-
> >  hw/block/nvme-ns.h | 4 +++-
> >  hw/block/nvme.c| 1 +
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index 0e5be44486f4..981d7101b8f2 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -18,7 +18,7 @@ static int nvme_ns_init(NvmeNamespace *ns)
> >  {
> >  NvmeIdNs *id_ns = &ns->id_ns;
> >  
> > -id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->lbaf[0].ds = ns->params.lbads;
> >  id_ns->nuse = id_ns->ncap = id_ns->nsze =
> >  cpu_to_le64(nvme_ns_nlbas(ns));
> >  
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > index b564bac25f6d..f1fe4db78b41 100644
> > --- a/hw/block/nvme-ns.h
> > +++ b/hw/block/nvme-ns.h
> > @@ -7,10 +7,12 @@
> >  
> >  #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> >  DEFINE_PROP_DRIVE("drive", _state, blk), \
> > -DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> > +DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
> > +DEFINE_PROP_UINT8("lbads", _state, _props.lbads, BDRV_SECTOR_BITS)
> 
> I think we need to validate the parameter is between 9 and 12 before
> trusting it can be used safely.
> 
> Alternatively, add supported formats to the lbaf array and let the host
> decide on a live system with the 'format' command.

The device does not yet support Format NVM, but we have a patch ready
for that to be submitted with a new series when this is merged.

For now, while it does not support Format, I will change this patch such
that it defaults to 9 (BRDV_SECTOR_BITS) and only accept 12 as an
alternative (while always keeping the number of formats available to 1).

Re: [PATCH v5 22/26] nvme: support multiple namespaces

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:31, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:04AM +0100, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> 
> I like this feature a lot, thanks for doing it.
> 
> Reviewed-by: Keith Busch 
> 
> > @@ -1256,18 +1272,24 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, 
> > NvmeCmd *cmd, uint8_t rae,
> >  uint64_t units_read = 0, units_written = 0, read_commands = 0,
> >  write_commands = 0;
> >  NvmeSmartLog smart;
> > -BlockAcctStats *s;
> >  
> >  if (nsid && nsid != 0x) {
> >  return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> 
> This is totally optional, but worth mentioning: this patch makes it
> possible to remove this check and allow per-namespace smart logs. The
> ID_CTRL.LPA would need to updated to reflect that if you wanted to
> go that route.

Yeah, I thought about that, but with NVMe v1.4 support arriving in a
later series, there are no longer any namespace specific stuff in the
log page anyway.

The spec isn't really clear on what the preferred behavior for a 1.4
compliant device is. Either

  1. LBA bit 0 set and just return the same page for each namespace or,
  2. LBA bit 0 unset and fail when NSID is set

Re: [PATCH v5 24/26] nvme: change controller pci id

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:35, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:06AM +0100, Klaus Jensen wrote:
> > There are two reasons for changing this:
> > 
> >   1. The nvme device currently uses an internal Intel device id.
> > 
> >   2. Since commits "nvme: fix write zeroes offset and count" and "nvme:
> >  support multiple namespaces" the controller device no longer has
> >  the quirks that the Linux kernel think it has.
> > 
> >  As the quirks are applied based on pci vendor and device id, change
> >  them to get rid of the quirks.
> > 
> > To keep backward compatibility, add a new 'x-use-intel-id' parameter to
> > the nvme device to force use of the Intel vendor and device id. This is
> > off by default but add a compat property to set this for machines 4.2
> > and older.
> > 
> > Signed-off-by: Klaus Jensen 
> 
> Yay, thank you for following through on getting this identifier assigned.
> 
> Reviewed-by: Keith Busch 

This is technically not "officially" sanctioned yet, but I got an
indication from Gerd that we are good to proceed with this.

Re: [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2020-02-05 Thread Klaus Birkelund Jensen

On Feb  5 01:47, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:51:42AM +0100, Klaus Jensen wrote:
> > Hi,
> > 
> > 
> > Changes since v4
> >  - Changed vendor and device id to use a Red Hat allocated one. For
> >backwards compatibility add the 'x-use-intel-id' nvme device
> >parameter. This is off by default but is added as a machine compat
> >property to be true for machine types <= 4.2.
> > 
> >  - SGL mapping code has been refactored.
> 
> Looking pretty good to me. For the series beyond the individually
> reviewed patches:
> 
> Acked-by: Keith Busch 
> 
> If you need to send a v5, you may add my tag to the patches that are not
> substaintially modified if you like.

I'll send a v6 with the changes to "nvme: make lba data size
configurable". It won't be substantially changed, I will just only
accept 9 and 12 as valid values for lbads.

Thanks for the Ack's and Reviews Keith!


Klaus

Re: [PATCH v5 01/26] nvme: rename trace events to nvme_dev

2020-02-12 Thread Klaus Birkelund Jensen

On Feb 12 11:08, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Change the prefix of all nvme device related trace events to 'nvme_dev'
> > to not clash with trace events from the nvme block driver.
> > 

Hi Maxim,

Thank you very much for your thorough reviews! Utterly appreciated!

I'll start going through your suggested changes. There is a bit of work
to do on splitting patches into refactoring and bugfixes, but I can
definitely see the reason for this, so I'll get to work.

You mention the alignment with split lines alot. I actually thought I
was following CODING_STYLE.rst (which allows a single 4 space indent for
functions, but not statements such as if/else and while/for). But since
hw/block/nvme.c is originally written in the style of aligning with the
opening paranthesis I'm in the wrong here, so I will of course amend
it. Should have done that from the beginning, it's just my personal
taste shining through ;)

Thanks again,
Klaus

Re: [PATCH RESEND v2] block/nvme: introduce PMR support from NVMe 1.4 spec

2020-03-11 Thread Klaus Birkelund Jensen

On Mar 11 15:54, Andrzej Jakowski wrote:
> On 3/11/20 2:20 AM, Stefan Hajnoczi wrote:
> > Please try:
> > 
> >   $ git grep pmem
> > 
> > backends/hostmem-file.c is the backend that can be used and the
> > pmem_persist() API can be used to flush writes.
> 
> I've reworked this patch into hostmem-file type of backend.
> From simple tests in virtual machine: writing to PMR region
> and then reading from it after VM power cycle I have observed that
> there is no persistency.
> 
> I guess that persistent behavior can be achieved if memory backend file
> resides on actual persistent memory in VMM. I haven't found mechanism to
> persist memory backend file when it resides in the file system on block
> storage. My original mmap + msync based solution worked well there.
> I believe that main problem with mmap was with "ifdef _WIN32" that made it 
> platform specific and w/o it patchew CI complained. 
> Is there a way that I could rework mmap + msync solution so it would fit
> into qemu design?
> 

Hi Andrzej,

Thanks for working on this!

FWIW, I have implemented other stuff for the NVMe device that requires
persistent storage (e.g. LBA allocation tracking for DULBE support). I
used the approach of adding an additional blockdev and simply use the
qemu block layer. This would also make it work on WIN32. And if we just
set bit 0 in PMRWBM and disable the write cache on the blockdev we
should be good on the durability requirements.

Unfortunately, I do not see (or know, maybe Stefan has an idea?) an easy
way of using the MemoryRegionOps nicely with async block backend i/o. so
we either have to use blocking I/O or fire and forget aio. Or, we can
maybe keep bit 1 set in PMRWBM and force a blocking blk_flush on PMRSTS
read.

Finally, a thing to consider is that this is adding an optional NVMe 1.4
feature to an already frankenstein device that doesn't even implement
mandatory v1.2. I think that bumping the NVMe version to 1.4 is out of
the question until we actually implement it fully wrt. mandatory
features. My patchset brings the device up to v1.3 and I have v1.4 ready
for posting, so I think we can get there.

Klaus

Re: [PATCH v5 08/26] nvme: refactor device realization

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 11:27, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > This patch splits up nvme_realize into multiple individual functions,
> > each initializing a different subset of the device.
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c | 175 +++-
> >  hw/block/nvme.h |  21 ++
> >  2 files changed, 133 insertions(+), 63 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index e1810260d40b..81514eaef63a 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -44,6 +44,7 @@
> >  #include "nvme.h"
> >  
> >  #define NVME_SPEC_VER 0x00010201
> > +#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> >  
> >  #define NVME_GUEST_ERR(trace, fmt, ...) \
> >  do { \
> > @@ -1325,67 +1326,106 @@ static const MemoryRegionOps nvme_cmb_ops = {
> >  },
> >  };
> >  
> > -static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > +static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
> >  {
> > -NvmeCtrl *n = NVME(pci_dev);
> > -NvmeIdCtrl *id = &n->id_ctrl;
> > -
> > -int i;
> > -int64_t bs_size;
> > -uint8_t *pci_conf;
> > -
> > -if (!n->params.num_queues) {
> > -error_setg(errp, "num_queues can't be zero");
> > -return;
> > -}
> > +NvmeParams *params = &n->params;
> >  
> >  if (!n->conf.blk) {
> > -error_setg(errp, "drive property not set");
> > -return;
> > +error_setg(errp, "nvme: block backend not configured");
> > +return 1;
> As a matter of taste, negative values indicate error, and 0 is the success 
> value.
> In Linux kernel this is even an official rule.
> >  }

Fixed.

> >  
> > -bs_size = blk_getlength(n->conf.blk);
> > -if (bs_size < 0) {
> > -error_setg(errp, "could not get backing file size");
> > -return;
> > +if (!params->serial) {
> > +error_setg(errp, "nvme: serial not configured");
> > +return 1;
> >  }
> >  
> > -if (!n->params.serial) {
> > -error_setg(errp, "serial property not set");
> > -return;
> > +if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
> > +error_setg(errp, "nvme: invalid queue configuration");
> Maybe something like "nvme: invalid queue count specified, should be between 
> 1 and ..."?
> > +return 1;
> >  }

Fixed.

> > +
> > +return 0;
> > +}
> > +
> > +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > +{
> >  blkconf_blocksizes(&n->conf);
> >  if (!blkconf_apply_backend_options(&n->conf, 
> > blk_is_read_only(n->conf.blk),
> > -   false, errp)) {
> > -return;
> > +false, errp)) {
> > +return 1;
> >  }
> >  
> > -pci_conf = pci_dev->config;
> > -pci_conf[PCI_INTERRUPT_PIN] = 1;
> > -pci_config_set_prog_interface(pci_dev->config, 0x2);
> > -pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> > -pcie_endpoint_cap_init(pci_dev, 0x80);
> > +return 0;
> > +}
> >  
> > +static void nvme_init_state(NvmeCtrl *n)
> > +{
> >  n->num_namespaces = 1;
> >  n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
> 
> Isn't that wrong?
> First 4K of mmio (0x1000) is the registers, and that is followed by the 
> doorbells,
> and each doorbell takes 8 bytes (assuming regular doorbell stride).
> so n->params.num_queues + 1 should be total number of queues, thus the 0x1004 
> should be 0x1000 IMHO.
> I might miss some rounding magic here though.
> 

Yeah. I think you are right. It all becomes slightly more fishy due to
the num_queues device parameter being 1's based and accounts for the
admin queue pair.

But in get/set features, the value has to be 0's based and only account
for the I/O queues, so we need to subtract 2 from the value. It's
confusing all around.

Since the admin queue pair isn't really optional I think it would be
better that we introduces a new max_ioqpairs parameter that is 1's
based, counts number of pairs and obviously only accounts for the io
queues.

I guess we need to keep the num_queues parameter around for
compatibility.

The doorbells are only 4 bytes btw, but the calculation still looks
wrong. With a max_ioqpairs parameter in place, the reg_size should be

pow2ceil(0x1008 + 2 * (n->params.max_ioqpairs) * 4)

Right? Thats 0x1000 for the core registers, 8 bytes for the sq/cq
doorbells for the admin queue pair, and then room for the i/o queue
pairs.

I added a patch for this in v6.

> > -n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> > -
> >  n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> >  n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> >  n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> > +}
> >  
> > -memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
> > -  "nvme", n->reg_size);
> > +static void nvme_in

Re: [PATCH v5 09/26] nvme: add temperature threshold feature

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 11:31, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > It might seem wierd to implement this feature for an emulated device,
> > but it is mandatory to support and the feature is useful for testing
> > asynchronous event request support, which will be added in a later
> > patch.
> 
> Absolutely but as the old saying is, rules are rules.
> At least, to the defense of the spec, making this mandatory
> forced the vendors to actually report some statistics about
> the device in neutral format as opposed to yet another
> vendor proprietary thing (I am talking about SMART log page).
> 
> > 
> > Signed-off-by: Klaus Jensen 
> 
> I noticed that you sign off some patches with your @samsung.com email,
> and some with @cnexlabs.com
> Is there a reason for that?

Yeah. Some of this code was made while I was at CNEX Labs. I've since
moved to Samsung. But credit where credit's due.

> 
> 
> > ---
> >  hw/block/nvme.c  | 50 
> >  hw/block/nvme.h  |  2 ++
> >  include/block/nvme.h |  7 ++-
> >  3 files changed, 58 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 81514eaef63a..f72348344832 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -45,6 +45,9 @@
> >  
> >  #define NVME_SPEC_VER 0x00010201
> >  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> > +#define NVME_TEMPERATURE 0x143
> > +#define NVME_TEMPERATURE_WARNING 0x157
> > +#define NVME_TEMPERATURE_CRITICAL 0x175
> >  
> >  #define NVME_GUEST_ERR(trace, fmt, ...) \
> >  do { \
> > @@ -798,9 +801,31 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl 
> > *n, NvmeCmd *cmd)
> >  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest 
> > *req)
> >  {
> >  uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > +uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  uint32_t result;
> >  
> >  switch (dw10) {
> > +case NVME_TEMPERATURE_THRESHOLD:
> > +result = 0;
> > +
> > +/*
> > + * The controller only implements the Composite Temperature 
> > sensor, so
> > + * return 0 for all other sensors.
> > + */
> > +if (NVME_TEMP_TMPSEL(dw11)) {
> > +break;
> > +}
> > +
> > +switch (NVME_TEMP_THSEL(dw11)) {
> > +case 0x0:
> > +result = cpu_to_le16(n->features.temp_thresh_hi);
> > +break;
> > +case 0x1:
> > +result = cpu_to_le16(n->features.temp_thresh_low);
> > +break;
> > +}
> > +
> > +break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  result = blk_enable_write_cache(n->conf.blk);
> >  trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
> > @@ -845,6 +870,23 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd 
> > *cmd, NvmeRequest *req)
> >  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  
> >  switch (dw10) {
> > +case NVME_TEMPERATURE_THRESHOLD:
> > +if (NVME_TEMP_TMPSEL(dw11)) {
> > +break;
> > +}
> > +
> > +switch (NVME_TEMP_THSEL(dw11)) {
> > +case 0x0:
> > +n->features.temp_thresh_hi = NVME_TEMP_TMPTH(dw11);
> > +break;
> > +case 0x1:
> > +n->features.temp_thresh_low = NVME_TEMP_TMPTH(dw11);
> > +break;
> > +default:
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +break;
> >  case NVME_VOLATILE_WRITE_CACHE:
> >  blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >  break;
> > @@ -1366,6 +1408,9 @@ static void nvme_init_state(NvmeCtrl *n)
> >  n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> >  n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> >  n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> > +
> > +n->temperature = NVME_TEMPERATURE;
> 
> This appears not to be used in the patch.
> I think you should move that to the next patch that
> adds the get log page support.
> 

Fixed.

> > +n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> >  }
> >  
> >  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > @@ -1447,6 +1492,11 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >  id->acl = 3;
> >  id->frmw = 7 << 1;
> >  id->lpa = 1 << 0;
> > +
> > +/* recommended default value (~70 C) */
> > +id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> > +id->cctemp = cpu_to_le16(NVME_TEMPERATURE_CRITICAL);
> > +
> >  id->sqes = (0x6 << 4) | 0x6;
> >  id->cqes = (0x4 << 4) | 0x4;
> >  id->nn = cpu_to_le32(n->num_namespaces);
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index a867bdfabafd..1518f32557a3 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -108,6 +108,7 @@ typedef struct NvmeCtrl {
> >  uint64_tirq_status;
> >  uint64_thost_timestamp; /* Timestamp sent by the 
> > host */
> >  ui

Re: [PATCH v5 10/26] nvme: add support for the get log page command

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 11:35, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Add support for the Get Log Page command and basic implementations of
> > the mandatory Error Information, SMART / Health Information and Firmware
> > Slot Information log pages.
> > 
> > In violation of the specification, the SMART / Health Information log
> > page does not persist information over the lifetime of the controller
> > because the device has no place to store such persistent state.
> Yea, not the end of the world.
> > 
> > Note that the LPA field in the Identify Controller data structure
> > intentionally has bit 0 cleared because there is no namespace specific
> > information in the SMART / Health information log page.
> Makes sense.
> > 
> > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > Section 5.10 ("Get Log Page command").
> > 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/nvme.c   | 122 +-
> >  hw/block/nvme.h   |  10 
> >  hw/block/trace-events |   2 +
> >  include/block/nvme.h  |   2 +-
> >  4 files changed, 134 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f72348344832..468c36918042 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -569,6 +569,123 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd 
> > *cmd)
> >  return NVME_SUCCESS;
> >  }
> >  
> > +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > buf_len,
> > +uint64_t off, NvmeRequest *req)
> > +{
> > +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +uint32_t nsid = le32_to_cpu(cmd->nsid);
> > +
> > +uint32_t trans_len;
> > +time_t current_ms;
> > +uint64_t units_read = 0, units_written = 0, read_commands = 0,
> > +write_commands = 0;
> > +NvmeSmartLog smart;
> > +BlockAcctStats *s;
> > +
> > +if (nsid && nsid != 0x) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +s = blk_get_stats(n->conf.blk);
> > +
> > +units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > +units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > +read_commands = s->nr_ops[BLOCK_ACCT_READ];
> > +write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> > +
> > +if (off > sizeof(smart)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +trans_len = MIN(sizeof(smart) - off, buf_len);
> > +
> > +memset(&smart, 0x0, sizeof(smart));
> > +
> > +smart.data_units_read[0] = cpu_to_le64(units_read / 1000);
> > +smart.data_units_written[0] = cpu_to_le64(units_written / 1000);
> > +smart.host_read_commands[0] = cpu_to_le64(read_commands);
> > +smart.host_write_commands[0] = cpu_to_le64(write_commands);
> > +
> > +smart.temperature[0] = n->temperature & 0xff;
> > +smart.temperature[1] = (n->temperature >> 8) & 0xff;
> > +
> > +if ((n->temperature > n->features.temp_thresh_hi) ||
> > +(n->temperature < n->features.temp_thresh_low)) {
> > +smart.critical_warning |= NVME_SMART_TEMPERATURE;
> > +}
> > +
> > +current_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > +smart.power_on_hours[0] = cpu_to_le64(
> > +(((current_ms - n->starttime_ms) / 1000) / 60) / 60);
> > +
> > +return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> > +prp2);
> > +}
> Looks OK.
> > +
> > +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t 
> > buf_len,
> > +uint64_t off, NvmeRequest *req)
> > +{
> > +uint32_t trans_len;
> > +uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > +uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +NvmeFwSlotInfoLog fw_log;
> > +
> > +if (off > sizeof(fw_log)) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +memset(&fw_log, 0, sizeof(NvmeFwSlotInfoLog));
> > +
> > +trans_len = MIN(sizeof(fw_log) - off, buf_len);
> > +
> > +return nvme_dma_read_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> > +prp2);
> > +}
> Looks OK
> > +
> > +static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > +uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > +uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> > +uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> > +uint8_t  lid = dw10 & 0xff;
> > +uint8_t  rae = (dw10 >> 15) & 0x1;
> > +uint32_t numdl, numdu;
> > +uint64_t off, lpol, lpou;
> > +size_t   len;
> > +
> > +numdl = (dw10 >> 16);
> > +numdu = (dw11 & 0x);
> > +lpol = dw12;
> > +lpou = dw13;
> > +
> > +len = (((numdu << 16) | numdl) + 1) << 2;
> > +off = (lpou << 32ULL) | lpol;
> > +
> > +if (off & 0x3) {
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> 
> Good. 
> Note that there are plenty of other place

Re: [PATCH v5 22/26] nvme: support multiple namespaces

2020-03-16 Thread Klaus Birkelund Jensen

On Feb 12 14:34, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> Very reasonable way to do it. 
> > 
> > Signed-off-by: Klaus Jensen 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  hw/block/Makefile.objs |   2 +-
> >  hw/block/nvme-ns.c | 158 +++
> >  hw/block/nvme-ns.h |  60 +++
> >  hw/block/nvme.c| 235 +
> >  hw/block/nvme.h|  47 -
> >  hw/block/trace-events  |   6 +-
> >  6 files changed, 389 insertions(+), 119 deletions(-)
> >  create mode 100644 hw/block/nvme-ns.c
> >  create mode 100644 hw/block/nvme-ns.h
> > 
> > diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
> > index 28c2495a00dc..45f463462f1e 100644
> > --- a/hw/block/Makefile.objs
> > +++ b/hw/block/Makefile.objs
> > @@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
> >  common-obj-$(CONFIG_XEN) += xen-block.o
> >  common-obj-$(CONFIG_ECC) += ecc.o
> >  common-obj-$(CONFIG_ONENAND) += onenand.o
> > -common-obj-$(CONFIG_NVME_PCI) += nvme.o
> > +common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
> >  common-obj-$(CONFIG_SWIM) += swim.o
> >  
> >  obj-$(CONFIG_SH4) += tc58128.o
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > new file mode 100644
> > index ..0e5be44486f4
> > --- /dev/null
> > +++ b/hw/block/nvme-ns.c
> > @@ -0,0 +1,158 @@
> > +#include "qemu/osdep.h"
> > +#include "qemu/units.h"
> > +#include "qemu/cutils.h"
> > +#include "qemu/log.h"
> > +#include "hw/block/block.h"
> > +#include "hw/pci/msix.h"
> Do you need this include?

No, I needed hw/pci/pci.h instead :)

> > +#include "sysemu/sysemu.h"
> > +#include "sysemu/block-backend.h"
> > +#include "qapi/error.h"
> > +
> > +#include "hw/qdev-properties.h"
> > +#include "hw/qdev-core.h"
> > +
> > +#include "nvme.h"
> > +#include "nvme-ns.h"
> > +
> > +static int nvme_ns_init(NvmeNamespace *ns)
> > +{
> > +NvmeIdNs *id_ns = &ns->id_ns;
> > +
> > +id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +id_ns->nuse = id_ns->ncap = id_ns->nsze =
> > +cpu_to_le64(nvme_ns_nlbas(ns));
> Nitpick: To be honest I don't really like that chain assignment, 
> especially since it forces to wrap the line, but that is just my
> personal taste.

Fixed, and also added a comment as to why they are the same.

> > +
> > +return 0;
> > +}
> > +
> > +static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, NvmeIdCtrl *id,
> > +Error **errp)
> > +{
> > +uint64_t perm, shared_perm;
> > +
> > +Error *local_err = NULL;
> > +int ret;
> > +
> > +perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
> > +shared_perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
> > +BLK_PERM_GRAPH_MOD;
> > +
> > +ret = blk_set_perm(ns->blk, perm, shared_perm, &local_err);
> > +if (ret) {
> > +error_propagate_prepend(errp, local_err, "blk_set_perm: ");
> > +return ret;
> > +}
> 
> You should consider using blkconf_apply_backend_options.
> Take a look at for example virtio_blk_device_realize.
> That will give you support for read only block devices as well.

So, yeah. There is a reason for this. And I will add that as a comment,
but I will write it here for posterity.

The problem is when the nvme-ns device starts getting more than just a
single drive attached (I have patches ready that will add a "metadata"
and a "state" drive). The blkconf_ functions work on a BlockConf that
embeds a BlockBackend, so you can't have one BlockConf with multiple
BlockBackend's. That is why I'm kinda copying the "good parts" of
the blkconf_apply_backend_options code here.

> 
> I personally only once grazed the area of block permissions,
> so I prefer someone from the block layer to review this as well.
> 
> > +
> > +ns->size = blk_getlength(ns->blk);
> > +if (ns->size < 0) {
> > +error_setg_errno(errp, -ns->size, "blk_getlength");
> > +return 1;
> > +}
> > +
> > +switch (n->conf.wce) {
> > +case ON_OFF_AUTO_ON:
> > +n->features.volatile_wc = 1;
> > +break;
> > +case ON_OFF_AUTO_OFF:
> > +n->features.volatile_wc = 0;
> > +case ON

1 2 >

1 - 100 of 131 matches

Mail list logo