Re: [PATCH v3] hw/nvme: clean up CC register write logic

2022-06-08 Thread Łukasz Gieryk
On Tue, Jun 07, 2022 at 01:23:20PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> The SRIOV series exposed an issued with how CC register writes are
> handled and how CSTS is set in response to that. Specifically, after
> applying the SRIOV series, the controller could end up in a state with
> CC.EN set to '1' but with CSTS.RDY cleared to '0', causing drivers to
> expect CSTS.RDY to transition to '1' but timing out.
> 
> Clean this up.
> 
> Signed-off-by: Klaus Jensen 
> Reviewed-by: Lukasz Maniak 
> ---
> v3:
>   * clear intms/intmc/cc regardless of reset type
> 
>  hw/nvme/ctrl.c | 38 --
>  1 file changed, 16 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 658584d417fe..a558f5cb29c1 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -6190,10 +6190,15 @@ static void nvme_ctrl_reset(NvmeCtrl *n, 
> NvmeResetType rst)
>  
>  if (pci_is_vf(pci_dev)) {
>  sctrl = nvme_sctrl(n);
> +
>  stl_le_p(>bar.csts, sctrl->scs ? 0 : NVME_CSTS_FAILED);
>  } else {
>  stl_le_p(>bar.csts, 0);
>  }
> +
> +stl_le_p(>bar.intms, 0);
> +stl_le_p(>bar.intmc, 0);
> +stl_le_p(>bar.cc, 0);
>  }
>  
>  static void nvme_ctrl_shutdown(NvmeCtrl *n)
> @@ -6405,20 +6410,21 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr 
> offset, uint64_t data,
>  nvme_irq_check(n);
>  break;
>  case NVME_REG_CC:
> +stl_le_p(>bar.cc, data);
> +
>  trace_pci_nvme_mmio_cfg(data & 0x);
>  
> -/* Windows first sends data, then sends enable bit */
> -if (!NVME_CC_EN(data) && !NVME_CC_EN(cc) &&
> -!NVME_CC_SHN(data) && !NVME_CC_SHN(cc))
> -{
> -cc = data;
> +if (NVME_CC_SHN(data) && !(NVME_CC_SHN(cc))) {
> +trace_pci_nvme_mmio_shutdown_set();
> +nvme_ctrl_shutdown(n);
> +csts &= ~(CSTS_SHST_MASK << CSTS_SHST_SHIFT);
> +csts |= NVME_CSTS_SHST_COMPLETE;
> +} else if (!NVME_CC_SHN(data) && NVME_CC_SHN(cc)) {
> +trace_pci_nvme_mmio_shutdown_cleared();
> +csts &= ~(CSTS_SHST_MASK << CSTS_SHST_SHIFT);
>  }
>  
>  if (NVME_CC_EN(data) && !NVME_CC_EN(cc)) {
> -cc = data;
> -
> -/* flush CC since nvme_start_ctrl() needs the value */
> -stl_le_p(>bar.cc, cc);
>  if (unlikely(nvme_start_ctrl(n))) {
>  trace_pci_nvme_err_startfail();
>  csts = NVME_CSTS_FAILED;
> @@ -6429,22 +6435,10 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr 
> offset, uint64_t data,
>  } else if (!NVME_CC_EN(data) && NVME_CC_EN(cc)) {
>  trace_pci_nvme_mmio_stopped();
>  nvme_ctrl_reset(n, NVME_RESET_CONTROLLER);
> -cc = 0;
> -csts &= ~NVME_CSTS_READY;
> -}
>  
> -if (NVME_CC_SHN(data) && !(NVME_CC_SHN(cc))) {
> -trace_pci_nvme_mmio_shutdown_set();
> -nvme_ctrl_shutdown(n);
> -cc = data;
> -csts |= NVME_CSTS_SHST_COMPLETE;
> -} else if (!NVME_CC_SHN(data) && NVME_CC_SHN(cc)) {
> -trace_pci_nvme_mmio_shutdown_cleared();
> -csts &= ~NVME_CSTS_SHST_COMPLETE;
> -cc = data;
> +break;
>  }
>  
> -stl_le_p(>bar.cc, cc);
>  stl_le_p(>bar.csts, csts);
>  
>  break;
> -- 
> 2.36.1

Reviewed-by: Łukasz Gieryk 




Re: [PATCH v2] hw/nvme: clean up CC register write logic

2022-06-07 Thread Łukasz Gieryk
On Fri, Jun 03, 2022 at 10:24:51PM +0200, Klaus Jensen wrote:
> On Jun  1 15:28, Lukasz Maniak wrote:
> > On Wed, May 25, 2022 at 09:35:24AM +0200, Klaus Jensen wrote:
> > > 
> > > +stl_le_p(>bar.intms, 0);
> > > +stl_le_p(>bar.intmc, 0);
> > > +stl_le_p(>bar.cc, 0);
> > 
> > Looks fine, though it seems the NVMe spec says the above registers
> > should be cleared during each reset for VF as well.
> > 
> 
> Aren't the values of all other registers than CSTS just undefined? (NVMe
> v2.0b, Section 8.26.3)

My 2 cents –

When VF is online:
- Both Controller Reset (CR) and PCIe Function Level Reset (FLR) can be
  issued to given VF
- Both resets shall return all (except specific) Nvme registers of given
  VF to their reset values (3.7.2)

When VF is offline:
- CR cannot be issued (only CSTS is defined, writes to CC are dropped),
  so doesn’t need an explicit IF
- FLR is allowed as it’s a part of the procedure to bring VF online
  (mentioned in 8.26.3)
- At least FLR shall reset the registers for VF

So I agree with the other Lukasz's suggestion. I would also clear
intms/intmc/cc for both: VF and PF reset paths, regardless of the actual
reset type.




Re: [PATCH v7 12/12] hw/acpi: Make the PCI hot-plug aware of SR-IOV

2022-04-04 Thread Łukasz Gieryk
On Thu, Mar 31, 2022 at 02:38:41PM +0200, Igor Mammedov wrote:
> it's unclear what's bing hotpluged and unplugged, it would be better if
> you included QEMU CLI and relevan qmp/monito commands to reproduce it.

Qemu CLI:
-
-device pcie-root-port,slot=0,id=rp0
-device nvme-subsys,id=subsys0
-device 
nvme,id=nvme0,bus=rp0,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,sriov_vq_flexible=2,sriov_vi_flexible=1

Guest OS:
-
sudo nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
sudo nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
echo 1 > /sys/bus/pci/devices/:01:00.0/reset
sleep 1
echo 1 > /sys/bus/pci/devices/:01:00.0/sriov_numvfs
nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
sleep 2
echo 01:00.1 > /sys/bus/pci/drivers/nvme/bind

Qemu monitor:
-
device_del nvme0
 



Re: [PATCH v5 13/15] hw/nvme: Add support for the Virtualization Management command

2022-03-09 Thread Łukasz Gieryk
On Tue, Mar 01, 2022 at 02:07:08PM +0100, Klaus Jensen wrote:
> On Feb 17 18:45, Lukasz Maniak wrote:
> > From: Łukasz Gieryk 
> > 
> > With the new command one can:
> >  - assign flexible resources (queues, interrupts) to primary and
> >secondary controllers,
> >  - toggle the online/offline state of given controller.
> > 
> 
> QEMU segfaults (or asserts depending on the wind blowing) if the SR-IOV
> enabled device is hotplugged after being configured (i.e. follow the
> docs for a simple setup and then do a `device_del ` in the
> monitor. I suspect this is related to freeing the queues and something
> getting double-freed.
> 

I’ve finally found some time to look at the issue.

Long story short: the hot-plug mechanism deletes all VFs without the PF
knowing, then PF tries to reset and delete all the already non-existing
devices.

I have a solution for the problem, but there’s high a chance it’s not
the correct one. I’m still reading through the specs, as my knowledge in
the area of hot-plug/ACPI is quite limited.

Soon we will release the next patch set, with the fix included. I hope
the ACPI maintainers will chime in then. Till that happens, this is the
summary of my findings:

1) The current SR-IOV implementation assumes it’s the PF that creates
   and deletes VFs.
2) It’s a design decision (the Nvme device at least) for the VFs to be
   of the same class as PF. Effectively, they share the dc->hotpluggable
   value.
3) When a VF is created, it’s added as a child node to PF’s PCI bus
   slot.
4) Monitor/device_del triggers the ACPI mechanism. The implementation is
   not aware of SR/IOV and ejects PF’s PCI slot, directly unrealizing all
   hot-pluggable (!acpi_pcihp_pc_no_hotplug) children nodes.
5) VFs are unrealized directly, and it doesn’t work well with (1).
   SR/IOV structures are not updated, so when it’s PF’s turn to be
   unrealized, it works on stale pointers to already-deleted VFs.

My proposed ‘fix’ is to make the PCI ACPI code aware of SR/IOV:


diff --git a/hw/acpi/pcihp.c b/hw/acpi/pcihp.c
index f4d706e47d..090bdb8e74 100644
--- a/hw/acpi/pcihp.c
+++ b/hw/acpi/pcihp.c
@@ -196,8 +196,12 @@ static bool acpi_pcihp_pc_no_hotplug(AcpiPciHpState *s, 
PCIDevice *dev)
  * ACPI doesn't allow hotplug of bridge devices.  Don't allow
  * hot-unplug of bridge devices unless they were added by hotplug
  * (and so, not described by acpi).
+ *
+ * Don't allow hot-unplug of SR-IOV Virtual Functions, as they
+ * will be removed implicitly, when Physical Function is unplugged.
  */
-return (pc->is_bridge && !dev->qdev.hotplugged) || !dc->hotpluggable;
+return (pc->is_bridge && !dev->qdev.hotplugged) || !dc->hotpluggable ||
+   pci_is_vf(dev);
 }




Re: [PATCH v3 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV)

2022-01-26 Thread Łukasz Gieryk
I'm sorry for the delayed response. We (I and the other Lukasz) somehow
had hoped that Knut, the original author of this patch, would have
responded.

Let me address your questions, up to my best knowledge.
  
> > -static pcibus_t pci_bar_address(PCIDevice *d,
> > -int reg, uint8_t type, pcibus_t size)
> > +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
> > +uint8_t type, pcibus_t size)
> > +{
> > +pcibus_t new_addr;
> > +if (!pci_is_vf(d)) {
> > +int bar = pci_bar(d, reg);
> > +if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > +new_addr = pci_get_quad(d->config + bar);
> > +} else {
> > +new_addr = pci_get_long(d->config + bar);
> > +}
> > +} else {
> > +PCIDevice *pf = d->exp.sriov_vf.pf;
> > +uint16_t sriov_cap = pf->exp.sriov_cap;
> > +int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
> > +uint16_t vf_offset = pci_get_word(pf->config + sriov_cap + 
> > PCI_SRIOV_VF_OFFSET);
> > +uint16_t vf_stride = pci_get_word(pf->config + sriov_cap + 
> > PCI_SRIOV_VF_STRIDE);
> > +uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) / vf_stride;
> > +
> > +if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > +new_addr = pci_get_quad(pf->config + bar);
> > +} else {
> > +new_addr = pci_get_long(pf->config + bar);
> > +}
> > +new_addr += vf_num * size;
> > +}
> > +if (reg != PCI_ROM_SLOT) {
> > +/* Preserve the rom enable bit */
> > +new_addr &= ~(size - 1);
> 
> This comment puzzles me. How does clearing low bits preserve
> any bits? Looks like this will clear low bits if any.
> 

I think the comment applies to (reg != PCI_ROM_SLOT), i.e., the bits are
cleared for BARs, but not for expansion ROM. I agree the placement of this
comment is slightly misleading. We will move it up and rephrase slightly.
 
> > +pcibus_t pci_bar_address(PCIDevice *d,
> > + int reg, uint8_t type, pcibus_t size)
> >  {
> >  pcibus_t new_addr, last_addr;
> > -int bar = pci_bar(d, reg);
> >  uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
> >  Object *machine = qdev_get_machine();
> >  ObjectClass *oc = object_get_class(machine);
> > @@ -1309,7 +1363,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
> >  if (!(cmd & PCI_COMMAND_IO)) {
> >  return PCI_BAR_UNMAPPED;
> >  }
> > -new_addr = pci_get_long(d->config + bar) & ~(size - 1);
> > +new_addr = pci_config_get_bar_addr(d, reg, type, size);
> >  last_addr = new_addr + size - 1;
> >  /* Check if 32 bit BAR wraps around explicitly.
> >   * TODO: make priorities correct and remove this work around.
> > @@ -1324,11 +1378,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
> >  if (!(cmd & PCI_COMMAND_MEMORY)) {
> >  return PCI_BAR_UNMAPPED;
> >  }
> > -if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > -new_addr = pci_get_quad(d->config + bar);
> > -} else {
> > -new_addr = pci_get_long(d->config + bar);
> > -}
> > +new_addr = pci_config_get_bar_addr(d, reg, type, size);
> >  /* the ROM slot has a specific enable bit */
> >  if (reg == PCI_ROM_SLOT && !(new_addr & PCI_ROM_ADDRESS_ENABLE)) {
> 
> And in fact here we check the low bit and handle it specially.

The code seems correct for me. The bit is preserved for ROM case.

> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index d7d73a31e4..182a225054 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -446,6 +446,11 @@ void pcie_cap_slot_plug_cb(HotplugHandler 
> > *hotplug_dev, DeviceState *dev,
> >  PCIDevice *pci_dev = PCI_DEVICE(dev);
> >  uint32_t lnkcap = pci_get_long(exp_cap + PCI_EXP_LNKCAP);
> >  
> > +if(pci_is_vf(pci_dev)) {
> > +/* We don't want to change any state in hotplug_dev for SR/IOV 
> > virtual functions */
> > +return;
> > +}
> > +
> 
> Coding style violation here.  And pls document the why not the what.
> E.g. IIRC the reason is that VFs don't have an express capability,
> right?

I think the reason is that virtual functions don’t exist physically, so
they cannot be individually disconnected. Only PF should respond to
hotplug events, implicitly disconnecting (thus: destroying) all child
VFs.

Anyway, we will update this comment to state *why* and add the missing
space.

V4 with the mentioned changes will happen really soon.




Re: [PATCH v2 00/15] hw/nvme: SR-IOV with Virtualization Enhancements

2021-12-20 Thread Łukasz Gieryk
On Mon, Dec 20, 2021 at 08:12:06AM +0100, Klaus Jensen wrote:
> On Nov 25 15:15, Łukasz Gieryk wrote:
> > On Wed, Nov 24, 2021 at 09:03:06AM +0100, Klaus Jensen wrote:
> > > Hi Lukasz,
> > > 
> > > I've been through this. I have a couple of review comments, but overall
> > > looks good for inclusion in nvme-next. Would be nice to get this in
> > > early in the cycle so it can mature there for 7.0.
> > 
> > We (I’m speaking on behalf of the other Lukasz) are really happy to
> > read that. We will do our best to make it happen.
> > 
> 
> Keith, do you have any comments on this series - otherwise I'd like to
> stage this for nvme-next come January.
> 

Please, wait with the staging - we are about to release the v3.

We have few fixes that will make the device more compliant with the NVMe
and SR-IOV Spec.





Re: [PATCH v2 00/15] hw/nvme: SR-IOV with Virtualization Enhancements

2021-11-25 Thread Łukasz Gieryk
On Wed, Nov 24, 2021 at 09:03:06AM +0100, Klaus Jensen wrote:
> Hi Lukasz,
> 
> I've been through this. I have a couple of review comments, but overall
> looks good for inclusion in nvme-next. Would be nice to get this in
> early in the cycle so it can mature there for 7.0.

We (I’m speaking on behalf of the other Lukasz) are really happy to
read that. We will do our best to make it happen.

> 
> I'd like that we mark this support experimental, so we can easily do
> some changes to how parameters work since I'm not sure we completely
> agree on that yet.
> 
> By the way, in the future, please add me and Keith as CCs on the entire
> series so we get CC'ed on replies to the cover-letter ;)
> 

> > List of known gaps and nice-to-haves:
> > 
> > 1) Interaction of secondary controllers with namespaces is not 100%
> > following the spec
> > 
> > The limitation: VF has to be visible on the PCI bus first, and only then
> > such VF can have a namespace attached.
> > 
> 
> Looking at the spec I'm not even sure what the expected behavior is
> supposed to be, can you elaborate? I rebased this on latest, and with
> Hannes changes, shared namespaces will be attached by default, which
> seems to be reasonable.

An example flow:

# Release flexible resources from PF (assuming it’s /dev/nvme0)
nvme virt-mgmt -c 0 -r 0 -n 0 -a 1 /dev/nvme0
nvme virt-mgmt -c 0 -r 1 -n 0 -a 1 /dev/nvme0
echo 1 > /sys/class/nvme/nvme0/reset_controller
# Bind sane minimums to VF1 (cntlid=1) and set it online
nvme virt-mgmt -c 1 -r 0 -n 5 -a 8 /dev/nvme0
nvme virt-mgmt -c 1 -r 1 -n 5 -a 8 /dev/nvme0
nvme virt-mgmt -c 1 -a 9 /dev/nvme0
# Enable 2 VFs
echo 2 > /sys/bus/pci/devices//sriov_numvfs
# PF, VF1 and VF2 are visible on PCI
lspci | grep Non-Volatile
# The NVMe driver is bound to PF and VF1 (the only online VF)
nvme list -v
# VFs shall eventually not support Ns Management/Attachment commands,
# and namespaces should be attached to VFs (i.e., their secondary
# controllers) through the PF.
# A namespace can be attached to VF1, VF2
nvme attach-ns /dev/nvme0 -c 1 -n X
nvme attach-ns /dev/nvme0 -c 2 -n X
# According to the spec this should also succeed, but today it won’t
nvme attach-ns /dev/nvme0 -c 3 -n X

VF3’s NvmeCtrl object is not yet allocated, so today there’s nothing
for nvme_subsys_ctrl() to return for cntlid=3, besides NULL (the
current behavior) or SUBSYS_SLOT_RSVD.

Relevant use cases:
 - admin can configure disabled VFs,
 - information about attached ns persists when VFs are disabled,
are not that critical, but of course it’s a discrepancy from what a
real device can handle.

In my opinion, to handle the cases correctly, information about attached
namespaces could be moved to subsystem. Could you share your thoughts
whether such approach would make sense?




Re: [PATCH v2 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers

2021-11-25 Thread Łukasz Gieryk
On Wed, Nov 24, 2021 at 03:26:30PM +0100, Łukasz Gieryk wrote:
> On Wed, Nov 24, 2021 at 09:04:31AM +0100, Klaus Jensen wrote:
> > On Nov 16 16:34, Łukasz Gieryk wrote:
> > > With four new properties:
> > >  - sriov_v{i,q}_flexible,
> > >  - sriov_max_v{i,q}_per_vf,
> > > one can configure the number of available flexible resources, as well as
> > > the limits. The primary and secondary controller capability structures
> > > are initialized accordingly.
> > > 
> > > Since the number of available queues (interrupts) now varies between
> > > VF/PF, BAR size calculation is also adjusted.
> > > 
> > > Signed-off-by: Łukasz Gieryk 
> > > ---
> > >  hw/nvme/ctrl.c   | 138 ---
> > >  hw/nvme/nvme.h   |   4 ++
> > >  include/block/nvme.h |   5 ++
> > >  3 files changed, 140 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > index f8f5dfe204..f589ffde59 100644
> > > --- a/hw/nvme/ctrl.c
> > > +++ b/hw/nvme/ctrl.c
> > > @@ -6358,13 +6444,40 @@ static void nvme_init_state(NvmeCtrl *n)
> > >  n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > >  n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
> > >  
> > > -list->numcntl = cpu_to_le16(n->params.sriov_max_vfs);
> > > -for (i = 0; i < n->params.sriov_max_vfs; i++) {
> > > +list->numcntl = cpu_to_le16(max_vfs);
> > > +for (i = 0; i < max_vfs; i++) {
> > >  sctrl = >sec[i];
> > >  sctrl->pcid = cpu_to_le16(n->cntlid);
> > >  }
> > >  
> > >  cap->cntlid = cpu_to_le16(n->cntlid);
> > > +cap->crt = NVME_CRT_VQ | NVME_CRT_VI;
> > > +
> > > +if (pci_is_vf(>parent_obj)) {
> > > +cap->vqprt = cpu_to_le16(1 + n->conf_ioqpairs);
> > > +} else {
> > > +cap->vqprt = cpu_to_le16(1 + n->params.max_ioqpairs -
> > > + n->params.sriov_vq_flexible);
> > > +cap->vqfrt = cpu_to_le32(n->params.sriov_vq_flexible);
> > > +cap->vqrfap = cap->vqfrt;
> > > +cap->vqgran = cpu_to_le16(NVME_VF_RES_GRANULARITY);
> > > +cap->vqfrsm = n->params.sriov_max_vq_per_vf ?
> > > +cpu_to_le16(n->params.sriov_max_vq_per_vf) :
> > > +cap->vqprt;
> > 
> > That this defaults to VQPRT doesn't seem right. It should default to
> > VQFRT. Does not make sense to report a maximum number of assignable
> > flexible resources that are bigger than the number of flexible resources
> > available.
> 
> I’ve explained in on of v1 threads why I think using the current default
> is better than VQPRT.
> 
> What you’ve noticed is indeed an inconvenience, but it’s – at least in
> my opinion – part of the design. What matters is the current number of
> unassigned flexible resources. It may be lower than VQFRSM due to
> multiple reasons:
>  1) resources are bound to PF, 
>  2) resources are bound to other VFs,
>  3) resources simply don’t exist (not baked in silicone: VQFRT < VQFRSM).
> 
> If 1) and 2) are allowed to happen, and the user must be aware of that,
> then why 3) shouldn’t?
> 

I’ve done some more thinking, and now I’m not happy with my version, nor
the suggested VQPRT.

How about using this formula instead?:

v{q,i}frsm = sriov_max_v{I,q}_per_vf ? sriov_max_v{I,q}_per_vf :
 floor(sriov_v{i,q}_flexible / sriov_max_vfs)

v{q,i}frsm would end up with values similar/proportional to those
reported by and actual SR-IOV-capable device available on the market.




Re: [PATCH v2 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers

2021-11-24 Thread Łukasz Gieryk
On Wed, Nov 24, 2021 at 09:04:31AM +0100, Klaus Jensen wrote:
> On Nov 16 16:34, Łukasz Gieryk wrote:
> > With four new properties:
> >  - sriov_v{i,q}_flexible,
> >  - sriov_max_v{i,q}_per_vf,
> > one can configure the number of available flexible resources, as well as
> > the limits. The primary and secondary controller capability structures
> > are initialized accordingly.
> > 
> > Since the number of available queues (interrupts) now varies between
> > VF/PF, BAR size calculation is also adjusted.
> > 
> > Signed-off-by: Łukasz Gieryk 
> > ---
> >  hw/nvme/ctrl.c   | 138 ---
> >  hw/nvme/nvme.h   |   4 ++
> >  include/block/nvme.h |   5 ++
> >  3 files changed, 140 insertions(+), 7 deletions(-)
> > 
> > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > index f8f5dfe204..f589ffde59 100644
> > --- a/hw/nvme/ctrl.c
> > +++ b/hw/nvme/ctrl.c
> > @@ -6358,13 +6444,40 @@ static void nvme_init_state(NvmeCtrl *n)
> >  n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> >  n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
> >  
> > -list->numcntl = cpu_to_le16(n->params.sriov_max_vfs);
> > -for (i = 0; i < n->params.sriov_max_vfs; i++) {
> > +list->numcntl = cpu_to_le16(max_vfs);
> > +for (i = 0; i < max_vfs; i++) {
> >  sctrl = >sec[i];
> >  sctrl->pcid = cpu_to_le16(n->cntlid);
> >  }
> >  
> >  cap->cntlid = cpu_to_le16(n->cntlid);
> > +cap->crt = NVME_CRT_VQ | NVME_CRT_VI;
> > +
> > +if (pci_is_vf(>parent_obj)) {
> > +cap->vqprt = cpu_to_le16(1 + n->conf_ioqpairs);
> > +} else {
> > +cap->vqprt = cpu_to_le16(1 + n->params.max_ioqpairs -
> > + n->params.sriov_vq_flexible);
> > +cap->vqfrt = cpu_to_le32(n->params.sriov_vq_flexible);
> > +cap->vqrfap = cap->vqfrt;
> > +cap->vqgran = cpu_to_le16(NVME_VF_RES_GRANULARITY);
> > +cap->vqfrsm = n->params.sriov_max_vq_per_vf ?
> > +cpu_to_le16(n->params.sriov_max_vq_per_vf) :
> > +cap->vqprt;
> 
> That this defaults to VQPRT doesn't seem right. It should default to
> VQFRT. Does not make sense to report a maximum number of assignable
> flexible resources that are bigger than the number of flexible resources
> available.

I’ve explained in on of v1 threads why I think using the current default
is better than VQPRT.

What you’ve noticed is indeed an inconvenience, but it’s – at least in
my opinion – part of the design. What matters is the current number of
unassigned flexible resources. It may be lower than VQFRSM due to
multiple reasons:
 1) resources are bound to PF, 
 2) resources are bound to other VFs,
 3) resources simply don’t exist (not baked in silicone: VQFRT < VQFRSM).

If 1) and 2) are allowed to happen, and the user must be aware of that,
then why 3) shouldn’t?




Re: [PATCH v2 13/15] hw/nvme: Add support for the Virtualization Management command

2021-11-24 Thread Łukasz Gieryk
On Wed, Nov 24, 2021 at 09:06:23AM +0100, Klaus Jensen wrote:
> On Nov 16 16:34, Łukasz Gieryk wrote:
> > With the new Virtualization Management command one can:
> >  - assign flexible resources (queues, interrupts) to primary and
> >secondary controllers,
> >  - toggle the online/offline state of given controller.
> > 
> > Signed-off-by: Łukasz Gieryk 
> > ---
> >  hw/nvme/ctrl.c   | 204 +++
> >  hw/nvme/nvme.h   |  16 
> >  hw/nvme/trace-events |   3 +
> >  include/block/nvme.h |  17 
> >  4 files changed, 240 insertions(+)
> > 
> > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > index f589ffde59..9d0432a2e5 100644
> > --- a/hw/nvme/ctrl.c
> > +++ b/hw/nvme/ctrl.c
> 
> [... snip]
> 
> > +static uint16_t nvme_assign_virt_res_to_sec(NvmeCtrl *n, NvmeRequest *req,
> > +uint16_t cntlid, uint8_t rt, 
> > int nr)
> > +{
> > +int limit = rt ? n->params.sriov_max_vi_per_vf :
> > + n->params.sriov_max_vq_per_vf;
> 
> If these parameters are left at the default, limit is 0 and the check
> below fails.
> 
> [... snip]
> 
> > +if (nr > limit) {
> > +return NVME_INVALID_NUM_RESOURCES | NVME_DNR;
> > +}

Indeed, my bad.

Al the tests I have at hand set the parameters explicitly, so the
problem has slipped through. I’ve manually tested only the negative
scenarios, where Qemu refuses to start.




Re: [PATCH v2 08/15] hw/nvme: Implement the Function Level Reset

2021-11-17 Thread Łukasz Gieryk
On Tue, Nov 16, 2021 at 01:28:19PM -0800, Keith Busch wrote:
> On Tue, Nov 16, 2021 at 04:34:39PM +0100, Łukasz Gieryk wrote:
> >  if (!pci_is_vf(>parent_obj) && n->params.sriov_max_vfs) {
> > -pcie_sriov_pf_disable_vfs(>parent_obj);
> > +if (rst != NVME_RESET_CONTROLLER) {
> > +pcie_sriov_pf_disable_vfs(>parent_obj);
> 
> Shouldn't this be 'if (rst == NVME_RESET_FUNCTION)'?

The NVMe Spec lists five possible reset types (triggers). According
to my understanding, only the Controller Reset doesn’t affect the VFs'
state, hence the '!='.




[PATCH v2 11/15] hw/nvme: Calculate BAR attributes in a function

2021-11-16 Thread Łukasz Gieryk
An NVMe device with SR-IOV capability calculates the BAR size
differently for PF and VF, so it makes sense to extract the common code
to a separate function.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c | 45 +++--
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 2f3b6b5a82..f8f5dfe204 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6409,6 +6409,34 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice 
*pci_dev)
 memory_region_set_enabled(>pmr.dev->mr, false);
 }
 
+static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs,
+  unsigned *msix_table_offset,
+  unsigned *msix_pba_offset)
+{
+uint64_t bar_size, msix_table_size, msix_pba_size;
+
+bar_size = sizeof(NvmeBar) + 2 * total_queues * NVME_DB_SIZE;
+bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
+
+if (msix_table_offset) {
+*msix_table_offset = bar_size;
+}
+
+msix_table_size = PCI_MSIX_ENTRY_SIZE * total_irqs;
+bar_size += msix_table_size;
+bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
+
+if (msix_pba_offset) {
+*msix_pba_offset = bar_size;
+}
+
+msix_pba_size = QEMU_ALIGN_UP(total_irqs, 64) / 8;
+bar_size += msix_pba_size;
+
+bar_size = pow2ceil(bar_size);
+return bar_size;
+}
+
 static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
 uint64_t bar_size)
 {
@@ -6448,7 +6476,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, 
uint8_t offset)
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
 uint8_t *pci_conf = pci_dev->config;
-uint64_t bar_size, msix_table_size, msix_pba_size;
+uint64_t bar_size;
 unsigned msix_table_offset, msix_pba_offset;
 int ret;
 
@@ -6474,19 +6502,8 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice 
*pci_dev, Error **errp)
 }
 
 /* add one to max_ioqpairs to account for the admin queue pair */
-bar_size = sizeof(NvmeBar) +
-   2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE;
-bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
-msix_table_offset = bar_size;
-msix_table_size = PCI_MSIX_ENTRY_SIZE * n->params.msix_qsize;
-
-bar_size += msix_table_size;
-bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
-msix_pba_offset = bar_size;
-msix_pba_size = QEMU_ALIGN_UP(n->params.msix_qsize, 64) / 8;
-
-bar_size += msix_pba_size;
-bar_size = pow2ceil(bar_size);
+bar_size = nvme_bar_size(n->params.max_ioqpairs + 1, n->params.msix_qsize,
+ _table_offset, _pba_offset);
 
 memory_region_init(>bar0, OBJECT(n), "nvme-bar0", bar_size);
 memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n, "nvme",
-- 
2.25.1




[PATCH v2 07/15] hw/nvme: Add support for Secondary Controller List

2021-11-16 Thread Łukasz Gieryk
From: Lukasz Maniak 

Introduce handling for Secondary Controller List (Identify command with
CNS value of 15h).

Secondary controller ids are unique in the subsystem, hence they are
reserved by it upon initialization of the primary controller to the
number of sriov_max_vfs.

ID reservation requires the addition of an intermediate controller slot
state, so the reserved controller has the address 0x.
A secondary controller is in the reserved state when it has no virtual
function assigned, but its primary controller is realized.
Secondary controller reservations are released to NULL when its primary
controller is unregistered.

Signed-off-by: Lukasz Maniak 
---
 hw/nvme/ctrl.c   | 56 +
 hw/nvme/ns.c |  2 +-
 hw/nvme/nvme.h   | 18 +++
 hw/nvme/subsys.c | 74 ++--
 hw/nvme/trace-events |  1 +
 include/block/nvme.h | 20 
 6 files changed, 161 insertions(+), 10 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 44734a74f9..961161ba8e 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -161,6 +161,7 @@
 #include "qemu/error-report.h"
 #include "qemu/log.h"
 #include "qemu/units.h"
+#include "qemu/range.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "sysemu/sysemu.h"
@@ -4545,6 +4546,14 @@ static uint16_t nvme_identify_pri_ctrl_cap(NvmeCtrl *n, 
NvmeRequest *req)
 return nvme_c2h(n, (uint8_t *)>pri_ctrl_cap, sizeof(NvmePriCtrlCap), 
req);
 }
 
+static uint16_t nvme_identify_sec_ctrl_list(NvmeCtrl *n, NvmeRequest *req)
+{
+trace_pci_nvme_identify_sec_ctrl_list(le16_to_cpu(n->pri_ctrl_cap.cntlid),
+  n->sec_ctrl_list.numcntl);
+
+return nvme_c2h(n, (uint8_t *)>sec_ctrl_list, sizeof(NvmeSecCtrlList), 
req);
+}
+
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
  bool active)
 {
@@ -4765,6 +4774,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_identify_ctrl_list(n, req, false);
 case NVME_ID_CNS_PRIMARY_CTRL_CAP:
 return nvme_identify_pri_ctrl_cap(n, req);
+case NVME_ID_CNS_SECONDARY_CTRL_LIST:
+return nvme_identify_sec_ctrl_list(n, req);
 case NVME_ID_CNS_CS_NS:
 return nvme_identify_ns_csi(n, req, true);
 case NVME_ID_CNS_CS_NS_PRESENT:
@@ -6316,6 +6327,9 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 static void nvme_init_state(NvmeCtrl *n)
 {
 NvmePriCtrlCap *cap = >pri_ctrl_cap;
+NvmeSecCtrlList *list = >sec_ctrl_list;
+NvmeSecCtrlEntry *sctrl;
+int i;
 
 /* add one to max_ioqpairs to account for the admin queue pair */
 n->reg_size = pow2ceil(sizeof(NvmeBar) +
@@ -6327,6 +6341,12 @@ static void nvme_init_state(NvmeCtrl *n)
 n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
 n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
 
+list->numcntl = cpu_to_le16(n->params.sriov_max_vfs);
+for (i = 0; i < n->params.sriov_max_vfs; i++) {
+sctrl = >sec[i];
+sctrl->pcid = cpu_to_le16(n->cntlid);
+}
+
 cap->cntlid = cpu_to_le16(n->cntlid);
 }
 
@@ -6755,6 +6775,41 @@ static void nvme_set_smart_warning(Object *obj, Visitor 
*v, const char *name,
 }
 }
 
+static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
+  uint32_t val, int len)
+{
+NvmeCtrl *n = NVME(dev);
+NvmeSecCtrlEntry *sctrl;
+uint16_t sriov_cap = dev->exp.sriov_cap;
+uint32_t off = address - sriov_cap;
+int i, num_vfs;
+
+if (!sriov_cap) {
+return;
+}
+
+if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
+num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
+
+for (i = 0; i < num_vfs; i++) {
+sctrl = >sec_ctrl_list.sec[i];
+
+if (val & PCI_SRIOV_CTRL_VFE) {
+sctrl->vfn = cpu_to_le16(i + 1);
+} else {
+sctrl->vfn = 0;
+}
+}
+}
+}
+
+static void nvme_pci_write_config(PCIDevice *dev, uint32_t address,
+  uint32_t val, int len)
+{
+nvme_sriov_pre_write_ctrl(dev, address, val, len);
+pci_default_write_config(dev, address, val, len);
+}
+
 static const VMStateDescription nvme_vmstate = {
 .name = "nvme",
 .unmigratable = 1,
@@ -6766,6 +6821,7 @@ static void nvme_class_init(ObjectClass *oc, void *data)
 PCIDeviceClass *pc = PCI_DEVICE_CLASS(oc);
 
 pc->realize = nvme_realize;
+pc->config_write = nvme_pci_write_config;
 pc->exit = nvme_exit;
 pc->class_id = PCI_CLASS_STORAGE_EXPRESS;
 pc->revision = 2;
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index b7cf1494e7..c70aed8c66 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -517,7 +517,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 for (i = 0; i < ARRAY_SIZE(subsys->ctrls); i++) {
  

[PATCH v2 15/15] hw/nvme: Update the initalization place for the AER queue

2021-11-16 Thread Łukasz Gieryk
This patch updates the initialization place for the AER queue, so it’s
initialized once, at controller initialization, and not every time
controller is enabled.

While the original version works for a non-SR-IOV device, as it’s hard
to interact with the controller if it’s not enabled, the multiple
reinitialization is not necessarily correct.

With the SR/IOV feature enabled a segfault can happen: a VF can have its
controller disabled, while a namespace can still be attached to the
controller through the parent PF. An event generated in such case ends
up on an uninitialized queue.

While it’s an interesting question whether a VF should support AER in
the first place, I don’t think it must be answered today.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 9d0432a2e5..7d41318961 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -5980,8 +5980,6 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 
 nvme_set_timestamp(n, 0ULL);
 
-QTAILQ_INIT(>aer_queue);
-
 nvme_select_iocs(n);
 
 return 0;
@@ -6960,6 +6958,8 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->cmic |= NVME_CMIC_MULTI_CTRL;
 }
 
+QTAILQ_INIT(>aer_queue);
+
 NVME_CAP_SET_MQES(cap, 0x7ff);
 NVME_CAP_SET_CQR(cap, 1);
 NVME_CAP_SET_TO(cap, 0xf);
-- 
2.25.1




[PATCH v2 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers

2021-11-16 Thread Łukasz Gieryk
With four new properties:
 - sriov_v{i,q}_flexible,
 - sriov_max_v{i,q}_per_vf,
one can configure the number of available flexible resources, as well as
the limits. The primary and secondary controller capability structures
are initialized accordingly.

Since the number of available queues (interrupts) now varies between
VF/PF, BAR size calculation is also adjusted.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c   | 138 ---
 hw/nvme/nvme.h   |   4 ++
 include/block/nvme.h |   5 ++
 3 files changed, 140 insertions(+), 7 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f8f5dfe204..f589ffde59 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -36,6 +36,10 @@
  *  zoned.zasl=, \
  *  zoned.auto_transition=, \
  *  sriov_max_vfs= \
+ *  sriov_vq_flexible= \
+ *  sriov_vi_flexible= \
+ *  sriov_max_vi_per_vf= \
+ *  sriov_max_vq_per_vf= \
  *  subsys=
  *  -device nvme-ns,drive=,bus=,nsid=,\
  *  zoned=, \
@@ -113,6 +117,26 @@
  *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
  *   Virtual function controllers will not report SR-IOV capability.
  *
+ * - `sriov_vq_flexible`
+ *   Indicates the total number of flexible queue resources assignable to all
+ *   the secondary controllers. Implicitly sets the number of PF-private
+ *   resources to (max_ioqpairs - sriov_vq_flexible).
+ *
+ * - `sriov_vi_flexible`
+ *   Indicates the total number of flexible interrupt resources assignable to
+ *   all the secondary controllers. Implicitly sets the number of PF-private
+ *   resources to (msix_qsize - sriov_vi_flexible).
+ *
+ * - `sriov_max_vi_per_vf`
+ *   Indicates the maximum number of virtual interrupt resources assignable
+ *   to a secondary controller. The default 0 resolves to the number of private
+ *   interrupt resources configured for PF.
+ *
+ * - `sriov_max_vq_per_vf`
+ *   Indicates the maximum number of virtual queue resources assignable to
+ *   a secondary controller. The default 0 resolves to the number of private
+ *   queue resources configured for PF.
+ *
  * nvme namespace device parameters
  * 
  * - `shared`
@@ -185,6 +209,7 @@
 #define NVME_NUM_FW_SLOTS 1
 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
 #define NVME_MAX_VFS 127
+#define NVME_VF_RES_GRANULARITY 1
 #define NVME_VF_OFFSET 0x1
 #define NVME_VF_STRIDE 1
 
@@ -6338,6 +6363,58 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 error_setg(errp, "PMR is not supported with SR-IOV");
 return;
 }
+
+if (!params->sriov_vq_flexible || !params->sriov_vi_flexible) {
+error_setg(errp, "both sriov_vq_flexible and sriov_vi_flexible"
+   " must be set for the use of SR-IOV");
+return;
+}
+
+if (params->sriov_vq_flexible < params->sriov_max_vfs * 2) {
+error_setg(errp, "sriov_vq_flexible must be greater than or equal"
+   " to %d (sriov_max_vfs * 2) and be a multiple of %d"
+   " (sriov_max_vfs)", params->sriov_max_vfs * 2,
+   params->sriov_max_vfs);
+return;
+}
+
+if (params->max_ioqpairs < params->sriov_vq_flexible + 2) {
+error_setg(errp, "sriov_vq_flexible - max_ioqpairs (PF-private"
+   " queue resources) must be greater than or equal to 2");
+return;
+}
+
+if (params->sriov_vi_flexible < params->sriov_max_vfs) {
+error_setg(errp, "sriov_vi_flexible must be greater than or equal"
+   " to %d (sriov_max_vfs) and be a multiple of %d"
+   " (sriov_max_vfs)", params->sriov_max_vfs,
+   params->sriov_max_vfs);
+return;
+}
+
+if (params->msix_qsize < params->sriov_vi_flexible + 1) {
+error_setg(errp, "sriov_vi_flexible - msix_qsize (PF-private"
+   " interrupt resources) must be greater than or equal"
+   " to 1");
+return;
+}
+
+if (params->sriov_max_vi_per_vf &&
+(params->sriov_max_vi_per_vf - 1) % NVME_VF_RES_GRANULARITY) {
+error_setg(errp, "sriov_max_vi_per_vf must meet:"
+   " (X - 1) %% %d == 0 and X >= 1",
+   NVME_VF_RES_GRANULARITY);
+return;
+}
+
+if (params->sriov_max_vq_per_vf &&
+(params->sriov_max_vq_per_vf < 2 ||
+ (params->sriov_max_vq_per_vf - 1) % NVME_VF_RES_GRANULARITY)) {

[PATCH v2 13/15] hw/nvme: Add support for the Virtualization Management command

2021-11-16 Thread Łukasz Gieryk
With the new Virtualization Management command one can:
 - assign flexible resources (queues, interrupts) to primary and
   secondary controllers,
 - toggle the online/offline state of given controller.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c   | 204 +++
 hw/nvme/nvme.h   |  16 
 hw/nvme/trace-events |   3 +
 include/block/nvme.h |  17 
 4 files changed, 240 insertions(+)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f589ffde59..9d0432a2e5 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -257,6 +257,7 @@ static const uint32_t nvme_cse_acs[256] = {
 [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
+[NVME_ADM_CMD_VIRT_MNGMT]   = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_FORMAT_NVM]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 };
 
@@ -288,6 +289,7 @@ static const uint32_t nvme_cse_iocs_zoned[256] = {
 };
 
 static void nvme_process_sq(void *opaque);
+static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst);
 
 static uint16_t nvme_sqid(NvmeRequest *req)
 {
@@ -5517,6 +5519,167 @@ out:
 return status;
 }
 
+static void nvme_get_virt_res_num(NvmeCtrl *n, uint8_t rt, int *num_total,
+  int *num_prim, int *num_sec)
+{
+*num_total = le32_to_cpu(rt ? n->pri_ctrl_cap.vifrt : 
n->pri_ctrl_cap.vqfrt);
+*num_prim = le16_to_cpu(rt ? n->pri_ctrl_cap.virfap : 
n->pri_ctrl_cap.vqrfap);
+*num_sec = le16_to_cpu(rt ? n->pri_ctrl_cap.virfa : n->pri_ctrl_cap.vqrfa);
+}
+
+static uint16_t nvme_assign_virt_res_to_prim(NvmeCtrl *n, NvmeRequest *req,
+ uint16_t cntlid, uint8_t rt, int 
nr)
+{
+int num_total, num_prim, num_sec;
+
+if (cntlid != n->cntlid) {
+return NVME_INVALID_CTRL_ID | NVME_DNR;
+}
+
+nvme_get_virt_res_num(n, rt, _total, _prim, _sec);
+
+if (nr > num_total) {
+return NVME_INVALID_NUM_RESOURCES | NVME_DNR;
+}
+
+if (nr > num_total - num_sec) {
+return NVME_INVALID_RESOURCE_ID | NVME_DNR;
+}
+
+if (rt) {
+n->pri_ctrl_cap.virfap = cpu_to_le16(nr);
+} else {
+n->pri_ctrl_cap.vqrfap = cpu_to_le16(nr);
+}
+
+req->cqe.result = cpu_to_le32(nr);
+return req->status;
+}
+
+static void nvme_update_virt_res(NvmeCtrl *n, NvmeSecCtrlEntry *sctrl,
+ uint8_t rt, int nr)
+{
+int prev_nr, prev_total;
+
+if (rt) {
+prev_nr = le16_to_cpu(sctrl->nvi);
+prev_total = le32_to_cpu(n->pri_ctrl_cap.virfa);
+sctrl->nvi = cpu_to_le16(nr);
+n->pri_ctrl_cap.virfa = cpu_to_le32(prev_total + nr - prev_nr);
+} else {
+prev_nr = le16_to_cpu(sctrl->nvq);
+prev_total = le32_to_cpu(n->pri_ctrl_cap.vqrfa);
+sctrl->nvq = cpu_to_le16(nr);
+n->pri_ctrl_cap.vqrfa = cpu_to_le32(prev_total + nr - prev_nr);
+}
+}
+
+static uint16_t nvme_assign_virt_res_to_sec(NvmeCtrl *n, NvmeRequest *req,
+uint16_t cntlid, uint8_t rt, int 
nr)
+{
+int limit = rt ? n->params.sriov_max_vi_per_vf :
+ n->params.sriov_max_vq_per_vf;
+int num_total, num_prim, num_sec, num_free, diff;
+NvmeSecCtrlEntry *sctrl;
+
+sctrl = nvme_sctrl_for_cntlid(n, cntlid);
+if (!sctrl) {
+return NVME_INVALID_CTRL_ID | NVME_DNR;
+}
+
+if (sctrl->scs) {
+return NVME_INVALID_SEC_CTRL_STATE | NVME_DNR;
+}
+
+if (nr > limit) {
+return NVME_INVALID_NUM_RESOURCES | NVME_DNR;
+}
+
+nvme_get_virt_res_num(n, rt, _total, _prim, _sec);
+num_free = num_total - num_prim - num_sec;
+diff = nr - le16_to_cpu(rt ? sctrl->nvi : sctrl->nvq);
+
+if (diff > num_free) {
+return NVME_INVALID_RESOURCE_ID | NVME_DNR;
+}
+
+nvme_update_virt_res(n, sctrl, rt, nr);
+req->cqe.result = cpu_to_le32(nr);
+
+return req->status;
+}
+
+static uint16_t nvme_virt_set_state(NvmeCtrl *n, uint16_t cntlid, bool online)
+{
+NvmeCtrl *sn = NULL;
+NvmeSecCtrlEntry *sctrl;
+
+sctrl = nvme_sctrl_for_cntlid(n, cntlid);
+if (!sctrl) {
+return NVME_INVALID_CTRL_ID | NVME_DNR;
+}
+
+if (sctrl->vfn) {
+sn = NVME(pcie_sriov_get_vf_at_index(>parent_obj,
+ le16_to_cpu(sctrl->vfn) - 1));
+}
+
+if (online) {
+if (!NVME_CC_EN(ldl_le_p(>bar.cc)) || !sctrl->nvi ||
+(le16_to_cpu(sctrl->nvq) < 2)) {
+return NVME_INVALID_SEC_CTRL_STATE | NVME_DNR;
+}
+
+if (!sctrl->scs) {
+sctrl->scs = 0x1;
+if (sn) {
+nvme_ctrl_reset(sn, NVME_RESET_CONTROLLER);
+}
+}
+} else {

[PATCH v2 02/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt

2021-11-16 Thread Łukasz Gieryk
From: Knut Omang 

Add a small intro + minimal documentation for how to
implement SR/IOV support for an emulated device.

Signed-off-by: Knut Omang 
---
 docs/pcie_sriov.txt | 115 
 1 file changed, 115 insertions(+)
 create mode 100644 docs/pcie_sriov.txt

diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
new file mode 100644
index 00..f5e891e1d4
--- /dev/null
+++ b/docs/pcie_sriov.txt
@@ -0,0 +1,115 @@
+PCI SR/IOV EMULATION SUPPORT
+
+
+Description
+===
+SR/IOV (Single Root I/O Virtualization) is an optional extended capability
+of a PCI Express device. It allows a single physical function (PF) to appear 
as multiple
+virtual functions (VFs) for the main purpose of eliminating software
+overhead in I/O from virtual machines.
+
+Qemu now implements the basic common functionality to enable an emulated device
+to support SR/IOV. Yet no fully implemented devices exists in Qemu, but a
+proof-of-concept hack of the Intel igb can be found here:
+
+git://github.com/knuto/qemu.git sriov_patches_v5
+
+Implementation
+==
+Implementing emulation of an SR/IOV capable device typically consists of
+implementing support for two types of device classes; the "normal" physical 
device
+(PF) and the virtual device (VF). From Qemu's perspective, the VFs are just
+like other devices, except that some of their properties are derived from
+the PF.
+
+A virtual function is different from a physical function in that the BAR
+space for all VFs are defined by the BAR registers in the PFs SR/IOV
+capability. All VFs have the same BARs and BAR sizes.
+
+Accesses to these virtual BARs then is computed as
+
++  *  + 
+
+From our emulation perspective this means that there is a separate call for
+setting up a BAR for a VF.
+
+1) To enable SR/IOV support in the PF, it must be a PCI Express device so
+   you would need to add a PCI Express capability in the normal PCI
+   capability list. You might also want to add an ARI (Alternative
+   Routing-ID Interpretation) capability to indicate that your device
+   supports functions beyond it's "own" function space (0-7),
+   which is necessary to support more than 7 functions, or
+   if functions extends beyond offset 7 because they are placed at an
+   offset > 1 or have stride > 1.
+
+   ...
+   #include "hw/pci/pcie.h"
+   #include "hw/pci/pcie_sriov.h"
+
+   pci_your_pf_dev_realize( ... )
+   {
+  ...
+  int ret = pcie_endpoint_cap_init(d, 0x70);
+  ...
+  pcie_ari_init(d, 0x100, 1);
+  ...
+
+  /* Add and initialize the SR/IOV capability */
+  pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
+   vf_devid, initial_vfs, total_vfs,
+   fun_offset, stride);
+
+  /* Set up individual VF BARs (parameters as for normal BARs) */
+  pcie_sriov_pf_init_vf_bar( ... )
+  ...
+   }
+
+   For cleanup, you simply call:
+
+  pcie_sriov_pf_exit(device);
+
+   which will delete all the virtual functions and associated resources.
+
+2) Similarly in the implementation of the virtual function, you need to
+   make it a PCI Express device and add a similar set of capabilities
+   except for the SR/IOV capability. Then you need to set up the VF BARs as
+   subregions of the PFs SR/IOV VF BARs by calling
+   pcie_sriov_vf_register_bar() instead of the normal pci_register_bar() call:
+
+   pci_your_vf_dev_realize( ... )
+   {
+  ...
+  int ret = pcie_endpoint_cap_init(d, 0x60);
+  ...
+  pcie_ari_init(d, 0x100, 1);
+  ...
+  memory_region_init(mr, ... )
+  pcie_sriov_vf_register_bar(d, bar_nr, mr);
+  ...
+   }
+
+Testing on Linux guest
+==
+The easiest is if your device driver supports sysfs based SR/IOV
+enabling. Support for this was added in kernel v.3.8, so not all drivers
+support it yet.
+
+To enable 4 VFs for a device at 01:00.0:
+
+   modprobe yourdriver
+   echo 4 > /sys/bus/pci/devices/:01:00.0/sriov_numvfs
+
+You should now see 4 VFs with lspci.
+To turn SR/IOV off again - the standard requires you to turn it off before you 
can enable
+another VF count, and the emulation enforces this:
+
+   echo 0 > /sys/bus/pci/devices/:01:00.0/sriov_numvfs
+
+Older drivers typically provide a max_vfs module parameter
+to enable it at load time:
+
+   modprobe yourdriver max_vfs=4
+
+To disable the VFs again then, you simply have to unload the driver:
+
+   rmmod yourdriver
-- 
2.25.1




[PATCH v2 14/15] docs: Add documentation for SR-IOV and Virtualization Enhancements

2021-11-16 Thread Łukasz Gieryk
From: Lukasz Maniak 

Signed-off-by: Lukasz Maniak 
---
 docs/system/devices/nvme.rst | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
index bff72d1c24..29fe2565b5 100644
--- a/docs/system/devices/nvme.rst
+++ b/docs/system/devices/nvme.rst
@@ -235,3 +235,38 @@ The virtual namespace device supports DIF- and DIX-based 
protection information
   to ``1`` to transfer protection information as the first eight bytes of
   metadata. Otherwise, the protection information is transferred as the last
   eight bytes.
+
+Virtualization Enhancements and SR-IOV
+--
+
+The ``nvme`` device supports Single Root I/O Virtualization and Sharing
+along with Virtualization Enhancements. The controller has to be linked to
+an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
+
+A number of parameters are present:
+
+``sriov_max_vfs`` (default: ``0``)
+  Indicates the maximum number of PCIe virtual functions supported
+  by the controller. Specifying a non-zero value enables reporting of both
+  SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
+  by the NVMe device. Virtual function controllers will not report SR-IOV.
+
+``sriov_vq_flexible``
+  Indicates the total number of flexible queue resources assignable to all
+  the secondary controllers. Implicitly sets the number of PF-private
+  resources to (max_ioqpairs - sriov_vq_flexible).
+
+``sriov_vi_flexible``
+  Indicates the total number of flexible interrupt resources assignable to
+  all the secondary controllers. Implicitly sets the number of PF-private
+  resources to (msix_qsize - sriov_vi_flexible).
+
+``sriov_max_vi_per_vf``
+  Indicates the maximum number of virtual interrupt resources assignable
+  to a secondary controller. The default 0 resolves to the number of private
+  interrupt resources configured for PF.
+
+``sriov_max_vq_per_vf``
+  Indicates the maximum number of virtual queue resources assignable to
+  a secondary controller. The default 0 resolves to the number of private
+  queue resources configured for PF.
-- 
2.25.1




[PATCH v2 10/15] hw/nvme: Remove reg_size variable and update BAR0 size calculation

2021-11-16 Thread Łukasz Gieryk
The n->reg_size parameter unnecessarily splits the BAR0 size calculation
in two phases; removed to simplify the code.

With all the calculations done in one place, it seems the pow2ceil,
applied originally to reg_size, is unnecessary. The rounding should
happen as the last step, when BAR size includes Nvme registers, queue
registers, and MSIX-related space.

Finally, the size of the mmio memory region is extended to cover the 1st
4KiB padding (see the map below). Access to this range is handled as
interaction with a non-existing queue and generates an error trace, so
actually nothing changes, while the reg_size variable is no longer needed.


|  BAR0|

[Nvme Registers]
[Queues]
[power-of-2 padding] - removed in this patch
[4KiB padding (1)  ]
[MSIX TABLE]
[4KiB padding (2)  ]
[MSIX PBA  ]
[power-of-2 padding]

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c | 10 +-
 hw/nvme/nvme.h |  1 -
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index d502917d98..2f3b6b5a82 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6351,9 +6351,6 @@ static void nvme_init_state(NvmeCtrl *n)
 n->conf_ioqpairs = n->params.max_ioqpairs;
 n->conf_msix_qsize = n->params.msix_qsize;
 
-/* add one to max_ioqpairs to account for the admin queue pair */
-n->reg_size = pow2ceil(sizeof(NvmeBar) +
-   2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
 n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1);
 n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
 n->temperature = NVME_TEMPERATURE;
@@ -6476,7 +6473,10 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice 
*pci_dev, Error **errp)
 pcie_ari_init(pci_dev, 0x100, 1);
 }
 
-bar_size = QEMU_ALIGN_UP(n->reg_size, 4 * KiB);
+/* add one to max_ioqpairs to account for the admin queue pair */
+bar_size = sizeof(NvmeBar) +
+   2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE;
+bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
 msix_table_offset = bar_size;
 msix_table_size = PCI_MSIX_ENTRY_SIZE * n->params.msix_qsize;
 
@@ -6490,7 +6490,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, 
Error **errp)
 
 memory_region_init(>bar0, OBJECT(n), "nvme-bar0", bar_size);
 memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n, "nvme",
-  n->reg_size);
+  msix_table_offset);
 memory_region_add_subregion(>bar0, 0, >iomem);
 
 if (pci_is_vf(pci_dev)) {
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 927890b490..1401ac3904 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -414,7 +414,6 @@ typedef struct NvmeCtrl {
 uint16_tmax_prp_ents;
 uint16_tcqe_size;
 uint16_tsqe_size;
-uint32_treg_size;
 uint32_tmax_q_ents;
 uint8_t outstanding_aers;
 uint32_tirq_status;
-- 
2.25.1




[PATCH v2 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV)

2021-11-16 Thread Łukasz Gieryk
From: Knut Omang 

This patch provides the building blocks for creating an SR/IOV
PCIe Extended Capability header and register/unregister
SR/IOV Virtual Functions.

Signed-off-by: Knut Omang 
---
 hw/pci/meson.build  |   1 +
 hw/pci/pci.c|  97 +---
 hw/pci/pcie.c   |   5 +
 hw/pci/pcie_sriov.c | 287 
 hw/pci/trace-events |   5 +
 include/hw/pci/pci.h|  12 +-
 include/hw/pci/pcie.h   |   6 +
 include/hw/pci/pcie_sriov.h |  67 +
 include/qemu/typedefs.h |   2 +
 9 files changed, 456 insertions(+), 26 deletions(-)
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

diff --git a/hw/pci/meson.build b/hw/pci/meson.build
index 5c4bbac817..bcc9c75919 100644
--- a/hw/pci/meson.build
+++ b/hw/pci/meson.build
@@ -5,6 +5,7 @@ pci_ss.add(files(
   'pci.c',
   'pci_bridge.c',
   'pci_host.c',
+  'pcie_sriov.c',
   'shpc.c',
   'slotid_cap.c'
 ))
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e5993c1ef5..1892a7e74c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -239,6 +239,9 @@ int pci_bar(PCIDevice *d, int reg)
 {
 uint8_t type;
 
+/* PCIe virtual functions do not have their own BARs */
+assert(!pci_is_vf(d));
+
 if (reg != PCI_ROM_SLOT)
 return PCI_BASE_ADDRESS_0 + reg * 4;
 
@@ -304,10 +307,30 @@ void pci_device_deassert_intx(PCIDevice *dev)
 }
 }
 
-static void pci_do_device_reset(PCIDevice *dev)
+static void pci_reset_regions(PCIDevice *dev)
 {
 int r;
+if (pci_is_vf(dev)) {
+return;
+}
+
+for (r = 0; r < PCI_NUM_REGIONS; ++r) {
+PCIIORegion *region = >io_regions[r];
+if (!region->size) {
+continue;
+}
+
+if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
+region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+pci_set_quad(dev->config + pci_bar(dev, r), region->type);
+} else {
+pci_set_long(dev->config + pci_bar(dev, r), region->type);
+}
+}
+}
 
+static void pci_do_device_reset(PCIDevice *dev)
+{
 pci_device_deassert_intx(dev);
 assert(dev->irq_state == 0);
 
@@ -323,19 +346,7 @@ static void pci_do_device_reset(PCIDevice *dev)
   pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
   pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE));
 dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
-for (r = 0; r < PCI_NUM_REGIONS; ++r) {
-PCIIORegion *region = >io_regions[r];
-if (!region->size) {
-continue;
-}
-
-if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
-region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
-pci_set_quad(dev->config + pci_bar(dev, r), region->type);
-} else {
-pci_set_long(dev->config + pci_bar(dev, r), region->type);
-}
-}
+pci_reset_regions(dev);
 pci_update_mappings(dev);
 
 msi_reset(dev);
@@ -884,6 +895,15 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice 
*dev, Error **errp)
 dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
 }
 
+/* With SR/IOV and ARI, a device at function 0 need not be a multifunction
+ * device, as it may just be a VF that ended up with function 0 in
+ * the legacy PCI interpretation. Avoid failing in such cases:
+ */
+if (pci_is_vf(dev) &&
+dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+return;
+}
+
 /*
  * multifunction bit is interpreted in two ways as follows.
  *   - all functions must set the bit to 1.
@@ -1083,6 +1103,7 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev,
bus->devices[devfn]->name);
 return NULL;
 } else if (dev->hotplugged &&
+   !pci_is_vf(pci_dev) &&
pci_get_function_0(pci_dev)) {
 error_setg(errp, "PCI: slot %d function 0 already occupied by %s,"
" new func %s cannot be exposed to guest.",
@@ -1191,6 +1212,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 pcibus_t size = memory_region_size(memory);
 uint8_t hdr_type;
 
+assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar */
 assert(region_num >= 0);
 assert(region_num < PCI_NUM_REGIONS);
 assert(is_power_of_2(size));
@@ -1294,11 +1316,43 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int 
region_num)
 return pci_dev->io_regions[region_num].addr;
 }
 
-static pcibus_t pci_bar_address(PCIDevice *d,
-int reg, uint8_t type, pcibus_t size)
+static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
+uint8_t type, pcibus_t size)
+{
+pcibus_t new_addr;
+if (!pci_is_vf(d)) {
+int bar = pci_bar(d, reg);
+if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+new_addr = 

[PATCH v2 00/15] hw/nvme: SR-IOV with Virtualization Enhancements

2021-11-16 Thread Łukasz Gieryk
This is the updated version of SR-IOV support for the NVMe device.

Changes since v1:
 - Dropped the "pcie: Set default and supported MaxReadReq to 512" patch.
   The original author agrees it may be no longer needed for recent
   kernels.
 - Dropped the "pcie: Add callback preceding SR-IOV VFs update" patch.
   A customized pc->config_write callback is used instead.
 - Split the "hw/nvme: Calculate BAR attributes in a function” patch into
   cleanup and update parts.
 - Reworked the configuration options related to SR-IOV.
 - Fixed nvme_update_msixcap_ts() for platform without MSIX support.
 - Updated handling of SUBSYS_SLOT_RSVD values in nvme_subsys_ctrl().
 - Updated error codes returned from the Virtualization Management
   command (DNR is set).
 - Updated typedef/enum names mismatch.
 - Few other minor tweaks and improvements.

List of known gaps and nice-to-haves:

1) Interaction of secondary controllers with namespaces is not 100%
following the spec

The limitation: VF has to be visible on the PCI bus first, and only then
such VF can have a namespace attached.

The problem is that the mapping between controller and attached
namespaces is stored in the NvmeCtrl object, and unrealized VF doesn’t
have this object allocated. There are multiple ways to address the
issue, but none of those we’ve considered so far is elegant. The fact,
that the namespace-related code area is busy (pending patches, some
rework?), doesn’t help either.


2) VFs report and support the same features as the parent PF

Due to security reasons, user of a VF should be not allowed to interact
with other VFs. Essentially, VFs should not announce and support:
Namespace Management, Attachment, corresponding Identify commands, and
maybe other features as well.


3) PMR and CMB must be disabled when SR-IOV is active

The main concern is whether PMR/CMB should be even implemented for a
device that supports SR-IOV. If the answer is yes, then another question
arises: how the feature should behave? Simulation of different devices
may require different implementation.

It's too early to get into such details, so we’ve decided to disallow
both features altogether if SR-IOV is enabled.


4) The Private Resources Mode

The NVMe Spec defines Flexible Resources as an optional feature. VFs can
alternatively support a fixed number of queues/interrupts.

This SR-IOV implementation supports Flexible Resources with the
Virtualization Management command, and a fallback to Private Resources
is not available. Support for such fallback, if there’s demand, can be
implemented later.


5) Support for Namespace Management command

Device with virtualization enhancements must support the Namespace
Management command. The command is not absolutely necessary to use
SR-IOV, but for a more complete set of features you may want to
cherry-pick this patch:
https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg03107.html
together with this fix:
https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06734.html


Knut Omang (2):
  pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt

Lukasz Maniak (4):
  hw/nvme: Add support for SR-IOV
  hw/nvme: Add support for Primary Controller Capabilities
  hw/nvme: Add support for Secondary Controller List
  docs: Add documentation for SR-IOV and Virtualization Enhancements

Łukasz Gieryk (9):
  pcie: Add helpers to the SR/IOV API
  pcie: Add 1.2 version token for the Power Management Capability
  hw/nvme: Implement the Function Level Reset
  hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  hw/nvme: Remove reg_size variable and update BAR0 size calculation
  hw/nvme: Calculate BAR attributes in a function
  hw/nvme: Initialize capability structures for primary/secondary
controllers
  hw/nvme: Add support for the Virtualization Management command
  hw/nvme: Update the initalization place for the AER queue

 docs/pcie_sriov.txt  | 115 +++
 docs/system/devices/nvme.rst |  35 ++
 hw/nvme/ctrl.c   | 634 ---
 hw/nvme/ns.c |   2 +-
 hw/nvme/nvme.h   |  51 ++-
 hw/nvme/subsys.c |  74 +++-
 hw/nvme/trace-events |   6 +
 hw/pci/meson.build   |   1 +
 hw/pci/pci.c |  97 --
 hw/pci/pcie.c|   5 +
 hw/pci/pcie_sriov.c  | 301 +
 hw/pci/trace-events  |   5 +
 include/block/nvme.h |  65 
 include/hw/pci/pci.h |  12 +-
 include/hw/pci/pci_ids.h |   1 +
 include/hw/pci/pci_regs.h|   1 +
 include/hw/pci/pcie.h|   6 +
 include/hw/pci/pcie_sriov.h  |  75 +
 include/qemu/typedefs.h  |   2 +
 19 files changed, 1407 insertions(+), 81 deletions(-)
 create mode 100644 docs/pcie_sriov.txt
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

-- 
2.25.1




[PATCH v2 09/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime

2021-11-16 Thread Łukasz Gieryk
The NVMe device defines two properties: max_ioqpairs, msix_qsize. Having
them as constants is problematic for SR-IOV support.

SR-IOV introduces virtual resources (queues, interrupts) that can be
assigned to PF and its dependent VFs. Each device, following a reset,
should work with the configured number of queues. A single constant is
no longer sufficient to hold the whole state.

This patch tries to solve the problem by introducing additional
variables in NvmeCtrl’s state. The variables for, e.g., managing queues
are therefore organized as:
 - n->params.max_ioqpairs – no changes, constant set by the user
 - n->(mutable_state) – (not a part of this patch) user-configurable,
specifies number of queues available _after_
reset
 - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
  n->params.max_ioqpairs; initialized in realize()
  and updated during reset() to reflect user’s
  changes to the mutable state

Since the number of available i/o queues and interrupts can change in
runtime, buffers for sq/cqs and the MSIX-related structures are
allocated big enough to handle the limits, to completely avoid the
complicated reallocation. A helper function (nvme_update_msixcap_ts)
updates the corresponding capability register, to signal configuration
changes.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c | 52 ++
 hw/nvme/nvme.h |  2 ++
 2 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index da5b630ae3..d502917d98 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -417,12 +417,12 @@ static bool nvme_nsid_valid(NvmeCtrl *n, uint32_t nsid)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->params.max_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->conf_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->params.max_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->conf_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -4035,8 +4035,7 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_err_invalid_create_sq_cqid(cqid);
 return NVME_INVALID_CQID | NVME_DNR;
 }
-if (unlikely(!sqid || sqid > n->params.max_ioqpairs ||
-n->sq[sqid] != NULL)) {
+if (unlikely(!sqid || sqid > n->conf_ioqpairs || n->sq[sqid] != NULL)) {
 trace_pci_nvme_err_invalid_create_sq_sqid(sqid);
 return NVME_INVALID_QID | NVME_DNR;
 }
@@ -4383,8 +4382,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_create_cq(prp1, cqid, vector, qsize, qflags,
  NVME_CQ_FLAGS_IEN(qflags) != 0);
 
-if (unlikely(!cqid || cqid > n->params.max_ioqpairs ||
-n->cq[cqid] != NULL)) {
+if (unlikely(!cqid || cqid > n->conf_ioqpairs || n->cq[cqid] != NULL)) {
 trace_pci_nvme_err_invalid_create_cq_cqid(cqid);
 return NVME_INVALID_QID | NVME_DNR;
 }
@@ -4400,7 +4398,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
-if (unlikely(vector >= n->params.msix_qsize)) {
+if (unlikely(vector >= n->conf_msix_qsize)) {
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -4981,13 +4979,12 @@ defaults:
 
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = (n->params.max_ioqpairs - 1) |
-((n->params.max_ioqpairs - 1) << 16);
+result = (n->conf_ioqpairs - 1) | ((n->conf_ioqpairs - 1) << 16);
 trace_pci_nvme_getfeat_numq(result);
 break;
 case NVME_INTERRUPT_VECTOR_CONF:
 iv = dw11 & 0x;
-if (iv >= n->params.max_ioqpairs + 1) {
+if (iv >= n->conf_ioqpairs + 1) {
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
@@ -5142,10 +5139,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
NvmeRequest *req)
 
 trace_pci_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->params.max_ioqpairs,
-n->params.max_ioqpairs);
-req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
-  ((n->params.max_ioqpairs - 1) << 16));
+n->conf_ioqpairs,
+n-&

[PATCH v2 08/15] hw/nvme: Implement the Function Level Reset

2021-11-16 Thread Łukasz Gieryk
This patch implements the Function Level Reset, a feature currently not
implemented for the Nvme device, while listed as a mandatory ("shall")
in the 1.4 spec.

The implementation reuses FLR-related building blocks defined for the
pci-bridge module, and follows the same logic:
- FLR capability is advertised in the PCIE config,
- custom pci_write_config callback detects a write to the trigger
  register and performs the PCI reset,
- which, eventually, calls the custom dc->reset handler.

Depending on reset type, parts of the state should (or should not) be
cleared. To distinguish the type of reset, an additional parameter is
passed to the reset function.

This patch also enables advertisement of the Power Management PCI
capability. The main reason behind it is to announce the no_soft_reset=1
bit, to signal SR-IOV support where each VF can be reset individually.

The implementation purposedly ignores writes to the PMCS.PS register,
as even such naïve behavior is enough to correctly handle the D3->D0
transition.

It’s worth to note, that the power state transition back to to D3, with
all the corresponding side effects, wasn't and stil isn't handled
properly.

Signed-off-by: Łukasz Gieryk 
Reviewed-by: Klaus Jensen 
---
 hw/nvme/ctrl.c   | 45 
 hw/nvme/nvme.h   |  5 +
 hw/nvme/trace-events |  1 +
 3 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 961161ba8e..da5b630ae3 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -5583,7 +5583,7 @@ static void nvme_process_sq(void *opaque)
 }
 }
 
-static void nvme_ctrl_reset(NvmeCtrl *n)
+static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
 {
 NvmeNamespace *ns;
 int i;
@@ -5615,7 +5615,9 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
 }
 
 if (!pci_is_vf(>parent_obj) && n->params.sriov_max_vfs) {
-pcie_sriov_pf_disable_vfs(>parent_obj);
+if (rst != NVME_RESET_CONTROLLER) {
+pcie_sriov_pf_disable_vfs(>parent_obj);
+}
 }
 
 n->aer_queued = 0;
@@ -5849,7 +5851,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
uint64_t data,
 }
 } else if (!NVME_CC_EN(data) && NVME_CC_EN(cc)) {
 trace_pci_nvme_mmio_stopped();
-nvme_ctrl_reset(n);
+nvme_ctrl_reset(n, NVME_RESET_CONTROLLER);
 cc = 0;
 csts &= ~NVME_CSTS_READY;
 }
@@ -6406,6 +6408,28 @@ static void nvme_init_sriov(NvmeCtrl *n, PCIDevice 
*pci_dev, uint16_t offset,
   PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
 }
 
+static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
+{
+Error *err = NULL;
+int ret;
+
+ret = pci_add_capability(pci_dev, PCI_CAP_ID_PM, offset,
+ PCI_PM_SIZEOF, );
+if (err) {
+error_report_err(err);
+return ret;
+}
+
+pci_set_word(pci_dev->config + offset + PCI_PM_PMC,
+ PCI_PM_CAP_VER_1_2);
+pci_set_word(pci_dev->config + offset + PCI_PM_CTRL,
+ PCI_PM_CTRL_NO_SOFT_RESET);
+pci_set_word(pci_dev->wmask + offset + PCI_PM_CTRL,
+ PCI_PM_CTRL_STATE_MASK);
+
+return 0;
+}
+
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
 uint8_t *pci_conf = pci_dev->config;
@@ -6427,7 +6451,9 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, 
Error **errp)
 }
 
 pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
+nvme_add_pm_capability(pci_dev, 0x60);
 pcie_endpoint_cap_init(pci_dev, 0x80);
+pcie_cap_flr_init(pci_dev);
 if (n->params.sriov_max_vfs) {
 pcie_ari_init(pci_dev, 0x100, 1);
 }
@@ -6676,7 +6702,7 @@ static void nvme_exit(PCIDevice *pci_dev)
 NvmeNamespace *ns;
 int i;
 
-nvme_ctrl_reset(n);
+nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
 
 if (n->subsys) {
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
@@ -6775,6 +6801,15 @@ static void nvme_set_smart_warning(Object *obj, Visitor 
*v, const char *name,
 }
 }
 
+static void nvme_pci_reset(DeviceState *qdev)
+{
+PCIDevice *pci_dev = PCI_DEVICE(qdev);
+NvmeCtrl *n = NVME(pci_dev);
+
+trace_pci_nvme_pci_reset();
+nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
+}
+
 static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
   uint32_t val, int len)
 {
@@ -6808,6 +6843,7 @@ static void nvme_pci_write_config(PCIDevice *dev, 
uint32_t address,
 {
 nvme_sriov_pre_write_ctrl(dev, address, val, len);
 pci_default_write_config(dev, address, val, len);
+pcie_cap_flr_write_config(dev, address, val, len);
 }
 
 static const VMStateDescription nvme_vmstate = {
@@ -6830,6 +6866,7 @@ static void nvme_class_init(ObjectClass *oc, void *data)
 dc->desc = "Non-Vola

[PATCH v2 03/15] pcie: Add helpers to the SR/IOV API

2021-11-16 Thread Łukasz Gieryk
Two convenience functions for retrieving:
 - the total number of VFs,
 - the PCIDevice object of the N-th VF.

Signed-off-by: Łukasz Gieryk 
Reviewed-by: Knut Omang 
---
 hw/pci/pcie_sriov.c | 14 ++
 include/hw/pci/pcie_sriov.h |  8 
 2 files changed, 22 insertions(+)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 501a1ff433..3515cb603d 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -280,8 +280,22 @@ uint16_t pcie_sriov_vf_number(PCIDevice *dev)
 return dev->exp.sriov_vf.vf_number;
 }
 
+uint16_t pcie_sriov_vf_number_total(PCIDevice *dev)
+{
+assert(!pci_is_vf(dev));
+return dev->exp.sriov_pf.num_vfs;
+}
 
 PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
 {
 return dev->exp.sriov_vf.pf;
 }
+
+PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n)
+{
+assert(!pci_is_vf(dev));
+if (n < dev->exp.sriov_pf.num_vfs) {
+return dev->exp.sriov_pf.vf[n];
+}
+return NULL;
+}
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 0974f00054..33ed6d0686 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -59,9 +59,17 @@ void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
 /* Get logical VF number of a VF - only valid for VFs */
 uint16_t pcie_sriov_vf_number(PCIDevice *dev);
 
+/* Get the total number of VFs - only valid for PF */
+uint16_t pcie_sriov_vf_number_total(PCIDevice *dev);
+
 /* Get the physical function that owns this VF.
  * Returns NULL if dev is not a virtual function
  */
 PCIDevice *pcie_sriov_get_pf(PCIDevice *dev);
 
+/* Get the n-th VF of this physical function - only valid for PF.
+ * Returns NULL if index is invalid
+ */
+PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n);
+
 #endif /* QEMU_PCIE_SRIOV_H */
-- 
2.25.1




[PATCH v2 06/15] hw/nvme: Add support for Primary Controller Capabilities

2021-11-16 Thread Łukasz Gieryk
From: Lukasz Maniak 

Implementation of Primary Controller Capabilities data
structure (Identify command with CNS value of 14h).

Currently, the command returns only ID of a primary controller.
Handling of remaining fields are added in subsequent patches
implementing virtualization enhancements.

Signed-off-by: Lukasz Maniak 
Reviewed-by: Klaus Jensen 
---
 hw/nvme/ctrl.c   | 22 +-
 hw/nvme/nvme.h   |  2 ++
 hw/nvme/trace-events |  1 +
 include/block/nvme.h | 23 +++
 4 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 04de63f30e..44734a74f9 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4538,6 +4538,13 @@ static uint16_t nvme_identify_ctrl_list(NvmeCtrl *n, 
NvmeRequest *req,
 return nvme_c2h(n, (uint8_t *)list, sizeof(list), req);
 }
 
+static uint16_t nvme_identify_pri_ctrl_cap(NvmeCtrl *n, NvmeRequest *req)
+{
+trace_pci_nvme_identify_pri_ctrl_cap(le16_to_cpu(n->pri_ctrl_cap.cntlid));
+
+return nvme_c2h(n, (uint8_t *)>pri_ctrl_cap, sizeof(NvmePriCtrlCap), 
req);
+}
+
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
  bool active)
 {
@@ -4756,6 +4763,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_identify_ctrl_list(n, req, true);
 case NVME_ID_CNS_CTRL_LIST:
 return nvme_identify_ctrl_list(n, req, false);
+case NVME_ID_CNS_PRIMARY_CTRL_CAP:
+return nvme_identify_pri_ctrl_cap(n, req);
 case NVME_ID_CNS_CS_NS:
 return nvme_identify_ns_csi(n, req, true);
 case NVME_ID_CNS_CS_NS_PRESENT:
@@ -6306,6 +6315,8 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 
 static void nvme_init_state(NvmeCtrl *n)
 {
+NvmePriCtrlCap *cap = >pri_ctrl_cap;
+
 /* add one to max_ioqpairs to account for the admin queue pair */
 n->reg_size = pow2ceil(sizeof(NvmeBar) +
2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
@@ -6315,6 +6326,8 @@ static void nvme_init_state(NvmeCtrl *n)
 n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
 n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
 n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
+
+cap->cntlid = cpu_to_le16(n->cntlid);
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -6614,15 +6627,14 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 qbus_init(>bus, sizeof(NvmeBus), TYPE_NVME_BUS,
   _dev->qdev, n->parent_obj.qdev.id);
 
-nvme_init_state(n);
-if (nvme_init_pci(n, pci_dev, errp)) {
-return;
-}
-
 if (nvme_init_subsys(n, errp)) {
 error_propagate(errp, local_err);
 return;
 }
+nvme_init_state(n);
+if (nvme_init_pci(n, pci_dev, errp)) {
+return;
+}
 nvme_init_ctrl(n, pci_dev);
 
 /* setup a namespace if the controller drive property was given */
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 4c8af34b28..81deb45dfb 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -461,6 +461,8 @@ typedef struct NvmeCtrl {
 };
 uint32_tasync_config;
 } features;
+
+NvmePriCtrlCap  pri_ctrl_cap;
 } NvmeCtrl;
 
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index ff6cafd520..1014ebceb6 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -52,6 +52,7 @@ pci_nvme_identify_ctrl(void) "identify controller"
 pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_ctrl_list(uint8_t cns, uint16_t cntid) "cns 0x%"PRIx8" cntid 
%"PRIu16""
+pci_nvme_identify_pri_ctrl_cap(uint16_t cntlid) "identify primary controller 
capabilities cntlid=%"PRIu16""
 pci_nvme_identify_ns_csi(uint32_t ns, uint8_t csi) "nsid=%"PRIu32", 
csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "nsid=%"PRIu16", 
csi=0x%"PRIx8""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index e3bd47bf76..f69bd1d14f 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1017,6 +1017,7 @@ enum NvmeIdCns {
 NVME_ID_CNS_NS_PRESENT= 0x11,
 NVME_ID_CNS_NS_ATTACHED_CTRL_LIST = 0x12,
 NVME_ID_CNS_CTRL_LIST = 0x13,
+NVME_ID_CNS_PRIMARY_CTRL_CAP  = 0x14,
 NVME_ID_CNS_CS_NS_PRESENT_LIST= 0x1a,
 NVME_ID_CNS_CS_NS_PRESENT = 0x1b,
 NVME_ID_CNS_IO_COMMAND_SET= 0x1c,
@@ -1465,6 +1466,27 @@ typedef enum NvmeZoneState {
 NVME_ZONE_STATE_OFFLINE  = 0x0f,
 } NvmeZoneState;
 
+typedef struct QEMU_PACKED NvmePriCtrlCap {
+uint16_tcntlid;
+uint16_tportid;
+uint8_t crt;
+uint8_t rsvd5[27];
+uint32_tvqfrt;
+uint32_tvqrfa;
+uint16_tvqrfap;
+uint16_tvqprt;
+

[PATCH v2 05/15] hw/nvme: Add support for SR-IOV

2021-11-16 Thread Łukasz Gieryk
From: Lukasz Maniak 

This patch implements initial support for Single Root I/O Virtualization
on an NVMe device.

Essentially, it allows to define the maximum number of virtual functions
supported by the NVMe controller via sriov_max_vfs parameter.

Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
capability by a physical controller and ARI capability by both the
physical and virtual function devices.

NVMe controllers created via virtual functions mirror functionally
the physical controller, which may not entirely be the case, thus
consideration would be needed on the way to limit the capabilities of
the VF.

NVMe subsystem is required for the use of SR-IOV.

Signed-off-by: Lukasz Maniak 
---
 hw/nvme/ctrl.c   | 84 ++--
 hw/nvme/nvme.h   |  3 +-
 include/hw/pci/pci_ids.h |  1 +
 3 files changed, 84 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 6a571d18cf..04de63f30e 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -35,6 +35,7 @@
  *  mdts=,vsl=, \
  *  zoned.zasl=, \
  *  zoned.auto_transition=, \
+ *  sriov_max_vfs= \
  *  subsys=
  *  -device nvme-ns,drive=,bus=,nsid=,\
  *  zoned=, \
@@ -106,6 +107,12 @@
  *   transitioned to zone state closed for resource management purposes.
  *   Defaults to 'on'.
  *
+ * - `sriov_max_vfs`
+ *   Indicates the maximum number of PCIe virtual functions supported
+ *   by the controller. The default value is 0. Specifying a non-zero value
+ *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
+ *   Virtual function controllers will not report SR-IOV capability.
+ *
  * nvme namespace device parameters
  * 
  * - `shared`
@@ -160,6 +167,7 @@
 #include "sysemu/block-backend.h"
 #include "sysemu/hostmem.h"
 #include "hw/pci/msix.h"
+#include "hw/pci/pcie_sriov.h"
 #include "migration/vmstate.h"
 
 #include "nvme.h"
@@ -175,6 +183,9 @@
 #define NVME_TEMPERATURE_CRITICAL 0x175
 #define NVME_NUM_FW_SLOTS 1
 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
+#define NVME_MAX_VFS 127
+#define NVME_VF_OFFSET 0x1
+#define NVME_VF_STRIDE 1
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -5583,6 +5594,10 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
 g_free(event);
 }
 
+if (!pci_is_vf(>parent_obj) && n->params.sriov_max_vfs) {
+pcie_sriov_pf_disable_vfs(>parent_obj);
+}
+
 n->aer_queued = 0;
 n->outstanding_aers = 0;
 n->qs_created = false;
@@ -6264,6 +6279,29 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 error_setg(errp, "vsl must be non-zero");
 return;
 }
+
+if (params->sriov_max_vfs) {
+if (!n->subsys) {
+error_setg(errp, "subsystem is required for the use of SR-IOV");
+return;
+}
+
+if (params->sriov_max_vfs > NVME_MAX_VFS) {
+error_setg(errp, "sriov_max_vfs must be between 0 and %d",
+   NVME_MAX_VFS);
+return;
+}
+
+if (params->cmb_size_mb) {
+error_setg(errp, "CMB is not supported with SR-IOV");
+return;
+}
+
+if (n->pmr.dev) {
+error_setg(errp, "PMR is not supported with SR-IOV");
+return;
+}
+}
 }
 
 static void nvme_init_state(NvmeCtrl *n)
@@ -6321,6 +6359,20 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice 
*pci_dev)
 memory_region_set_enabled(>pmr.dev->mr, false);
 }
 
+static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
+uint64_t bar_size)
+{
+uint16_t vf_dev_id = n->params.use_intel_id ?
+ PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
+
+pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
+   n->params.sriov_max_vfs, n->params.sriov_max_vfs,
+   NVME_VF_OFFSET, NVME_VF_STRIDE);
+
+pcie_sriov_pf_init_vf_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
+  PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
+}
+
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
 uint8_t *pci_conf = pci_dev->config;
@@ -6335,7 +6387,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, 
Error **errp)
 
 if (n->params.use_intel_id) {
 pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
-pci_config_set_device_id(pci_conf, 0x5845);
+pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_INTEL_NVME);
 } else {
 pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_REDHAT);
 pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_REDHAT_NVME);
@@ -6343,6 +6395,9 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, 
Error **errp)
 
 pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
 pcie_endpoint_cap_init(pci_dev, 0x80);
+ 

[PATCH v2 04/15] pcie: Add 1.2 version token for the Power Management Capability

2021-11-16 Thread Łukasz Gieryk
Signed-off-by: Łukasz Gieryk 
---
 include/hw/pci/pci_regs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
index 77ba64b931..a590140962 100644
--- a/include/hw/pci/pci_regs.h
+++ b/include/hw/pci/pci_regs.h
@@ -4,5 +4,6 @@
 #include "standard-headers/linux/pci_regs.h"
 
 #define  PCI_PM_CAP_VER_1_1 0x0002  /* PCI PM spec ver. 1.1 */
+#define  PCI_PM_CAP_VER_1_2 0x0003  /* PCI PM spec ver. 1.2 */
 
 #endif
-- 
2.25.1




Re: [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers

2021-11-08 Thread Łukasz Gieryk
On Mon, Nov 08, 2021 at 09:25:58AM +0100, Klaus Jensen wrote:
> On Nov  5 15:04, Łukasz Gieryk wrote:
> > On Fri, Nov 05, 2021 at 09:46:28AM +0100, Łukasz Gieryk wrote:
> > > On Thu, Nov 04, 2021 at 04:48:43PM +0100, Łukasz Gieryk wrote:
> > > > On Wed, Nov 03, 2021 at 01:07:31PM +0100, Klaus Jensen wrote:
> > > > > On Oct  7 18:24, Lukasz Maniak wrote:
> > > > > > From: Łukasz Gieryk 
> > > > > > 
> > > > > > With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) 
> > > > > > one
> > > > > > can configure the maximum number of virtual queues and interrupts
> > > > > > assignable to a single virtual device. The primary and secondary
> > > > > > controller capability structures are initialized accordingly.
> > > > > > 
> > > > > > Since the number of available queues (interrupts) now varies between
> > > > > > VF/PF, BAR size calculation is also adjusted.
> > > > > > 
> > > > > 
> > > > > While this patch allows configuring the VQFRSM and VIFRSM fields, it
> > > > > implicitly sets VQFRT and VIFRT (i.e. by setting them to the product 
> > > > > of
> > > > > sriov_max_vi_pervf and max_vfs). Which is just setting it to an upper
> > > > > bound and this removes a testable case for host software (e.g.
> > > > > requesting more flexible resources than what is currently available).
> > > > > 
> > > > > This patch also requires that these parameters are set if 
> > > > > sriov_max_vfs
> > > > > is. I think we can provide better defaults.
> > > > > 
> > > > 
> > > > Originally I considered more params, but ended up coding the simplest,
> > > > user-friendly solution, because I did not like the mess with so many
> > > > parameters, and the flexibility wasn't needed for my use cases. But I do
> > > > agree: others may need the flexibility. Case (FRT < max_vfs * FRSM) is
> > > > valid and resembles an actual device.
> > > > 
> > > > > How about,
> > > > > 
> > > > > 1. if only sriov_max_vfs is set, then all VFs get private resources
> > > > >equal to max_ioqpairs. Like before this patch. This limits the 
> > > > > number
> > > > >of parameters required to get a basic setup going.
> > > > > 
> > > > > 2. if sriov_v{q,i}_private is set (I suggested this parameter in patch
> > > > >10), the difference between that and max_ioqpairs become flexible
> > > > >resources. Also, I'd be just fine with having sriov_v{q,i}_flexible
> > > > >instead and just make the difference become private resources.
> > > > >Potato/potato.
> > > > > 
> > > > >a. in the absence of sriov_max_v{q,i}_per_vf, set them to the 
> > > > > number
> > > > >   of calculated flexible resources.
> > > > > 
> > > > > This probably smells a bit like bikeshedding, but I think this gives
> > > > > more flexibility and better defaults, which helps with verifying host
> > > > > software.
> > > > > 
> > > > > If we can't agree on this now, I suggest we could go ahead and merge 
> > > > > the
> > > > > base functionality (i.e. private resources only) and ruminate some 
> > > > > more
> > > > > about these parameters.
> > > > 
> > > > The problem is that the spec allows VFs to support either only private,
> > > > or only flexible resources.
> > > > 
> > > > At this point I have to admit, that since my use cases for
> > > > QEMU/Nvme/SRIOV require flexible resources, I haven’t paid much
> > > > attention to the case with VFs having private resources. So this SR/IOV
> > > > implementation doesn’t even support such case (max_vX_per_vf != 0).
> > > > 
> > > > Let me summarize the possible config space, and how the current
> > > > parameters (could) map to these (interrupt-related ones omitted):
> > > > 
> > > > Flexible resources not supported (not implemented):
> > > >  - Private resources for PF = max_ioqpairs
> > > >  - Private resources per VF = ?
> > > >  - (error if flexible resources are configured)
> > > > 
> > &g

Re: [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers

2021-11-05 Thread Łukasz Gieryk
On Fri, Nov 05, 2021 at 09:46:28AM +0100, Łukasz Gieryk wrote:
> On Thu, Nov 04, 2021 at 04:48:43PM +0100, Łukasz Gieryk wrote:
> > On Wed, Nov 03, 2021 at 01:07:31PM +0100, Klaus Jensen wrote:
> > > On Oct  7 18:24, Lukasz Maniak wrote:
> > > > From: Łukasz Gieryk 
> > > > 
> > > > With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
> > > > can configure the maximum number of virtual queues and interrupts
> > > > assignable to a single virtual device. The primary and secondary
> > > > controller capability structures are initialized accordingly.
> > > > 
> > > > Since the number of available queues (interrupts) now varies between
> > > > VF/PF, BAR size calculation is also adjusted.
> > > > 
> > > 
> > > While this patch allows configuring the VQFRSM and VIFRSM fields, it
> > > implicitly sets VQFRT and VIFRT (i.e. by setting them to the product of
> > > sriov_max_vi_pervf and max_vfs). Which is just setting it to an upper
> > > bound and this removes a testable case for host software (e.g.
> > > requesting more flexible resources than what is currently available).
> > > 
> > > This patch also requires that these parameters are set if sriov_max_vfs
> > > is. I think we can provide better defaults.
> > > 
> > 
> > Originally I considered more params, but ended up coding the simplest,
> > user-friendly solution, because I did not like the mess with so many
> > parameters, and the flexibility wasn't needed for my use cases. But I do
> > agree: others may need the flexibility. Case (FRT < max_vfs * FRSM) is
> > valid and resembles an actual device.
> > 
> > > How about,
> > > 
> > > 1. if only sriov_max_vfs is set, then all VFs get private resources
> > >equal to max_ioqpairs. Like before this patch. This limits the number
> > >of parameters required to get a basic setup going.
> > > 
> > > 2. if sriov_v{q,i}_private is set (I suggested this parameter in patch
> > >10), the difference between that and max_ioqpairs become flexible
> > >resources. Also, I'd be just fine with having sriov_v{q,i}_flexible
> > >instead and just make the difference become private resources.
> > >Potato/potato.
> > > 
> > >a. in the absence of sriov_max_v{q,i}_per_vf, set them to the number
> > >   of calculated flexible resources.
> > > 
> > > This probably smells a bit like bikeshedding, but I think this gives
> > > more flexibility and better defaults, which helps with verifying host
> > > software.
> > > 
> > > If we can't agree on this now, I suggest we could go ahead and merge the
> > > base functionality (i.e. private resources only) and ruminate some more
> > > about these parameters.
> > 
> > The problem is that the spec allows VFs to support either only private,
> > or only flexible resources.
> > 
> > At this point I have to admit, that since my use cases for
> > QEMU/Nvme/SRIOV require flexible resources, I haven’t paid much
> > attention to the case with VFs having private resources. So this SR/IOV
> > implementation doesn’t even support such case (max_vX_per_vf != 0).
> > 
> > Let me summarize the possible config space, and how the current
> > parameters (could) map to these (interrupt-related ones omitted):
> > 
> > Flexible resources not supported (not implemented):
> >  - Private resources for PF = max_ioqpairs
> >  - Private resources per VF = ?
> >  - (error if flexible resources are configured)
> > 
> > With flexible resources:
> >  - VQPRT, private resources for PF  = max_ioqpairs
> >  - VQFRT, total flexible resources  = max_vq_per_vf * num_vfs
> >  - VQFRSM, maximum assignable per VF= max_vq_per_vf
> >  - VQGRAN, granularity  = #define constant
> >  - (error if private resources per VF are configured)
> > 
> > Since I don’t want to misunderstand your suggestion: could you provide a
> > similar map with your parameters, formulas, and explain how to determine
> > if flexible resources are active? I want to be sure we are on the
> > same page.
> > 
> 
> I’ve just re-read through my email and decided that some bits need
> clarification.
> 
> This implementation supports the “Flexible”-resources-only flavor of
> SR/IOV, while the “Private” also could be supported. Some effort is
> required to support both, and I cannot afford that (at least I cannot
> commit today, n

Re: [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers

2021-11-05 Thread Łukasz Gieryk
On Thu, Nov 04, 2021 at 04:48:43PM +0100, Łukasz Gieryk wrote:
> On Wed, Nov 03, 2021 at 01:07:31PM +0100, Klaus Jensen wrote:
> > On Oct  7 18:24, Lukasz Maniak wrote:
> > > From: Łukasz Gieryk 
> > > 
> > > With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
> > > can configure the maximum number of virtual queues and interrupts
> > > assignable to a single virtual device. The primary and secondary
> > > controller capability structures are initialized accordingly.
> > > 
> > > Since the number of available queues (interrupts) now varies between
> > > VF/PF, BAR size calculation is also adjusted.
> > > 
> > 
> > While this patch allows configuring the VQFRSM and VIFRSM fields, it
> > implicitly sets VQFRT and VIFRT (i.e. by setting them to the product of
> > sriov_max_vi_pervf and max_vfs). Which is just setting it to an upper
> > bound and this removes a testable case for host software (e.g.
> > requesting more flexible resources than what is currently available).
> > 
> > This patch also requires that these parameters are set if sriov_max_vfs
> > is. I think we can provide better defaults.
> > 
> 
> Originally I considered more params, but ended up coding the simplest,
> user-friendly solution, because I did not like the mess with so many
> parameters, and the flexibility wasn't needed for my use cases. But I do
> agree: others may need the flexibility. Case (FRT < max_vfs * FRSM) is
> valid and resembles an actual device.
> 
> > How about,
> > 
> > 1. if only sriov_max_vfs is set, then all VFs get private resources
> >equal to max_ioqpairs. Like before this patch. This limits the number
> >of parameters required to get a basic setup going.
> > 
> > 2. if sriov_v{q,i}_private is set (I suggested this parameter in patch
> >10), the difference between that and max_ioqpairs become flexible
> >resources. Also, I'd be just fine with having sriov_v{q,i}_flexible
> >instead and just make the difference become private resources.
> >Potato/potato.
> > 
> >a. in the absence of sriov_max_v{q,i}_per_vf, set them to the number
> >   of calculated flexible resources.
> > 
> > This probably smells a bit like bikeshedding, but I think this gives
> > more flexibility and better defaults, which helps with verifying host
> > software.
> > 
> > If we can't agree on this now, I suggest we could go ahead and merge the
> > base functionality (i.e. private resources only) and ruminate some more
> > about these parameters.
> 
> The problem is that the spec allows VFs to support either only private,
> or only flexible resources.
> 
> At this point I have to admit, that since my use cases for
> QEMU/Nvme/SRIOV require flexible resources, I haven’t paid much
> attention to the case with VFs having private resources. So this SR/IOV
> implementation doesn’t even support such case (max_vX_per_vf != 0).
> 
> Let me summarize the possible config space, and how the current
> parameters (could) map to these (interrupt-related ones omitted):
> 
> Flexible resources not supported (not implemented):
>  - Private resources for PF = max_ioqpairs
>  - Private resources per VF = ?
>  - (error if flexible resources are configured)
> 
> With flexible resources:
>  - VQPRT, private resources for PF  = max_ioqpairs
>  - VQFRT, total flexible resources  = max_vq_per_vf * num_vfs
>  - VQFRSM, maximum assignable per VF= max_vq_per_vf
>  - VQGRAN, granularity  = #define constant
>  - (error if private resources per VF are configured)
> 
> Since I don’t want to misunderstand your suggestion: could you provide a
> similar map with your parameters, formulas, and explain how to determine
> if flexible resources are active? I want to be sure we are on the
> same page.
> 

I’ve just re-read through my email and decided that some bits need
clarification.

This implementation supports the “Flexible”-resources-only flavor of
SR/IOV, while the “Private” also could be supported. Some effort is
required to support both, and I cannot afford that (at least I cannot
commit today, neither the other Lukasz).

While I’m ready to rework the Flexible config and prepare it to be
extended later to handle the Private variant, the 2nd version of these
patches will still support the Flexible flavor only.

I will include appropriate TODO/open in the next cover letter.




Re: [PATCH 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers

2021-11-04 Thread Łukasz Gieryk
On Wed, Nov 03, 2021 at 01:07:31PM +0100, Klaus Jensen wrote:
> On Oct  7 18:24, Lukasz Maniak wrote:
> > From: Łukasz Gieryk 
> > 
> > With two new properties (sriov_max_vi_per_vf, sriov_max_vq_per_vf) one
> > can configure the maximum number of virtual queues and interrupts
> > assignable to a single virtual device. The primary and secondary
> > controller capability structures are initialized accordingly.
> > 
> > Since the number of available queues (interrupts) now varies between
> > VF/PF, BAR size calculation is also adjusted.
> > 
> 
> While this patch allows configuring the VQFRSM and VIFRSM fields, it
> implicitly sets VQFRT and VIFRT (i.e. by setting them to the product of
> sriov_max_vi_pervf and max_vfs). Which is just setting it to an upper
> bound and this removes a testable case for host software (e.g.
> requesting more flexible resources than what is currently available).
> 
> This patch also requires that these parameters are set if sriov_max_vfs
> is. I think we can provide better defaults.
> 

Originally I considered more params, but ended up coding the simplest,
user-friendly solution, because I did not like the mess with so many
parameters, and the flexibility wasn't needed for my use cases. But I do
agree: others may need the flexibility. Case (FRT < max_vfs * FRSM) is
valid and resembles an actual device.

> How about,
> 
> 1. if only sriov_max_vfs is set, then all VFs get private resources
>equal to max_ioqpairs. Like before this patch. This limits the number
>of parameters required to get a basic setup going.
> 
> 2. if sriov_v{q,i}_private is set (I suggested this parameter in patch
>10), the difference between that and max_ioqpairs become flexible
>resources. Also, I'd be just fine with having sriov_v{q,i}_flexible
>instead and just make the difference become private resources.
>Potato/potato.
> 
>a. in the absence of sriov_max_v{q,i}_per_vf, set them to the number
>   of calculated flexible resources.
> 
> This probably smells a bit like bikeshedding, but I think this gives
> more flexibility and better defaults, which helps with verifying host
> software.
> 
> If we can't agree on this now, I suggest we could go ahead and merge the
> base functionality (i.e. private resources only) and ruminate some more
> about these parameters.

The problem is that the spec allows VFs to support either only private,
or only flexible resources.

At this point I have to admit, that since my use cases for
QEMU/Nvme/SRIOV require flexible resources, I haven’t paid much
attention to the case with VFs having private resources. So this SR/IOV
implementation doesn’t even support such case (max_vX_per_vf != 0).

Let me summarize the possible config space, and how the current
parameters (could) map to these (interrupt-related ones omitted):

Flexible resources not supported (not implemented):
 - Private resources for PF = max_ioqpairs
 - Private resources per VF = ?
 - (error if flexible resources are configured)

With flexible resources:
 - VQPRT, private resources for PF  = max_ioqpairs
 - VQFRT, total flexible resources  = max_vq_per_vf * num_vfs
 - VQFRSM, maximum assignable per VF= max_vq_per_vf
 - VQGRAN, granularity  = #define constant
 - (error if private resources per VF are configured)

Since I don’t want to misunderstand your suggestion: could you provide a
similar map with your parameters, formulas, and explain how to determine
if flexible resources are active? I want to be sure we are on the
same page.

-- 
Regards,
Łukasz



Re: [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime

2021-10-21 Thread Łukasz Gieryk
On Wed, Oct 20, 2021 at 09:06:06PM +0200, Klaus Jensen wrote:
> On Oct  7 18:24, Lukasz Maniak wrote:
> > From: Łukasz Gieryk 
> > 
> > The Nvme device defines two properties: max_ioqpairs, msix_qsize. Having
> > them as constants is problematic for SR-IOV support.
> > 
> > The SR-IOV feature introduces virtual resources (queues, interrupts)
> > that can be assigned to PF and its dependent VFs. Each device, following
> > a reset, should work with the configured number of queues. A single
> > constant is no longer sufficient to hold the whole state.
> > 
> > This patch tries to solve the problem by introducing additional
> > variables in NvmeCtrl’s state. The variables for, e.g., managing queues
> > are therefore organized as:
> > 
> >  - n->params.max_ioqpairs – no changes, constant set by the user.
> > 
> >  - n->max_ioqpairs - (new) value derived from n->params.* in realize();
> >  constant through device’s lifetime.
> > 
> >  - n->(mutable_state) – (not a part of this patch) user-configurable,
> > specifies number of queues available _after_
> > reset.
> > 
> >  - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
> >   n->params.max_ioqpairs; initialized in realize()
> >   and updated during reset() to reflect user’s
> >   changes to the mutable state.
> > 
> > Since the number of available i/o queues and interrupts can change in
> > runtime, buffers for sq/cqs and the MSIX-related structures are
> > allocated big enough to handle the limits, to completely avoid the
> > complicated reallocation. A helper function (nvme_update_msixcap_ts)
> > updates the corresponding capability register, to signal configuration
> > changes.
> > 
> > Signed-off-by: Łukasz Gieryk 
> 
> Instead of this, how about adding new parameters, say, sriov_vi_private
> and sriov_vq_private. Then, max_ioqpairs and msix_qsize are still the
> "physical" limits and the new parameters just reserve some for the
> primary controller, the rest being available for flexsible resources.

Compare your configuration:

max_ioqpairs = 26
sriov_max_vfs= 4
sriov_vq_private = 10

with mine:

max_ioqpairs= 10
sriov_max_vfs   = 4
sriov_max_vq_per_vf = 4

In your version, if I wanted to change max_vfs but keep the same number
of flexible resources per VF, then I would have to do some math and
update max_ioparis. And then I also would have to adjust the other
interrupt-related parameter, as it's also affected. In my opinion
it's quite inconvenient.
 
Now, even if I changed the semantic of params, I would still need most
of this patch. (Let’s keep the discussion regarding if max_* fields are
necessary in the other thread).

Without virtualization, the maximum number of queues is constant. User
(i.e., nvme kernel driver) can only query this value (e.g., 10) and
needs to follow this limit.

With virtualization, the flexible resources kick in. Let's continue with
the sample numbers defined earlier (10 private + 16 flexible resources).

1) The device boots, all 16 flexible queues are assigned to the primary
   controller.
2) Nvme kernel driver queries for the limit (10+16=26) and can create/use
   up to this many queues. 
3) User via the virtualization management command unbinds some (let's
   say 2) of the flexible queues from the primary controller and assigns
   them to a secondary controller.
4) After reset, the Physical Function Device reports different limit
   (24), and when the Virtual Device shows up, it will report 1 (adminQ
   consumed the other resource). 

So I need additional variable in the state to store the intermediate
limit (24 or 1), as none of the existing params has the correct value,
and all the places that validate limits must work on the value.




Re: [PATCH 10/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime

2021-10-18 Thread Łukasz Gieryk
On Mon, Oct 18, 2021 at 12:06:22PM +0200, Philippe Mathieu-Daudé wrote:
> On 10/7/21 18:24, Lukasz Maniak wrote:
> > From: Łukasz Gieryk 
> > 
> > The Nvme device defines two properties: max_ioqpairs, msix_qsize. Having
> > them as constants is problematic for SR-IOV support.
> > 
> > The SR-IOV feature introduces virtual resources (queues, interrupts)
> > that can be assigned to PF and its dependent VFs. Each device, following
> > a reset, should work with the configured number of queues. A single
> > constant is no longer sufficient to hold the whole state.
> > 
> > This patch tries to solve the problem by introducing additional
> > variables in NvmeCtrl’s state. The variables for, e.g., managing queues
> > are therefore organized as:
> > 
> >  - n->params.max_ioqpairs – no changes, constant set by the user.
> > 
> >  - n->max_ioqpairs - (new) value derived from n->params.* in realize();
> >  constant through device’s lifetime.
> > 
> >  - n->(mutable_state) – (not a part of this patch) user-configurable,
> > specifies number of queues available _after_
> > reset.
> > 
> >  - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
> >   n->params.max_ioqpairs; initialized in realize()
> >   and updated during reset() to reflect user’s
> >   changes to the mutable state.
> > 
> > Since the number of available i/o queues and interrupts can change in
> > runtime, buffers for sq/cqs and the MSIX-related structures are
> > allocated big enough to handle the limits, to completely avoid the
> > complicated reallocation. A helper function (nvme_update_msixcap_ts)
> > updates the corresponding capability register, to signal configuration
> > changes.
> > 
> > Signed-off-by: Łukasz Gieryk 
> > ---
> >  hw/nvme/ctrl.c | 62 +-
> >  hw/nvme/nvme.h |  4 
> >  2 files changed, 45 insertions(+), 21 deletions(-)
> 
> > @@ -6322,11 +6334,17 @@ static void nvme_init_state(NvmeCtrl *n)
> >  NvmeSecCtrlEntry *sctrl;
> >  int i;
> >  
> > +n->max_ioqpairs = n->params.max_ioqpairs;
> > +n->conf_ioqpairs = n->max_ioqpairs;
> > +
> > +n->max_msix_qsize = n->params.msix_qsize;
> > +n->conf_msix_qsize = n->max_msix_qsize;
> 
> From an developer perspective, the API becomes confusing.
> Most fields from NvmeParams are exposed via QMP, such max_ioqpairs.

Hi Philippe,

I’m not sure I understand your concern. The NvmeParams stays as it was,
so the interaction with QMP stays unchanged. Sure, if QMP allows
updating NvmeParams in runtime (I’m guessing, as I’m not really
accustomed with the feature), then the Nvme device will no longer
respond to those changes. But n->conf_ioqpairs is not meant to be
altered via QEMU’s interfaces, but rather though the NVME protocol, by
the guest OS kernel/user.

Could you explain how the changes are going to break (or make more
confusing) the interaction with QMP?

> I'm not sure we need 2 distinct fields. Maybe simply reorganize
> to not reset these values in the DeviceReset handler?

The idea was to calculate the max value once and use it in multiple
places later. The actual calculations are in the following 12/15 patch
(I’m also including the code below), so indeed, the intended use case
is not so obvious.

if (pci_is_vf(>parent_obj)) {
n->max_ioqpairs = n->params.sriov_max_vq_per_vf - 1;
} else {
n->max_ioqpairs = n->params.max_ioqpairs +
  n->params.sriov_max_vfs * n->params.sriov_max_vq_per_vf;
}

But as I’m thinking more about the problem, then indeed, the max_*
fields may be not necessary. I could calculate max_msix_qsize in the
only place it’s used, and turn the above snippet for max_iopairs into a
function. The downside is the code for calculating maximums is no longer
grouped together.

> Also, with this series we should consider implementing the migration
> state (nvme_vmstate).

I wasn’t aware of this feature. I have to do the readings first.

> > diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> > index 9fbb0a70b5..65383e495c 100644
> > --- a/hw/nvme/nvme.h
> > +++ b/hw/nvme/nvme.h
> > @@ -420,6 +420,10 @@ typedef struct NvmeCtrl {
> >  uint64_tstarttime_ms;
> >  uint16_ttemperature;
> >  uint8_t smart_critical_warning;
> > +uint32_tmax_msix_qsize; /* Derived from 
> > params.msix.qsize */
> > +uint32_tconf_msix_qsize;/* Configured limit */
> > +uint32_tmax_ioqpairs;   /* Derived from 
> > params.max_ioqpairs */
> > +uint32_tconf_ioqpairs;  /* Configured limit */
> >  
> 

-- 
Regards,
Łukasz Gieryk