Re: [PATCH 2/2] qapi/block: Restrict vhost-user-blk to CONFIG_VHOST_USER_BLK_SERVER
Philippe Mathieu-Daudé writes: > Do not list vhost-user-blk in BlockExportType > when CONFIG_VHOST_USER_BLK_SERVER is disabled. > > Fixes: 90fc91d50b7 ("convert vhost-user-blk server to block export API") My immediate reaction was "what exactly is broken before this patch?" I think it's introspection: query-qmp-schema has vhost-user-blk even though it's not actually available. Let's spell that out. > Signed-off-by: Philippe Mathieu-Daudé > --- > qapi/block-export.json | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/qapi/block-export.json b/qapi/block-export.json > index c1b92ce1c1c..6bc29a75dc0 100644 > --- a/qapi/block-export.json > +++ b/qapi/block-export.json > @@ -277,7 +277,8 @@ > # Since: 4.2 > ## > { 'enum': 'BlockExportType', > - 'data': [ 'nbd', 'vhost-user-blk', > + 'data': [ 'nbd', > +{ 'name': 'vhost-user-blk', 'if': 'CONFIG_VHOST_USER_BLK_SERVER' > }, > { 'name': 'fuse', 'if': 'CONFIG_FUSE' } ] } > > ## Doesn't compile when I configure --disable-vhost-user. Fix: diff --git a/qapi/block-export.json b/qapi/block-export.json index 6bc29a75dc..f9ce79a974 100644 --- a/qapi/block-export.json +++ b/qapi/block-export.json @@ -320,7 +320,8 @@ 'discriminator': 'type', 'data': { 'nbd': 'BlockExportOptionsNbd', - 'vhost-user-blk': 'BlockExportOptionsVhostUserBlk', + 'vhost-user-blk': { 'type': 'BlockExportOptionsVhostUserBlk', + 'if': 'CONFIG_VHOST_USER_BLK_SERVER' }, 'fuse': { 'type': 'BlockExportOptionsFuse', 'if': 'CONFIG_FUSE' } } }
Re: [PATCH 5/5] dma: Let ld*_pci_dma() propagate MemTxResult
On 12/18/21 7:10 AM, Philippe Mathieu-Daudé wrote: ld*_dma() returns a MemTxResult type. Do not discard it, return it to the caller. Update the few callers. Signed-off-by: Philippe Mathieu-Daudé --- include/hw/pci/pci.h | 17 - hw/audio/intel-hda.c | 2 +- hw/net/eepro100.c| 25 ++--- hw/net/tulip.c | 16 hw/scsi/megasas.c| 21 - hw/scsi/mptsas.c | 16 +++- hw/scsi/vmw_pvscsi.c | 16 ++-- 7 files changed, 60 insertions(+), 53 deletions(-) diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h index c90cecc85c0..5b36334a28a 100644 --- a/include/hw/pci/pci.h +++ b/include/hw/pci/pci.h @@ -850,15 +850,14 @@ static inline MemTxResult pci_dma_write(PCIDevice *dev, dma_addr_t addr, DMA_DIRECTION_FROM_DEVICE, MEMTXATTRS_UNSPECIFIED); } -#define PCI_DMA_DEFINE_LDST(_l, _s, _bits) \ -static inline uint##_bits##_t ld##_l##_pci_dma(PCIDevice *dev, \ - dma_addr_t addr, \ - MemTxAttrs attrs) \ -{ \ -uint##_bits##_t val; \ -ld##_l##_dma(pci_get_address_space(dev), addr, &val, attrs); \ -return val; \ -} \ +#define PCI_DMA_DEFINE_LDST(_l, _s, _bits) \ +static inline MemTxResult ld##_l##_pci_dma(PCIDevice *dev, \ + dma_addr_t addr, \ + uint##_bits##_t *val, \ + MemTxAttrs attrs) \ +{ \ +return ld##_l##_dma(pci_get_address_space(dev), addr, val, attrs); \ +} \ static inline MemTxResult st##_s##_pci_dma(PCIDevice *dev, \ dma_addr_t addr, \ uint##_bits##_t val, \ diff --git a/hw/audio/intel-hda.c b/hw/audio/intel-hda.c index e34b7ab0e92..2b55d521503 100644 --- a/hw/audio/intel-hda.c +++ b/hw/audio/intel-hda.c @@ -335,7 +335,7 @@ static void intel_hda_corb_run(IntelHDAState *d) rp = (d->corb_rp + 1) & 0xff; addr = intel_hda_addr(d->corb_lbase, d->corb_ubase); -verb = ldl_le_pci_dma(&d->pci, addr + 4 * rp, MEMTXATTRS_UNSPECIFIED); +ldl_le_pci_dma(&d->pci, addr + 4 * rp, &verb, MEMTXATTRS_UNSPECIFIED); d->corb_rp = rp; dprint(d, 2, "%s: [rp 0x%x] verb 0x%08x\n", __func__, rp, verb); diff --git a/hw/net/eepro100.c b/hw/net/eepro100.c index eb82e9cb118..679f52f80f1 100644 --- a/hw/net/eepro100.c +++ b/hw/net/eepro100.c @@ -769,18 +769,16 @@ static void tx_command(EEPRO100State *s) } else { /* Flexible mode. */ uint8_t tbd_count = 0; +uint32_t tx_buffer_address; +uint16_t tx_buffer_size; +uint16_t tx_buffer_el; + if (s->has_extended_tcb_support && !(s->configuration[6] & BIT(4))) { /* Extended Flexible TCB. */ for (; tbd_count < 2; tbd_count++) { -uint32_t tx_buffer_address = ldl_le_pci_dma(&s->dev, -tbd_address, -attrs); -uint16_t tx_buffer_size = lduw_le_pci_dma(&s->dev, - tbd_address + 4, - attrs); -uint16_t tx_buffer_el = lduw_le_pci_dma(&s->dev, -tbd_address + 6, -attrs); +ldl_le_pci_dma(&s->dev, tbd_address, &tx_buffer_address, attrs); +lduw_le_pci_dma(&s->dev, tbd_address + 4, &tx_buffer_size, attrs); +lduw_le_pci_dma(&s->dev, tbd_address + 6, &tx_buffer_el, attrs); tbd_address += 8; TRACE(RXTX, logout ("TBD (extended flexible mode): buffer address 0x%08x, size 0x%04x\n", @@ -796,12 +794,9 @@ static void tx_command(EEPRO100State *s) } tbd_address = tbd_array; for (; tbd_count < s->tx.tbd_count; tbd_count++) { -uint32_t tx_buffer_address = ldl_le_pci_dma(&s->dev, tbd_address, -attrs); -uint16_t tx_buffer_size = lduw_le_pci_dma(&s->dev, tbd_address + 4, - attrs); -uint16_t tx_buffer_el = lduw_le_pci_dma(&s->dev, tbd_address + 6, -attrs); +ldl_le_pci_dma(&s->dev, tbd_address, &tx_buffer_address, attrs); +lduw_le_pci_dma(&s-
Re: [PATCH 4/5] dma: Let st*_pci_dma() propagate MemTxResult
On 12/18/21 7:10 AM, Philippe Mathieu-Daudé wrote: st*_dma() returns a MemTxResult type. Do not discard it, return it to the caller. Signed-off-by: Philippe Mathieu-Daudé --- include/hw/pci/pci.h | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) Reviewed-by: Richard Henderson r~
Re: [PATCH 3/5] dma: Let ld*_pci_dma() take MemTxAttrs argument
On 12/18/21 7:10 AM, Philippe Mathieu-Daudé wrote: Let devices specify transaction attributes when calling ld*_pci_dma(). Keep the default MEMTXATTRS_UNSPECIFIED in the few callers. Signed-off-by: Philippe Mathieu-Daudé --- include/hw/pci/pci.h | 6 +++--- hw/audio/intel-hda.c | 2 +- hw/net/eepro100.c| 19 +-- hw/net/tulip.c | 18 ++ hw/scsi/megasas.c| 16 ++-- hw/scsi/mptsas.c | 10 ++ hw/scsi/vmw_pvscsi.c | 3 ++- hw/usb/hcd-xhci.c| 1 + 8 files changed, 46 insertions(+), 29 deletions(-) Reviewed-by: Richard Henderson r~
Re: [PATCH 2/5] dma: Let st*_pci_dma() take MemTxAttrs argument
On 12/18/21 7:10 AM, Philippe Mathieu-Daudé wrote: Let devices specify transaction attributes when calling st*_pci_dma(). Keep the default MEMTXATTRS_UNSPECIFIED in the few callers. Signed-off-by: Philippe Mathieu-Daudé --- include/hw/pci/pci.h | 11 ++- hw/audio/intel-hda.c | 10 ++ hw/net/eepro100.c| 29 ++--- hw/net/tulip.c | 18 ++ hw/scsi/megasas.c| 15 ++- hw/scsi/vmw_pvscsi.c | 3 ++- 6 files changed, 52 insertions(+), 34 deletions(-) Reviewed-by: Richard Henderson r~
Re: [PATCH 1/5] hw/scsi/megasas: Use uint32_t for reply queue head/tail values
On 12/18/21 7:10 AM, Philippe Mathieu-Daudé wrote: While the reply queue values fit in 16-bit, they are accessed as 32-bit: 661:s->reply_queue_head = ldl_le_pci_dma(pcid, s->producer_pa); 662:s->reply_queue_head %= MEGASAS_MAX_FRAMES; 663:s->reply_queue_tail = ldl_le_pci_dma(pcid, s->consumer_pa); 664:s->reply_queue_tail %= MEGASAS_MAX_FRAMES; Having: 41:#define MEGASAS_MAX_FRAMES 2048 /* Firmware limit at 65535 */ In order to update the ld/st*_pci_dma() API to pass the address of the value to access, it is simpler to have the head/tail declared as 32-bit values. Replace the uint16_t by uint32_t, wasting 4 bytes in the MegasasState structure. Signed-off-by: Philippe Mathieu-Daudé --- hw/scsi/megasas.c| 4 ++-- hw/scsi/trace-events | 8 2 files changed, 6 insertions(+), 6 deletions(-) Acked-by: Richard Henderson r~
[PATCH v3 13/15] hw/nvme: Add support for the Virtualization Management command
From: Łukasz Gieryk With the new command one can: - assign flexible resources (queues, interrupts) to primary and secondary controllers, - toggle the online/offline state of given controller. Signed-off-by: Łukasz Gieryk --- hw/nvme/ctrl.c | 253 ++- hw/nvme/nvme.h | 20 hw/nvme/trace-events | 3 + include/block/nvme.h | 17 +++ 4 files changed, 291 insertions(+), 2 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index e43773b525..e21c60fee8 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -188,6 +188,7 @@ #include "qemu/error-report.h" #include "qemu/log.h" #include "qemu/units.h" +#include "qemu/range.h" #include "qapi/error.h" #include "qapi/visitor.h" #include "sysemu/sysemu.h" @@ -259,6 +260,7 @@ static const uint32_t nvme_cse_acs[256] = { [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC, +[NVME_ADM_CMD_VIRT_MNGMT] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_FORMAT_NVM] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, }; @@ -290,6 +292,7 @@ static const uint32_t nvme_cse_iocs_zoned[256] = { }; static void nvme_process_sq(void *opaque); +static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst); static uint16_t nvme_sqid(NvmeRequest *req) { @@ -5539,6 +5542,164 @@ out: return status; } +static void nvme_get_virt_res_num(NvmeCtrl *n, uint8_t rt, int *num_total, + int *num_prim, int *num_sec) +{ +*num_total = le32_to_cpu(rt ? n->pri_ctrl_cap.vifrt : n->pri_ctrl_cap.vqfrt); +*num_prim = le16_to_cpu(rt ? n->pri_ctrl_cap.virfap : n->pri_ctrl_cap.vqrfap); +*num_sec = le16_to_cpu(rt ? n->pri_ctrl_cap.virfa : n->pri_ctrl_cap.vqrfa); +} + +static uint16_t nvme_assign_virt_res_to_prim(NvmeCtrl *n, NvmeRequest *req, + uint16_t cntlid, uint8_t rt, int nr) +{ +int num_total, num_prim, num_sec; + +if (cntlid != n->cntlid) { +return NVME_INVALID_CTRL_ID | NVME_DNR; +} + +nvme_get_virt_res_num(n, rt, &num_total, &num_prim, &num_sec); + +if (nr > num_total) { +return NVME_INVALID_NUM_RESOURCES | NVME_DNR; +} + +if (nr > num_total - num_sec) { +return NVME_INVALID_RESOURCE_ID | NVME_DNR; +} + +if (rt) { +n->next_pri_ctrl_cap.virfap = cpu_to_le16(nr); +} else { +n->next_pri_ctrl_cap.vqrfap = cpu_to_le16(nr); +} + +req->cqe.result = cpu_to_le32(nr); +return req->status; +} + +static void nvme_update_virt_res(NvmeCtrl *n, NvmeSecCtrlEntry *sctrl, + uint8_t rt, int nr) +{ +int prev_nr, prev_total; + +if (rt) { +prev_nr = le16_to_cpu(sctrl->nvi); +prev_total = le32_to_cpu(n->pri_ctrl_cap.virfa); +sctrl->nvi = cpu_to_le16(nr); +n->pri_ctrl_cap.virfa = cpu_to_le32(prev_total + nr - prev_nr); +} else { +prev_nr = le16_to_cpu(sctrl->nvq); +prev_total = le32_to_cpu(n->pri_ctrl_cap.vqrfa); +sctrl->nvq = cpu_to_le16(nr); +n->pri_ctrl_cap.vqrfa = cpu_to_le32(prev_total + nr - prev_nr); +} +} + +static uint16_t nvme_assign_virt_res_to_sec(NvmeCtrl *n, NvmeRequest *req, +uint16_t cntlid, uint8_t rt, int nr) +{ +int num_total, num_prim, num_sec, num_free, diff, limit; +NvmeSecCtrlEntry *sctrl; + +sctrl = nvme_sctrl_for_cntlid(n, cntlid); +if (!sctrl) { +return NVME_INVALID_CTRL_ID | NVME_DNR; +} + +if (sctrl->scs) { +return NVME_INVALID_SEC_CTRL_STATE | NVME_DNR; +} + +limit = le16_to_cpu(rt ? n->pri_ctrl_cap.vifrsm : n->pri_ctrl_cap.vqfrsm); +if (nr > limit) { +return NVME_INVALID_NUM_RESOURCES | NVME_DNR; +} + +nvme_get_virt_res_num(n, rt, &num_total, &num_prim, &num_sec); +num_free = num_total - num_prim - num_sec; +diff = nr - le16_to_cpu(rt ? sctrl->nvi : sctrl->nvq); + +if (diff > num_free) { +return NVME_INVALID_RESOURCE_ID | NVME_DNR; +} + +nvme_update_virt_res(n, sctrl, rt, nr); +req->cqe.result = cpu_to_le32(nr); + +return req->status; +} + +static uint16_t nvme_virt_set_state(NvmeCtrl *n, uint16_t cntlid, bool online) +{ +NvmeCtrl *sn = NULL; +NvmeSecCtrlEntry *sctrl; +int vf_index; + +sctrl = nvme_sctrl_for_cntlid(n, cntlid); +if (!sctrl) { +return NVME_INVALID_CTRL_ID | NVME_DNR; +} + +if (!pci_is_vf(&n->parent_obj)) { +vf_index = le16_to_cpu(sctrl->vfn) - 1; +sn = NVME(pcie_sriov_get_vf_at_index(&n->parent_obj, vf_index)); +} + +if (online) { +if (!sctrl->nvi || (le16_to_cpu(sctrl->nvq) < 2) || !sn) { +return NVME_INVALID_SEC_CTRL_STATE | NVME_DNR; +} + +if (!sctrl->scs) { +sctrl->scs = 0x1; +nvme_ctrl_reset(
[PATCH v3 10/15] hw/nvme: Remove reg_size variable and update BAR0 size calculation
From: Łukasz Gieryk The n->reg_size parameter unnecessarily splits the BAR0 size calculation in two phases; removed to simplify the code. With all the calculations done in one place, it seems the pow2ceil, applied originally to reg_size, is unnecessary. The rounding should happen as the last step, when BAR size includes Nvme registers, queue registers, and MSIX-related space. Finally, the size of the mmio memory region is extended to cover the 1st 4KiB padding (see the map below). Access to this range is handled as interaction with a non-existing queue and generates an error trace, so actually nothing changes, while the reg_size variable is no longer needed. | BAR0| [Nvme Registers] [Queues] [power-of-2 padding] - removed in this patch [4KiB padding (1) ] [MSIX TABLE] [4KiB padding (2) ] [MSIX PBA ] [power-of-2 padding] Signed-off-by: Łukasz Gieryk --- hw/nvme/ctrl.c | 10 +- hw/nvme/nvme.h | 1 - 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index de463450b6..a4b11b201a 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -6370,9 +6370,6 @@ static void nvme_init_state(NvmeCtrl *n) n->conf_ioqpairs = n->params.max_ioqpairs; n->conf_msix_qsize = n->params.msix_qsize; -/* add one to max_ioqpairs to account for the admin queue pair */ -n->reg_size = pow2ceil(sizeof(NvmeBar) + - 2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE); n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1); n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1); n->temperature = NVME_TEMPERATURE; @@ -6496,7 +6493,10 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) pcie_ari_init(pci_dev, 0x100, 1); } -bar_size = QEMU_ALIGN_UP(n->reg_size, 4 * KiB); +/* add one to max_ioqpairs to account for the admin queue pair */ +bar_size = sizeof(NvmeBar) + + 2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE; +bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB); msix_table_offset = bar_size; msix_table_size = PCI_MSIX_ENTRY_SIZE * n->params.msix_qsize; @@ -6510,7 +6510,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) memory_region_init(&n->bar0, OBJECT(n), "nvme-bar0", bar_size); memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme", - n->reg_size); + msix_table_offset); memory_region_add_subregion(&n->bar0, 0, &n->iomem); if (pci_is_vf(pci_dev)) { diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h index 927890b490..1401ac3904 100644 --- a/hw/nvme/nvme.h +++ b/hw/nvme/nvme.h @@ -414,7 +414,6 @@ typedef struct NvmeCtrl { uint16_tmax_prp_ents; uint16_tcqe_size; uint16_tsqe_size; -uint32_treg_size; uint32_tmax_q_ents; uint8_t outstanding_aers; uint32_tirq_status; -- 2.25.1
[PATCH v3 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers
From: Łukasz Gieryk With four new properties: - sriov_v{i,q}_flexible, - sriov_max_v{i,q}_per_vf, one can configure the number of available flexible resources, as well as the limits. The primary and secondary controller capability structures are initialized accordingly. Since the number of available queues (interrupts) now varies between VF/PF, BAR size calculation is also adjusted. Signed-off-by: Łukasz Gieryk --- hw/nvme/ctrl.c | 138 --- hw/nvme/nvme.h | 4 ++ include/block/nvme.h | 5 ++ 3 files changed, 140 insertions(+), 7 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index a26abaea36..e43773b525 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -36,6 +36,10 @@ * zoned.zasl=, \ * zoned.auto_transition=, \ * sriov_max_vfs= \ + * sriov_vq_flexible= \ + * sriov_vi_flexible= \ + * sriov_max_vi_per_vf= \ + * sriov_max_vq_per_vf= \ * subsys= * -device nvme-ns,drive=,bus=,nsid=,\ * zoned=, \ @@ -113,6 +117,29 @@ * enables reporting of both SR-IOV and ARI capabilities by the NVMe device. * Virtual function controllers will not report SR-IOV capability. * + * NOTE: Single Root I/O Virtualization support is experimental. + * All the related parameters may be subject to change. + * + * - `sriov_vq_flexible` + * Indicates the total number of flexible queue resources assignable to all + * the secondary controllers. Implicitly sets the number of primary + * controller's private resources to `(max_ioqpairs - sriov_vq_flexible)`. + * + * - `sriov_vi_flexible` + * Indicates the total number of flexible interrupt resources assignable to + * all the secondary controllers. Implicitly sets the number of primary + * controller's private resources to `(msix_qsize - sriov_vi_flexible)`. + * + * - `sriov_max_vi_per_vf` + * Indicates the maximum number of virtual interrupt resources assignable + * to a secondary controller. The default 0 resolves to + * `(sriov_vi_flexible / sriov_max_vfs)`. + * + * - `sriov_max_vq_per_vf` + * Indicates the maximum number of virtual queue resources assignable to + * a secondary controller. The default 0 resolves to + * `(sriov_vq_flexible / sriov_max_vfs)`. + * * nvme namespace device parameters * * - `shared` @@ -184,6 +211,7 @@ #define NVME_NUM_FW_SLOTS 1 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB) #define NVME_MAX_VFS 127 +#define NVME_VF_RES_GRANULARITY 1 #define NVME_VF_OFFSET 0x1 #define NVME_VF_STRIDE 1 @@ -6357,6 +6385,54 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp) error_setg(errp, "PMR is not supported with SR-IOV"); return; } + +if (!params->sriov_vq_flexible || !params->sriov_vi_flexible) { +error_setg(errp, "both sriov_vq_flexible and sriov_vi_flexible" + " must be set for the use of SR-IOV"); +return; +} + +if (params->sriov_vq_flexible < params->sriov_max_vfs * 2) { +error_setg(errp, "sriov_vq_flexible must be greater than or equal" + " to %d (sriov_max_vfs * 2)", params->sriov_max_vfs * 2); +return; +} + +if (params->max_ioqpairs < params->sriov_vq_flexible + 2) { +error_setg(errp, "sriov_vq_flexible - max_ioqpairs (PF-private" + " queue resources) must be greater than or equal to 2"); +return; +} + +if (params->sriov_vi_flexible < params->sriov_max_vfs) { +error_setg(errp, "sriov_vi_flexible must be greater than or equal" + " to %d (sriov_max_vfs)", params->sriov_max_vfs); +return; +} + +if (params->msix_qsize < params->sriov_vi_flexible + 1) { +error_setg(errp, "sriov_vi_flexible - msix_qsize (PF-private" + " interrupt resources) must be greater than or equal" + " to 1"); +return; +} + +if (params->sriov_max_vi_per_vf && +(params->sriov_max_vi_per_vf - 1) % NVME_VF_RES_GRANULARITY) { +error_setg(errp, "sriov_max_vi_per_vf must meet:" + " (X - 1) %% %d == 0 and X >= 1", + NVME_VF_RES_GRANULARITY); +return; +} + +if (params->sriov_max_vq_per_vf && +(params->sriov_max_vq_per_vf < 2 || + (params->sriov_max_vq_per_vf - 1) % NVME_VF_RES_GRANULARITY)) { +error_setg(errp, "sriov_max_vq_per_vf must meet:" + " (X - 1) %% %d == 0 and X >= 2", + NVME_VF_RES_GRANULARITY); +return; +} } } @@ -6365,10 +6441,19 @@ static void nvme_init_state(NvmeCtrl *n) NvmePriCtrlCap *cap = &n->pri_ctrl_cap; NvmeS
[PATCH v3 14/15] docs: Add documentation for SR-IOV and Virtualization Enhancements
Signed-off-by: Lukasz Maniak --- docs/system/devices/nvme.rst | 36 1 file changed, 36 insertions(+) diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst index b5acb2a9c1..166a11abc6 100644 --- a/docs/system/devices/nvme.rst +++ b/docs/system/devices/nvme.rst @@ -239,3 +239,39 @@ The virtual namespace device supports DIF- and DIX-based protection information to ``1`` to transfer protection information as the first eight bytes of metadata. Otherwise, the protection information is transferred as the last eight bytes. + +Virtualization Enhancements and SR-IOV (Experimental Support) +- + +The ``nvme`` device supports Single Root I/O Virtualization and Sharing +along with Virtualization Enhancements. The controller has to be linked to +an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV. + +A number of parameters are present (**please note, that they may be +subject to change**): + +``sriov_max_vfs`` (default: ``0``) + Indicates the maximum number of PCIe virtual functions supported + by the controller. Specifying a non-zero value enables reporting of both + SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities + by the NVMe device. Virtual function controllers will not report SR-IOV. + +``sriov_vq_flexible`` + Indicates the total number of flexible queue resources assignable to all + the secondary controllers. Implicitly sets the number of primary + controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``. + +``sriov_vi_flexible`` + Indicates the total number of flexible interrupt resources assignable to + all the secondary controllers. Implicitly sets the number of primary + controller's private resources to ``(msix_qsize - sriov_vi_flexible)``. + +``sriov_max_vi_per_vf`` (default: ``0``) + Indicates the maximum number of virtual interrupt resources assignable + to a secondary controller. The default ``0`` resolves to + ``(sriov_vi_flexible / sriov_max_vfs)`` + +``sriov_max_vq_per_vf`` (default: ``0``) + Indicates the maximum number of virtual queue resources assignable to + a secondary controller. The default ``0`` resolves to + ``(sriov_vq_flexible / sriov_max_vfs)`` -- 2.25.1
[PATCH v3 11/15] hw/nvme: Calculate BAR attributes in a function
From: Łukasz Gieryk An NVMe device with SR-IOV capability calculates the BAR size differently for PF and VF, so it makes sense to extract the common code to a separate function. Signed-off-by: Łukasz Gieryk --- hw/nvme/ctrl.c | 45 +++-- 1 file changed, 31 insertions(+), 14 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index a4b11b201a..a26abaea36 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -6429,6 +6429,34 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev) memory_region_set_enabled(&n->pmr.dev->mr, false); } +static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs, + unsigned *msix_table_offset, + unsigned *msix_pba_offset) +{ +uint64_t bar_size, msix_table_size, msix_pba_size; + +bar_size = sizeof(NvmeBar) + 2 * total_queues * NVME_DB_SIZE; +bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB); + +if (msix_table_offset) { +*msix_table_offset = bar_size; +} + +msix_table_size = PCI_MSIX_ENTRY_SIZE * total_irqs; +bar_size += msix_table_size; +bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB); + +if (msix_pba_offset) { +*msix_pba_offset = bar_size; +} + +msix_pba_size = QEMU_ALIGN_UP(total_irqs, 64) / 8; +bar_size += msix_pba_size; + +bar_size = pow2ceil(bar_size); +return bar_size; +} + static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset, uint64_t bar_size) { @@ -6468,7 +6496,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset) static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) { uint8_t *pci_conf = pci_dev->config; -uint64_t bar_size, msix_table_size, msix_pba_size; +uint64_t bar_size; unsigned msix_table_offset, msix_pba_offset; int ret; @@ -6494,19 +6522,8 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) } /* add one to max_ioqpairs to account for the admin queue pair */ -bar_size = sizeof(NvmeBar) + - 2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE; -bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB); -msix_table_offset = bar_size; -msix_table_size = PCI_MSIX_ENTRY_SIZE * n->params.msix_qsize; - -bar_size += msix_table_size; -bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB); -msix_pba_offset = bar_size; -msix_pba_size = QEMU_ALIGN_UP(n->params.msix_qsize, 64) / 8; - -bar_size += msix_pba_size; -bar_size = pow2ceil(bar_size); +bar_size = nvme_bar_size(n->params.max_ioqpairs + 1, n->params.msix_qsize, + &msix_table_offset, &msix_pba_offset); memory_region_init(&n->bar0, OBJECT(n), "nvme-bar0", bar_size); memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme", -- 2.25.1
[PATCH v3 15/15] hw/nvme: Update the initalization place for the AER queue
From: Łukasz Gieryk This patch updates the initialization place for the AER queue, so it’s initialized once, at controller initialization, and not every time controller is enabled. While the original version works for a non-SR-IOV device, as it’s hard to interact with the controller if it’s not enabled, the multiple reinitialization is not necessarily correct. With the SR/IOV feature enabled a segfault can happen: a VF can have its controller disabled, while a namespace can still be attached to the controller through the parent PF. An event generated in such case ends up on an uninitialized queue. While it’s an interesting question whether a VF should support AER in the first place, I don’t think it must be answered today. Signed-off-by: Łukasz Gieryk --- hw/nvme/ctrl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index e21c60fee8..23280f501f 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -6023,8 +6023,6 @@ static int nvme_start_ctrl(NvmeCtrl *n) nvme_set_timestamp(n, 0ULL); -QTAILQ_INIT(&n->aer_queue); - nvme_select_iocs(n); return 0; @@ -7001,6 +6999,8 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev) id->cmic |= NVME_CMIC_MULTI_CTRL; } +QTAILQ_INIT(&n->aer_queue); + NVME_CAP_SET_MQES(cap, 0x7ff); NVME_CAP_SET_CQR(cap, 1); NVME_CAP_SET_TO(cap, 0xf); -- 2.25.1
[PATCH v3 09/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
From: Łukasz Gieryk The NVMe device defines two properties: max_ioqpairs, msix_qsize. Having them as constants is problematic for SR-IOV support. SR-IOV introduces virtual resources (queues, interrupts) that can be assigned to PF and its dependent VFs. Each device, following a reset, should work with the configured number of queues. A single constant is no longer sufficient to hold the whole state. This patch tries to solve the problem by introducing additional variables in NvmeCtrl’s state. The variables for, e.g., managing queues are therefore organized as: - n->params.max_ioqpairs – no changes, constant set by the user - n->(mutable_state) – (not a part of this patch) user-configurable, specifies number of queues available _after_ reset - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’ n->params.max_ioqpairs; initialized in realize() and updated during reset() to reflect user’s changes to the mutable state Since the number of available i/o queues and interrupts can change in runtime, buffers for sq/cqs and the MSIX-related structures are allocated big enough to handle the limits, to completely avoid the complicated reallocation. A helper function (nvme_update_msixcap_ts) updates the corresponding capability register, to signal configuration changes. Signed-off-by: Łukasz Gieryk --- hw/nvme/ctrl.c | 52 ++ hw/nvme/nvme.h | 2 ++ 2 files changed, 38 insertions(+), 16 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 9e83b4dd76..de463450b6 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -416,12 +416,12 @@ static bool nvme_nsid_valid(NvmeCtrl *n, uint32_t nsid) static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid) { -return sqid < n->params.max_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1; +return sqid < n->conf_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1; } static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid) { -return cqid < n->params.max_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1; +return cqid < n->conf_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1; } static void nvme_inc_cq_tail(NvmeCQueue *cq) @@ -4034,8 +4034,7 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req) trace_pci_nvme_err_invalid_create_sq_cqid(cqid); return NVME_INVALID_CQID | NVME_DNR; } -if (unlikely(!sqid || sqid > n->params.max_ioqpairs || -n->sq[sqid] != NULL)) { +if (unlikely(!sqid || sqid > n->conf_ioqpairs || n->sq[sqid] != NULL)) { trace_pci_nvme_err_invalid_create_sq_sqid(sqid); return NVME_INVALID_QID | NVME_DNR; } @@ -4387,8 +4386,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req) trace_pci_nvme_create_cq(prp1, cqid, vector, qsize, qflags, NVME_CQ_FLAGS_IEN(qflags) != 0); -if (unlikely(!cqid || cqid > n->params.max_ioqpairs || -n->cq[cqid] != NULL)) { +if (unlikely(!cqid || cqid > n->conf_ioqpairs || n->cq[cqid] != NULL)) { trace_pci_nvme_err_invalid_create_cq_cqid(cqid); return NVME_INVALID_QID | NVME_DNR; } @@ -4404,7 +4402,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req) trace_pci_nvme_err_invalid_create_cq_vector(vector); return NVME_INVALID_IRQ_VECTOR | NVME_DNR; } -if (unlikely(vector >= n->params.msix_qsize)) { +if (unlikely(vector >= n->conf_msix_qsize)) { trace_pci_nvme_err_invalid_create_cq_vector(vector); return NVME_INVALID_IRQ_VECTOR | NVME_DNR; } @@ -5000,13 +4998,12 @@ defaults: break; case NVME_NUMBER_OF_QUEUES: -result = (n->params.max_ioqpairs - 1) | -((n->params.max_ioqpairs - 1) << 16); +result = (n->conf_ioqpairs - 1) | ((n->conf_ioqpairs - 1) << 16); trace_pci_nvme_getfeat_numq(result); break; case NVME_INTERRUPT_VECTOR_CONF: iv = dw11 & 0x; -if (iv >= n->params.max_ioqpairs + 1) { +if (iv >= n->conf_ioqpairs + 1) { return NVME_INVALID_FIELD | NVME_DNR; } @@ -5161,10 +5158,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req) trace_pci_nvme_setfeat_numq((dw11 & 0x) + 1, ((dw11 >> 16) & 0x) + 1, -n->params.max_ioqpairs, -n->params.max_ioqpairs); -req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) | - ((n->params.max_ioqpairs - 1) << 16)); +n->conf_ioqpairs, +n->conf_ioqpairs); +req->cqe.result = cpu_to_le32((n->conf_ioqpairs - 1) | + ((n->conf_ioqpairs - 1) << 16)); break; case NVME_ASYNC
[PATCH v3 07/15] hw/nvme: Add support for Secondary Controller List
Introduce handling for Secondary Controller List (Identify command with CNS value of 15h). Secondary controller ids are unique in the subsystem, hence they are reserved by it upon initialization of the primary controller to the number of sriov_max_vfs. ID reservation requires the addition of an intermediate controller slot state, so the reserved controller has the address 0x. A secondary controller is in the reserved state when it has no virtual function assigned, but its primary controller is realized. Secondary controller reservations are released to NULL when its primary controller is unregistered. Signed-off-by: Lukasz Maniak --- hw/nvme/ctrl.c | 35 + hw/nvme/ns.c | 2 +- hw/nvme/nvme.h | 18 +++ hw/nvme/subsys.c | 75 ++-- hw/nvme/trace-events | 1 + include/block/nvme.h | 20 6 files changed, 141 insertions(+), 10 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 651e1f2fa2..eaca12df57 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -4550,6 +4550,29 @@ static uint16_t nvme_identify_pri_ctrl_cap(NvmeCtrl *n, NvmeRequest *req) return nvme_c2h(n, (uint8_t *)&n->pri_ctrl_cap, sizeof(NvmePriCtrlCap), req); } +static uint16_t nvme_identify_sec_ctrl_list(NvmeCtrl *n, NvmeRequest *req) +{ +NvmeIdentify *c = (NvmeIdentify *)&req->cmd; +uint16_t pri_ctrl_id = le16_to_cpu(n->pri_ctrl_cap.cntlid); +uint16_t min_id = le16_to_cpu(c->ctrlid); +uint8_t num_sec_ctrl = n->sec_ctrl_list.numcntl; +NvmeSecCtrlList list = {0}; +uint8_t i; + +for (i = 0; i < num_sec_ctrl; i++) { +if (n->sec_ctrl_list.sec[i].scid >= min_id) { +list.numcntl = num_sec_ctrl - i; +memcpy(&list.sec, n->sec_ctrl_list.sec + i, + list.numcntl * sizeof(NvmeSecCtrlEntry)); +break; +} +} + +trace_pci_nvme_identify_sec_ctrl_list(pri_ctrl_id, list.numcntl); + +return nvme_c2h(n, (uint8_t *)&list, sizeof(list), req); +} + static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req, bool active) { @@ -4770,6 +4793,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req) return nvme_identify_ctrl_list(n, req, false); case NVME_ID_CNS_PRIMARY_CTRL_CAP: return nvme_identify_pri_ctrl_cap(n, req); +case NVME_ID_CNS_SECONDARY_CTRL_LIST: +return nvme_identify_sec_ctrl_list(n, req); case NVME_ID_CNS_CS_NS: return nvme_identify_ns_csi(n, req, true); case NVME_ID_CNS_CS_NS_PRESENT: @@ -6321,6 +6346,9 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp) static void nvme_init_state(NvmeCtrl *n) { NvmePriCtrlCap *cap = &n->pri_ctrl_cap; +NvmeSecCtrlList *list = &n->sec_ctrl_list; +NvmeSecCtrlEntry *sctrl; +int i; /* add one to max_ioqpairs to account for the admin queue pair */ n->reg_size = pow2ceil(sizeof(NvmeBar) + @@ -6332,6 +6360,13 @@ static void nvme_init_state(NvmeCtrl *n) n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL); n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1); +list->numcntl = cpu_to_le16(n->params.sriov_max_vfs); +for (i = 0; i < n->params.sriov_max_vfs; i++) { +sctrl = &list->sec[i]; +sctrl->pcid = cpu_to_le16(n->cntlid); +sctrl->vfn = cpu_to_le16(i + 1); +} + cap->cntlid = cpu_to_le16(n->cntlid); } diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c index 8b5f98c761..e7a54ac572 100644 --- a/hw/nvme/ns.c +++ b/hw/nvme/ns.c @@ -511,7 +511,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp) for (i = 0; i < ARRAY_SIZE(subsys->ctrls); i++) { NvmeCtrl *ctrl = subsys->ctrls[i]; -if (ctrl) { +if (ctrl && ctrl != SUBSYS_SLOT_RSVD) { nvme_attach_ns(ctrl, ns); } } diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h index 81deb45dfb..2157a7b95f 100644 --- a/hw/nvme/nvme.h +++ b/hw/nvme/nvme.h @@ -43,6 +43,7 @@ typedef struct NvmeBus { #define TYPE_NVME_SUBSYS "nvme-subsys" #define NVME_SUBSYS(obj) \ OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS) +#define SUBSYS_SLOT_RSVD (void *)0x typedef struct NvmeSubsystem { DeviceState parent_obj; @@ -67,6 +68,10 @@ static inline NvmeCtrl *nvme_subsys_ctrl(NvmeSubsystem *subsys, return NULL; } +if (subsys->ctrls[cntlid] == SUBSYS_SLOT_RSVD) { +return NULL; +} + return subsys->ctrls[cntlid]; } @@ -463,6 +468,7 @@ typedef struct NvmeCtrl { } features; NvmePriCtrlCap pri_ctrl_cap; +NvmeSecCtrlList sec_ctrl_list; } NvmeCtrl; static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid) @@ -497,6 +503,18 @@ static inline uint16_t nvme_cid(NvmeRequest *req) return le16_to_cpu(req->cqe.cid); } +static inline NvmeSecCtrlEntry *nvme_sctrl(NvmeCtrl *n) +
[PATCH v3 04/15] pcie: Add 1.2 version token for the Power Management Capability
From: Łukasz Gieryk Signed-off-by: Łukasz Gieryk --- include/hw/pci/pci_regs.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h index 77ba64b931..a590140962 100644 --- a/include/hw/pci/pci_regs.h +++ b/include/hw/pci/pci_regs.h @@ -4,5 +4,6 @@ #include "standard-headers/linux/pci_regs.h" #define PCI_PM_CAP_VER_1_1 0x0002 /* PCI PM spec ver. 1.1 */ +#define PCI_PM_CAP_VER_1_2 0x0003 /* PCI PM spec ver. 1.2 */ #endif -- 2.25.1
[PATCH v3 08/15] hw/nvme: Implement the Function Level Reset
From: Łukasz Gieryk This patch implements the Function Level Reset, a feature currently not implemented for the Nvme device, while listed as a mandatory ("shall") in the 1.4 spec. The implementation reuses FLR-related building blocks defined for the pci-bridge module, and follows the same logic: - FLR capability is advertised in the PCIE config, - custom pci_write_config callback detects a write to the trigger register and performs the PCI reset, - which, eventually, calls the custom dc->reset handler. Depending on reset type, parts of the state should (or should not) be cleared. To distinguish the type of reset, an additional parameter is passed to the reset function. This patch also enables advertisement of the Power Management PCI capability. The main reason behind it is to announce the no_soft_reset=1 bit, to signal SR-IOV support where each VF can be reset individually. The implementation purposedly ignores writes to the PMCS.PS register, as even such naïve behavior is enough to correctly handle the D3->D0 transition. It’s worth to note, that the power state transition back to to D3, with all the corresponding side effects, wasn't and stil isn't handled properly. Signed-off-by: Łukasz Gieryk Reviewed-by: Klaus Jensen --- hw/nvme/ctrl.c | 52 hw/nvme/nvme.h | 5 + hw/nvme/trace-events | 1 + 3 files changed, 54 insertions(+), 4 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index eaca12df57..9e83b4dd76 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -5602,7 +5602,7 @@ static void nvme_process_sq(void *opaque) } } -static void nvme_ctrl_reset(NvmeCtrl *n) +static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst) { NvmeNamespace *ns; int i; @@ -5634,7 +5634,9 @@ static void nvme_ctrl_reset(NvmeCtrl *n) } if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) { -pcie_sriov_pf_disable_vfs(&n->parent_obj); +if (rst != NVME_RESET_CONTROLLER) { +pcie_sriov_pf_disable_vfs(&n->parent_obj); +} } n->aer_queued = 0; @@ -5868,7 +5870,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data, } } else if (!NVME_CC_EN(data) && NVME_CC_EN(cc)) { trace_pci_nvme_mmio_stopped(); -nvme_ctrl_reset(n); +nvme_ctrl_reset(n, NVME_RESET_CONTROLLER); cc = 0; csts &= ~NVME_CSTS_READY; } @@ -6426,6 +6428,28 @@ static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset, PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size); } +static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset) +{ +Error *err = NULL; +int ret; + +ret = pci_add_capability(pci_dev, PCI_CAP_ID_PM, offset, + PCI_PM_SIZEOF, &err); +if (err) { +error_report_err(err); +return ret; +} + +pci_set_word(pci_dev->config + offset + PCI_PM_PMC, + PCI_PM_CAP_VER_1_2); +pci_set_word(pci_dev->config + offset + PCI_PM_CTRL, + PCI_PM_CTRL_NO_SOFT_RESET); +pci_set_word(pci_dev->wmask + offset + PCI_PM_CTRL, + PCI_PM_CTRL_STATE_MASK); + +return 0; +} + static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) { uint8_t *pci_conf = pci_dev->config; @@ -6447,7 +6471,9 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) } pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS); +nvme_add_pm_capability(pci_dev, 0x60); pcie_endpoint_cap_init(pci_dev, 0x80); +pcie_cap_flr_init(pci_dev); if (n->params.sriov_max_vfs) { pcie_ari_init(pci_dev, 0x100, 1); } @@ -6696,7 +6722,7 @@ static void nvme_exit(PCIDevice *pci_dev) NvmeNamespace *ns; int i; -nvme_ctrl_reset(n); +nvme_ctrl_reset(n, NVME_RESET_FUNCTION); if (n->subsys) { for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { @@ -6795,6 +6821,22 @@ static void nvme_set_smart_warning(Object *obj, Visitor *v, const char *name, } } +static void nvme_pci_reset(DeviceState *qdev) +{ +PCIDevice *pci_dev = PCI_DEVICE(qdev); +NvmeCtrl *n = NVME(pci_dev); + +trace_pci_nvme_pci_reset(); +nvme_ctrl_reset(n, NVME_RESET_FUNCTION); +} + +static void nvme_pci_write_config(PCIDevice *dev, uint32_t address, + uint32_t val, int len) +{ +pci_default_write_config(dev, address, val, len); +pcie_cap_flr_write_config(dev, address, val, len); +} + static const VMStateDescription nvme_vmstate = { .name = "nvme", .unmigratable = 1, @@ -6806,6 +6848,7 @@ static void nvme_class_init(ObjectClass *oc, void *data) PCIDeviceClass *pc = PCI_DEVICE_CLASS(oc); pc->realize = nvme_realize; +pc->config_write = nvme_pci_write_config; pc->exit = nvme_exit; pc->class_id = PCI_CLASS_STORAG
[PATCH v3 06/15] hw/nvme: Add support for Primary Controller Capabilities
Implementation of Primary Controller Capabilities data structure (Identify command with CNS value of 14h). Currently, the command returns only ID of a primary controller. Handling of remaining fields are added in subsequent patches implementing virtualization enhancements. Signed-off-by: Lukasz Maniak Reviewed-by: Klaus Jensen --- hw/nvme/ctrl.c | 22 +- hw/nvme/nvme.h | 2 ++ hw/nvme/trace-events | 1 + include/block/nvme.h | 23 +++ 4 files changed, 43 insertions(+), 5 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 159635c1af..651e1f2fa2 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -4543,6 +4543,13 @@ static uint16_t nvme_identify_ctrl_list(NvmeCtrl *n, NvmeRequest *req, return nvme_c2h(n, (uint8_t *)list, sizeof(list), req); } +static uint16_t nvme_identify_pri_ctrl_cap(NvmeCtrl *n, NvmeRequest *req) +{ +trace_pci_nvme_identify_pri_ctrl_cap(le16_to_cpu(n->pri_ctrl_cap.cntlid)); + +return nvme_c2h(n, (uint8_t *)&n->pri_ctrl_cap, sizeof(NvmePriCtrlCap), req); +} + static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req, bool active) { @@ -4761,6 +4768,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req) return nvme_identify_ctrl_list(n, req, true); case NVME_ID_CNS_CTRL_LIST: return nvme_identify_ctrl_list(n, req, false); +case NVME_ID_CNS_PRIMARY_CTRL_CAP: +return nvme_identify_pri_ctrl_cap(n, req); case NVME_ID_CNS_CS_NS: return nvme_identify_ns_csi(n, req, true); case NVME_ID_CNS_CS_NS_PRESENT: @@ -6311,6 +6320,8 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp) static void nvme_init_state(NvmeCtrl *n) { +NvmePriCtrlCap *cap = &n->pri_ctrl_cap; + /* add one to max_ioqpairs to account for the admin queue pair */ n->reg_size = pow2ceil(sizeof(NvmeBar) + 2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE); @@ -6320,6 +6331,8 @@ static void nvme_init_state(NvmeCtrl *n) n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING; n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL); n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1); + +cap->cntlid = cpu_to_le16(n->cntlid); } static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev) @@ -6619,15 +6632,14 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp) qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS, &pci_dev->qdev, n->parent_obj.qdev.id); -nvme_init_state(n); -if (nvme_init_pci(n, pci_dev, errp)) { -return; -} - if (nvme_init_subsys(n, errp)) { error_propagate(errp, local_err); return; } +nvme_init_state(n); +if (nvme_init_pci(n, pci_dev, errp)) { +return; +} nvme_init_ctrl(n, pci_dev); /* setup a namespace if the controller drive property was given */ diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h index 4c8af34b28..81deb45dfb 100644 --- a/hw/nvme/nvme.h +++ b/hw/nvme/nvme.h @@ -461,6 +461,8 @@ typedef struct NvmeCtrl { }; uint32_tasync_config; } features; + +NvmePriCtrlCap pri_ctrl_cap; } NvmeCtrl; static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid) diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events index ff6cafd520..1014ebceb6 100644 --- a/hw/nvme/trace-events +++ b/hw/nvme/trace-events @@ -52,6 +52,7 @@ pci_nvme_identify_ctrl(void) "identify controller" pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8"" pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32"" pci_nvme_identify_ctrl_list(uint8_t cns, uint16_t cntid) "cns 0x%"PRIx8" cntid %"PRIu16"" +pci_nvme_identify_pri_ctrl_cap(uint16_t cntlid) "identify primary controller capabilities cntlid=%"PRIu16"" pci_nvme_identify_ns_csi(uint32_t ns, uint8_t csi) "nsid=%"PRIu32", csi=0x%"PRIx8"" pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32"" pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "nsid=%"PRIu16", csi=0x%"PRIx8"" diff --git a/include/block/nvme.h b/include/block/nvme.h index e3bd47bf76..f69bd1d14f 100644 --- a/include/block/nvme.h +++ b/include/block/nvme.h @@ -1017,6 +1017,7 @@ enum NvmeIdCns { NVME_ID_CNS_NS_PRESENT= 0x11, NVME_ID_CNS_NS_ATTACHED_CTRL_LIST = 0x12, NVME_ID_CNS_CTRL_LIST = 0x13, +NVME_ID_CNS_PRIMARY_CTRL_CAP = 0x14, NVME_ID_CNS_CS_NS_PRESENT_LIST= 0x1a, NVME_ID_CNS_CS_NS_PRESENT = 0x1b, NVME_ID_CNS_IO_COMMAND_SET= 0x1c, @@ -1465,6 +1466,27 @@ typedef enum NvmeZoneState { NVME_ZONE_STATE_OFFLINE = 0x0f, } NvmeZoneState; +typedef struct QEMU_PACKED NvmePriCtrlCap { +uint16_tcntlid; +uint16_tportid; +uint8_t crt; +uint8_t rsvd5[27]; +uint32_tvqfrt; +uint32_tvqrfa; +uint16_tvqrfap; +uint16_tvqprt; +uint16_tvqfr
[PATCH v3 05/15] hw/nvme: Add support for SR-IOV
This patch implements initial support for Single Root I/O Virtualization on an NVMe device. Essentially, it allows to define the maximum number of virtual functions supported by the NVMe controller via sriov_max_vfs parameter. Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV capability by a physical controller and ARI capability by both the physical and virtual function devices. NVMe controllers created via virtual functions mirror functionally the physical controller, which may not entirely be the case, thus consideration would be needed on the way to limit the capabilities of the VF. NVMe subsystem is required for the use of SR-IOV. Signed-off-by: Lukasz Maniak --- hw/nvme/ctrl.c | 84 ++-- hw/nvme/nvme.h | 3 +- include/hw/pci/pci_ids.h | 1 + 3 files changed, 84 insertions(+), 4 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 5f573c417b..159635c1af 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -35,6 +35,7 @@ * mdts=,vsl=, \ * zoned.zasl=, \ * zoned.auto_transition=, \ + * sriov_max_vfs= \ * subsys= * -device nvme-ns,drive=,bus=,nsid=,\ * zoned=, \ @@ -106,6 +107,12 @@ * transitioned to zone state closed for resource management purposes. * Defaults to 'on'. * + * - `sriov_max_vfs` + * Indicates the maximum number of PCIe virtual functions supported + * by the controller. The default value is 0. Specifying a non-zero value + * enables reporting of both SR-IOV and ARI capabilities by the NVMe device. + * Virtual function controllers will not report SR-IOV capability. + * * nvme namespace device parameters * * - `shared` @@ -160,6 +167,7 @@ #include "sysemu/block-backend.h" #include "sysemu/hostmem.h" #include "hw/pci/msix.h" +#include "hw/pci/pcie_sriov.h" #include "migration/vmstate.h" #include "nvme.h" @@ -175,6 +183,9 @@ #define NVME_TEMPERATURE_CRITICAL 0x175 #define NVME_NUM_FW_SLOTS 1 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB) +#define NVME_MAX_VFS 127 +#define NVME_VF_OFFSET 0x1 +#define NVME_VF_STRIDE 1 #define NVME_GUEST_ERR(trace, fmt, ...) \ do { \ @@ -5588,6 +5599,10 @@ static void nvme_ctrl_reset(NvmeCtrl *n) g_free(event); } +if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) { +pcie_sriov_pf_disable_vfs(&n->parent_obj); +} + n->aer_queued = 0; n->outstanding_aers = 0; n->qs_created = false; @@ -6269,6 +6284,29 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp) error_setg(errp, "vsl must be non-zero"); return; } + +if (params->sriov_max_vfs) { +if (!n->subsys) { +error_setg(errp, "subsystem is required for the use of SR-IOV"); +return; +} + +if (params->sriov_max_vfs > NVME_MAX_VFS) { +error_setg(errp, "sriov_max_vfs must be between 0 and %d", + NVME_MAX_VFS); +return; +} + +if (params->cmb_size_mb) { +error_setg(errp, "CMB is not supported with SR-IOV"); +return; +} + +if (n->pmr.dev) { +error_setg(errp, "PMR is not supported with SR-IOV"); +return; +} +} } static void nvme_init_state(NvmeCtrl *n) @@ -6326,6 +6364,20 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev) memory_region_set_enabled(&n->pmr.dev->mr, false); } +static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset, +uint64_t bar_size) +{ +uint16_t vf_dev_id = n->params.use_intel_id ? + PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME; + +pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id, + n->params.sriov_max_vfs, n->params.sriov_max_vfs, + NVME_VF_OFFSET, NVME_VF_STRIDE); + +pcie_sriov_pf_init_vf_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY | + PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size); +} + static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) { uint8_t *pci_conf = pci_dev->config; @@ -6340,7 +6392,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) if (n->params.use_intel_id) { pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL); -pci_config_set_device_id(pci_conf, 0x5845); +pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_INTEL_NVME); } else { pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_REDHAT); pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_REDHAT_NVME); @@ -6348,6 +6400,9 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS); pcie_endpoint_cap_init(pci_dev, 0x80); +if (n->par
[PATCH v3 02/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt
From: Knut Omang Add a small intro + minimal documentation for how to implement SR/IOV support for an emulated device. Signed-off-by: Knut Omang --- docs/pcie_sriov.txt | 115 1 file changed, 115 insertions(+) create mode 100644 docs/pcie_sriov.txt diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt new file mode 100644 index 00..f5e891e1d4 --- /dev/null +++ b/docs/pcie_sriov.txt @@ -0,0 +1,115 @@ +PCI SR/IOV EMULATION SUPPORT + + +Description +=== +SR/IOV (Single Root I/O Virtualization) is an optional extended capability +of a PCI Express device. It allows a single physical function (PF) to appear as multiple +virtual functions (VFs) for the main purpose of eliminating software +overhead in I/O from virtual machines. + +Qemu now implements the basic common functionality to enable an emulated device +to support SR/IOV. Yet no fully implemented devices exists in Qemu, but a +proof-of-concept hack of the Intel igb can be found here: + +git://github.com/knuto/qemu.git sriov_patches_v5 + +Implementation +== +Implementing emulation of an SR/IOV capable device typically consists of +implementing support for two types of device classes; the "normal" physical device +(PF) and the virtual device (VF). From Qemu's perspective, the VFs are just +like other devices, except that some of their properties are derived from +the PF. + +A virtual function is different from a physical function in that the BAR +space for all VFs are defined by the BAR registers in the PFs SR/IOV +capability. All VFs have the same BARs and BAR sizes. + +Accesses to these virtual BARs then is computed as + ++ * + + +From our emulation perspective this means that there is a separate call for +setting up a BAR for a VF. + +1) To enable SR/IOV support in the PF, it must be a PCI Express device so + you would need to add a PCI Express capability in the normal PCI + capability list. You might also want to add an ARI (Alternative + Routing-ID Interpretation) capability to indicate that your device + supports functions beyond it's "own" function space (0-7), + which is necessary to support more than 7 functions, or + if functions extends beyond offset 7 because they are placed at an + offset > 1 or have stride > 1. + + ... + #include "hw/pci/pcie.h" + #include "hw/pci/pcie_sriov.h" + + pci_your_pf_dev_realize( ... ) + { + ... + int ret = pcie_endpoint_cap_init(d, 0x70); + ... + pcie_ari_init(d, 0x100, 1); + ... + + /* Add and initialize the SR/IOV capability */ + pcie_sriov_pf_init(d, 0x200, "your_virtual_dev", + vf_devid, initial_vfs, total_vfs, + fun_offset, stride); + + /* Set up individual VF BARs (parameters as for normal BARs) */ + pcie_sriov_pf_init_vf_bar( ... ) + ... + } + + For cleanup, you simply call: + + pcie_sriov_pf_exit(device); + + which will delete all the virtual functions and associated resources. + +2) Similarly in the implementation of the virtual function, you need to + make it a PCI Express device and add a similar set of capabilities + except for the SR/IOV capability. Then you need to set up the VF BARs as + subregions of the PFs SR/IOV VF BARs by calling + pcie_sriov_vf_register_bar() instead of the normal pci_register_bar() call: + + pci_your_vf_dev_realize( ... ) + { + ... + int ret = pcie_endpoint_cap_init(d, 0x60); + ... + pcie_ari_init(d, 0x100, 1); + ... + memory_region_init(mr, ... ) + pcie_sriov_vf_register_bar(d, bar_nr, mr); + ... + } + +Testing on Linux guest +== +The easiest is if your device driver supports sysfs based SR/IOV +enabling. Support for this was added in kernel v.3.8, so not all drivers +support it yet. + +To enable 4 VFs for a device at 01:00.0: + + modprobe yourdriver + echo 4 > /sys/bus/pci/devices/:01:00.0/sriov_numvfs + +You should now see 4 VFs with lspci. +To turn SR/IOV off again - the standard requires you to turn it off before you can enable +another VF count, and the emulation enforces this: + + echo 0 > /sys/bus/pci/devices/:01:00.0/sriov_numvfs + +Older drivers typically provide a max_vfs module parameter +to enable it at load time: + + modprobe yourdriver max_vfs=4 + +To disable the VFs again then, you simply have to unload the driver: + + rmmod yourdriver -- 2.25.1
[PATCH v3 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
From: Knut Omang This patch provides the building blocks for creating an SR/IOV PCIe Extended Capability header and register/unregister SR/IOV Virtual Functions. Signed-off-by: Knut Omang --- hw/pci/meson.build | 1 + hw/pci/pci.c| 97 +--- hw/pci/pcie.c | 5 + hw/pci/pcie_sriov.c | 287 hw/pci/trace-events | 5 + include/hw/pci/pci.h| 12 +- include/hw/pci/pcie.h | 6 + include/hw/pci/pcie_sriov.h | 67 + include/qemu/typedefs.h | 2 + 9 files changed, 456 insertions(+), 26 deletions(-) create mode 100644 hw/pci/pcie_sriov.c create mode 100644 include/hw/pci/pcie_sriov.h diff --git a/hw/pci/meson.build b/hw/pci/meson.build index 5c4bbac817..bcc9c75919 100644 --- a/hw/pci/meson.build +++ b/hw/pci/meson.build @@ -5,6 +5,7 @@ pci_ss.add(files( 'pci.c', 'pci_bridge.c', 'pci_host.c', + 'pcie_sriov.c', 'shpc.c', 'slotid_cap.c' )) diff --git a/hw/pci/pci.c b/hw/pci/pci.c index e5993c1ef5..1892a7e74c 100644 --- a/hw/pci/pci.c +++ b/hw/pci/pci.c @@ -239,6 +239,9 @@ int pci_bar(PCIDevice *d, int reg) { uint8_t type; +/* PCIe virtual functions do not have their own BARs */ +assert(!pci_is_vf(d)); + if (reg != PCI_ROM_SLOT) return PCI_BASE_ADDRESS_0 + reg * 4; @@ -304,10 +307,30 @@ void pci_device_deassert_intx(PCIDevice *dev) } } -static void pci_do_device_reset(PCIDevice *dev) +static void pci_reset_regions(PCIDevice *dev) { int r; +if (pci_is_vf(dev)) { +return; +} + +for (r = 0; r < PCI_NUM_REGIONS; ++r) { +PCIIORegion *region = &dev->io_regions[r]; +if (!region->size) { +continue; +} + +if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) && +region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) { +pci_set_quad(dev->config + pci_bar(dev, r), region->type); +} else { +pci_set_long(dev->config + pci_bar(dev, r), region->type); +} +} +} +static void pci_do_device_reset(PCIDevice *dev) +{ pci_device_deassert_intx(dev); assert(dev->irq_state == 0); @@ -323,19 +346,7 @@ static void pci_do_device_reset(PCIDevice *dev) pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) | pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE)); dev->config[PCI_CACHE_LINE_SIZE] = 0x0; -for (r = 0; r < PCI_NUM_REGIONS; ++r) { -PCIIORegion *region = &dev->io_regions[r]; -if (!region->size) { -continue; -} - -if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) && -region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) { -pci_set_quad(dev->config + pci_bar(dev, r), region->type); -} else { -pci_set_long(dev->config + pci_bar(dev, r), region->type); -} -} +pci_reset_regions(dev); pci_update_mappings(dev); msi_reset(dev); @@ -884,6 +895,15 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice *dev, Error **errp) dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION; } +/* With SR/IOV and ARI, a device at function 0 need not be a multifunction + * device, as it may just be a VF that ended up with function 0 in + * the legacy PCI interpretation. Avoid failing in such cases: + */ +if (pci_is_vf(dev) && +dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) { +return; +} + /* * multifunction bit is interpreted in two ways as follows. * - all functions must set the bit to 1. @@ -1083,6 +1103,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, bus->devices[devfn]->name); return NULL; } else if (dev->hotplugged && + !pci_is_vf(pci_dev) && pci_get_function_0(pci_dev)) { error_setg(errp, "PCI: slot %d function 0 already occupied by %s," " new func %s cannot be exposed to guest.", @@ -1191,6 +1212,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num, pcibus_t size = memory_region_size(memory); uint8_t hdr_type; +assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar */ assert(region_num >= 0); assert(region_num < PCI_NUM_REGIONS); assert(is_power_of_2(size)); @@ -1294,11 +1316,43 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num) return pci_dev->io_regions[region_num].addr; } -static pcibus_t pci_bar_address(PCIDevice *d, -int reg, uint8_t type, pcibus_t size) +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg, +uint8_t type, pcibus_t size) +{ +pcibus_t new_addr; +if (!pci_is_vf(d)) { +int bar = pci_bar(d, reg); +if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) { +new_addr = pci
[PATCH v3 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
This is the version of the patch series that we consider ready for staging. We do not intend to work on the v4 unless there are major issues. Changes since v2: - The documentation mentions that SR-IOV support is still an experimental feature. - The default value activates properly when sriov_max_v{i,q}_per_vf == 0. - Secondary Controller List (CNS 15h) handles the CDW10.CNTID field. - Virtual Function Number ("VFN") in Secondary Controller Entry is not cleared to zero as the controller goes offline. - Removed no longer used helper pcie_sriov_vf_number_total. - Reset other than Controller Reset is necessary to activate (or deactivate) flexible resources. - The v{i,q}rfap fields in Primary Controller Capabilities store the currently active number of bound resources, not the number active after reset. - Secondary controller cannot be set online unless the corresponding VF is enabled (sriov_numvfs set to at least the secondary controller's VF number) The list of opens and known gaps remains the same as for v2: https://lists.gnu.org/archive/html/qemu-block/2021-11/msg00423.html Knut Omang (2): pcie: Add support for Single Root I/O Virtualization (SR/IOV) pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt Lukasz Maniak (4): hw/nvme: Add support for SR-IOV hw/nvme: Add support for Primary Controller Capabilities hw/nvme: Add support for Secondary Controller List docs: Add documentation for SR-IOV and Virtualization Enhancements Łukasz Gieryk (9): pcie: Add a helper to the SR/IOV API pcie: Add 1.2 version token for the Power Management Capability hw/nvme: Implement the Function Level Reset hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime hw/nvme: Remove reg_size variable and update BAR0 size calculation hw/nvme: Calculate BAR attributes in a function hw/nvme: Initialize capability structures for primary/secondary controllers hw/nvme: Add support for the Virtualization Management command hw/nvme: Update the initalization place for the AER queue docs/pcie_sriov.txt | 115 ++ docs/system/devices/nvme.rst | 36 ++ hw/nvme/ctrl.c | 665 --- hw/nvme/ns.c | 2 +- hw/nvme/nvme.h | 55 ++- hw/nvme/subsys.c | 75 +++- hw/nvme/trace-events | 6 + hw/pci/meson.build | 1 + hw/pci/pci.c | 97 +++-- hw/pci/pcie.c| 5 + hw/pci/pcie_sriov.c | 295 hw/pci/trace-events | 5 + include/block/nvme.h | 65 include/hw/pci/pci.h | 12 +- include/hw/pci/pci_ids.h | 1 + include/hw/pci/pci_regs.h| 1 + include/hw/pci/pcie.h| 6 + include/hw/pci/pcie_sriov.h | 72 include/qemu/typedefs.h | 2 + 19 files changed, 1435 insertions(+), 81 deletions(-) create mode 100644 docs/pcie_sriov.txt create mode 100644 hw/pci/pcie_sriov.c create mode 100644 include/hw/pci/pcie_sriov.h -- 2.25.1
[PATCH v3 03/15] pcie: Add a helper to the SR/IOV API
From: Łukasz Gieryk Convenience function for retrieving the PCIDevice object of the N-th VF. Signed-off-by: Łukasz Gieryk Reviewed-by: Knut Omang --- hw/pci/pcie_sriov.c | 10 +- include/hw/pci/pcie_sriov.h | 5 + 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c index 501a1ff433..be8c907e06 100644 --- a/hw/pci/pcie_sriov.c +++ b/hw/pci/pcie_sriov.c @@ -280,8 +280,16 @@ uint16_t pcie_sriov_vf_number(PCIDevice *dev) return dev->exp.sriov_vf.vf_number; } - PCIDevice *pcie_sriov_get_pf(PCIDevice *dev) { return dev->exp.sriov_vf.pf; } + +PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n) +{ +assert(!pci_is_vf(dev)); +if (n < dev->exp.sriov_pf.num_vfs) { +return dev->exp.sriov_pf.vf[n]; +} +return NULL; +} diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h index 0974f00054..cd2aebd3a6 100644 --- a/include/hw/pci/pcie_sriov.h +++ b/include/hw/pci/pcie_sriov.h @@ -64,4 +64,9 @@ uint16_t pcie_sriov_vf_number(PCIDevice *dev); */ PCIDevice *pcie_sriov_get_pf(PCIDevice *dev); +/* Get the n-th VF of this physical function - only valid for PF. + * Returns NULL if index is invalid + */ +PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n); + #endif /* QEMU_PCIE_SRIOV_H */ -- 2.25.1
Re: [RFC PATCH v2 02/14] job.h: categorize fields in struct Job
On 16/12/2021 17:21, Stefan Hajnoczi wrote: On Thu, Nov 04, 2021 at 10:53:22AM -0400, Emanuele Giuseppe Esposito wrote: Categorize the fields in struct Job to understand which ones need to be protected by the job mutex and which don't. Signed-off-by: Emanuele Giuseppe Esposito --- include/qemu/job.h | 57 +++--- 1 file changed, 34 insertions(+), 23 deletions(-) diff --git a/include/qemu/job.h b/include/qemu/job.h index ccf7826426..f7036ac6b3 100644 --- a/include/qemu/job.h +++ b/include/qemu/job.h @@ -40,27 +40,52 @@ typedef struct JobTxn JobTxn; * Long-running operation. */ typedef struct Job { + +/* Fields set at initialization (job_create), and never modified */ + /** The ID of the job. May be NULL for internal jobs. */ char *id; -/** The type of this job. */ +/** + * The type of this job. + * All callbacks are called with job_mutex *not* held. + */ const JobDriver *driver; -/** Reference count of the block job */ -int refcnt; - -/** Current state; See @JobStatus for details. */ -JobStatus status; - /** AioContext to run the job coroutine in */ AioContext *aio_context; "Fields set at initialization (job_create), and never modified" does not apply here. blockjob.c:child_job_set_aio_ctx() changes it at runtime. Right. aio_context can theoretically avoid also the job_mutex, if we make sure that all klass->set_aio_ctx() are under BQL (they are) and under drains (work in progress). For now I will protect it with job_lock(). Thank you, Emanuele
Re: [PATCH v4 0/7] nbd reconnect on open
13.12.2021 18:32, Vladimir Sementsov-Ogievskiy wrote: Hi all! The functionality is reviewed, python testing part is not. I've dropped the patch "qapi: make blockdev-add a coroutine command": it's optional, I don't want to slow down the whole series because of it. v4: 01-03: wording, add Eric's r-b others: small changes, never had an r-b Vladimir Sementsov-Ogievskiy (7): nbd: allow reconnect on open, with corresponding new options nbd/client-connection: nbd_co_establish_connection(): return real error nbd/client-connection: improve error message of cancelled attempt iotests.py: add qemu_tool_popen() For qemu_io* functions support --image-opts argument, which conflicts with -f argument from qemu_io_args. Add qemu-io Popen constructor wrapper. To be used in the following new test commit. iotests: add nbd-reconnect-on-open test qapi/block-core.json | 9 ++- block/nbd.c | 45 +++- nbd/client-connection.c | 59 ++- tests/qemu-iotests/iotests.py | 36 ++ .../qemu-iotests/tests/nbd-reconnect-on-open | 71 +++ .../tests/nbd-reconnect-on-open.out | 11 +++ 6 files changed, 199 insertions(+), 32 deletions(-) create mode 100755 tests/qemu-iotests/tests/nbd-reconnect-on-open create mode 100644 tests/qemu-iotests/tests/nbd-reconnect-on-open.out Thanks for reviewing! I do s/6.2/7.0/ fix to patch 1, restore subjects of patches 5,6 (which were somehow lost in transition v3->v4) and apply the series to my nbd branch. -- Best regards, Vladimir
Re: [PATCH v2 0/2] qemu-img convert: Fix sparseness detection
Am 17.12.21 um 17:46 schrieb Vladimir Sementsov-Ogievskiy: > Hi all! > > 01: only update test output rebasing on master > 02: replaced with my proposed solution. > > Kevin Wolf (1): > iotests: Test qemu-img convert of zeroed data cluster > > Vladimir Sementsov-Ogievskiy (1): > qemu-img: make is_allocated_sectors() more efficient > > qemu-img.c | 23 +++ > tests/qemu-iotests/122 | 1 + > tests/qemu-iotests/122.out | 2 ++ > 3 files changed, 22 insertions(+), 4 deletions(-) > Tested-by: Peter Lieven