[PULL for-6.0 v2 03/10] hw/block/nvme: fix the nsid 'invalid' value
From: Klaus Jensen The `nvme_nsid()` function returns '-1' (h) when the given namespace is NULL. Since h is actually a valid namespace identifier (the "broadcast" value), change this to be '0' since that actually *is* the invalid value. Signed-off-by: Klaus Jensen Reviewed-by: Gollu Appalanaidu Reviewed-by: Keith Busch --- hw/block/nvme-ns.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h index 9ab7894fc83e..82340c4b2574 100644 --- a/hw/block/nvme-ns.h +++ b/hw/block/nvme-ns.h @@ -96,7 +96,7 @@ static inline uint32_t nvme_nsid(NvmeNamespace *ns) return ns->params.nsid; } -return -1; +return 0; } static inline bool nvme_ns_shared(NvmeNamespace *ns) -- 2.31.1
[PULL for-6.0 v2 07/10] hw/block/nvme: add missing copyright headers
From: Klaus Jensen Add missing license/copyright headers to the nvme-dif.{c,h} files. Signed-off-by: Klaus Jensen Reviewed-by: Keith Busch --- hw/block/nvme-dif.h | 10 ++ hw/block/nvme-dif.c | 10 ++ 2 files changed, 20 insertions(+) diff --git a/hw/block/nvme-dif.h b/hw/block/nvme-dif.h index 5a8e37c8525b..524faffbd7a0 100644 --- a/hw/block/nvme-dif.h +++ b/hw/block/nvme-dif.h @@ -1,3 +1,13 @@ +/* + * QEMU NVM Express End-to-End Data Protection support + * + * Copyright (c) 2021 Samsung Electronics Co., Ltd. + * + * Authors: + * Klaus Jensen + * Gollu Appalanaidu + */ + #ifndef HW_NVME_DIF_H #define HW_NVME_DIF_H diff --git a/hw/block/nvme-dif.c b/hw/block/nvme-dif.c index e6f04faafb5f..81b0a4cb1382 100644 --- a/hw/block/nvme-dif.c +++ b/hw/block/nvme-dif.c @@ -1,3 +1,13 @@ +/* + * QEMU NVM Express End-to-End Data Protection support + * + * Copyright (c) 2021 Samsung Electronics Co., Ltd. + * + * Authors: + * Klaus Jensen + * Gollu Appalanaidu + */ + #include "qemu/osdep.h" #include "hw/block/block.h" #include "sysemu/dma.h" -- 2.31.1
[PULL for-6.0 v2 05/10] hw/block/nvme: update dmsrl limit on namespace detachment
From: Klaus Jensen The Non-MDTS DMSRL limit must be recomputed when namespaces are detached. Fixes: 645ce1a70cb6 ("hw/block/nvme: support namespace attachment command") Signed-off-by: Klaus Jensen Reviewed-by: Gollu Appalanaidu Reviewed-by: Keith Busch --- hw/block/nvme.c | 17 + 1 file changed, 17 insertions(+) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 09c38fb35e0d..0898ece2af31 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -4868,6 +4868,21 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest *req) return NVME_NO_COMPLETE; } +static void nvme_update_dmrsl(NvmeCtrl *n) +{ +int nsid; + +for (nsid = 1; nsid <= NVME_MAX_NAMESPACES; nsid++) { +NvmeNamespace *ns = nvme_ns(n, nsid); +if (!ns) { +continue; +} + +n->dmrsl = MIN_NON_ZERO(n->dmrsl, +BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1)); +} +} + static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns); static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req) { @@ -4917,6 +4932,8 @@ static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req) } nvme_ns_detach(ctrl, ns); + +nvme_update_dmrsl(ctrl); } /* -- 2.31.1
[PULL for-6.0 v2 04/10] hw/block/nvme: fix warning about legacy namespace configuration
From: Klaus Jensen Remove the unused BlockConf from the controller structure and fix the constraint checking to actually check the right BlockConf and issue the warning. Signed-off-by: Klaus Jensen Reviewed-by: Gollu Appalanaidu Reviewed-by: Keith Busch --- hw/block/nvme.h | 1 - hw/block/nvme.c | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/hw/block/nvme.h b/hw/block/nvme.h index 9edc86d79e98..8d1806cc942f 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -166,7 +166,6 @@ typedef struct NvmeCtrl { NvmeBar bar; NvmeParams params; NvmeBus bus; -BlockConfconf; uint16_tcntlid; boolqs_created; diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 7244534a89e9..09c38fb35e0d 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -5805,7 +5805,7 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp) params->max_ioqpairs = params->num_queues - 1; } -if (n->conf.blk) { +if (n->namespace.blkconf.blk) { warn_report("drive property is deprecated; " "please use an nvme-ns device instead"); } -- 2.31.1
[PULL for-6.0 v2 01/10] hw/block/nvme: fix pi constraint check
From: Klaus Jensen Protection Information can only be enabled if there is at least 8 bytes of metadata. Signed-off-by: Klaus Jensen Reviewed-by: Gollu Appalanaidu Reviewed-by: Keith Busch --- hw/block/nvme-ns.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c index 7f8d139a8663..ca04ee1bacfb 100644 --- a/hw/block/nvme-ns.c +++ b/hw/block/nvme-ns.c @@ -394,7 +394,7 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp) return -1; } -if (ns->params.pi && !ns->params.ms) { +if (ns->params.pi && ns->params.ms < 8) { error_setg(errp, "at least 8 bytes of metadata required to enable " "protection information"); return -1; -- 2.31.1
[PULL for-6.0 v2 02/10] hw/block/nvme: fix missing string representation for ns attachment
From: Klaus Jensen Add the missing nvme_adm_opc_str entry for the Namespace Attachment command. Signed-off-by: Klaus Jensen Reviewed-by: Gollu Appalanaidu Reviewed-by: Keith Busch --- hw/block/nvme.h | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/block/nvme.h b/hw/block/nvme.h index 5b0031b11db2..9edc86d79e98 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -86,6 +86,7 @@ static inline const char *nvme_adm_opc_str(uint8_t opc) case NVME_ADM_CMD_SET_FEATURES: return "NVME_ADM_CMD_SET_FEATURES"; case NVME_ADM_CMD_GET_FEATURES: return "NVME_ADM_CMD_GET_FEATURES"; case NVME_ADM_CMD_ASYNC_EV_REQ: return "NVME_ADM_CMD_ASYNC_EV_REQ"; +case NVME_ADM_CMD_NS_ATTACHMENT:return "NVME_ADM_CMD_NS_ATTACHMENT"; case NVME_ADM_CMD_FORMAT_NVM: return "NVME_ADM_CMD_FORMAT_NVM"; default:return "NVME_ADM_CMD_UNKNOWN"; } -- 2.31.1
[PULL for-6.0 v2 00/10] emulated nvme fixes for -rc3
From: Klaus Jensen Hi Peter, My apologies that these didn't make it for -rc2! I botched v1, so please pull this v2 instead. The following changes since commit d0d3dd401b70168a353450e031727affee828527: Update version for v6.0.0-rc2 release (2021-04-06 18:34:34 +0100) are available in the Git repository at: git://git.infradead.org/qemu-nvme.git tags/nvme-fixes-2021-04-07-pull-request for you to fetch changes up to 5dd79300df47f07d0e9d6a7bda43b23ff26001dc: hw/block/nvme: fix out-of-bounds read in nvme_subsys_ctrl (2021-04-07 07:27:09 +0200) emulated nvme fixes for -rc3 v2: - added missing patches Klaus Jensen (10): hw/block/nvme: fix pi constraint check hw/block/nvme: fix missing string representation for ns attachment hw/block/nvme: fix the nsid 'invalid' value hw/block/nvme: fix warning about legacy namespace configuration hw/block/nvme: update dmsrl limit on namespace detachment hw/block/nvme: fix handling of private namespaces hw/block/nvme: add missing copyright headers hw/block/nvme: fix ns attachment out-of-bounds read hw/block/nvme: fix assert crash in nvme_subsys_ns hw/block/nvme: fix out-of-bounds read in nvme_subsys_ctrl hw/block/nvme-dif.h| 10 +++ hw/block/nvme-ns.h | 12 ++-- hw/block/nvme-subsys.h | 11 ++-- hw/block/nvme.h| 41 +--- include/block/nvme.h | 1 + hw/block/nvme-dif.c| 10 +++ hw/block/nvme-ns.c | 78 ++ hw/block/nvme-subsys.c | 28 hw/block/nvme.c| 143 + hw/block/trace-events | 1 - 10 files changed, 158 insertions(+), 177 deletions(-) -- 2.31.1
Re: [PATCH 0/6] Add debug interface to kick/call on purpose
在 2021/4/7 上午7:27, Dongli Zhang 写道: This will answer your question that "Can it bypass the masking?". For vhost-scsi, virtio-blk, virtio-scsi and virtio-net, to write to eventfd is not able to bypass masking because masking is to unregister the eventfd. To write to eventfd does not take effect. However, it is possible to bypass masking for vhost-net because vhost-net registered a specific masked_notifier eventfd in order to mask irq. To write to original eventfd still takes effect. We may leave the user to decide whether to write to 'masked_notifier' or original 'guest_notifier' for vhost-net. My fault here. To write to masked_notifier will always be masked:( Only when there's no bug in the qemu. If it is EventNotifier level, we will not care whether the EventNotifier is masked or not. It just provides an interface to write to EventNotifier. Yes. To dump the MSI-x table for both virtio and vfio will help confirm if the vector is masked. That would be helpful as well. It's probably better to extend "info pci" command. Thanks Thank you very much! Dongli Zhang
Re: [PATCH 0/6] Add debug interface to kick/call on purpose
在 2021/4/6 下午4:43, Dongli Zhang 写道: On 4/5/21 6:55 PM, Jason Wang wrote: 在 2021/4/6 上午4:00, Dongli Zhang 写道: On 4/1/21 8:47 PM, Jason Wang wrote: 在 2021/3/30 下午3:29, Dongli Zhang 写道: On 3/28/21 8:56 PM, Jason Wang wrote: 在 2021/3/27 上午5:16, Dongli Zhang 写道: Hi Jason, On 3/26/21 12:24 AM, Jason Wang wrote: 在 2021/3/26 下午1:44, Dongli Zhang 写道: The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to the loss of doorbell kick, e.g., https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$ ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler"). This patch introduces a new debug interface 'DeviceEvent' to DeviceClass to help narrow down if the issue is due to loss of irq/kick. So far the new interface handles only two events: 'call' and 'kick'. Any device (e.g., virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X or legacy IRQ). The 'call' is to inject irq on purpose by admin for a specific device (e.g., vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell on purpose by admin at QEMU/host side for a specific device. This device can be used as a workaround if call/kick is lost due to virtualization software (e.g., kernel or QEMU) issue. We may also implement the interface for VFIO PCI, e.g., to write to VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM on purpose. This is considered future work once the virtio part is done. Below is from live crash analysis. Initially, the queue=2 has count=15 for 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no used available. We suspect this is because vhost-scsi was not notified by VM. In order to narrow down and analyze the issue, we use live crash to dump the current counter of eventfd for queue=2. crash> eventfd_ctx 8f67f6bbe700 struct eventfd_ctx { kref = { refcount = { refs = { counter = 4 } } }, wqh = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, head = { next = 0x8f841dc08e18, prev = 0x8f841dc08e18 } }, count = 15, ---> eventfd is 15 !!! flags = 526336 } Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic with this interface. { "execute": "x-debug-device-event", "arguments": { "dev": "/machine/peripheral/vscsi0", "event": "kick", "queue": 2 } } The counter is increased to 16. Suppose the hang issue is resolved, it indicates something bad is in software that the 'kick' is lost. What do you mean by "software" here? And it looks to me you're testing whether event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not sure how much value could we gain from a dedicated debug interface like this consider there're a lot of exisinting general purpose debugging method like tracing or gdb. I'd say the path from virtio_queue_notify() to event_notifier_set() is only a very small fraction of the process of virtqueue kick which is unlikey to be buggy. Consider usually the ioeventfd will be offloaded to KVM, it's more a chance that something is wrong in setuping ioeventfd instead of here. Irq is even more complicated. Thank you very much! I am not testing whether event_notifier_set() is called by virtio_queue_notify(). The 'software' indicates the data processing and event notification mechanism involved with virtio/vhost PV driver frontend. E.g., while VM is waiting for an extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event() erroneously returns false due to corrupted ring buffer status. So there could be several factors that may block the notification: 1) eventfd bug (ioeventfd vs irqfd) 2) wrong virtqueue state (either driver or device) 3) missing barriers (either driver or device) 4) Qemu bug (irqchip and routing) ... This is not only about whether notification is blocked. It can also be used to help narrow down and understand if there is any suspicious issue in blk-mq/scsi/netdev/napi code. The PV drivers are not only drivers following virtio spec. It is closely related to many of other kernel components. Suppose IO was recovered after we inject an IRQ to vhost-scsi on purpose, we will be able to analyze what may happen along the IO completion path starting from when /where the IRQ is injected ... perhaps the root cause is not with virtio but blk-mq/scsi (this is just an example). In addition, this idea should help for vfio-pci. Suppose the developer for a specific device driver suspects IO/networking hangs because of loss if IRQ,
Re: [PATCH 0/6] Add debug interface to kick/call on purpose
On 4/6/21 1:43 AM, Dongli Zhang wrote: > > > On 4/5/21 6:55 PM, Jason Wang wrote: >> >> 在 2021/4/6 上午4:00, Dongli Zhang 写道: >>> >>> On 4/1/21 8:47 PM, Jason Wang wrote: 在 2021/3/30 下午3:29, Dongli Zhang 写道: > On 3/28/21 8:56 PM, Jason Wang wrote: >> 在 2021/3/27 上午5:16, Dongli Zhang 写道: >>> Hi Jason, >>> >>> On 3/26/21 12:24 AM, Jason Wang wrote: 在 2021/3/26 下午1:44, Dongli Zhang 写道: > The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due > to > the loss of doorbell kick, e.g., > > https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$ > > > > > ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit > fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler"). > > This patch introduces a new debug interface 'DeviceEvent' to > DeviceClass > to help narrow down if the issue is due to loss of irq/kick. So far > the new > interface handles only two events: 'call' and 'kick'. Any device > (e.g., > virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, > MSI-X > or legacy IRQ). > > The 'call' is to inject irq on purpose by admin for a specific device > (e.g., > vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the > doorbell > on purpose by admin at QEMU/host side for a specific device. > > > This device can be used as a workaround if call/kick is lost due to > virtualization software (e.g., kernel or QEMU) issue. > > We may also implement the interface for VFIO PCI, e.g., to write to > VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to > VM > on purpose. This is considered future work once the virtio part is > done. > > > Below is from live crash analysis. Initially, the queue=2 has > count=15 for > 'kick' eventfd_ctx. Suppose there is data in vring avail while there > is no > used available. We suspect this is because vhost-scsi was not > notified by > VM. In order to narrow down and analyze the issue, we use live crash > to > dump the current counter of eventfd for queue=2. > > crash> eventfd_ctx 8f67f6bbe700 > struct eventfd_ctx { > kref = { > refcount = { > refs = { > counter = 4 > } > } > }, > wqh = { > lock = { > { > rlock = { > raw_lock = { > val = { > counter = 0 > } > } > } > } > }, > head = { > next = 0x8f841dc08e18, > prev = 0x8f841dc08e18 > } > }, > count = 15, ---> eventfd is 15 !!! > flags = 526336 > } > > Now we kick the doorbell for vhost-scsi queue=2 on purpose for > diagnostic > with this interface. > > { "execute": "x-debug-device-event", > "arguments": { "dev": "/machine/peripheral/vscsi0", > "event": "kick", "queue": 2 } } > > The counter is increased to 16. Suppose the hang issue is resolved, it > indicates something bad is in software that the 'kick' is lost. What do you mean by "software" here? And it looks to me you're testing whether event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not sure how much value could we gain from a dedicated debug interface like this consider there're a lot of exisinting general purpose debugging method like tracing or gdb. I'd say the path from virtio_queue_notify() to event_notifier_set() is only a very small fraction of the process of virtqueue kick which is unlikey to be buggy. Consider usually the ioeventfd will be offloaded to KVM, it's more a chance that something is wrong in setuping ioeventfd instead of here. Irq is even more complicated. >>> Thank you very much! >>> >>> I am not testing whether event_notifier_set() is called by >>> virtio_queue_notify(). >>> >>> The 'software' indicates the data processing and event notification >>> mechanism >>> involved with virtio/vhost PV driver frontend. E.g.,
[PULL for-6.0 6/7] hw/block/nvme: fix handling of private namespaces
From: Klaus Jensen Prior to this patch, if a private nvme-ns device (that is, a namespace that is not linked to a subsystem) is wired up to an nvme-subsys linked nvme controller device, the device fails to verify that the namespace id is unique within the subsystem. NVM Express v1.4b, Section 6.1.6 ("NSID and Namespace Usage") states that because the device supports Namespace Management, "NSIDs *shall* be unique within the NVM subsystem". Additionally, prior to this patch, private namespaces are not known to the subsystem and the namespace is considered exclusive to the controller with which it is initially wired up to. However, this is not the definition of a private namespace; per Section 1.6.33 ("private namespace"), a private namespace is just a namespace that does not support multipath I/O or namespace sharing, which means "that it is only able to be attached to one controller at a time". Fix this by always allocating namespaces in the subsystem (if one is linked to the controller), regardless of the shared/private status of the namespace. Whether or not the namespace is shareable is controlled by a new `shared` nvme-ns parameter. Finally, this fix allows the nvme-ns `subsys` parameter to be removed, since the `shared` parameter now serves the purpose of attaching the namespace to all controllers in the subsystem upon device realization. It is invalid to have an nvme-ns namespace device with a linked subsystem without the parent nvme controller device also being linked to one and since the nvme-ns devices will unconditionally be "attached" (in QEMU terms that is) to an nvme controller device through an NvmeBus, the nvme-ns namespace device can always get a reference to the subsystem of the controller it is explicitly (using 'bus=' parameter) or implicitly attaching to. Fixes: e570768566b3 ("hw/block/nvme: support for shared namespace in subsystem") Cc: Minwoo Im Signed-off-by: Klaus Jensen Reviewed-by: Gollu Appalanaidu Reviewed-by: Keith Busch Reviewed-by: Minwoo Im --- hw/block/nvme-ns.h | 10 +--- hw/block/nvme-subsys.h | 7 +-- hw/block/nvme.h| 39 + include/block/nvme.h | 1 + hw/block/nvme-ns.c | 76 + hw/block/nvme-subsys.c | 28 -- hw/block/nvme.c| 123 ++--- hw/block/trace-events | 1 - 8 files changed, 115 insertions(+), 170 deletions(-) diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h index 82340c4b2574..fb0a41f912e7 100644 --- a/hw/block/nvme-ns.h +++ b/hw/block/nvme-ns.h @@ -29,6 +29,7 @@ typedef struct NvmeZone { typedef struct NvmeNamespaceParams { bool detached; +bool shared; uint32_t nsid; QemuUUID uuid; @@ -60,8 +61,8 @@ typedef struct NvmeNamespace { const uint32_t *iocs; uint8_t csi; uint16_t status; +int attached; -NvmeSubsystem *subsys; QTAILQ_ENTRY(NvmeNamespace) entry; NvmeIdNsZoned *id_ns_zoned; @@ -99,11 +100,6 @@ static inline uint32_t nvme_nsid(NvmeNamespace *ns) return 0; } -static inline bool nvme_ns_shared(NvmeNamespace *ns) -{ -return !!ns->subsys; -} - static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns) { NvmeIdNs *id_ns = >id_ns; @@ -225,7 +221,7 @@ static inline void nvme_aor_dec_active(NvmeNamespace *ns) } void nvme_ns_init_format(NvmeNamespace *ns); -int nvme_ns_setup(NvmeNamespace *ns, Error **errp); +int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp); void nvme_ns_drain(NvmeNamespace *ns); void nvme_ns_shutdown(NvmeNamespace *ns); void nvme_ns_cleanup(NvmeNamespace *ns); diff --git a/hw/block/nvme-subsys.h b/hw/block/nvme-subsys.h index aafa04b84829..24132edd005c 100644 --- a/hw/block/nvme-subsys.h +++ b/hw/block/nvme-subsys.h @@ -14,7 +14,7 @@ OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS) #define NVME_SUBSYS_MAX_CTRLS 32 -#define NVME_SUBSYS_MAX_NAMESPACES 256 +#define NVME_MAX_NAMESPACES 256 typedef struct NvmeCtrl NvmeCtrl; typedef struct NvmeNamespace NvmeNamespace; @@ -24,7 +24,7 @@ typedef struct NvmeSubsystem { NvmeCtrl*ctrls[NVME_SUBSYS_MAX_CTRLS]; /* Allocated namespaces for this subsystem */ -NvmeNamespace *namespaces[NVME_SUBSYS_MAX_NAMESPACES + 1]; +NvmeNamespace *namespaces[NVME_MAX_NAMESPACES + 1]; struct { char *nqn; @@ -32,7 +32,6 @@ typedef struct NvmeSubsystem { } NvmeSubsystem; int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp); -int nvme_subsys_register_ns(NvmeNamespace *ns, Error **errp); static inline NvmeCtrl *nvme_subsys_ctrl(NvmeSubsystem *subsys, uint32_t cntlid) @@ -54,7 +53,7 @@ static inline NvmeNamespace *nvme_subsys_ns(NvmeSubsystem *subsys, return NULL; } -assert(nsid && nsid <= NVME_SUBSYS_MAX_NAMESPACES); +assert(nsid && nsid <= NVME_MAX_NAMESPACES); return subsys->namespaces[nsid]; } diff --git a/hw/block/nvme.h b/hw/block/nvme.h index
[PULL for-6.0 5/7] hw/block/nvme: update dmsrl limit on namespace detachment
From: Klaus Jensen The Non-MDTS DMSRL limit must be recomputed when namespaces are detached. Fixes: 645ce1a70cb6 ("hw/block/nvme: support namespace attachment command") Signed-off-by: Klaus Jensen Reviewed-by: Gollu Appalanaidu Reviewed-by: Keith Busch --- hw/block/nvme.c | 17 + 1 file changed, 17 insertions(+) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 09c38fb35e0d..0898ece2af31 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -4868,6 +4868,21 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest *req) return NVME_NO_COMPLETE; } +static void nvme_update_dmrsl(NvmeCtrl *n) +{ +int nsid; + +for (nsid = 1; nsid <= NVME_MAX_NAMESPACES; nsid++) { +NvmeNamespace *ns = nvme_ns(n, nsid); +if (!ns) { +continue; +} + +n->dmrsl = MIN_NON_ZERO(n->dmrsl, +BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1)); +} +} + static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns); static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req) { @@ -4917,6 +4932,8 @@ static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req) } nvme_ns_detach(ctrl, ns); + +nvme_update_dmrsl(ctrl); } /* -- 2.31.1
[PULL for-6.0 1/7] hw/block/nvme: fix pi constraint check
From: Klaus Jensen Protection Information can only be enabled if there is at least 8 bytes of metadata. Signed-off-by: Klaus Jensen Reviewed-by: Gollu Appalanaidu Reviewed-by: Keith Busch --- hw/block/nvme-ns.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c index 7f8d139a8663..ca04ee1bacfb 100644 --- a/hw/block/nvme-ns.c +++ b/hw/block/nvme-ns.c @@ -394,7 +394,7 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp) return -1; } -if (ns->params.pi && !ns->params.ms) { +if (ns->params.pi && ns->params.ms < 8) { error_setg(errp, "at least 8 bytes of metadata required to enable " "protection information"); return -1; -- 2.31.1
[PULL for-6.0 2/7] hw/block/nvme: fix missing string representation for ns attachment
From: Klaus Jensen Add the missing nvme_adm_opc_str entry for the Namespace Attachment command. Signed-off-by: Klaus Jensen Reviewed-by: Gollu Appalanaidu Reviewed-by: Keith Busch --- hw/block/nvme.h | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/block/nvme.h b/hw/block/nvme.h index 5b0031b11db2..9edc86d79e98 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -86,6 +86,7 @@ static inline const char *nvme_adm_opc_str(uint8_t opc) case NVME_ADM_CMD_SET_FEATURES: return "NVME_ADM_CMD_SET_FEATURES"; case NVME_ADM_CMD_GET_FEATURES: return "NVME_ADM_CMD_GET_FEATURES"; case NVME_ADM_CMD_ASYNC_EV_REQ: return "NVME_ADM_CMD_ASYNC_EV_REQ"; +case NVME_ADM_CMD_NS_ATTACHMENT:return "NVME_ADM_CMD_NS_ATTACHMENT"; case NVME_ADM_CMD_FORMAT_NVM: return "NVME_ADM_CMD_FORMAT_NVM"; default:return "NVME_ADM_CMD_UNKNOWN"; } -- 2.31.1
[PULL for-6.0 7/7] hw/block/nvme: add missing copyright headers
From: Klaus Jensen Add missing license/copyright headers to the nvme-dif.{c,h} files. Signed-off-by: Klaus Jensen Reviewed-by: Keith Busch --- hw/block/nvme-dif.h | 10 ++ hw/block/nvme-dif.c | 10 ++ 2 files changed, 20 insertions(+) diff --git a/hw/block/nvme-dif.h b/hw/block/nvme-dif.h index 5a8e37c8525b..524faffbd7a0 100644 --- a/hw/block/nvme-dif.h +++ b/hw/block/nvme-dif.h @@ -1,3 +1,13 @@ +/* + * QEMU NVM Express End-to-End Data Protection support + * + * Copyright (c) 2021 Samsung Electronics Co., Ltd. + * + * Authors: + * Klaus Jensen + * Gollu Appalanaidu + */ + #ifndef HW_NVME_DIF_H #define HW_NVME_DIF_H diff --git a/hw/block/nvme-dif.c b/hw/block/nvme-dif.c index e6f04faafb5f..81b0a4cb1382 100644 --- a/hw/block/nvme-dif.c +++ b/hw/block/nvme-dif.c @@ -1,3 +1,13 @@ +/* + * QEMU NVM Express End-to-End Data Protection support + * + * Copyright (c) 2021 Samsung Electronics Co., Ltd. + * + * Authors: + * Klaus Jensen + * Gollu Appalanaidu + */ + #include "qemu/osdep.h" #include "hw/block/block.h" #include "sysemu/dma.h" -- 2.31.1
[PULL for-6.0 3/7] hw/block/nvme: fix the nsid 'invalid' value
From: Klaus Jensen The `nvme_nsid()` function returns '-1' (h) when the given namespace is NULL. Since h is actually a valid namespace identifier (the "broadcast" value), change this to be '0' since that actually *is* the invalid value. Signed-off-by: Klaus Jensen Reviewed-by: Gollu Appalanaidu Reviewed-by: Keith Busch --- hw/block/nvme-ns.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h index 9ab7894fc83e..82340c4b2574 100644 --- a/hw/block/nvme-ns.h +++ b/hw/block/nvme-ns.h @@ -96,7 +96,7 @@ static inline uint32_t nvme_nsid(NvmeNamespace *ns) return ns->params.nsid; } -return -1; +return 0; } static inline bool nvme_ns_shared(NvmeNamespace *ns) -- 2.31.1
[PULL for-6.0 4/7] hw/block/nvme: fix warning about legacy namespace configuration
From: Klaus Jensen Remove the unused BlockConf from the controller structure and fix the constraint checking to actually check the right BlockConf and issue the warning. Signed-off-by: Klaus Jensen Reviewed-by: Gollu Appalanaidu Reviewed-by: Keith Busch --- hw/block/nvme.h | 1 - hw/block/nvme.c | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/hw/block/nvme.h b/hw/block/nvme.h index 9edc86d79e98..8d1806cc942f 100644 --- a/hw/block/nvme.h +++ b/hw/block/nvme.h @@ -166,7 +166,6 @@ typedef struct NvmeCtrl { NvmeBar bar; NvmeParams params; NvmeBus bus; -BlockConfconf; uint16_tcntlid; boolqs_created; diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 7244534a89e9..09c38fb35e0d 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -5805,7 +5805,7 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp) params->max_ioqpairs = params->num_queues - 1; } -if (n->conf.blk) { +if (n->namespace.blkconf.blk) { warn_report("drive property is deprecated; " "please use an nvme-ns device instead"); } -- 2.31.1
[PULL for-6.0 0/7] emulated nvme fixes for -rc3
From: Klaus Jensen Hi Peter, My apologies that these didn't make it for -rc2! The following changes since commit d0d3dd401b70168a353450e031727affee828527: Update version for v6.0.0-rc2 release (2021-04-06 18:34:34 +0100) are available in the Git repository at: git://git.infradead.org/qemu-nvme.git tags/nvme-fixes-2021-04-06-pull-request for you to fetch changes up to 9553f8deebe0e443a0b006aa9d881269fd251a2c: hw/block/nvme: add missing copyright headers (2021-04-06 21:14:25 +0200) emulated nvme fixes for -rc3 Klaus Jensen (7): hw/block/nvme: fix pi constraint check hw/block/nvme: fix missing string representation for ns attachment hw/block/nvme: fix the nsid 'invalid' value hw/block/nvme: fix warning about legacy namespace configuration hw/block/nvme: update dmsrl limit on namespace detachment hw/block/nvme: fix handling of private namespaces hw/block/nvme: add missing copyright headers hw/block/nvme-dif.h| 10 +++ hw/block/nvme-ns.h | 12 ++-- hw/block/nvme-subsys.h | 7 +- hw/block/nvme.h| 41 +--- include/block/nvme.h | 1 + hw/block/nvme-dif.c| 10 +++ hw/block/nvme-ns.c | 78 ++ hw/block/nvme-subsys.c | 28 hw/block/nvme.c| 142 + hw/block/trace-events | 1 - 10 files changed, 156 insertions(+), 174 deletions(-) -- 2.31.1
Re: [PATCH for-6.0 v2 4/8] hw/block/nvme: fix controller namespaces array indexing
On Apr 6 09:28, Klaus Jensen wrote: > On Apr 6 09:01, Philippe Mathieu-Daudé wrote: > > On 4/5/21 7:54 PM, Klaus Jensen wrote: > > > From: Klaus Jensen > > > > > > The controller namespaces array being 0-indexed requires 'nsid - 1' > > > everywhere. Something that is easy to miss. Align the controller > > > namespaces array with the subsystem namespaces array such that both are > > > 1-indexed. > > > > TBH I don't understand the justification. > > Justification is mostly to align with the subsystem device. I like the > '1-indexed' approach better. And the -1 causes Coverity to complain > before the assert was added. > > > Assuming you hit a > > bug and try to protect yourself, maybe now you should also > > check for > > > > assert(n->namespaces[0] == NULL); > > > > somewhere. In nvme_ns() maybe? > > > > That is definitely a state that should always hold, I guess we can do > that, but we do already guard all "insertions" into the namespace array > by an assert on the nsid. Then again, asserting here makes sure that we > don't introduce something else that inserts on this invalid position. > > So, good point, I'll add it. > Then again again. I don't see the reason for the assert. Even if something ends up there by mistake we will never return it. If something ends up there due to new code, that nvme_ns() will always return NULL when nsid is zero and that should weed out the bug easily. I'll update the commit message to make it clear that this is about making both the subsystem and controller namespaces arrays 1-indexed. Them being indexed differently is a recipe for disaster I'd say. In anycase, I it actually a stretch to call this a bug fix, so I'll drop it and queue it up for v6.1. signature.asc Description: PGP signature
Re: [PATCH 0/2] block/rbd: fix memory leaks
On 29.03.21 17:01, Stefano Garzarella wrote: This series fixes two memory leaks, found through valgrind, in the rbd driver. Stefano Garzarella (2): block/rbd: fix memory leak in qemu_rbd_connect() block/rbd: fix memory leak in qemu_rbd_co_create_opts() block/rbd.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) Reviewed-by: Max Reitz I’m not quite sure whether this is fit for 6.0... I think it’s too late for rc2, so I don’t know. Max
Re: [PATCH for-6.0] block/nbd: fix possible use after free of s->connect_thread
06.04.2021 18:51, Vladimir Sementsov-Ogievskiy wrote: If on nbd_close() we detach the thread (in nbd_co_establish_connection_cancel() thr->state becomes CONNECT_THREAD_RUNNING_DETACHED), after that point we should not use s->connect_thread (which is set to NULL), as running thread may free it at any time. Still nbd_co_establish_connection() does exactly this: it saves s->connect_thread to local variable (just for better code style) and use it even after yield point, when thread may be already detached. Fix that. Also check thr to be non-NULL on nbd_co_establish_connection() start for safety. After this patch "case CONNECT_THREAD_RUNNING_DETACHED" becomes impossible in the second switch in nbd_co_establish_connection(). Still, don't add extra abort() just before the release. If it somehow possible to reach this "case:" it won't hurt. Anyway, good refactoring of all this reconnect mess will come soon. Signed-off-by: Vladimir Sementsov-Ogievskiy --- Hi all! I faced a crash, just running 277 iotest in a loop. I can't reproduce it on master, it reproduces only on my branch with nbd reconnect refactorings. Still, it seems very possible that it may crash under some conditions. So I propose this patch for 6.0. It's written so that it's obvious that it will not hurt: pre-patch, on first hunk we'll just crash if thr is NULL, on second hunk it's safe to return -1, and using thr when s->connect_thread is already zeroed is obviously wrong. Ha, occasionally I reinvented what Roman already does in "[PATCH 1/7] block/nbd: avoid touching freed connect_thread". My additional first hunk actually is not needed, as nbd_co_establish_connection is called after if (!nbd_clisent_connecting(s)) { return; }, so we should not be here after nbd_co_establish_connection_cancel(bs, true); which is called with s->state set to NBD_CLIENT_QUIT. So, it would be more honest to take Roman's patch "[PATCH 1/7] block/nbd: avoid touching freed connect_thread" :) Eric, could you take a look? If there no more pending block patches, I can try to send pull-request myself -- Best regards, Vladimir
Re: [PATCH 1/7] block/nbd: avoid touching freed connect_thread
15.03.2021 09:06, Roman Kagan wrote: When the NBD connection is being torn down, the connection thread gets canceled and "detached", meaning it is about to get freed. If this happens while the connection coroutine yielded waiting for the connection thread to complete, when it resumes it may access the invalidated connection thread data. To prevent this, revalidate the ->connect_thread pointer in nbd_co_establish_connection_cancel before using after the the yield. Signed-off-by: Roman Kagan --- block/nbd.c | 9 + 1 file changed, 9 insertions(+) diff --git a/block/nbd.c b/block/nbd.c index c26dc5a54f..447d176b76 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -486,6 +486,15 @@ nbd_co_establish_connection(BlockDriverState *bs, Error **errp) s->wait_connect = true; qemu_coroutine_yield(); +/* + * If nbd_co_establish_connection_cancel had a chance to run it may have + * invalidated ->connect_thread. + */ +thr = s->connect_thread; +if (!thr) { +return -ECONNABORTED; nbd_co_establish_connection() tends to return -1 or 0, not -errno, so -1 is better here. Still it doesn't really matter. +} + qemu_mutex_lock(>mutex); switch (thr->state) { -- Best regards, Vladimir
[PATCH for-6.0] block/nbd: fix possible use after free of s->connect_thread
If on nbd_close() we detach the thread (in nbd_co_establish_connection_cancel() thr->state becomes CONNECT_THREAD_RUNNING_DETACHED), after that point we should not use s->connect_thread (which is set to NULL), as running thread may free it at any time. Still nbd_co_establish_connection() does exactly this: it saves s->connect_thread to local variable (just for better code style) and use it even after yield point, when thread may be already detached. Fix that. Also check thr to be non-NULL on nbd_co_establish_connection() start for safety. After this patch "case CONNECT_THREAD_RUNNING_DETACHED" becomes impossible in the second switch in nbd_co_establish_connection(). Still, don't add extra abort() just before the release. If it somehow possible to reach this "case:" it won't hurt. Anyway, good refactoring of all this reconnect mess will come soon. Signed-off-by: Vladimir Sementsov-Ogievskiy --- Hi all! I faced a crash, just running 277 iotest in a loop. I can't reproduce it on master, it reproduces only on my branch with nbd reconnect refactorings. Still, it seems very possible that it may crash under some conditions. So I propose this patch for 6.0. It's written so that it's obvious that it will not hurt: pre-patch, on first hunk we'll just crash if thr is NULL, on second hunk it's safe to return -1, and using thr when s->connect_thread is already zeroed is obviously wrong. block/nbd.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/nbd.c b/block/nbd.c index c26dc5a54f..1d4668d42d 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -443,6 +443,11 @@ nbd_co_establish_connection(BlockDriverState *bs, Error **errp) BDRVNBDState *s = bs->opaque; NBDConnectThread *thr = s->connect_thread; +if (!thr) { +/* detached */ +return -1; +} + qemu_mutex_lock(>mutex); switch (thr->state) { @@ -486,6 +491,12 @@ nbd_co_establish_connection(BlockDriverState *bs, Error **errp) s->wait_connect = true; qemu_coroutine_yield(); +if (!s->connect_thread) { +/* detached */ +return -1; +} +assert(thr == s->connect_thread); + qemu_mutex_lock(>mutex); switch (thr->state) { -- 2.29.2
Re: [PATCH v2 2/2] block/rbd: Add an escape-aware strchr helper
On 01.04.21 23:01, Connor Kuehl wrote: Sometimes the parser needs to further split a token it has collected from the token input stream. Right now, it does a cursory check to see if the relevant characters appear in the token to determine if it should break it down further. However, qemu_rbd_next_tok() will escape characters as it removes tokens from the token stream and plain strchr() won't. This can make the initial strchr() check slightly misleading since it implies qemu_rbd_next_tok() will find the token and split on it, except the reality is that qemu_rbd_next_tok() will pass over it if it is escaped. Use a custom strchr to avoid mixing escaped and unescaped string operations. Reported-by: Han Han Fixes: https://bugzilla.redhat.com/1873913 Signed-off-by: Connor Kuehl --- block/rbd.c| 20 ++-- tests/qemu-iotests/231 | 4 tests/qemu-iotests/231.out | 3 +++ 3 files changed, 25 insertions(+), 2 deletions(-) diff --git a/block/rbd.c b/block/rbd.c index 9071a00e3f..c0e4d4a952 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -134,6 +134,22 @@ static char *qemu_rbd_next_tok(char *src, char delim, char **p) return src; } +static char *qemu_rbd_strchr(char *src, char delim) +{ +char *p; + +for (p = src; *p; ++p) { +if (*p == delim) { +return p; +} +if (*p == '\\') { +++p; +} +} + +return NULL; +} + So I thought you could make qemu_rbd_do_next_tok() to do this. (I didn’t say you should, but bear with me.) That would be possible by giving it a new parameter (e.g. @find), and if that is set, return @end if *end == delim after the loop, and NULL otherwise. Now, if you add wrapper functions to make it nice, there’s not much more difference in lines added compared to just adding a new function, but it does mean your function should basically be the same as qemu_rbd_next_tok(), except that no splitting happens, that there is no *p, and that @end is returned instead of @src. So there is one difference, and that is that qemu_rbd_next_tok() has this condition to skip escaped characters: if (*end == '\\' && end[1] != '\0') { where qemu_rbd_strchr() has only: if (*p == '\\') { And I think qemu_rbd_next_tok() is right; if the string in question has a trailing backslash, qemu_rbd_strchr() will ignore the final NUL and continue searching past the end of the string. Max
Re: [PULL for-6.0 0/2] emulated nvme fixes
On Mon, 5 Apr 2021 at 18:34, Klaus Jensen wrote: > > From: Klaus Jensen > > Hi Peter, > > The following changes since commit 25d75c99b2e5941c67049ee776efdb226414f4c6: > > Merge remote-tracking branch 'remotes/xtensa/tags/20210403-xtensa' into > staging (2021-04-04 21:48:45 +0100) > > are available in the Git repository at: > > git://git.infradead.org/qemu-nvme.git tags/nvme-fixes-for-6.0-pull-request > > for you to fetch changes up to 498114b37bc99fddcfc24b92bff7f1323fb32672: > > hw/block/nvme: expose 'bootindex' property (2021-04-05 19:33:04 +0200) > > > emulated nvme fixes > Applied, thanks. Please update the changelog at https://wiki.qemu.org/ChangeLog/6.0 for any user-visible changes. -- PMM
Re: [PATCH 0/6] Add debug interface to kick/call on purpose
On 4/5/21 6:55 PM, Jason Wang wrote: > > 在 2021/4/6 上午4:00, Dongli Zhang 写道: >> >> On 4/1/21 8:47 PM, Jason Wang wrote: >>> 在 2021/3/30 下午3:29, Dongli Zhang 写道: On 3/28/21 8:56 PM, Jason Wang wrote: > 在 2021/3/27 上午5:16, Dongli Zhang 写道: >> Hi Jason, >> >> On 3/26/21 12:24 AM, Jason Wang wrote: >>> 在 2021/3/26 下午1:44, Dongli Zhang 写道: The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to the loss of doorbell kick, e.g., https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$ ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler"). This patch introduces a new debug interface 'DeviceEvent' to DeviceClass to help narrow down if the issue is due to loss of irq/kick. So far the new interface handles only two events: 'call' and 'kick'. Any device (e.g., virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X or legacy IRQ). The 'call' is to inject irq on purpose by admin for a specific device (e.g., vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell on purpose by admin at QEMU/host side for a specific device. This device can be used as a workaround if call/kick is lost due to virtualization software (e.g., kernel or QEMU) issue. We may also implement the interface for VFIO PCI, e.g., to write to VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM on purpose. This is considered future work once the virtio part is done. Below is from live crash analysis. Initially, the queue=2 has count=15 for 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no used available. We suspect this is because vhost-scsi was not notified by VM. In order to narrow down and analyze the issue, we use live crash to dump the current counter of eventfd for queue=2. crash> eventfd_ctx 8f67f6bbe700 struct eventfd_ctx { kref = { refcount = { refs = { counter = 4 } } }, wqh = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, head = { next = 0x8f841dc08e18, prev = 0x8f841dc08e18 } }, count = 15, ---> eventfd is 15 !!! flags = 526336 } Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic with this interface. { "execute": "x-debug-device-event", "arguments": { "dev": "/machine/peripheral/vscsi0", "event": "kick", "queue": 2 } } The counter is increased to 16. Suppose the hang issue is resolved, it indicates something bad is in software that the 'kick' is lost. >>> What do you mean by "software" here? And it looks to me you're testing >>> whether >>> event_notifier_set() is called by virtio_queue_notify() here. If so, >>> I'm not >>> sure how much value could we gain from a dedicated debug interface like >>> this >>> consider there're a lot of exisinting general purpose debugging method >>> like >>> tracing or gdb. I'd say the path from virtio_queue_notify() to >>> event_notifier_set() is only a very small fraction of the process of >>> virtqueue >>> kick which is unlikey to be buggy. Consider usually the ioeventfd will >>> be >>> offloaded to KVM, it's more a chance that something is wrong in setuping >>> ioeventfd instead of here. Irq is even more complicated. >> Thank you very much! >> >> I am not testing whether event_notifier_set() is called by >> virtio_queue_notify(). >> >> The 'software' indicates the data processing and event notification >> mechanism >> involved with virtio/vhost PV driver frontend. E.g., while VM is waiting >> for an >> extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event() >> erroneously returns false due to corrupted ring buffer status. > So there
Re: [PATCH 2/2] block/rbd: fix memory leak in qemu_rbd_co_create_opts()
Stefano Garzarella writes: > When we allocate 'q_namespace', we forgot to set 'has_q_namespace' > to true. This can cause several issues, including a memory leak, > since qapi_free_BlockdevCreateOptions() does not deallocate that > memory, as reported by valgrind: > > 13 bytes in 1 blocks are definitely lost in loss record 7 of 96 > at 0x4839809: malloc (vg_replace_malloc.c:307) > by 0x48CEBB8: g_malloc (in /usr/lib64/libglib-2.0.so.0.6600.8) > by 0x48E3FE3: g_strdup (in /usr/lib64/libglib-2.0.so.0.6600.8) > by 0x180010: qemu_rbd_co_create_opts (rbd.c:446) > by 0x1AE72C: bdrv_create_co_entry (block.c:492) > by 0x241902: coroutine_trampoline (coroutine-ucontext.c:173) > by 0x57530AF: ??? (in /usr/lib64/libc-2.32.so) > by 0x1FFEFFFA6F: ??? > > Fix setting 'has_q_namespace' to true when we allocate 'q_namespace'. > > Fixes: 19ae9ae014 ("block/rbd: Add support for ceph namespaces") > Signed-off-by: Stefano Garzarella > --- > block/rbd.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/block/rbd.c b/block/rbd.c > index 24cefcd0dc..f098a89c7b 100644 > --- a/block/rbd.c > +++ b/block/rbd.c > @@ -444,6 +444,7 @@ static int coroutine_fn > qemu_rbd_co_create_opts(BlockDriver *drv, > loc->user= g_strdup(qdict_get_try_str(options, "user")); > loc->has_user= !!loc->user; > loc->q_namespace = g_strdup(qdict_get_try_str(options, "namespace")); > +loc->has_q_namespace = !!loc->q_namespace; > loc->image = g_strdup(qdict_get_try_str(options, "image")); > keypairs = qdict_get_try_str(options, "=keyvalue-pairs"); Reviewed-by: Markus Armbruster
Re: [PATCH 1/2] block/rbd: fix memory leak in qemu_rbd_connect()
Stefano Garzarella writes: > In qemu_rbd_connect(), 'mon_host' is allocated by qemu_rbd_mon_host() > using g_strjoinv(), but it's only freed in the error path, leaking > memory in the success path as reported by valgrind: > > 80 bytes in 4 blocks are definitely lost in loss record 5,028 of 6,516 > at 0x4839809: malloc (vg_replace_malloc.c:307) > by 0x5315BB8: g_malloc (in /usr/lib64/libglib-2.0.so.0.6600.8) > by 0x532B6FF: g_strjoinv (in /usr/lib64/libglib-2.0.so.0.6600.8) > by 0x87D07E: qemu_rbd_mon_host (rbd.c:538) > by 0x87D07E: qemu_rbd_connect (rbd.c:562) > by 0x87E1CE: qemu_rbd_open (rbd.c:740) > by 0x840EB1: bdrv_open_driver (block.c:1528) > by 0x8453A9: bdrv_open_common (block.c:1802) > by 0x8453A9: bdrv_open_inherit (block.c:3444) > by 0x8464C2: bdrv_open (block.c:3537) > by 0x8108CD: qmp_blockdev_add (blockdev.c:3569) > by 0x8EA61B: qmp_marshal_blockdev_add (qapi-commands-block-core.c:1086) > by 0x90B528: do_qmp_dispatch_bh (qmp-dispatch.c:131) > by 0x907EA4: aio_bh_poll (async.c:164) > > Fix freeing 'mon_host' also when qemu_rbd_connect() ends correctly. > > Signed-off-by: Stefano Garzarella I believe this Fixes: 0a55679b4a5061f4d74bdb1a0e81611ba3390b00 Reviewed-by: Markus Armbruster
Re: issuing [block-]job-complete to jobs in STANDBY state
On Thu, Apr 01, 2021 at 15:02:35 -0400, John Snow wrote: > Hi; downstream we've run into an issue where VMs under heavy load with many > simultaneously concurrent block jobs running might occasionally flicker into > the STANDBY state, during which time they will be unable to receive JOB > COMPLETE commands. I assume this flicker is due to > child_job_drained_begin(). > > BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1945635 > > It's safe to just retry this operation again, but it may be difficult to > understand WHY the job is paused at the application level, since the flush > event may be asynchronous and unpredictable. > > We could define a transition to allow COMPLETE to be applied to STANDBY > jobs, but is there any risk or drawback to doing so? On QMP's side, we do > know the difference between a temporary pause and a user pause/error pause > (Both use the user_pause flag.) I think qemu should allow that, because otherwise it's like playing whack-a-mole. Libvirt could delay the completiion until the 'standby' state is over and the job is running but that would be racy. Additionally, we could even hit the standby state before the state transition would be processed internally, and the error message is too generic to attempt to decide what to do based on it. (We do want to report real errors)
Re: [PATCH for-6.0 v2 4/8] hw/block/nvme: fix controller namespaces array indexing
On Apr 6 09:01, Philippe Mathieu-Daudé wrote: > On 4/5/21 7:54 PM, Klaus Jensen wrote: > > From: Klaus Jensen > > > > The controller namespaces array being 0-indexed requires 'nsid - 1' > > everywhere. Something that is easy to miss. Align the controller > > namespaces array with the subsystem namespaces array such that both are > > 1-indexed. > > TBH I don't understand the justification. Justification is mostly to align with the subsystem device. I like the '1-indexed' approach better. And the -1 causes Coverity to complain before the assert was added. > Assuming you hit a > bug and try to protect yourself, maybe now you should also > check for > > assert(n->namespaces[0] == NULL); > > somewhere. In nvme_ns() maybe? > That is definitely a state that should always hold, I guess we can do that, but we do already guard all "insertions" into the namespace array by an assert on the nsid. Then again, asserting here makes sure that we don't introduce something else that inserts on this invalid position. So, good point, I'll add it. > > > > Signed-off-by: Klaus Jensen > > Reviewed-by: Gollu Appalanaidu > > --- > > hw/block/nvme.h | 8 > > hw/block/nvme.c | 2 +- > > 2 files changed, 5 insertions(+), 5 deletions(-) > > > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h > > index 9edc86d79e98..c610ab30dc5c 100644 > > --- a/hw/block/nvme.h > > +++ b/hw/block/nvme.h > > @@ -217,7 +217,7 @@ typedef struct NvmeCtrl { > > * Attached namespaces to this controller. If subsys is not given, all > > * namespaces in this list will always be attached. > > */ > > -NvmeNamespace *namespaces[NVME_MAX_NAMESPACES]; > > +NvmeNamespace *namespaces[NVME_MAX_NAMESPACES + 1]; > > NvmeSQueue **sq; > > NvmeCQueue **cq; > > NvmeSQueue admin_sq; > > @@ -232,7 +232,7 @@ static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, > > uint32_t nsid) > > return NULL; > > } > > > > -return n->namespaces[nsid - 1]; > > +return n->namespaces[nsid]; > > } > > > > static inline bool nvme_ns_is_attached(NvmeCtrl *n, NvmeNamespace *ns) > > @@ -253,7 +253,7 @@ static inline void nvme_ns_attach(NvmeCtrl *n, > > NvmeNamespace *ns) > > uint32_t nsid = nvme_nsid(ns); > > assert(nsid && nsid <= NVME_MAX_NAMESPACES); > > > > -n->namespaces[nsid - 1] = ns; > > +n->namespaces[nsid] = ns; > > } > > > > static inline void nvme_ns_detach(NvmeCtrl *n, NvmeNamespace *ns) > > @@ -261,7 +261,7 @@ static inline void nvme_ns_detach(NvmeCtrl *n, > > NvmeNamespace *ns) > > uint32_t nsid = nvme_nsid(ns); > > assert(nsid && nsid <= NVME_MAX_NAMESPACES); > > > > -n->namespaces[nsid - 1] = NULL; > > +n->namespaces[nsid] = NULL; > > } > > > > static inline NvmeCQueue *nvme_cq(NvmeRequest *req) > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c > > index c54ec3c9523c..6d31d5b17a0b 100644 > > --- a/hw/block/nvme.c > > +++ b/hw/block/nvme.c > > @@ -5915,7 +5915,7 @@ int nvme_register_namespace(NvmeCtrl *n, > > NvmeNamespace *ns, Error **errp) > > return -1; > > } > > } else { > > -if (n->namespaces[nsid - 1]) { > > +if (n->namespaces[nsid]) { > > error_setg(errp, "namespace id '%d' is already in use", nsid); > > return -1; > > } > > > > -- One of us - No more doubt, silence or taboo about mental illness. signature.asc Description: PGP signature
Re: [PATCH for-6.0 v2 6/8] hw/block/nvme: update dmsrl limit on namespace detachment
On Apr 6 09:10, Philippe Mathieu-Daudé wrote: > On 4/5/21 7:54 PM, Klaus Jensen wrote: > > From: Klaus Jensen > > > > The Non-MDTS DMSRL limit must be recomputed when namespaces are > > detached. > > > > Fixes: 645ce1a70cb6 ("hw/block/nvme: support namespace attachment command") > > Signed-off-by: Klaus Jensen > > Reviewed-by: Gollu Appalanaidu > > --- > > hw/block/nvme.c | 17 + > > 1 file changed, 17 insertions(+) > > > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c > > index de0e726dfdd8..3dc51f407671 100644 > > --- a/hw/block/nvme.c > > +++ b/hw/block/nvme.c > > @@ -4876,6 +4876,21 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest > > *req) > > return NVME_NO_COMPLETE; > > } > > > > +static void __nvme_update_dmrsl(NvmeCtrl *n) > > +{ > > +int nsid; > > + > > +for (nsid = 1; nsid <= NVME_MAX_NAMESPACES; nsid++) { > > +NvmeNamespace *ns = nvme_ns(n, nsid); > > +if (!ns) { > > +continue; > > +} > > + > > +n->dmrsl = MIN_NON_ZERO(n->dmrsl, > > +BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1)); > > +} > > +} > > + > > static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns); > > static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req) > > { > > @@ -4925,6 +4940,8 @@ static uint16_t nvme_ns_attachment(NvmeCtrl *n, > > NvmeRequest *req) > > } > > > > nvme_ns_detach(ctrl, ns); > > + > > +__nvme_update_dmrsl(ctrl); > > } > > Why the '__' prefix? It doesn't seem clearer (I'm not sure there is > a convention, it makes me think of a internal macro expansion use > for preprocessor). > > There are very few uses of this prefix: > > hw/9pfs/cofs.c:21:static ssize_t __readlink(V9fsState *s, V9fsPath > *path, V9fsString *buf) > hw/block/nvme.c:1683:static uint16_t __nvme_zrm_open(NvmeNamespace *ns, > NvmeZone *zone, > hw/block/nvme.c:1742:static void __nvme_advance_zone_wp(NvmeNamespace > *ns, NvmeZone *zone, > hw/block/nvme.c:5213:static void __nvme_select_ns_iocs(NvmeCtrl *n, > NvmeNamespace *ns) > hw/i386/amd_iommu.c:1160:static int __amdvi_int_remap_msi(AMDVIState *iommu, > hw/intc/s390_flic_kvm.c:255:static int __get_all_irqs(KVMS390FLICState > *flic, > hw/net/rocker/rocker_desc.c:199:static bool > __desc_ring_post_desc(DescRing *ring, int err) > hw/net/sungem.c:766:static uint16_t __sungem_mii_read(SunGEMState *s, > uint8_t phy_addr, > hw/ppc/ppc.c:867:static void __cpu_ppc_store_decr(PowerPCCPU *cpu, > uint64_t *nextp, > hw/s390x/pv.c:25:static int __s390_pv_cmd(uint32_t cmd, const char > *cmdname, void *data) > pc-bios/s390-ccw/cio.c:315:static int __do_cio(SubChannelId schid, > uint32_t ccw_addr, int fmt, Irb *irb) > target/ppc/mmu-hash64.c:170:static void __helper_slbie(CPUPPCState *env, > target_ulong addr, > > Thomas, Eric, is it worth cleaning these and updating the > 'CODESTYLE.rst'? > Yeah ok, I think you are right that there is no clear convention on when to use this or not. I typically just use it for functions that are normally not supposed to be called directly. But I don't even think its consistent in the nvme device. For my sake, we can clean it up, I'll drop it in this case since there is no good reason for it other than my own idea of "style". signature.asc Description: PGP signature
Re: [PATCH for-6.0 v2 3/8] hw/block/nvme: fix the nsid 'invalid' value
On Apr 6 08:53, Philippe Mathieu-Daudé wrote: > Hi Klaus, > > On 4/5/21 7:54 PM, Klaus Jensen wrote: > > From: Klaus Jensen > > > > The `nvme_nsid()` function returns '-1' (h) when the given > > namespace is NULL. Since h is actually a valid namespace > > identifier (the "broadcast" value), change this to be '0' since that > > actually *is* the invalid value. > > > > Signed-off-by: Klaus Jensen > > Reviewed-by: Gollu Appalanaidu > > --- > > hw/block/nvme-ns.h | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h > > index 9ab7894fc83e..82340c4b2574 100644 > > --- a/hw/block/nvme-ns.h > > +++ b/hw/block/nvme-ns.h > > @@ -96,7 +96,7 @@ static inline uint32_t nvme_nsid(NvmeNamespace *ns) > > return ns->params.nsid; > > } > > > > -return -1; > > +return 0; > > For 6.1 can you add a NVME_NSID_INVALID definition along > NVME_NSID_BROADCAST and use it here? > Good idea Philippe, I'll write that up, thanks! signature.asc Description: PGP signature
Re: [PATCH for-6.0 v2 6/8] hw/block/nvme: update dmsrl limit on namespace detachment
On 4/5/21 7:54 PM, Klaus Jensen wrote: > From: Klaus Jensen > > The Non-MDTS DMSRL limit must be recomputed when namespaces are > detached. > > Fixes: 645ce1a70cb6 ("hw/block/nvme: support namespace attachment command") > Signed-off-by: Klaus Jensen > Reviewed-by: Gollu Appalanaidu > --- > hw/block/nvme.c | 17 + > 1 file changed, 17 insertions(+) > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c > index de0e726dfdd8..3dc51f407671 100644 > --- a/hw/block/nvme.c > +++ b/hw/block/nvme.c > @@ -4876,6 +4876,21 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest *req) > return NVME_NO_COMPLETE; > } > > +static void __nvme_update_dmrsl(NvmeCtrl *n) > +{ > +int nsid; > + > +for (nsid = 1; nsid <= NVME_MAX_NAMESPACES; nsid++) { > +NvmeNamespace *ns = nvme_ns(n, nsid); > +if (!ns) { > +continue; > +} > + > +n->dmrsl = MIN_NON_ZERO(n->dmrsl, > +BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1)); > +} > +} > + > static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns); > static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req) > { > @@ -4925,6 +4940,8 @@ static uint16_t nvme_ns_attachment(NvmeCtrl *n, > NvmeRequest *req) > } > > nvme_ns_detach(ctrl, ns); > + > +__nvme_update_dmrsl(ctrl); > } Why the '__' prefix? It doesn't seem clearer (I'm not sure there is a convention, it makes me think of a internal macro expansion use for preprocessor). There are very few uses of this prefix: hw/9pfs/cofs.c:21:static ssize_t __readlink(V9fsState *s, V9fsPath *path, V9fsString *buf) hw/block/nvme.c:1683:static uint16_t __nvme_zrm_open(NvmeNamespace *ns, NvmeZone *zone, hw/block/nvme.c:1742:static void __nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone, hw/block/nvme.c:5213:static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns) hw/i386/amd_iommu.c:1160:static int __amdvi_int_remap_msi(AMDVIState *iommu, hw/intc/s390_flic_kvm.c:255:static int __get_all_irqs(KVMS390FLICState *flic, hw/net/rocker/rocker_desc.c:199:static bool __desc_ring_post_desc(DescRing *ring, int err) hw/net/sungem.c:766:static uint16_t __sungem_mii_read(SunGEMState *s, uint8_t phy_addr, hw/ppc/ppc.c:867:static void __cpu_ppc_store_decr(PowerPCCPU *cpu, uint64_t *nextp, hw/s390x/pv.c:25:static int __s390_pv_cmd(uint32_t cmd, const char *cmdname, void *data) pc-bios/s390-ccw/cio.c:315:static int __do_cio(SubChannelId schid, uint32_t ccw_addr, int fmt, Irb *irb) target/ppc/mmu-hash64.c:170:static void __helper_slbie(CPUPPCState *env, target_ulong addr, Thomas, Eric, is it worth cleaning these and updating the 'CODESTYLE.rst'? Phil.
Re: [PATCH for-6.0 v2 4/8] hw/block/nvme: fix controller namespaces array indexing
On 4/5/21 7:54 PM, Klaus Jensen wrote: > From: Klaus Jensen > > The controller namespaces array being 0-indexed requires 'nsid - 1' > everywhere. Something that is easy to miss. Align the controller > namespaces array with the subsystem namespaces array such that both are > 1-indexed. TBH I don't understand the justification. Assuming you hit a bug and try to protect yourself, maybe now you should also check for assert(n->namespaces[0] == NULL); somewhere. In nvme_ns() maybe? > > Signed-off-by: Klaus Jensen > Reviewed-by: Gollu Appalanaidu > --- > hw/block/nvme.h | 8 > hw/block/nvme.c | 2 +- > 2 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h > index 9edc86d79e98..c610ab30dc5c 100644 > --- a/hw/block/nvme.h > +++ b/hw/block/nvme.h > @@ -217,7 +217,7 @@ typedef struct NvmeCtrl { > * Attached namespaces to this controller. If subsys is not given, all > * namespaces in this list will always be attached. > */ > -NvmeNamespace *namespaces[NVME_MAX_NAMESPACES]; > +NvmeNamespace *namespaces[NVME_MAX_NAMESPACES + 1]; > NvmeSQueue **sq; > NvmeCQueue **cq; > NvmeSQueue admin_sq; > @@ -232,7 +232,7 @@ static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, > uint32_t nsid) > return NULL; > } > > -return n->namespaces[nsid - 1]; > +return n->namespaces[nsid]; > } > > static inline bool nvme_ns_is_attached(NvmeCtrl *n, NvmeNamespace *ns) > @@ -253,7 +253,7 @@ static inline void nvme_ns_attach(NvmeCtrl *n, > NvmeNamespace *ns) > uint32_t nsid = nvme_nsid(ns); > assert(nsid && nsid <= NVME_MAX_NAMESPACES); > > -n->namespaces[nsid - 1] = ns; > +n->namespaces[nsid] = ns; > } > > static inline void nvme_ns_detach(NvmeCtrl *n, NvmeNamespace *ns) > @@ -261,7 +261,7 @@ static inline void nvme_ns_detach(NvmeCtrl *n, > NvmeNamespace *ns) > uint32_t nsid = nvme_nsid(ns); > assert(nsid && nsid <= NVME_MAX_NAMESPACES); > > -n->namespaces[nsid - 1] = NULL; > +n->namespaces[nsid] = NULL; > } > > static inline NvmeCQueue *nvme_cq(NvmeRequest *req) > diff --git a/hw/block/nvme.c b/hw/block/nvme.c > index c54ec3c9523c..6d31d5b17a0b 100644 > --- a/hw/block/nvme.c > +++ b/hw/block/nvme.c > @@ -5915,7 +5915,7 @@ int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace > *ns, Error **errp) > return -1; > } > } else { > -if (n->namespaces[nsid - 1]) { > +if (n->namespaces[nsid]) { > error_setg(errp, "namespace id '%d' is already in use", nsid); > return -1; > } >
Re: [PATCH for-6.0 v2 3/8] hw/block/nvme: fix the nsid 'invalid' value
Hi Klaus, On 4/5/21 7:54 PM, Klaus Jensen wrote: > From: Klaus Jensen > > The `nvme_nsid()` function returns '-1' (h) when the given > namespace is NULL. Since h is actually a valid namespace > identifier (the "broadcast" value), change this to be '0' since that > actually *is* the invalid value. > > Signed-off-by: Klaus Jensen > Reviewed-by: Gollu Appalanaidu > --- > hw/block/nvme-ns.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h > index 9ab7894fc83e..82340c4b2574 100644 > --- a/hw/block/nvme-ns.h > +++ b/hw/block/nvme-ns.h > @@ -96,7 +96,7 @@ static inline uint32_t nvme_nsid(NvmeNamespace *ns) > return ns->params.nsid; > } > > -return -1; > +return 0; For 6.1 can you add a NVME_NSID_INVALID definition along NVME_NSID_BROADCAST and use it here? > } > > static inline bool nvme_ns_shared(NvmeNamespace *ns) >
Re: [PATCH for-6.0 v2 7/8] hw/block/nvme: fix handling of private namespaces
On 21-04-05 19:54:51, Klaus Jensen wrote: > From: Klaus Jensen > > Prior to this patch, if a private nvme-ns device (that is, a namespace > that is not linked to a subsystem) is wired up to an nvme-subsys linked > nvme controller device, the device fails to verify that the namespace id > is unique within the subsystem. NVM Express v1.4b, Section 6.1.6 ("NSID > and Namespace Usage") states that because the device supports Namespace > Management, "NSIDs *shall* be unique within the NVM subsystem". > > Additionally, prior to this patch, private namespaces are not known to > the subsystem and the namespace is considered exclusive to the > controller with which it is initially wired up to. However, this is not > the definition of a private namespace; per Section 1.6.33 ("private > namespace"), a private namespace is just a namespace that does not > support multipath I/O or namespace sharing, which means "that it is only > able to be attached to one controller at a time". > > Fix this by always allocating namespaces in the subsystem (if one is > linked to the controller), regardsless of the shared/private status of > the namespace. Whether or not the namespace is shareable is controlled > by a new `shared` nvme-ns parameter. > > Finally, this fix allows the nvme-ns `subsys` parameter to be removed, > since the `shared` parameter now serves the purpose of attaching the > namespace to all controllers in the subsystem upon device realization. > It is invalid to have an nvme-ns namespace device with a linked > subsystem without the parent nvme controller device also being linked to > one and since the nvme-ns devices will unconditionally be "attached" (in > QEMU terms that is) to an nvme controller device through an NvmeBus, the > nvme-ns namespace device can always get a reference to the subsystem of > the controller it is explicitly (using 'bus=' parametr) or implicitly > attaching to. > > Fixes: e570768566b3 ("hw/block/nvme: support for shared namespace in > subsystem") > Cc: Minwoo Im > Signed-off-by: Klaus Jensen > Reviewed-by: Gollu Appalanaidu Reviewed-by: Minwoo Im Thanks for the fix.