date:20210726

Re: [PATCH for-6.1? v2 4/7] jobs: Give Job.force_cancel more meaning

2021-07-26 Thread Eric Blake

On Mon, Jul 26, 2021 at 04:46:10PM +0200, Max Reitz wrote:
> We largely have two cancel modes for jobs:
> 
> First, there is actual cancelling.  The job is terminated as soon as
> possible, without trying to reach a consistent result.
> 
> Second, we have mirror in the READY state.  Technically, the job is not
> really cancelled, but it just is a different completion mode.  The job
> can still run for an indefinite amount of time while it tries to reach a
> consistent result.
> 
> We want to be able to clearly distinguish which cancel mode a job is in
> (when it has been cancelled).  We can use Job.force_cancel for this, but
> right now it only reflects cancel requests from the user with
> force=true, but clearly, jobs that do not even distinguish between
> force=false and force=true are effectively always force-cancelled.
> 
> So this patch has Job.force_cancel signify whether the job will
> terminate as soon as possible (force_cancel=true) or whether it will
> effectively remain running despite being "cancelled"
> (force_cancel=false).
> 
> To this end, we let jobs that provide JobDriver.cancel() tell the
> generic job code whether they will terminate as soon as possible or not,
> and for jobs that do not provide that method we assume they will.
> 
> Signed-off-by: Max Reitz 
> ---
>  include/qemu/job.h | 11 ++-
>  block/backup.c |  3 ++-
>  block/mirror.c | 24 ++--
>  job.c  |  6 +-
>  4 files changed, 35 insertions(+), 9 deletions(-)
>

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PULL for-6.1 0/1] Block patches

2021-07-26 Thread Peter Maydell

On Mon, 26 Jul 2021 at 09:53, Stefan Hajnoczi  wrote:
>
> The following changes since commit a2376507f615495b1d16685449ce0ea78c2caf9d:
>
>   Merge remote-tracking branch 'remotes/bonzini-gitlab/tags/for-upstream' 
> into staging (2021-07-24 11:04:57 +0100)
>
> are available in the Git repository at:
>
>   https://gitlab.com/stefanha/qemu.git tags/block-pull-request
>
> for you to fetch changes up to 15a730e7a3aaac180df72cd5730e0617bcf44a5a:
>
>   block/nvme: Fix VFIO_MAP_DMA failed: No space left on device (2021-07-26 
> 09:38:12 +0100)
>
> 
> Pull request
>
> Phil's block/nvme.c ENOSPC fix for newer Linux kernels that return this errno.
>
> 


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/6.1
for any user-visible changes.

-- PMM

[PULL for-6.1 11/11] tests/qtest/nvme-test: add mmio read test

2021-07-26 Thread Klaus Jensen

From: Klaus Jensen 

Add a regression test for mmio read on big-endian hosts.

Signed-off-by: Klaus Jensen 
Reviewed-by: Gollu Appalanaidu 
---
 tests/qtest/nvme-test.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/tests/qtest/nvme-test.c b/tests/qtest/nvme-test.c
index 47e757d7e2af..f8bafb5d70fb 100644
--- a/tests/qtest/nvme-test.c
+++ b/tests/qtest/nvme-test.c
@@ -67,6 +67,30 @@ static void nvmetest_oob_cmb_test(void *obj, void *data, 
QGuestAllocator *alloc)
 g_assert_cmpint(qpci_io_readl(pdev, bar, cmb_bar_size - 1), !=, 
0x44332211);
 }
 
+static void nvmetest_reg_read_test(void *obj, void *data, QGuestAllocator 
*alloc)
+{
+QNvme *nvme = obj;
+QPCIDevice *pdev = &nvme->dev;
+QPCIBar bar;
+uint32_t cap_lo, cap_hi;
+uint64_t cap;
+
+qpci_device_enable(pdev);
+bar = qpci_iomap(pdev, 0, NULL);
+
+cap_lo = qpci_io_readl(pdev, bar, 0x0);
+g_assert_cmpint(NVME_CAP_MQES(cap_lo), ==, 0x7ff);
+
+cap_hi = qpci_io_readl(pdev, bar, 0x4);
+g_assert_cmpint(NVME_CAP_MPSMAX((uint64_t)cap_hi << 32), ==, 0x4);
+
+cap = qpci_io_readq(pdev, bar, 0x0);
+g_assert_cmpint(NVME_CAP_MQES(cap), ==, 0x7ff);
+g_assert_cmpint(NVME_CAP_MPSMAX(cap), ==, 0x4);
+
+qpci_iounmap(pdev, bar);
+}
+
 static void nvmetest_pmr_reg_test(void *obj, void *data, QGuestAllocator 
*alloc)
 {
 QNvme *nvme = obj;
@@ -142,6 +166,8 @@ static void nvme_register_nodes(void)
  &(QOSGraphTestOptions) {
 .edge.extra_device_opts = "pmrdev=pmr0"
 });
+
+qos_add_test("reg-read", "nvme", nvmetest_reg_read_test, NULL);
 }
 
 libqos_init(nvme_register_nodes);
-- 
2.32.0

[PULL for-6.1 09/11] hw/nvme: fix out-of-bounds reads

2021-07-26 Thread Klaus Jensen

From: Klaus Jensen 

Peter noticed that mmio access may read into the NvmeParams member in
the NvmeCtrl struct.

Fix the bounds check.

Reported-by: Peter Maydell 
Signed-off-by: Klaus Jensen 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Peter Maydell 
---
 hw/nvme/ctrl.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 23ff71f65c0e..10c2363c1d4d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -5969,23 +5969,26 @@ static uint64_t nvme_mmio_read(void *opaque, hwaddr 
addr, unsigned size)
 /* should RAZ, fall through for now */
 }
 
-if (addr < sizeof(n->bar)) {
-/*
- * When PMRWBM bit 1 is set then read from
- * from PMRSTS should ensure prior writes
- * made it to persistent media
- */
-if (addr == NVME_REG_PMRSTS &&
-(NVME_PMRCAP_PMRWBM(n->bar.pmrcap) & 0x02)) {
-memory_region_msync(&n->pmr.dev->mr, 0, n->pmr.dev->size);
-}
-memcpy(&val, ptr + addr, size);
-} else {
+if (addr > sizeof(n->bar) - size) {
 NVME_GUEST_ERR(pci_nvme_ub_mmiord_invalid_ofs,
"MMIO read beyond last register,"
" offset=0x%"PRIx64", returning 0", addr);
+
+return 0;
 }
 
+/*
+ * When PMRWBM bit 1 is set then read from
+ * from PMRSTS should ensure prior writes
+ * made it to persistent media
+ */
+if (addr == NVME_REG_PMRSTS &&
+(NVME_PMRCAP_PMRWBM(n->bar.pmrcap) & 0x02)) {
+memory_region_msync(&n->pmr.dev->mr, 0, n->pmr.dev->size);
+}
+
+memcpy(&val, ptr + addr, size);
+
 return val;
 }
 
-- 
2.32.0

[PULL for-6.1 07/11] hw/nvme: split pmrmsc register into upper and lower

2021-07-26 Thread Klaus Jensen

From: Klaus Jensen 

The specification uses a set of 32 bit PMRMSCL and PMRMSCU registers to
make up the 64 bit logical PMRMSC register.

Make it so.

Signed-off-by: Klaus Jensen 
Reviewed-by: Keith Busch 
---
 include/block/nvme.h | 31 ---
 hw/nvme/ctrl.c   | 10 ++
 2 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 527105fafc0b..84053b68b987 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -26,7 +26,8 @@ typedef struct QEMU_PACKED NvmeBar {
 uint32_tpmrsts;
 uint32_tpmrebs;
 uint32_tpmrswtp;
-uint64_tpmrmsc;
+uint32_tpmrmscl;
+uint32_tpmrmscu;
 uint8_t css[484];
 } NvmeBar;
 
@@ -475,25 +476,25 @@ enum NvmePmrswtpMask {
 #define NVME_PMRSWTP_SET_PMRSWTV(pmrswtp, val)   \
 (pmrswtp |= (uint64_t)(val & PMRSWTP_PMRSWTV_MASK) << 
PMRSWTP_PMRSWTV_SHIFT)
 
-enum NvmePmrmscShift {
-PMRMSC_CMSE_SHIFT   = 1,
-PMRMSC_CBA_SHIFT= 12,
+enum NvmePmrmsclShift {
+PMRMSCL_CMSE_SHIFT   = 1,
+PMRMSCL_CBA_SHIFT= 12,
 };
 
-enum NvmePmrmscMask {
-PMRMSC_CMSE_MASK   = 0x1,
-PMRMSC_CBA_MASK= 0xf,
+enum NvmePmrmsclMask {
+PMRMSCL_CMSE_MASK   = 0x1,
+PMRMSCL_CBA_MASK= 0xf,
 };
 
-#define NVME_PMRMSC_CMSE(pmrmsc)\
-((pmrmsc >> PMRMSC_CMSE_SHIFT)   & PMRMSC_CMSE_MASK)
-#define NVME_PMRMSC_CBA(pmrmsc) \
-((pmrmsc >> PMRMSC_CBA_SHIFT)   & PMRMSC_CBA_MASK)
+#define NVME_PMRMSCL_CMSE(pmrmscl)\
+((pmrmscl >> PMRMSCL_CMSE_SHIFT)   & PMRMSCL_CMSE_MASK)
+#define NVME_PMRMSCL_CBA(pmrmscl) \
+((pmrmscl >> PMRMSCL_CBA_SHIFT)   & PMRMSCL_CBA_MASK)
 
-#define NVME_PMRMSC_SET_CMSE(pmrmsc, val)   \
-(pmrmsc |= (uint64_t)(val & PMRMSC_CMSE_MASK) << PMRMSC_CMSE_SHIFT)
-#define NVME_PMRMSC_SET_CBA(pmrmsc, val)   \
-(pmrmsc |= (uint64_t)(val & PMRMSC_CBA_MASK) << PMRMSC_CBA_SHIFT)
+#define NVME_PMRMSCL_SET_CMSE(pmrmscl, val)   \
+(pmrmscl |= (uint32_t)(val & PMRMSCL_CMSE_MASK) << PMRMSCL_CMSE_SHIFT)
+#define NVME_PMRMSCL_SET_CBA(pmrmscl, val)   \
+(pmrmscl |= (uint32_t)(val & PMRMSCL_CBA_MASK) << PMRMSCL_CBA_SHIFT)
 
 enum NvmeSglDescriptorType {
 NVME_SGL_DESCR_TYPE_DATA_BLOCK  = 0x0,
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 2f0524e12a36..070d9f6a962d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -5916,11 +5916,13 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
uint64_t data,
 return;
 }
 
-n->bar.pmrmsc = (n->bar.pmrmsc & ~0x) | (data & 0x);
+n->bar.pmrmscl = data;
 n->pmr.cmse = false;
 
-if (NVME_PMRMSC_CMSE(n->bar.pmrmsc)) {
-hwaddr cba = NVME_PMRMSC_CBA(n->bar.pmrmsc) << PMRMSC_CBA_SHIFT;
+if (NVME_PMRMSCL_CMSE(n->bar.pmrmscl)) {
+uint64_t pmrmscu = n->bar.pmrmscu;
+hwaddr cba = (pmrmscu << 32) |
+(NVME_PMRMSCL_CBA(n->bar.pmrmscl) << PMRMSCL_CBA_SHIFT);
 if (cba + int128_get64(n->pmr.dev->mr.size) < cba) {
 NVME_PMRSTS_SET_CBAI(n->bar.pmrsts, 1);
 return;
@@ -5936,7 +5938,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
uint64_t data,
 return;
 }
 
-n->bar.pmrmsc = (n->bar.pmrmsc & 0x) | (data << 32);
+n->bar.pmrmscu = data;
 return;
 default:
 NVME_GUEST_ERR(pci_nvme_ub_mmiowr_invalid,
-- 
2.32.0

[PULL for-6.1 08/11] hw/nvme: use symbolic names for registers

2021-07-26 Thread Klaus Jensen

From: Klaus Jensen 

Add the NvmeBarRegs enum and use these instead of explicit register
offsets.

Signed-off-by: Klaus Jensen 
Reviewed-by: Gollu Appalanaidu 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Keith Busch 
---
 include/block/nvme.h | 29 -
 hw/nvme/ctrl.c   | 44 ++--
 2 files changed, 50 insertions(+), 23 deletions(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 84053b68b987..77aae0117494 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -9,7 +9,7 @@ typedef struct QEMU_PACKED NvmeBar {
 uint32_tcc;
 uint8_t rsvd24[4];
 uint32_tcsts;
-uint32_tnssrc;
+uint32_tnssr;
 uint32_taqa;
 uint64_tasq;
 uint64_tacq;
@@ -31,6 +31,33 @@ typedef struct QEMU_PACKED NvmeBar {
 uint8_t css[484];
 } NvmeBar;
 
+enum NvmeBarRegs {
+NVME_REG_CAP = offsetof(NvmeBar, cap),
+NVME_REG_VS  = offsetof(NvmeBar, vs),
+NVME_REG_INTMS   = offsetof(NvmeBar, intms),
+NVME_REG_INTMC   = offsetof(NvmeBar, intmc),
+NVME_REG_CC  = offsetof(NvmeBar, cc),
+NVME_REG_CSTS= offsetof(NvmeBar, csts),
+NVME_REG_NSSR= offsetof(NvmeBar, nssr),
+NVME_REG_AQA = offsetof(NvmeBar, aqa),
+NVME_REG_ASQ = offsetof(NvmeBar, asq),
+NVME_REG_ACQ = offsetof(NvmeBar, acq),
+NVME_REG_CMBLOC  = offsetof(NvmeBar, cmbloc),
+NVME_REG_CMBSZ   = offsetof(NvmeBar, cmbsz),
+NVME_REG_BPINFO  = offsetof(NvmeBar, bpinfo),
+NVME_REG_BPRSEL  = offsetof(NvmeBar, bprsel),
+NVME_REG_BPMBL   = offsetof(NvmeBar, bpmbl),
+NVME_REG_CMBMSC  = offsetof(NvmeBar, cmbmsc),
+NVME_REG_CMBSTS  = offsetof(NvmeBar, cmbsts),
+NVME_REG_PMRCAP  = offsetof(NvmeBar, pmrcap),
+NVME_REG_PMRCTL  = offsetof(NvmeBar, pmrctl),
+NVME_REG_PMRSTS  = offsetof(NvmeBar, pmrsts),
+NVME_REG_PMREBS  = offsetof(NvmeBar, pmrebs),
+NVME_REG_PMRSWTP = offsetof(NvmeBar, pmrswtp),
+NVME_REG_PMRMSCL = offsetof(NvmeBar, pmrmscl),
+NVME_REG_PMRMSCU = offsetof(NvmeBar, pmrmscu),
+};
+
 enum NvmeCapShift {
 CAP_MQES_SHIFT = 0,
 CAP_CQR_SHIFT  = 16,
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 070d9f6a962d..23ff71f65c0e 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -5740,7 +5740,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
uint64_t data,
 }
 
 switch (offset) {
-case 0xc:   /* INTMS */
+case NVME_REG_INTMS:
 if (unlikely(msix_enabled(&(n->parent_obj {
 NVME_GUEST_ERR(pci_nvme_ub_mmiowr_intmask_with_msix,
"undefined access to interrupt mask set"
@@ -5752,7 +5752,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
uint64_t data,
 trace_pci_nvme_mmio_intm_set(data & 0x, n->bar.intmc);
 nvme_irq_check(n);
 break;
-case 0x10:  /* INTMC */
+case NVME_REG_INTMC:
 if (unlikely(msix_enabled(&(n->parent_obj {
 NVME_GUEST_ERR(pci_nvme_ub_mmiowr_intmask_with_msix,
"undefined access to interrupt mask clr"
@@ -5764,7 +5764,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
uint64_t data,
 trace_pci_nvme_mmio_intm_clr(data & 0x, n->bar.intmc);
 nvme_irq_check(n);
 break;
-case 0x14:  /* CC */
+case NVME_REG_CC:
 trace_pci_nvme_mmio_cfg(data & 0x);
 /* Windows first sends data, then sends enable bit */
 if (!NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc) &&
@@ -5798,7 +5798,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
uint64_t data,
 n->bar.cc = data;
 }
 break;
-case 0x1c:  /* CSTS */
+case NVME_REG_CSTS:
 if (data & (1 << 4)) {
 NVME_GUEST_ERR(pci_nvme_ub_mmiowr_ssreset_w1c_unsupported,
"attempted to W1C CSTS.NSSRO"
@@ -5809,7 +5809,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
uint64_t data,
" of controller status");
 }
 break;
-case 0x20:  /* NSSR */
+case NVME_REG_NSSR:
 if (data == 0x4e564d65) {
 trace_pci_nvme_ub_mmiowr_ssreset_unsupported();
 } else {
@@ -5817,38 +5817,38 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
uint64_t data,
 return;
 }
 break;
-case 0x24:  /* AQA */
+case NVME_REG_AQA:
 n->bar.aqa = data & 0x;
 trace_pci_nvme_mmio_aqattr(data & 0x);
 break;
-case 0x28:  /* ASQ */
+case NVME_REG_ASQ:
 n->bar.asq = size == 8 ? data :
 (n->bar.asq & ~0xULL) | (data & 0x);
 trace_pci_nvme_mmio_asqaddr(data);
 break;
-case 0x2c:  /* ASQ hi */
+case NVME_REG_ASQ + 4:
 n->bar.asq = (n->bar.asq & 0x) | (data << 32);
 trace_pci_nvme_mmio_asqaddr_hi(data, n->bar.asq

[PULL for-6.1 02/11] hw/nvme: mark nvme-subsys non-hotpluggable

2021-07-26 Thread Klaus Jensen

From: Klaus Jensen 

We currently lack the infrastructure to handle subsystem hotplugging, so
disable it.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Klaus Jensen 
---
 hw/nvme/subsys.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 192223d17ca1..dc7a96862f37 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -61,6 +61,7 @@ static void nvme_subsys_class_init(ObjectClass *oc, void 
*data)
 
 dc->realize = nvme_subsys_realize;
 dc->desc = "Virtual NVMe subsystem";
+dc->hotpluggable = false;
 
 device_class_set_props(dc, nvme_subsystem_props);
 }
-- 
2.32.0

[PULL for-6.1 10/11] hw/nvme: fix mmio read

2021-07-26 Thread Klaus Jensen

From: Klaus Jensen 

The new PMR test unearthed a long-standing issue with MMIO reads on
big-endian hosts.

Fix this by unconditionally storing all controller registers in little
endian.

Cc: Gollu Appalanaidu 
Reported-by: Peter Maydell 
Signed-off-by: Klaus Jensen 
Reviewed-by: Peter Maydell 
---
 hw/nvme/ctrl.c | 291 +++--
 1 file changed, 162 insertions(+), 129 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 10c2363c1d4d..43dfaeac9f54 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -439,10 +439,12 @@ static uint8_t nvme_sq_empty(NvmeSQueue *sq)
 
 static void nvme_irq_check(NvmeCtrl *n)
 {
+uint32_t intms = ldl_le_p(&n->bar.intms);
+
 if (msix_enabled(&(n->parent_obj))) {
 return;
 }
-if (~n->bar.intms & n->irq_status) {
+if (~intms & n->irq_status) {
 pci_irq_assert(&n->parent_obj);
 } else {
 pci_irq_deassert(&n->parent_obj);
@@ -1289,7 +1291,7 @@ static void nvme_post_cqes(void *opaque)
 if (ret) {
 trace_pci_nvme_err_addr_write(addr);
 trace_pci_nvme_err_cfs();
-n->bar.csts = NVME_CSTS_FAILED;
+stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
 break;
 }
 QTAILQ_REMOVE(&cq->req_list, req, entry);
@@ -4022,7 +4024,7 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_err_invalid_create_sq_sqid(sqid);
 return NVME_INVALID_QID | NVME_DNR;
 }
-if (unlikely(!qsize || qsize > NVME_CAP_MQES(n->bar.cap))) {
+if (unlikely(!qsize || qsize > NVME_CAP_MQES(ldq_le_p(&n->bar.cap {
 trace_pci_nvme_err_invalid_create_sq_size(qsize);
 return NVME_MAX_QSIZE_EXCEEDED | NVME_DNR;
 }
@@ -4208,7 +4210,7 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint8_t 
csi, uint32_t buf_len,
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-switch (NVME_CC_CSS(n->bar.cc)) {
+switch (NVME_CC_CSS(ldl_le_p(&n->bar.cc))) {
 case NVME_CC_CSS_NVM:
 src_iocs = nvme_cse_iocs_nvm;
 /* fall through */
@@ -4370,7 +4372,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_err_invalid_create_cq_cqid(cqid);
 return NVME_INVALID_QID | NVME_DNR;
 }
-if (unlikely(!qsize || qsize > NVME_CAP_MQES(n->bar.cap))) {
+if (unlikely(!qsize || qsize > NVME_CAP_MQES(ldq_le_p(&n->bar.cap {
 trace_pci_nvme_err_invalid_create_cq_size(qsize);
 return NVME_MAX_QSIZE_EXCEEDED | NVME_DNR;
 }
@@ -5163,17 +5165,19 @@ static void nvme_update_dmrsl(NvmeCtrl *n)
 
 static void nvme_select_iocs_ns(NvmeCtrl *n, NvmeNamespace *ns)
 {
+uint32_t cc = ldl_le_p(&n->bar.cc);
+
 ns->iocs = nvme_cse_iocs_none;
 switch (ns->csi) {
 case NVME_CSI_NVM:
-if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
+if (NVME_CC_CSS(cc) != NVME_CC_CSS_ADMIN_ONLY) {
 ns->iocs = nvme_cse_iocs_nvm;
 }
 break;
 case NVME_CSI_ZONED:
-if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_CSI) {
+if (NVME_CC_CSS(cc) == NVME_CC_CSS_CSI) {
 ns->iocs = nvme_cse_iocs_zoned;
-} else if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_NVM) {
+} else if (NVME_CC_CSS(cc) == NVME_CC_CSS_NVM) {
 ns->iocs = nvme_cse_iocs_nvm;
 }
 break;
@@ -5510,7 +5514,7 @@ static void nvme_process_sq(void *opaque)
 if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
 trace_pci_nvme_err_addr_read(addr);
 trace_pci_nvme_err_cfs();
-n->bar.csts = NVME_CSTS_FAILED;
+stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
 break;
 }
 nvme_inc_sq_head(sq);
@@ -5565,8 +5569,6 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
 n->aer_queued = 0;
 n->outstanding_aers = 0;
 n->qs_created = false;
-
-n->bar.cc = 0;
 }
 
 static void nvme_ctrl_shutdown(NvmeCtrl *n)
@@ -5605,7 +5607,12 @@ static void nvme_select_iocs(NvmeCtrl *n)
 
 static int nvme_start_ctrl(NvmeCtrl *n)
 {
-uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;
+uint64_t cap = ldq_le_p(&n->bar.cap);
+uint32_t cc = ldl_le_p(&n->bar.cc);
+uint32_t aqa = ldl_le_p(&n->bar.aqa);
+uint64_t asq = ldq_le_p(&n->bar.asq);
+uint64_t acq = ldq_le_p(&n->bar.acq);
+uint32_t page_bits = NVME_CC_MPS(cc) + 12;
 uint32_t page_size = 1 << page_bits;
 
 if (unlikely(n->cq[0])) {
@@ -5616,73 +5623,72 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 trace_pci_nvme_err_startfail_sq();
 return -1;
 }
-if (unlikely(!n->bar.asq)) {
+if (unlikely(!asq)) {
 trace_pci_nvme_err_startfail_nbarasq();
 return -1;
 }
-if (unlikely(!n->bar.acq)) {
+if (unlikely(!acq)) {
 trace_pci_nvme_err_startfail_nbaracq();
 return -1;
 }
-if (unlikely(n->bar.asq & (page_size - 1))) {
-trace_pci_nvme_err_startfail_asq_misal

[PULL for-6.1 06/11] hw/nvme: fix controller hot unplugging

2021-07-26 Thread Klaus Jensen

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h   | 15 ---
 hw/nvme/ctrl.c   | 14 ++
 hw/nvme/ns.c | 18 ++
 hw/nvme/subsys.c |  3 +++
 4 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..83ffabade4cf 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,20 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > NVME_NSID_BROADCAST 
- 1);
 typedef struct NvmeCtrl NvmeCtrl;
 typedef struct NvmeNamespace NvmeNamespace;
 
+#define TYPE_NVME_BUS "nvme-bus"
+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+BusState parent_bus;
+} NvmeBus;
+
 #define TYPE_NVME_SUBSYS "nvme-subsys"
 #define NVME_SUBSYS(obj) \
 OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
 
 typedef struct NvmeSubsystem {
 DeviceState parent_obj;
+NvmeBus bus;
 uint8_t subnqn[256];
 
 NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
@@ -365,13 +373,6 @@ typedef struct NvmeCQueue {
 QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
 
-#define TYPE_NVME_BUS "nvme-bus"
-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-BusState parent_bus;
-} NvmeBus;
-
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
 OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index ead7531bde5e..2f0524e12a36 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6527,16 +6527,14 @@ static void nvme_exit(PCIDevice *pci_dev)
 
 nvme_ctrl_reset(n);
 
-for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-ns = nvme_ns(n, i);
-if (!ns) {
-continue;
+if (n->subsys) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+ns = nvme_ns(n, i);
+if (ns) {
+ns->attached--;
+}
 }
 
-nvme_ns_cleanup(ns);
-}
-
-if (n->subsys) {
 nvme_subsys_unregister_ctrl(n->subsys, n);
 }
 
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..b7cf1494e75b 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -441,6 +441,15 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
 }
 }
 
+static void nvme_ns_unrealize(DeviceState *dev)
+{
+NvmeNamespace *ns = NVME_NS(dev);
+
+nvme_ns_drain(ns);
+nvme_ns_shutdown(ns);
+nvme_ns_cleanup(ns);
+}
+
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
 {
 NvmeNamespace *ns = NVME_NS(dev);
@@ -462,6 +471,14 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
"linked to an nvme-subsys device");
 return;
 }
+} else {
+/*
+ * If this namespace belongs to a subsystem (through a link on the
+ * controller device), reparent the device.
+ */
+if (!qdev_set_parent_bus(dev, &subsys->bus.parent_bus, errp)) {
+return;
+}
 }
 
 if (nvme_ns_setup(ns, errp)) {
@@ -552,6 +569,7 @@ static void nvme_ns_class_init(ObjectClass *oc, void *data)
 
 dc->bus_type = TYPE_NVME_BUS;
 dc->realize = nvme_ns_realize;
+dc->unrealize = nvme_ns_unrealize;
 device_class_set_props(dc, nvme_ns_props);
 dc->desc = "Virtual NVMe namespace";
 }
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 92caa604a280..93c35950d69d 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -50,6 +50,9 @@ static void nvme_subsys_realize(DeviceState *dev, Error 
**errp)
 {
 NvmeSubsystem *subsys = NVME_SUBSYS(dev);
 
+qbus_create_inplace(&subsys->bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev,
+dev->id);
+
 nvme_subsys_setup(subsys);
 }
 
-- 
2.32.0

[PULL for-6.1 05/11] tests/qtest/nvme-test: add persistent memory region test

2021-07-26 Thread Klaus Jensen

From: Gollu Appalanaidu 

This will test the PMR functionality.

Signed-off-by: Gollu Appalanaidu 
Reviewed-by: Klaus Jensen 
[k.jensen: replaced memory-backend-file with memory-backend-ram]
Signed-off-by: Klaus Jensen 
---
 tests/qtest/nvme-test.c | 61 -
 1 file changed, 60 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/nvme-test.c b/tests/qtest/nvme-test.c
index d32c953a3824..47e757d7e2af 100644
--- a/tests/qtest/nvme-test.c
+++ b/tests/qtest/nvme-test.c
@@ -13,6 +13,7 @@
 #include "libqos/libqtest.h"
 #include "libqos/qgraph.h"
 #include "libqos/pci.h"
+#include "include/block/nvme.h"
 
 typedef struct QNvme QNvme;
 
@@ -66,12 +67,65 @@ static void nvmetest_oob_cmb_test(void *obj, void *data, 
QGuestAllocator *alloc)
 g_assert_cmpint(qpci_io_readl(pdev, bar, cmb_bar_size - 1), !=, 
0x44332211);
 }
 
+static void nvmetest_pmr_reg_test(void *obj, void *data, QGuestAllocator 
*alloc)
+{
+QNvme *nvme = obj;
+QPCIDevice *pdev = &nvme->dev;
+QPCIBar pmr_bar, nvme_bar;
+uint32_t pmrcap, pmrsts;
+
+qpci_device_enable(pdev);
+pmr_bar = qpci_iomap(pdev, 4, NULL);
+
+/* Without Enabling PMRCTL check bar enablemet */
+qpci_io_writel(pdev, pmr_bar, 0, 0xccbbaa99);
+g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), !=, 0x99);
+g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), !=, 0xaa99);
+
+/* Map NVMe Bar Register to Enable the Mem Region */
+nvme_bar = qpci_iomap(pdev, 0, NULL);
+
+pmrcap = qpci_io_readl(pdev, nvme_bar, 0xe00);
+g_assert_cmpint(NVME_PMRCAP_RDS(pmrcap), ==, 0x1);
+g_assert_cmpint(NVME_PMRCAP_WDS(pmrcap), ==, 0x1);
+g_assert_cmpint(NVME_PMRCAP_BIR(pmrcap), ==, 0x4);
+g_assert_cmpint(NVME_PMRCAP_PMRWBM(pmrcap), ==, 0x2);
+g_assert_cmpint(NVME_PMRCAP_CMSS(pmrcap), ==, 0x1);
+
+/* Enable PMRCTRL */
+qpci_io_writel(pdev, nvme_bar, 0xe04, 0x1);
+
+qpci_io_writel(pdev, pmr_bar, 0, 0x44332211);
+g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), ==, 0x11);
+g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), ==, 0x2211);
+g_assert_cmpint(qpci_io_readl(pdev, pmr_bar, 0), ==, 0x44332211);
+
+pmrsts = qpci_io_readl(pdev, nvme_bar, 0xe08);
+g_assert_cmpint(NVME_PMRSTS_NRDY(pmrsts), ==, 0x0);
+
+/* Disable PMRCTRL */
+qpci_io_writel(pdev, nvme_bar, 0xe04, 0x0);
+
+qpci_io_writel(pdev, pmr_bar, 0, 0x88776655);
+g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), !=, 0x55);
+g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), !=, 0x6655);
+g_assert_cmpint(qpci_io_readl(pdev, pmr_bar, 0), !=, 0x88776655);
+
+pmrsts = qpci_io_readl(pdev, nvme_bar, 0xe08);
+g_assert_cmpint(NVME_PMRSTS_NRDY(pmrsts), ==, 0x1);
+
+qpci_iounmap(pdev, nvme_bar);
+qpci_iounmap(pdev, pmr_bar);
+}
+
 static void nvme_register_nodes(void)
 {
 QOSGraphEdgeOptions opts = {
 .extra_device_opts = "addr=04.0,drive=drv0,serial=foo",
 .before_cmd_line = "-drive id=drv0,if=none,file=null-co://,"
-   "file.read-zeroes=on,format=raw",
+   "file.read-zeroes=on,format=raw "
+   "-object memory-backend-ram,id=pmr0,"
+   "share=on,size=8",
 };
 
 add_qpci_address(&opts, &(QPCIAddress) { .devfn = QPCI_DEVFN(4, 0) });
@@ -83,6 +137,11 @@ static void nvme_register_nodes(void)
 qos_add_test("oob-cmb-access", "nvme", nvmetest_oob_cmb_test, 
&(QOSGraphTestOptions) {
 .edge.extra_device_opts = "cmb_size_mb=2"
 });
+
+qos_add_test("pmr-test-access", "nvme", nvmetest_pmr_reg_test,
+ &(QOSGraphTestOptions) {
+.edge.extra_device_opts = "pmrdev=pmr0"
+});
 }
 
 libqos_init(nvme_register_nodes);
-- 
2.32.0

[PULL for-6.1 04/11] hw/nvme: error handling for too many mappings

2021-07-26 Thread Klaus Jensen

From: Padmakar Kalghatgi 

If the number of PRP/SGL mappings exceed 1024, reads and writes will
fail because of an internal QEMU limitation of max 1024 vectors.

Signed-off-by: Padmakar Kalghatgi 
Reviewed-by: Klaus Jensen 
[k.jensen: changed the error message to be more generic]
Signed-off-by: Klaus Jensen 
---
 hw/nvme/ctrl.c   | 13 +
 hw/nvme/trace-events |  1 +
 2 files changed, 14 insertions(+)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..ead7531bde5e 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -623,6 +623,10 @@ static uint16_t nvme_map_addr(NvmeCtrl *n, NvmeSg *sg, 
hwaddr addr, size_t len)
 return NVME_INVALID_USE_OF_CMB | NVME_DNR;
 }
 
+if (sg->iov.niov + 1 > IOV_MAX) {
+goto max_mappings_exceeded;
+}
+
 if (cmb) {
 return nvme_map_addr_cmb(n, &sg->iov, addr, len);
 } else {
@@ -634,9 +638,18 @@ static uint16_t nvme_map_addr(NvmeCtrl *n, NvmeSg *sg, 
hwaddr addr, size_t len)
 return NVME_INVALID_USE_OF_CMB | NVME_DNR;
 }
 
+if (sg->qsg.nsg + 1 > IOV_MAX) {
+goto max_mappings_exceeded;
+}
+
 qemu_sglist_add(&sg->qsg, addr, len);
 
 return NVME_SUCCESS;
+
+max_mappings_exceeded:
+NVME_GUEST_ERR(pci_nvme_ub_too_many_mappings,
+   "number of mappings exceed 1024");
+return NVME_INTERNAL_DEV_ERROR | NVME_DNR;
 }
 
 static inline bool nvme_addr_is_dma(NvmeCtrl *n, hwaddr addr)
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index f9a1f14e2638..430eeb395b24 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -199,3 +199,4 @@ pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t 
new_head) "completion qu
 pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write 
for nonexistent queue, sqid=%"PRIu32", ignoring"
 pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission 
queue doorbell write value beyond queue size, sqid=%"PRIu32", 
new_head=%"PRIu16", ignoring"
 pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"
+pci_nvme_ub_too_many_mappings(void) "too many prp/sgl mappings"
-- 
2.32.0

[PULL for-6.1 01/11] hw/nvme: remove NvmeCtrl parameter from ns setup/check functions

2021-07-26 Thread Klaus Jensen

From: Klaus Jensen 

The nvme_ns_setup and nvme_ns_check_constraints should not depend on the
controller state. Refactor and remove it.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h |  2 +-
 hw/nvme/ctrl.c |  2 +-
 hw/nvme/ns.c   | 37 ++---
 3 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 56f8eceed2ad..0868359a1e86 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -246,7 +246,7 @@ static inline void nvme_aor_dec_active(NvmeNamespace *ns)
 }
 
 void nvme_ns_init_format(NvmeNamespace *ns);
-int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
+int nvme_ns_setup(NvmeNamespace *ns, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_shutdown(NvmeNamespace *ns);
 void nvme_ns_cleanup(NvmeNamespace *ns);
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 629b0d38c2a2..dd1801510032 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6498,7 +6498,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 ns = &n->namespace;
 ns->params.nsid = 1;
 
-if (nvme_ns_setup(n, ns, errp)) {
+if (nvme_ns_setup(ns, errp)) {
 return;
 }
 
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 4275c3db6301..3c4f5b8c714a 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -346,8 +346,7 @@ static void nvme_zoned_ns_shutdown(NvmeNamespace *ns)
 assert(ns->nr_open_zones == 0);
 }
 
-static int nvme_ns_check_constraints(NvmeCtrl *n, NvmeNamespace *ns,
- Error **errp)
+static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
 {
 if (!ns->blkconf.blk) {
 error_setg(errp, "block backend not configured");
@@ -366,20 +365,6 @@ static int nvme_ns_check_constraints(NvmeCtrl *n, 
NvmeNamespace *ns,
 return -1;
 }
 
-if (!n->subsys) {
-if (ns->params.detached) {
-error_setg(errp, "detached requires that the nvme device is "
-   "linked to an nvme-subsys device");
-return -1;
-}
-
-if (ns->params.shared) {
-error_setg(errp, "shared requires that the nvme device is "
-   "linked to an nvme-subsys device");
-return -1;
-}
-}
-
 if (ns->params.zoned) {
 if (ns->params.max_active_zones) {
 if (ns->params.max_open_zones > ns->params.max_active_zones) {
@@ -411,9 +396,9 @@ static int nvme_ns_check_constraints(NvmeCtrl *n, 
NvmeNamespace *ns,
 return 0;
 }
 
-int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+int nvme_ns_setup(NvmeNamespace *ns, Error **errp)
 {
-if (nvme_ns_check_constraints(n, ns, errp)) {
+if (nvme_ns_check_constraints(ns, errp)) {
 return -1;
 }
 
@@ -465,7 +450,21 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 uint32_t nsid = ns->params.nsid;
 int i;
 
-if (nvme_ns_setup(n, ns, errp)) {
+if (!n->subsys) {
+if (ns->params.detached) {
+error_setg(errp, "detached requires that the nvme device is "
+   "linked to an nvme-subsys device");
+return;
+}
+
+if (ns->params.shared) {
+error_setg(errp, "shared requires that the nvme device is "
+   "linked to an nvme-subsys device");
+return;
+}
+}
+
+if (nvme_ns_setup(ns, errp)) {
 return;
 }
 
-- 
2.32.0

[PULL for-6.1 03/11] hw/nvme: unregister controller with subsystem at exit

2021-07-26 Thread Klaus Jensen

From: Klaus Jensen 

Make sure the controller is unregistered from the subsystem when device
is removed.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h   | 1 +
 hw/nvme/ctrl.c   | 4 
 hw/nvme/subsys.c | 5 +
 3 files changed, 10 insertions(+)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 0868359a1e86..c4065467d877 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -50,6 +50,7 @@ typedef struct NvmeSubsystem {
 } NvmeSubsystem;
 
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp);
+void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n);
 
 static inline NvmeCtrl *nvme_subsys_ctrl(NvmeSubsystem *subsys,
  uint32_t cntlid)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index dd1801510032..90e3ee2b70ee 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6523,6 +6523,10 @@ static void nvme_exit(PCIDevice *pci_dev)
 nvme_ns_cleanup(ns);
 }
 
+if (n->subsys) {
+nvme_subsys_unregister_ctrl(n->subsys, n);
+}
+
 g_free(n->cq);
 g_free(n->sq);
 g_free(n->aer_reqs);
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index dc7a96862f37..92caa604a280 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -32,6 +32,11 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 return cntlid;
 }
 
+void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n)
+{
+subsys->ctrls[n->cntlid] = NULL;
+}
+
 static void nvme_subsys_setup(NvmeSubsystem *subsys)
 {
 const char *nqn = subsys->params.nqn ?
-- 
2.32.0

[PULL for-6.1 00/11] hw/nvme fixes

2021-07-26 Thread Klaus Jensen

From: Klaus Jensen 

Hi Peter,

The following changes since commit 1d6f147f043bece029a795c6eb9d43c1abd909b6:

  Merge remote-tracking branch 'remotes/quic/tags/pull-hex-20210725' into 
staging (2021-07-26 13:36:51 +0100)

are available in the Git repository at:

  git://git.infradead.org/qemu-nvme.git tags/nvme-next-pull-request

for you to fetch changes up to 9631a8ab21679e3d605f7f540dd8c692b9593e02:

  tests/qtest/nvme-test: add mmio read test (2021-07-26 21:09:39 +0200)


hw/nvme fixes

* new PMR test (Gollu Appalanaidu)
* pmr/sgl mapping fix (Padmakar Kalghatgi)
* hotplug fixes (me)
* mmio out-of-bound read fix (me)
* big-endian host fixes (me)



Gollu Appalanaidu (1):
  tests/qtest/nvme-test: add persistent memory region test

Klaus Jensen (9):
  hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
  hw/nvme: mark nvme-subsys non-hotpluggable
  hw/nvme: unregister controller with subsystem at exit
  hw/nvme: fix controller hot unplugging
  hw/nvme: split pmrmsc register into upper and lower
  hw/nvme: use symbolic names for registers
  hw/nvme: fix out-of-bounds reads
  hw/nvme: fix mmio read
  tests/qtest/nvme-test: add mmio read test

Padmakar Kalghatgi (1):
  hw/nvme: error handling for too many mappings

 hw/nvme/nvme.h  |  18 +-
 include/block/nvme.h|  60 +--
 hw/nvme/ctrl.c  | 379 +++-
 hw/nvme/ns.c|  55 --
 hw/nvme/subsys.c|   9 +
 tests/qtest/nvme-test.c |  87 -
 hw/nvme/trace-events|   1 +
 7 files changed, 402 insertions(+), 207 deletions(-)

-- 
2.32.0

Re: [PATCH] raw-format: drop WRITE and RESIZE child perms when possible

2021-07-26 Thread Stefan Hajnoczi

On Mon, Jul 26, 2021 at 05:42:47PM +0200, Kevin Wolf wrote:
> Am 26.07.2021 um 16:41 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > 26.07.2021 15:28, Stefan Hajnoczi wrote:
> > > The following command-line fails due to a permissions conflict:
> > > 
> > >$ qemu-storage-daemon \
> > >--blockdev 
> > > driver=nvme,node-name=nvme0,device=:08:00.0,namespace=1 \
> > >--blockdev 
> > > driver=raw,node-name=l1-1,file=nvme0,offset=0,size=1073741824 \
> > >--blockdev 
> > > driver=raw,node-name=l1-2,file=nvme0,offset=1073741824,size=1073741824 \
> > >--nbd-server 
> > > addr.type=unix,addr.path=/tmp/nbd.sock,max-connections=2 \
> > >--export type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on 
> > > \
> > >--export type=nbd,id=nbd-l1-2,node-name=l1-2,name=l1-2,writable=on
> > > 
> > >qemu-storage-daemon: --export 
> > > type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on: Permission 
> > > conflict on node 'nvme0': permissions 'resize' are both required by node 
> > > 'l1-1' (uses node 'nvme0' as 'file' child) and unshared by node 'l1-2' 
> > > (uses node 'nvme0' as 'file' child).
> > > 
> > > The problem is that block/raw-format.c relies on bdrv_default_perms() to
> > > set permissions on the nvme node. The default permissions add RESIZE in
> > > anticipation of a format driver like qcow2 that needs to grow the image
> > > file. This fails because RESIZE is unshared, so we cannot get the RESIZE
> > > permission.
> > > 
> > > Max Reitz pointed out that block/crypto.c already handles this case by
> > > implementing a custom ->bdrv_child_perm() function that adjusts the
> > > result of bdrv_default_perms().
> > > 
> > > This patch takes the same approach in block/raw-format.c so that RESIZE
> > > is only required if it's actually necessary (e.g. the parent is qcow2).
> > > 
> > > Cc: Max Reitz 
> > > Cc: Kevin Wolf 
> > > Signed-off-by: Stefan Hajnoczi 
> > > ---
> > > This is not a bug fix, so I didn't mark it for QEMU 6.1. It's new
> > > behavior that hasn't been supported before. I want to split an NVMe
> > > drive using the raw format's offset=/size= feature.
> > > ---
> > >   block/raw-format.c | 21 -
> > >   1 file changed, 20 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/raw-format.c b/block/raw-format.c
> > > index 7717578ed6..c26f493688 100644
> > > --- a/block/raw-format.c
> > > +++ b/block/raw-format.c
> > > @@ -580,6 +580,25 @@ static void raw_cancel_in_flight(BlockDriverState 
> > > *bs)
> > >   bdrv_cancel_in_flight(bs->file->bs);
> > >   }
> > > +static void raw_child_perm(BlockDriverState *bs, BdrvChild *c,
> > > +   BdrvChildRole role,
> > > +   BlockReopenQueue *reopen_queue,
> > > +   uint64_t parent_perm, uint64_t parent_shared,
> > > +   uint64_t *nperm, uint64_t *nshared)
> > > +{
> > > +bdrv_default_perms(bs, c, role, reopen_queue, parent_perm,
> > > +   parent_shared, nperm, nshared);
> > > +
> > > +/*
> > > + * bdrv_default_perms() may add WRITE and/or RESIZE (see comment in
> > > + * bdrv_default_perms_for_storage() for an explanation) but we only 
> > > need
> > > + * them if they are in parent_perm. Drop WRITE and RESIZE whenever 
> > > possible
> > > + * to avoid permission conflicts.
> > > + */
> > > +*nperm &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
> > > +*nperm |= parent_perm & (BLK_PERM_WRITE | BLK_PERM_RESIZE);
> > > +}
> > > +
> > >   BlockDriver bdrv_raw = {
> > >   .format_name  = "raw",
> > >   .instance_size= sizeof(BDRVRawState),
> > > @@ -588,7 +607,7 @@ BlockDriver bdrv_raw = {
> > >   .bdrv_reopen_commit   = &raw_reopen_commit,
> > >   .bdrv_reopen_abort= &raw_reopen_abort,
> > >   .bdrv_open= &raw_open,
> > > -.bdrv_child_perm  = bdrv_default_perms,
> > > +.bdrv_child_perm  = raw_child_perm,
> > >   .bdrv_co_create_opts  = &raw_co_create_opts,
> > >   .bdrv_co_preadv   = &raw_co_preadv,
> > >   .bdrv_co_pwritev  = &raw_co_pwritev,
> > > 
> > 
> > I think it's OK:
> > 
> > Reviewed-by: Vladimir Sementsov-Ogievskiy 
> > 
> > 
> > Still, did you consider an alternative of making
> > bdrv_filter_default_perm() function public and just do
> > ".bdrv_child_perm = bdrv_filter_default_perm," here?
> > 
> > raw_format is not considered to be filter, but for it's permissions I
> > think it works exactly like filter.
> 
> I had the same thought, but then commit 69dca43d6b6 explicitly made the
> opposite change. I seem to remember that Max never liked raw being
> treated like a filter much.

Additionally:

  static int raw_open(BlockDriverState *bs, QDict *options, int flags,
  Error **errp)
  {
  ...
  /*
   * Without offset and a size limit, this driver behaves very much
   * like a filter.  With any such l

Re: [PATCH] raw-format: drop WRITE and RESIZE child perms when possible

2021-07-26 Thread Kevin Wolf

Am 26.07.2021 um 16:41 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 26.07.2021 15:28, Stefan Hajnoczi wrote:
> > The following command-line fails due to a permissions conflict:
> > 
> >$ qemu-storage-daemon \
> >--blockdev 
> > driver=nvme,node-name=nvme0,device=:08:00.0,namespace=1 \
> >--blockdev 
> > driver=raw,node-name=l1-1,file=nvme0,offset=0,size=1073741824 \
> >--blockdev 
> > driver=raw,node-name=l1-2,file=nvme0,offset=1073741824,size=1073741824 \
> >--nbd-server 
> > addr.type=unix,addr.path=/tmp/nbd.sock,max-connections=2 \
> >--export type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on \
> >--export type=nbd,id=nbd-l1-2,node-name=l1-2,name=l1-2,writable=on
> > 
> >qemu-storage-daemon: --export 
> > type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on: Permission 
> > conflict on node 'nvme0': permissions 'resize' are both required by node 
> > 'l1-1' (uses node 'nvme0' as 'file' child) and unshared by node 'l1-2' 
> > (uses node 'nvme0' as 'file' child).
> > 
> > The problem is that block/raw-format.c relies on bdrv_default_perms() to
> > set permissions on the nvme node. The default permissions add RESIZE in
> > anticipation of a format driver like qcow2 that needs to grow the image
> > file. This fails because RESIZE is unshared, so we cannot get the RESIZE
> > permission.
> > 
> > Max Reitz pointed out that block/crypto.c already handles this case by
> > implementing a custom ->bdrv_child_perm() function that adjusts the
> > result of bdrv_default_perms().
> > 
> > This patch takes the same approach in block/raw-format.c so that RESIZE
> > is only required if it's actually necessary (e.g. the parent is qcow2).
> > 
> > Cc: Max Reitz 
> > Cc: Kevin Wolf 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> > This is not a bug fix, so I didn't mark it for QEMU 6.1. It's new
> > behavior that hasn't been supported before. I want to split an NVMe
> > drive using the raw format's offset=/size= feature.
> > ---
> >   block/raw-format.c | 21 -
> >   1 file changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/raw-format.c b/block/raw-format.c
> > index 7717578ed6..c26f493688 100644
> > --- a/block/raw-format.c
> > +++ b/block/raw-format.c
> > @@ -580,6 +580,25 @@ static void raw_cancel_in_flight(BlockDriverState *bs)
> >   bdrv_cancel_in_flight(bs->file->bs);
> >   }
> > +static void raw_child_perm(BlockDriverState *bs, BdrvChild *c,
> > +   BdrvChildRole role,
> > +   BlockReopenQueue *reopen_queue,
> > +   uint64_t parent_perm, uint64_t parent_shared,
> > +   uint64_t *nperm, uint64_t *nshared)
> > +{
> > +bdrv_default_perms(bs, c, role, reopen_queue, parent_perm,
> > +   parent_shared, nperm, nshared);
> > +
> > +/*
> > + * bdrv_default_perms() may add WRITE and/or RESIZE (see comment in
> > + * bdrv_default_perms_for_storage() for an explanation) but we only 
> > need
> > + * them if they are in parent_perm. Drop WRITE and RESIZE whenever 
> > possible
> > + * to avoid permission conflicts.
> > + */
> > +*nperm &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
> > +*nperm |= parent_perm & (BLK_PERM_WRITE | BLK_PERM_RESIZE);
> > +}
> > +
> >   BlockDriver bdrv_raw = {
> >   .format_name  = "raw",
> >   .instance_size= sizeof(BDRVRawState),
> > @@ -588,7 +607,7 @@ BlockDriver bdrv_raw = {
> >   .bdrv_reopen_commit   = &raw_reopen_commit,
> >   .bdrv_reopen_abort= &raw_reopen_abort,
> >   .bdrv_open= &raw_open,
> > -.bdrv_child_perm  = bdrv_default_perms,
> > +.bdrv_child_perm  = raw_child_perm,
> >   .bdrv_co_create_opts  = &raw_co_create_opts,
> >   .bdrv_co_preadv   = &raw_co_preadv,
> >   .bdrv_co_pwritev  = &raw_co_pwritev,
> > 
> 
> I think it's OK:
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> 
> 
> Still, did you consider an alternative of making
> bdrv_filter_default_perm() function public and just do
> ".bdrv_child_perm = bdrv_filter_default_perm," here?
> 
> raw_format is not considered to be filter, but for it's permissions I
> think it works exactly like filter.

I had the same thought, but then commit 69dca43d6b6 explicitly made the
opposite change. I seem to remember that Max never liked raw being
treated like a filter much.

Kevin

Re: [PATCH for-6.1? v2 2/7] mirror: Drop s->synced

2021-07-26 Thread Eric Blake

On Mon, Jul 26, 2021 at 04:46:08PM +0200, Max Reitz wrote:
> As of HEAD^, there is no meaning to s->synced other than whether the job
> is READY or not.  job_is_ready() gives us that information, too.
> 
> Suggested-by: Vladimir Sementsov-Ogievskiy 
> Signed-off-by: Max Reitz 
> ---
>  block/mirror.c | 19 +--
>  1 file changed, 9 insertions(+), 10 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v6 1/5] hw/nvme: split pmrmsc register into upper and lower

2021-07-26 Thread Keith Busch

On Wed, Jul 21, 2021 at 09:48:32AM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> The specification uses a set of 32 bit PMRMSCL and PMRMSCU registers to
> make up the 64 bit logical PMRMSC register.
> 
> Make it so.

Looks good.

Reviewed-by: Keith Busch

Re: [PATCH v6 1/5] hw/nvme: split pmrmsc register into upper and lower

2021-07-26 Thread Klaus Jensen

Keith, Appala, any chance one of you could review this? This really
needs to get to an -rc sooner rather than later :)

On Jul 21 09:48, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> The specification uses a set of 32 bit PMRMSCL and PMRMSCU registers to
> make up the 64 bit logical PMRMSC register.
> 
> Make it so.
> 
> Signed-off-by: Klaus Jensen 
> ---
>  include/block/nvme.h | 31 ---
>  hw/nvme/ctrl.c   | 10 ++
>  2 files changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 527105fafc0b..84053b68b987 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -26,7 +26,8 @@ typedef struct QEMU_PACKED NvmeBar {
>  uint32_tpmrsts;
>  uint32_tpmrebs;
>  uint32_tpmrswtp;
> -uint64_tpmrmsc;
> +uint32_tpmrmscl;
> +uint32_tpmrmscu;
>  uint8_t css[484];
>  } NvmeBar;
>  
> @@ -475,25 +476,25 @@ enum NvmePmrswtpMask {
>  #define NVME_PMRSWTP_SET_PMRSWTV(pmrswtp, val)   \
>  (pmrswtp |= (uint64_t)(val & PMRSWTP_PMRSWTV_MASK) << 
> PMRSWTP_PMRSWTV_SHIFT)
>  
> -enum NvmePmrmscShift {
> -PMRMSC_CMSE_SHIFT   = 1,
> -PMRMSC_CBA_SHIFT= 12,
> +enum NvmePmrmsclShift {
> +PMRMSCL_CMSE_SHIFT   = 1,
> +PMRMSCL_CBA_SHIFT= 12,
>  };
>  
> -enum NvmePmrmscMask {
> -PMRMSC_CMSE_MASK   = 0x1,
> -PMRMSC_CBA_MASK= 0xf,
> +enum NvmePmrmsclMask {
> +PMRMSCL_CMSE_MASK   = 0x1,
> +PMRMSCL_CBA_MASK= 0xf,
>  };
>  
> -#define NVME_PMRMSC_CMSE(pmrmsc)\
> -((pmrmsc >> PMRMSC_CMSE_SHIFT)   & PMRMSC_CMSE_MASK)
> -#define NVME_PMRMSC_CBA(pmrmsc) \
> -((pmrmsc >> PMRMSC_CBA_SHIFT)   & PMRMSC_CBA_MASK)
> +#define NVME_PMRMSCL_CMSE(pmrmscl)\
> +((pmrmscl >> PMRMSCL_CMSE_SHIFT)   & PMRMSCL_CMSE_MASK)
> +#define NVME_PMRMSCL_CBA(pmrmscl) \
> +((pmrmscl >> PMRMSCL_CBA_SHIFT)   & PMRMSCL_CBA_MASK)
>  
> -#define NVME_PMRMSC_SET_CMSE(pmrmsc, val)   \
> -(pmrmsc |= (uint64_t)(val & PMRMSC_CMSE_MASK) << PMRMSC_CMSE_SHIFT)
> -#define NVME_PMRMSC_SET_CBA(pmrmsc, val)   \
> -(pmrmsc |= (uint64_t)(val & PMRMSC_CBA_MASK) << PMRMSC_CBA_SHIFT)
> +#define NVME_PMRMSCL_SET_CMSE(pmrmscl, val)   \
> +(pmrmscl |= (uint32_t)(val & PMRMSCL_CMSE_MASK) << PMRMSCL_CMSE_SHIFT)
> +#define NVME_PMRMSCL_SET_CBA(pmrmscl, val)   \
> +(pmrmscl |= (uint32_t)(val & PMRMSCL_CBA_MASK) << PMRMSCL_CBA_SHIFT)
>  
>  enum NvmeSglDescriptorType {
>  NVME_SGL_DESCR_TYPE_DATA_BLOCK  = 0x0,
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 2f0524e12a36..070d9f6a962d 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -5916,11 +5916,13 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr 
> offset, uint64_t data,
>  return;
>  }
>  
> -n->bar.pmrmsc = (n->bar.pmrmsc & ~0x) | (data & 0x);
> +n->bar.pmrmscl = data;
>  n->pmr.cmse = false;
>  
> -if (NVME_PMRMSC_CMSE(n->bar.pmrmsc)) {
> -hwaddr cba = NVME_PMRMSC_CBA(n->bar.pmrmsc) << PMRMSC_CBA_SHIFT;
> +if (NVME_PMRMSCL_CMSE(n->bar.pmrmscl)) {
> +uint64_t pmrmscu = n->bar.pmrmscu;
> +hwaddr cba = (pmrmscu << 32) |
> +(NVME_PMRMSCL_CBA(n->bar.pmrmscl) << PMRMSCL_CBA_SHIFT);
>  if (cba + int128_get64(n->pmr.dev->mr.size) < cba) {
>  NVME_PMRSTS_SET_CBAI(n->bar.pmrsts, 1);
>  return;
> @@ -5936,7 +5938,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
> uint64_t data,
>  return;
>  }
>  
> -n->bar.pmrmsc = (n->bar.pmrmsc & 0x) | (data << 32);
> +n->bar.pmrmscu = data;
>  return;
>  default:
>  NVME_GUEST_ERR(pci_nvme_ub_mmiowr_invalid,
> -- 
> 2.32.0
> 

-- 
One of us - No more doubt, silence or taboo about mental illness.


signature.asc
Description: PGP signature

Re: [PATCH for-6.1? v2 1/9] nbd/server: Mark variable unused in nbd_negotiate_meta_queries

2021-07-26 Thread Eric Blake

On Sun, Jul 25, 2021 at 02:24:08AM -1000, Richard Henderson wrote:
> From clang-13:
> nbd/server.c:976:22: error: variable 'bitmaps' set but not used \
> [-Werror,-Wunused-but-set-variable]
> 
> which is incorrect; see //bugs.llvm.org/show_bug.cgi?id=3888.
> 
> Cc: qemu-block@nongnu.org
> Cc: Eric Blake 
> Cc: Vladimir Sementsov-Ogievskiy 
> Signed-off-by: Richard Henderson 
> ---
>  nbd/server.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index b60ebc3ab6..3927f7789d 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -973,7 +973,8 @@ static int nbd_negotiate_meta_queries(NBDClient *client,
>  {
>  int ret;
>  g_autofree char *export_name = NULL;
> -g_autofree bool *bitmaps = NULL;
> +/* Mark unused to work around https://bugs.llvm.org/show_bug.cgi?id=3888 
> */
> +g_autofree G_GNUC_UNUSED bool *bitmaps = NULL;

Reviewed-by: Eric Blake 

I'm not sure this one patch warrants a pull request by itself, but I'm
not opposed to including it in 6.1 if anything also turns up affecting
NBD.  If someone wants to pick up the entire series, that would work
too.  Otherwise, I can queue this individual patch through my NBD tree
for 6.2.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PATCH for-6.1? v2 5/7] job: Add job_cancel_requested()

2021-07-26 Thread Max Reitz

Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_request() as the general variant, which returns true for any
jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
 include/qemu/job.h |  8 +++-
 block/mirror.c | 10 --
 job.c  |  7 ++-
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 8aa90f7395..032edf3c5f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
 /** Returns true if the job should not be visible to the management layer. */
 bool job_is_internal(Job *job);
 
-/** Returns whether the job is scheduled for cancellation. */
+/** Returns whether the job is being cancelled. */
 bool job_is_cancelled(Job *job);
 
+/**
+ * Returns whether the job is scheduled for cancellation (at an
+ * indefinite point).
+ */
+bool job_cancel_requested(Job *job);
+
 /** Returns whether the job is in a completed state. */
 bool job_is_completed(Job *job);
 
diff --git a/block/mirror.c b/block/mirror.c
index e93631a9f6..72e02fa34e 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -936,7 +936,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 /* Transition to the READY state and wait for complete. */
 job_transition_to_ready(&s->common.job);
 s->actively_synced = true;
-while (!job_is_cancelled(&s->common.job) && !s->should_complete) {
+while (!job_cancel_requested(&s->common.job) && !s->should_complete) {
 job_yield(&s->common.job);
 }
 s->common.job.cancelled = false;
@@ -1043,7 +1043,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 }
 
 should_complete = s->should_complete ||
-job_is_cancelled(&s->common.job);
+job_cancel_requested(&s->common.job);
 cnt = bdrv_get_dirty_count(s->dirty_bitmap);
 }
 
@@ -1087,7 +1087,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 trace_mirror_before_sleep(s, cnt, job_is_ready(&s->common.job),
   delay_ns);
 job_sleep_ns(&s->common.job, delay_ns);
-if (job_is_cancelled(&s->common.job) && s->common.job.force_cancel) {
+if (job_is_cancelled(&s->common.job)) {
 break;
 }
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1099,9 +1099,7 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 ||
-   (s->common.job.force_cancel &&
-job_is_cancelled(&s->common.job)));
+assert(ret < 0 || job_is_cancelled(&s->common.job));
 assert(need_drain);
 mirror_wait_for_all_io(s);
 }
diff --git a/job.c b/job.c
index e78d893a9c..dba17a680f 100644
--- a/job.c
+++ b/job.c
@@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
 }
 
 bool job_is_cancelled(Job *job)
+{
+return job->cancelled && job->force_cancel;
+}
+
+bool job_cancel_requested(Job *job)
 {
 return job->cancelled;
 }
@@ -1015,7 +1020,7 @@ void job_complete(Job *job, Error **errp)
 if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
 return;
 }
-if (job_is_cancelled(job) || !job->driver->complete) {
+if (job_cancel_requested(job) || !job->driver->complete) {
 error_setg(errp, "The active block job '%s' cannot be completed",
job->id);
 return;
-- 
2.31.1

[PATCH for-6.1? v2 3/7] job: @force parameter for job_cancel_sync{, _all}()

2021-07-26 Thread Max Reitz

Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/qemu/job.h| 10 ++---
 block/replication.c   |  4 +-
 blockdev.c|  4 +-
 job.c | 27 +---
 qemu-nbd.c|  2 +-
 softmmu/runstate.c|  2 +-
 storage-daemon/qemu-storage-daemon.c  |  2 +-
 tests/unit/test-block-iothread.c  |  2 +-
 tests/unit/test-blockjob.c|  2 +-
 tests/qemu-iotests/109.out| 60 +++
 tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
 11 files changed, 61 insertions(+), 56 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 41162ed494..5e8edbc2c8 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -506,19 +506,19 @@ void job_user_cancel(Job *job, bool force, Error **errp);
 
 /**
  * Synchronously cancel the @job.  The completion callback is called
- * before the function returns.  The job may actually complete
- * instead of canceling itself; the circumstances under which this
- * happens depend on the kind of job that is active.
+ * before the function returns.  If @force is false, the job may
+ * actually complete instead of canceling itself; the circumstances
+ * under which this happens depend on the kind of job that is active.
  *
  * Returns the return value from the job if the job actually completed
  * during the call, or -ECANCELED if it was canceled.
  *
  * Callers must hold the AioContext lock of job->aio_context.
  */
-int job_cancel_sync(Job *job);
+int job_cancel_sync(Job *job, bool force);
 
 /** Synchronously cancels all jobs using job_cancel_sync(). */
-void job_cancel_sync_all(void);
+void job_cancel_sync_all(bool force);
 
 /**
  * @job: The job to be completed.
diff --git a/block/replication.c b/block/replication.c
index 32444b9a8f..e7a9327b12 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -149,7 +149,7 @@ static void replication_close(BlockDriverState *bs)
 if (s->stage == BLOCK_REPLICATION_FAILOVER) {
 commit_job = &s->commit_job->job;
 assert(commit_job->aio_context == qemu_get_current_aio_context());
-job_cancel_sync(commit_job);
+job_cancel_sync(commit_job, false);
 }
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
@@ -726,7 +726,7 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
  * disk, secondary disk in backup_job_completed().
  */
 if (s->backup_job) {
-job_cancel_sync(&s->backup_job->job);
+job_cancel_sync(&s->backup_job->job, false);
 }
 
 if (!failover) {
diff --git a/blockdev.c b/blockdev.c
index 3d8ac368a1..aa95918c02 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1848,7 +1848,7 @@ static void drive_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(&state->job->job);
+job_cancel_sync(&state->job->job, true);
 
 aio_context_release(aio_context);
 }
@@ -1949,7 +1949,7 @@ static void blockdev_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(&state->job->job);
+job_cancel_sync(&state->job->job, true);
 
 aio_context_release(aio_context);
 }
diff --git a/job.c b/job.c
index e7a5d28854..9e971d64cf 100644
--- a/job.c
+++ b/job.c
@@ -763,7 +763,12 @@ static void job_completed_txn_abort(Job *job)
 if (other_job != job) {
 ctx = other_job->aio_context;
 aio_context_acquire(ctx);
-job_cancel_async(other_job, false);
+/*
+ * This is a transaction: If one job failed, no result will matter.
+ * Therefore, pass force=true to terminate all other jobs as 
quickly
+ * as possible.
+ */
+job_cancel_async(other_job, true);
 aio_context_release(ctx);
 }
 }
@@ -964,12 +969,24 @@ static void job_cancel_err(Job *job, Error **errp)
 job_cancel(job, false);
 }
 
-int job_cancel_sync

[PATCH for-6.1? v2 7/7] iotests: Add mirror-ready-cancel-error test

2021-07-26 Thread Max Reitz

Test what happens when there is an I/O error after a mirror job in the
READY phase has been cancelled.

Signed-off-by: Max Reitz 
---
 .../tests/mirror-ready-cancel-error   | 143 ++
 .../tests/mirror-ready-cancel-error.out   |   5 +
 2 files changed, 148 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

diff --git a/tests/qemu-iotests/tests/mirror-ready-cancel-error 
b/tests/qemu-iotests/tests/mirror-ready-cancel-error
new file mode 100755
index 00..f2dc1f
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-ready-cancel-error
@@ -0,0 +1,143 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when errors occur to a mirror job after it has
+# been cancelled in the READY phase
+#
+# Copyright (C) 2021 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import iotests
+
+
+image_size = 1 * 1024 * 1024
+source = os.path.join(iotests.test_dir, 'source.img')
+target = os.path.join(iotests.test_dir, 'target.img')
+
+
+class TestMirrorReadyCancelError(iotests.QMPTestCase):
+def setUp(self) -> None:
+assert iotests.qemu_img_create('-f', iotests.imgfmt, source,
+   str(image_size)) == 0
+assert iotests.qemu_img_create('-f', iotests.imgfmt, target,
+   str(image_size)) == 0
+
+self.vm = iotests.VM()
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(source)
+os.remove(target)
+
+def add_blockdevs(self, once: bool) -> None:
+res = self.vm.qmp('blockdev-add',
+  **{'node-name': 'source',
+ 'driver': iotests.imgfmt,
+ 'file': {
+ 'driver': 'file',
+ 'filename': source
+ }})
+self.assert_qmp(res, 'return', {})
+
+# blkdebug notes:
+# Enter state 2 on the first flush, which happens before the
+# job enters the READY state.  The second flush will happen
+# when the job is about to complete, and we want that one to
+# fail.
+res = self.vm.qmp('blockdev-add',
+  **{'node-name': 'target',
+ 'driver': iotests.imgfmt,
+ 'file': {
+ 'driver': 'blkdebug',
+ 'image': {
+ 'driver': 'file',
+ 'filename': target
+ },
+ 'set-state': [{
+ 'event': 'flush_to_disk',
+ 'state': 1,
+ 'new_state': 2
+ }],
+ 'inject-error': [{
+ 'event': 'flush_to_disk',
+ 'once': once,
+ 'immediately': True,
+ 'state': 2
+ }]}})
+self.assert_qmp(res, 'return', {})
+
+def start_mirror(self) -> None:
+res = self.vm.qmp('blockdev-mirror',
+  job_id='mirror',
+  device='source',
+  target='target',
+  filter_node_name='mirror-top',
+  sync='full',
+  on_target_error='stop')
+self.assert_qmp(res, 'return', {})
+
+def cancel_mirror_with_error(self) -> None:
+self.vm.event_wait('BLOCK_JOB_READY')
+
+# Write something so will not leave the job immediately, but
+# flush first (which will fail, thanks to blkdebug)
+res = self.vm.qmp('human-monitor-command',
+  command_line='qemu-io mirror-top "write 0 64k"')
+self.assert_qmp(res, 'return', '')
+
+# Drain status change events
+while self.vm.event_wait('JOB_STATUS_CHANGE', timeout=0.0) is not None:
+pass
+
+res = self.vm.qmp('block-job-cancel', device='mi

[PATCH for-6.1? v2 6/7] mirror: Check job_is_cancelled() earlier

2021-07-26 Thread Max Reitz

We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

A job being force-cancelled should be treated the same as the job having
failed, so put the check in the same place where we check `s->ret < 0`.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---
 block/mirror.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 72e02fa34e..46d1a1e5a2 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -993,7 +993,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_wait_for_any_operation(s, true);
 }
 
-if (s->ret < 0) {
+if (s->ret < 0 || job_is_cancelled(&s->common.job)) {
 ret = s->ret;
 goto immediate_exit;
 }
@@ -1078,8 +1078,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 break;
 }
 
-ret = 0;
-
 if (job_is_ready(&s->common.job) && !should_complete) {
 delay_ns = (s->in_flight == 0 &&
 cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
@@ -1087,9 +1085,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 trace_mirror_before_sleep(s, cnt, job_is_ready(&s->common.job),
   delay_ns);
 job_sleep_ns(&s->common.job, delay_ns);
-if (job_is_cancelled(&s->common.job)) {
-break;
-}
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
 }
 
-- 
2.31.1

[PATCH for-6.1? v2 1/7] mirror: Keep s->synced on error

2021-07-26 Thread Max Reitz

An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/mirror.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 98fc66eabf..d73b704473 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -121,7 +121,6 @@ typedef enum MirrorMethod {
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
 int error)
 {
-s->synced = false;
 s->actively_synced = false;
 if (read) {
 return block_job_error_action(&s->common, s->on_source_error,
-- 
2.31.1

[PATCH for-6.1? v2 4/7] jobs: Give Job.force_cancel more meaning

2021-07-26 Thread Max Reitz

We largely have two cancel modes for jobs:

First, there is actual cancelling.  The job is terminated as soon as
possible, without trying to reach a consistent result.

Second, we have mirror in the READY state.  Technically, the job is not
really cancelled, but it just is a different completion mode.  The job
can still run for an indefinite amount of time while it tries to reach a
consistent result.

We want to be able to clearly distinguish which cancel mode a job is in
(when it has been cancelled).  We can use Job.force_cancel for this, but
right now it only reflects cancel requests from the user with
force=true, but clearly, jobs that do not even distinguish between
force=false and force=true are effectively always force-cancelled.

So this patch has Job.force_cancel signify whether the job will
terminate as soon as possible (force_cancel=true) or whether it will
effectively remain running despite being "cancelled"
(force_cancel=false).

To this end, we let jobs that provide JobDriver.cancel() tell the
generic job code whether they will terminate as soon as possible or not,
and for jobs that do not provide that method we assume they will.

Signed-off-by: Max Reitz 
---
 include/qemu/job.h | 11 ++-
 block/backup.c |  3 ++-
 block/mirror.c | 24 ++--
 job.c  |  6 +-
 4 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 5e8edbc2c8..8aa90f7395 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -253,8 +253,17 @@ struct JobDriver {
 
 /**
  * If the callback is not NULL, it will be invoked in job_cancel_async
+ *
+ * This function must return true if the job will be cancelled
+ * immediately without any further I/O (mandatory if @force is
+ * true), and false otherwise.  This lets the generic job layer
+ * know whether a job has been truly (force-)cancelled, or whether
+ * it is just in a special completion mode (like mirror after
+ * READY).
+ * (If the callback is NULL, the job is assumed to terminate
+ * without I/O.)
  */
-void (*cancel)(Job *job, bool force);
+bool (*cancel)(Job *job, bool force);
 
 
 /** Called when the job is freed */
diff --git a/block/backup.c b/block/backup.c
index bd3614ce70..513e1c8a0b 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -331,11 +331,12 @@ static void coroutine_fn backup_set_speed(BlockJob *job, 
int64_t speed)
 }
 }
 
-static void backup_cancel(Job *job, bool force)
+static bool backup_cancel(Job *job, bool force)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
 
 bdrv_cancel_in_flight(s->target_bs);
+return true;
 }
 
 static const BlockJobDriver backup_job_driver = {
diff --git a/block/mirror.c b/block/mirror.c
index fcb7b65f93..e93631a9f6 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1087,9 +1087,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 trace_mirror_before_sleep(s, cnt, job_is_ready(&s->common.job),
   delay_ns);
 job_sleep_ns(&s->common.job, delay_ns);
-if (job_is_cancelled(&s->common.job) &&
-(!job_is_ready(&s->common.job) || s->common.job.force_cancel))
-{
+if (job_is_cancelled(&s->common.job) && s->common.job.force_cancel) {
 break;
 }
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1102,7 +1100,7 @@ immediate_exit:
  * the target is a copy of the source.
  */
 assert(ret < 0 ||
-   ((s->common.job.force_cancel || !job_is_ready(&s->common.job)) 
&&
+   (s->common.job.force_cancel &&
 job_is_cancelled(&s->common.job)));
 assert(need_drain);
 mirror_wait_for_all_io(s);
@@ -1188,14 +1186,27 @@ static bool mirror_drained_poll(BlockJob *job)
 return !!s->in_flight;
 }
 
-static void mirror_cancel(Job *job, bool force)
+static bool mirror_cancel(Job *job, bool force)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 BlockDriverState *target = blk_bs(s->target);
 
-if (force || !job_is_ready(job)) {
+/*
+ * Before the job is READY, we treat any cancellation like a
+ * force-cancellation.
+ */
+force = force || !job_is_ready(job);
+
+if (force) {
 bdrv_cancel_in_flight(target);
 }
+return force;
+}
+
+static bool commit_active_cancel(Job *job, bool force)
+{
+/* Same as above in mirror_cancel() */
+return force || !job_is_ready(job);
 }
 
 static const BlockJobDriver mirror_job_driver = {
@@ -1225,6 +1236,7 @@ static const BlockJobDriver commit_active_job_driver = {
 .abort  = mirror_abort,
 .pause  = mirror_pause,
 .complete   = mirror_complete,
+.cancel = commit_active_cancel,
 },
 .drained_poll   = mirror_drained_poll,
 };

[PATCH for-6.1? v2 2/7] mirror: Drop s->synced

2021-07-26 Thread Max Reitz

As of HEAD^, there is no meaning to s->synced other than whether the job
is READY or not.  job_is_ready() gives us that information, too.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
---
 block/mirror.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index d73b704473..fcb7b65f93 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -56,7 +56,6 @@ typedef struct MirrorBlockJob {
 bool zero_target;
 MirrorCopyMode copy_mode;
 BlockdevOnError on_source_error, on_target_error;
-bool synced;
 /* Set when the target is synced (dirty bitmap is clean, nothing
  * in flight) and the job is running in active mode */
 bool actively_synced;
@@ -936,7 +935,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 if (s->bdev_length == 0) {
 /* Transition to the READY state and wait for complete. */
 job_transition_to_ready(&s->common.job);
-s->synced = true;
 s->actively_synced = true;
 while (!job_is_cancelled(&s->common.job) && !s->should_complete) {
 job_yield(&s->common.job);
@@ -1028,7 +1026,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 should_complete = false;
 if (s->in_flight == 0 && cnt == 0) {
 trace_mirror_before_flush(s);
-if (!s->synced) {
+if (!job_is_ready(&s->common.job)) {
 if (mirror_flush(s) < 0) {
 /* Go check s->ret.  */
 continue;
@@ -1039,7 +1037,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
  * the target in a consistent state.
  */
 job_transition_to_ready(&s->common.job);
-s->synced = true;
 if (s->copy_mode != MIRROR_COPY_MODE_BACKGROUND) {
 s->actively_synced = true;
 }
@@ -1083,14 +1080,15 @@ static int coroutine_fn mirror_run(Job *job, Error 
**errp)
 
 ret = 0;
 
-if (s->synced && !should_complete) {
+if (job_is_ready(&s->common.job) && !should_complete) {
 delay_ns = (s->in_flight == 0 &&
 cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
 }
-trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
+trace_mirror_before_sleep(s, cnt, job_is_ready(&s->common.job),
+  delay_ns);
 job_sleep_ns(&s->common.job, delay_ns);
 if (job_is_cancelled(&s->common.job) &&
-(!s->synced || s->common.job.force_cancel))
+(!job_is_ready(&s->common.job) || s->common.job.force_cancel))
 {
 break;
 }
@@ -1103,8 +1101,9 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 || ((s->common.job.force_cancel || !s->synced) &&
-   job_is_cancelled(&s->common.job)));
+assert(ret < 0 ||
+   ((s->common.job.force_cancel || !job_is_ready(&s->common.job)) 
&&
+job_is_cancelled(&s->common.job)));
 assert(need_drain);
 mirror_wait_for_all_io(s);
 }
@@ -1127,7 +1126,7 @@ static void mirror_complete(Job *job, Error **errp)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 
-if (!s->synced) {
+if (!job_is_ready(job)) {
 error_setg(errp, "The active block job '%s' cannot be completed",
job->id);
 return;
-- 
2.31.1

[PATCH for-6.1? v2 0/7] mirror: Handle errors after READY cancel

2021-07-26 Thread Max Reitz

Hi,

v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-07/msg00705.html

Changes in v2:
- Added patch 2 (as suggested by Vladimir)
- Patch 4 (ex. 3): Rebase conflicts because of patch 2
- Patch 5 (ex. 4):
  - Rebase conflicts because of patch 2
  - Do not use job_cancel_requested() to determine how a soft-cancelled
job should be completed: A soft-cancelled job should end with
COMPLETED, not CANCELLED, and so job_is_cancelled() is the
appropriate condition there.


git-backport-diff against v1:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/7:[] [--] 'mirror: Keep s->synced on error'
002/7:[down] 'mirror: Drop s->synced'
003/7:[] [--] 'job: @force parameter for job_cancel_sync{,_all}()'
004/7:[0006] [FC] 'jobs: Give Job.force_cancel more meaning'
005/7:[0011] [FC] 'job: Add job_cancel_requested()'
006/7:[] [-C] 'mirror: Check job_is_cancelled() earlier'
007/7:[] [--] 'iotests: Add mirror-ready-cancel-error test'


Max Reitz (7):
  mirror: Keep s->synced on error
  mirror: Drop s->synced
  job: @force parameter for job_cancel_sync{,_all}()
  jobs: Give Job.force_cancel more meaning
  job: Add job_cancel_requested()
  mirror: Check job_is_cancelled() earlier
  iotests: Add mirror-ready-cancel-error test

 include/qemu/job.h|  29 +++-
 block/backup.c|   3 +-
 block/mirror.c|  47 +++---
 block/replication.c   |   4 +-
 blockdev.c|   4 +-
 job.c |  40 -
 qemu-nbd.c|   2 +-
 softmmu/runstate.c|   2 +-
 storage-daemon/qemu-storage-daemon.c  |   2 +-
 tests/unit/test-block-iothread.c  |   2 +-
 tests/unit/test-blockjob.c|   2 +-
 tests/qemu-iotests/109.out|  60 +++-
 .../tests/mirror-ready-cancel-error   | 143 ++
 .../tests/mirror-ready-cancel-error.out   |   5 +
 tests/qemu-iotests/tests/qsd-jobs.out |   2 +-
 15 files changed, 264 insertions(+), 83 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

-- 
2.31.1

Re: [PATCH] raw-format: drop WRITE and RESIZE child perms when possible

2021-07-26 Thread Vladimir Sementsov-Ogievskiy


26.07.2021 15:28, Stefan Hajnoczi wrote:

The following command-line fails due to a permissions conflict:

   $ qemu-storage-daemon \
   --blockdev driver=nvme,node-name=nvme0,device=:08:00.0,namespace=1 \
   --blockdev driver=raw,node-name=l1-1,file=nvme0,offset=0,size=1073741824 
\
   --blockdev 
driver=raw,node-name=l1-2,file=nvme0,offset=1073741824,size=1073741824 \
   --nbd-server addr.type=unix,addr.path=/tmp/nbd.sock,max-connections=2 \
   --export type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on \
   --export type=nbd,id=nbd-l1-2,node-name=l1-2,name=l1-2,writable=on

   qemu-storage-daemon: --export 
type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on: Permission conflict 
on node 'nvme0': permissions 'resize' are both required by node 'l1-1' (uses 
node 'nvme0' as 'file' child) and unshared by node 'l1-2' (uses node 'nvme0' as 
'file' child).

The problem is that block/raw-format.c relies on bdrv_default_perms() to
set permissions on the nvme node. The default permissions add RESIZE in
anticipation of a format driver like qcow2 that needs to grow the image
file. This fails because RESIZE is unshared, so we cannot get the RESIZE
permission.

Max Reitz pointed out that block/crypto.c already handles this case by
implementing a custom ->bdrv_child_perm() function that adjusts the
result of bdrv_default_perms().

This patch takes the same approach in block/raw-format.c so that RESIZE
is only required if it's actually necessary (e.g. the parent is qcow2).

Cc: Max Reitz 
Cc: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
---
This is not a bug fix, so I didn't mark it for QEMU 6.1. It's new
behavior that hasn't been supported before. I want to split an NVMe
drive using the raw format's offset=/size= feature.
---
  block/raw-format.c | 21 -
  1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/block/raw-format.c b/block/raw-format.c
index 7717578ed6..c26f493688 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -580,6 +580,25 @@ static void raw_cancel_in_flight(BlockDriverState *bs)
  bdrv_cancel_in_flight(bs->file->bs);
  }
  
+static void raw_child_perm(BlockDriverState *bs, BdrvChild *c,

+   BdrvChildRole role,
+   BlockReopenQueue *reopen_queue,
+   uint64_t parent_perm, uint64_t parent_shared,
+   uint64_t *nperm, uint64_t *nshared)
+{
+bdrv_default_perms(bs, c, role, reopen_queue, parent_perm,
+   parent_shared, nperm, nshared);
+
+/*
+ * bdrv_default_perms() may add WRITE and/or RESIZE (see comment in
+ * bdrv_default_perms_for_storage() for an explanation) but we only need
+ * them if they are in parent_perm. Drop WRITE and RESIZE whenever possible
+ * to avoid permission conflicts.
+ */
+*nperm &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
+*nperm |= parent_perm & (BLK_PERM_WRITE | BLK_PERM_RESIZE);
+}
+
  BlockDriver bdrv_raw = {
  .format_name  = "raw",
  .instance_size= sizeof(BDRVRawState),
@@ -588,7 +607,7 @@ BlockDriver bdrv_raw = {
  .bdrv_reopen_commit   = &raw_reopen_commit,
  .bdrv_reopen_abort= &raw_reopen_abort,
  .bdrv_open= &raw_open,
-.bdrv_child_perm  = bdrv_default_perms,
+.bdrv_child_perm  = raw_child_perm,
  .bdrv_co_create_opts  = &raw_co_create_opts,
  .bdrv_co_preadv   = &raw_co_preadv,
  .bdrv_co_pwritev  = &raw_co_pwritev,



I think it's OK:

Reviewed-by: Vladimir Sementsov-Ogievskiy 


Still, did you consider an alternative of making bdrv_filter_default_perm() function 
public and just do ".bdrv_child_perm = bdrv_filter_default_perm," here?

raw_format is not considered to be filter, but for it's permissions I think it 
works exactly like filter.


--
Best regards,
Vladimir

Re: [PATCH v2] nbd/server: Add --selinux-label option

2021-07-26 Thread Eric Blake

On Fri, Jul 23, 2021 at 11:47:51AM +0100, Daniel P. Berrangé wrote:
> On Fri, Jul 23, 2021 at 11:33:03AM +0100, Richard W.M. Jones wrote:
> > Under SELinux, Unix domain sockets have two labels.  One is on the
> > disk and can be set with commands such as chcon(1).  There is a
> > different label stored in memory (called the process label).  This can
> > only be set by the process creating the socket.  When using SELinux +
> > SVirt and wanting qemu to be able to connect to a qemu-nbd instance,
> > you must set both labels correctly first.
> > 
> > For qemu-nbd the options to set the second label are awkward.  You can
> > create the socket in a wrapper program and then exec into qemu-nbd.
> > Or you could try something with LD_PRELOAD.
> > 
> > This commit adds the ability to set the label straightforwardly on the
> > command line, via the new --selinux-label flag.  (The name of the flag
> > is the same as the equivalent nbdkit option.)
> > 
> > A worked example showing how to use the new option can be found in
> > this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1984938
> > 
> > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1984938
> > Signed-off-by: Richard W.M. Jones 
> > ---
> >  configure |  9 -
> >  meson.build   | 10 +-
> >  meson_options.txt |  3 ++
> >  qemu-nbd.c| 33 +++
> >  tests/docker/dockerfiles/centos8.docker   |  1 +
> >  tests/docker/dockerfiles/fedora.docker|  1 +
> >  tests/docker/dockerfiles/opensuse-leap.docker |  1 +
> >  tests/docker/dockerfiles/ubuntu1804.docker|  1 +
> >  tests/docker/dockerfiles/ubuntu2004.docker|  1 +
> >  9 files changed, 58 insertions(+), 2 deletions(-)
> 
> Reviewed-by: Daniel P. Berrangé 

Thanks. This is a new feature, so it doesn't qualify for inclusion in
6.1, but I'm queuing it through my NBD tree to go in as soon as
upstream reopens for 6.2.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PATCH] raw-format: drop WRITE and RESIZE child perms when possible

2021-07-26 Thread Stefan Hajnoczi

The following command-line fails due to a permissions conflict:

  $ qemu-storage-daemon \
  --blockdev driver=nvme,node-name=nvme0,device=:08:00.0,namespace=1 \
  --blockdev driver=raw,node-name=l1-1,file=nvme0,offset=0,size=1073741824 \
  --blockdev 
driver=raw,node-name=l1-2,file=nvme0,offset=1073741824,size=1073741824 \
  --nbd-server addr.type=unix,addr.path=/tmp/nbd.sock,max-connections=2 \
  --export type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on \
  --export type=nbd,id=nbd-l1-2,node-name=l1-2,name=l1-2,writable=on

  qemu-storage-daemon: --export 
type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on: Permission conflict 
on node 'nvme0': permissions 'resize' are both required by node 'l1-1' (uses 
node 'nvme0' as 'file' child) and unshared by node 'l1-2' (uses node 'nvme0' as 
'file' child).

The problem is that block/raw-format.c relies on bdrv_default_perms() to
set permissions on the nvme node. The default permissions add RESIZE in
anticipation of a format driver like qcow2 that needs to grow the image
file. This fails because RESIZE is unshared, so we cannot get the RESIZE
permission.

Max Reitz pointed out that block/crypto.c already handles this case by
implementing a custom ->bdrv_child_perm() function that adjusts the
result of bdrv_default_perms().

This patch takes the same approach in block/raw-format.c so that RESIZE
is only required if it's actually necessary (e.g. the parent is qcow2).

Cc: Max Reitz 
Cc: Kevin Wolf 
Signed-off-by: Stefan Hajnoczi 
---
This is not a bug fix, so I didn't mark it for QEMU 6.1. It's new
behavior that hasn't been supported before. I want to split an NVMe
drive using the raw format's offset=/size= feature.
---
 block/raw-format.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/block/raw-format.c b/block/raw-format.c
index 7717578ed6..c26f493688 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -580,6 +580,25 @@ static void raw_cancel_in_flight(BlockDriverState *bs)
 bdrv_cancel_in_flight(bs->file->bs);
 }
 
+static void raw_child_perm(BlockDriverState *bs, BdrvChild *c,
+   BdrvChildRole role,
+   BlockReopenQueue *reopen_queue,
+   uint64_t parent_perm, uint64_t parent_shared,
+   uint64_t *nperm, uint64_t *nshared)
+{
+bdrv_default_perms(bs, c, role, reopen_queue, parent_perm,
+   parent_shared, nperm, nshared);
+
+/*
+ * bdrv_default_perms() may add WRITE and/or RESIZE (see comment in
+ * bdrv_default_perms_for_storage() for an explanation) but we only need
+ * them if they are in parent_perm. Drop WRITE and RESIZE whenever possible
+ * to avoid permission conflicts.
+ */
+*nperm &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
+*nperm |= parent_perm & (BLK_PERM_WRITE | BLK_PERM_RESIZE);
+}
+
 BlockDriver bdrv_raw = {
 .format_name  = "raw",
 .instance_size= sizeof(BDRVRawState),
@@ -588,7 +607,7 @@ BlockDriver bdrv_raw = {
 .bdrv_reopen_commit   = &raw_reopen_commit,
 .bdrv_reopen_abort= &raw_reopen_abort,
 .bdrv_open= &raw_open,
-.bdrv_child_perm  = bdrv_default_perms,
+.bdrv_child_perm  = raw_child_perm,
 .bdrv_co_create_opts  = &raw_co_create_opts,
 .bdrv_co_preadv   = &raw_co_preadv,
 .bdrv_co_pwritev  = &raw_co_pwritev,
-- 
2.31.1

Re: [PATCH for-6.1? 1/6] mirror: Keep s->synced on error

2021-07-26 Thread Vladimir Sementsov-Ogievskiy


22.07.2021 15:26, Max Reitz wrote:

An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Max Reitz



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

Re: [PATCH for-6.1? v2 1/9] nbd/server: Mark variable unused in nbd_negotiate_meta_queries

2021-07-26 Thread Vladimir Sementsov-Ogievskiy


25.07.2021 15:24, Richard Henderson wrote:

 From clang-13:
nbd/server.c:976:22: error: variable 'bitmaps' set but not used \
 [-Werror,-Wunused-but-set-variable]

which is incorrect; see //bugs.llvm.org/show_bug.cgi?id=3888.

Cc:qemu-block@nongnu.org
Cc: Eric Blake
Cc: Vladimir Sementsov-Ogievskiy
Signed-off-by: Richard Henderson


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

Re: [PATCH v6 6/6] hmp: add virtio commands

2021-07-26 Thread Jonah Palmer




On 7/22/21 5:18 AM, Jason Wang wrote:


在 2021/7/21 下午5:11, Jonah Palmer 写道:



On 7/13/21 10:40 PM, Jason Wang wrote:


在 2021/7/12 下午6:35, Jonah Palmer 写道:

+void hmp_virtio_queue_status(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    const char *path = qdict_get_try_str(qdict, "path");
+    int queue = qdict_get_int(qdict, "queue");
+    VirtQueueStatus *s = qmp_x_debug_virtio_queue_status(path, 
queue, &err);

+
+    if (err != NULL) {
+    hmp_handle_error(mon, err);
+    return;
+    }
+
+    monitor_printf(mon, "%s:\n", path);
+    monitor_printf(mon, "  device_type:  %s\n",
+   VirtioType_str(s->device_type));
+    monitor_printf(mon, "  index:    %d\n", 
s->queue_index);

+    monitor_printf(mon, "  inuse:    %d\n", s->inuse);
+    monitor_printf(mon, "  last_avail_idx:   %d (%"PRId64" %% 
%"PRId64")\n",
+   s->last_avail_idx, s->last_avail_idx % 
s->vring_num,

+   s->vring_num);
+    monitor_printf(mon, "  shadow_avail_idx: %d (%"PRId64" %% 
%"PRId64")\n",
+   s->shadow_avail_idx, s->shadow_avail_idx % 
s->vring_num,

+   s->vring_num);
+    monitor_printf(mon, "  used_idx: %d (%"PRId64" %% 
%"PRId64")\n",
+   s->used_idx, s->used_idx % s->vring_num, 
s->vring_num);



The modular information is not the case of packed ring where the 
queue size does not have to be a power of 2.
Doesn't modulo work for any integer, regardless if it's a power of 2 
or not? Could you clarify this for me?



For packed ring, the index doesn't increase freely, it's always small 
than the virtqueue size.


So showing the modulo arithmetic seems useless since the device or 
driver doesn't use modulo for calculating the real offset.


Thanks


I see, got it. Thank you for the explanation.

I should be able to easily determine a packed or split ring via. 
virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED).



Jonah






Thank you,

Re: [PATCH v6 4/6] qmp: add QMP command x-debug-virtio-queue-status

2021-07-26 Thread Jonah Palmer



On 7/22/21 5:22 AM, Jason Wang wrote:


在 2021/7/21 下午4:59, Jonah Palmer 写道:



On 7/13/21 10:37 PM, Jason Wang wrote:


在 2021/7/12 下午6:35, Jonah Palmer 写道:

From: Laurent Vivier 

This new command shows internal status of a VirtQueue.
(vrings and indexes).

Signed-off-by: Laurent Vivier 
Signed-off-by: Jonah Palmer 
---
  hw/virtio/virtio-stub.c |   6 +++
  hw/virtio/virtio.c  |  37 ++
  qapi/virtio.json    | 102 


  3 files changed, 145 insertions(+)

  [Jonah: Added 'device-type' field to VirtQueueStatus and
  qmp command x-debug-virtio-queue-status.]

diff --git a/hw/virtio/virtio-stub.c b/hw/virtio/virtio-stub.c
index ddb592f..3c1bf17 100644
--- a/hw/virtio/virtio-stub.c
+++ b/hw/virtio/virtio-stub.c
@@ -17,3 +17,9 @@ VirtioStatus *qmp_x_debug_virtio_status(const 
char* path, Error **errp)

  {
  return qmp_virtio_unsupported(errp);
  }
+
+VirtQueueStatus *qmp_x_debug_virtio_queue_status(const char *path,
+ uint16_t queue, 
Error **errp)

+{
+    return qmp_virtio_unsupported(errp);
+}
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 81a0ee8..ccd4371 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3935,6 +3935,43 @@ static VirtIODevice 
*virtio_device_find(const char *path)

  return NULL;
  }
  +VirtQueueStatus *qmp_x_debug_virtio_queue_status(const char *path,
+ uint16_t queue, 
Error **errp)

+{
+    VirtIODevice *vdev;
+    VirtQueueStatus *status;
+
+    vdev = virtio_device_find(path);
+    if (vdev == NULL) {
+    error_setg(errp, "Path %s is not a VirtIO device", path);
+    return NULL;
+    }
+
+    if (queue >= VIRTIO_QUEUE_MAX || !virtio_queue_get_num(vdev, 
queue)) {

+    error_setg(errp, "Invalid virtqueue number %d", queue);
+    return NULL;
+    }
+
+    status = g_new0(VirtQueueStatus, 1);
+    status->device_type = qapi_enum_parse(&VirtioType_lookup, 
vdev->name,

+ VIRTIO_TYPE_UNKNOWN, NULL);
+    status->queue_index = vdev->vq[queue].queue_index;
+    status->inuse = vdev->vq[queue].inuse;
+    status->vring_num = vdev->vq[queue].vring.num;
+    status->vring_num_default = vdev->vq[queue].vring.num_default;
+    status->vring_align = vdev->vq[queue].vring.align;
+    status->vring_desc = vdev->vq[queue].vring.desc;
+    status->vring_avail = vdev->vq[queue].vring.avail;
+    status->vring_used = vdev->vq[queue].vring.used;
+    status->last_avail_idx = vdev->vq[queue].last_avail_idx;



As mentioned in previous versions. We need add vhost support 
otherwise the value here is wrong.
Got it. I'll add a case to determine if vhost is active for a given 
device.
So, in the case that vhost is active, should I just not print out the 
value or would I substitute it with

another value (whatever that might be)?



You can query the vhost for those index.

(vhost_get_vring_base())



  Same question for shadow_avail_idx below as well.



It's an implementation specific. I think we can simply not show it if 
vhost is enabled.


Thanks


Ah I see, thank you!

So, it appears to me that it's not very easy to get the struct vhost_dev 
pointer from struct VirtIODevice to indicate whether or not vhost is 
active, e.g. there's no virtio class-independent way to get struct 
vhost_dev.


I was thinking of adding an op/callback function to struct 
VirtioDeviceClass, e.g. bool has_vhost(VirtIODevice *vdev), and 
implement it for each virtio class (net, scsi, blk, etc.).


For example, for virtio-net, maybe it'd be something like:

bool has_vhost(VirtIODevice *vdev) {
  VirtIONet *n = VIRTIO_NET(vdev);
  NetClientState *nc = qemu_get_queue(n->nic);
  return nc->peer ? get_vhost_net(nc->peer) : false;
}

Also, for getting the last_avail_idx, I was also thinking of adding 
another op/callback to struct VirtioDeviceClass, e.g. unsigned int 
get_last_avail_idx(VirtIODevice *vdev, unsigned int vq_idx) that finds 
if vhost is active or not and either gets last_avail_idx from virtio 
directly or through vhost (e.g. 
vhost_dev->vhost_ops->vhost_get_vring_base()).


I wanted to run this by you and get your opinion on this before I 
started implementing it in code. Let me know what you think about this.



Jonah






Jonah




+    status->shadow_avail_idx = vdev->vq[queue].shadow_avail_idx;



The shadow index is something that is implementation specific e.g in 
the case of vhost it's kind of meaningless.


Thanks



+    status->used_idx = vdev->vq[queue].used_idx;
+    status->signalled_used = vdev->vq[queue].signalled_used;
+    status->signalled_used_valid = 
vdev->vq[queue].signalled_used_valid;

+
+    return status;
+}
+
  #define CONVERT_FEATURES(type, map)    \
  ({   \
  type *list = NULL; \
diff --git a/qapi/virtio.json b/qapi/virtio.json
index 78873cd..7007e0c 100644
--- a/qapi/virtio.json
+++ b/q

Re: [PATCH v6 0/6] hmp, qmp: Add some commands to introspect virtio devices

2021-07-26 Thread Jonah Palmer




On 7/22/21 5:16 AM, Jason Wang wrote:


在 2021/7/21 下午4:53, Jonah Palmer 写道:


Hi Jason. My apologies for the delayed response, several work-related 
things came up recently, but they're slowing down now so I'm turning 
my attention these patches to get taken care of.


A few questions and comments below (and in other following patches):


On 7/13/21 10:42 PM, Jason Wang wrote:


在 2021/7/12 下午6:35, Jonah Palmer 写道:
 Dump the information of the head element of the third 
queue of virtio-scsi:


 (qemu) virtio queue-element 
/machine/peripheral-anon/device[3]/virtio-backend 3

 index: 122
 ndescs: 3
 descs: addr 0x7302d000 len 4096 (write), addr 0x3c951763 
len 108 (write, next),

    addr 0x3c951728 len 51 (next)



I think it would be nice if we can show driver area and device area 
as well here.
Sure thing. And I apologize if it's obvious (I'm relatively new to 
virtio), but how can I expose the driver area?



So the spec defines three parts: the device area, the driver area, and 
the descriptor area. And they are all located in the guest memory.



I understand that virtio devices are part of the Qemu process, but I 
also thought that virtio drivers are in the
guest's kernel, which I don't believe I can see into from Qemu (or, 
at least, it's not obvious to me).



It works like how you access the descriptor ring (descriptor area).

Thanks


Oh, I see now! I didn't realize the device area is essentially the used 
ring and the driver area is the avail ring (at least for the split 
virtqueue model). I see this in the virtio spec now.



Thank you!






Jonah


Thanks

[PULL for-6.1 1/1] block/nvme: Fix VFIO_MAP_DMA failed: No space left on device

2021-07-26 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

When the NVMe block driver was introduced (see commit bdd6a90a9e5,
January 2018), Linux VFIO_IOMMU_MAP_DMA ioctl was only returning
-ENOMEM in case of error. The driver was correctly handling the
error path to recycle its volatile IOVA mappings.

To fix CVE-2019-3882, Linux commit 492855939bdb ("vfio/type1: Limit
DMA mappings per container", April 2019) added the -ENOSPC error to
signal the user exhausted the DMA mappings available for a container.

The block driver started to mis-behave:

  qemu-system-x86_64: VFIO_MAP_DMA failed: No space left on device
  (qemu)
  (qemu) info status
  VM status: paused (io-error)
  (qemu) c
  VFIO_MAP_DMA failed: No space left on device
  (qemu) c
  VFIO_MAP_DMA failed: No space left on device

(The VM is not resumable from here, hence stuck.)

Fix by handling the new -ENOSPC error (when DMA mappings are
exhausted) without any distinction to the current -ENOMEM error,
so we don't change the behavior on old kernels where the CVE-2019-3882
fix is not present.

An easy way to reproduce this bug is to restrict the DMA mapping
limit (65535 by default) when loading the VFIO IOMMU module:

  # modprobe vfio_iommu_type1 dma_entry_limit=666

Cc: qemu-sta...@nongnu.org
Cc: Fam Zheng 
Cc: Maxim Levitsky 
Cc: Alex Williamson 
Reported-by: Michal Prívozník 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210723195843.1032825-1-phi...@redhat.com
Fixes: bdd6a90a9e5 ("block: Add VFIO based NVMe driver")
Buglink: https://bugs.launchpad.net/qemu/+bug/186
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/65
Signed-off-by: Philippe Mathieu-Daudé 
Signed-off-by: Stefan Hajnoczi 
---
 block/nvme.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index 2b5421e7aa..e8dbbc2317 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -1030,7 +1030,29 @@ try_map:
 r = qemu_vfio_dma_map(s->vfio,
   qiov->iov[i].iov_base,
   len, true, &iova);
+if (r == -ENOSPC) {
+/*
+ * In addition to the -ENOMEM error, the VFIO_IOMMU_MAP_DMA
+ * ioctl returns -ENOSPC to signal the user exhausted the DMA
+ * mappings available for a container since Linux kernel commit
+ * 492855939bdb ("vfio/type1: Limit DMA mappings per container",
+ * April 2019, see CVE-2019-3882).
+ *
+ * This block driver already handles this error path by checking
+ * for the -ENOMEM error, so we directly replace -ENOSPC by
+ * -ENOMEM. Beside, -ENOSPC has a specific meaning for blockdev
+ * coroutines: it triggers BLOCKDEV_ON_ERROR_ENOSPC and
+ * BLOCK_ERROR_ACTION_STOP which stops the VM, asking the operator
+ * to add more storage to the blockdev. Not something we can do
+ * easily with an IOMMU :)
+ */
+r = -ENOMEM;
+}
 if (r == -ENOMEM && retry) {
+/*
+ * We exhausted the DMA mappings available for our container:
+ * recycle the volatile IOVA mappings.
+ */
 retry = false;
 trace_nvme_dma_flush_queue_wait(s);
 if (s->dma_map_count) {
-- 
2.31.1

[PULL for-6.1 0/1] Block patches

2021-07-26 Thread Stefan Hajnoczi

The following changes since commit a2376507f615495b1d16685449ce0ea78c2caf9d:

  Merge remote-tracking branch 'remotes/bonzini-gitlab/tags/for-upstream' into 
staging (2021-07-24 11:04:57 +0100)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 15a730e7a3aaac180df72cd5730e0617bcf44a5a:

  block/nvme: Fix VFIO_MAP_DMA failed: No space left on device (2021-07-26 
09:38:12 +0100)


Pull request

Phil's block/nvme.c ENOSPC fix for newer Linux kernels that return this errno.



Philippe Mathieu-Daudé (1):
  block/nvme: Fix VFIO_MAP_DMA failed: No space left on device

 block/nvme.c | 22 ++
 1 file changed, 22 insertions(+)

-- 
2.31.1

Re: [PATCH-for-6.1 v3] block/nvme: Fix VFIO_MAP_DMA failed: No space left on device

2021-07-26 Thread Stefan Hajnoczi

On Fri, Jul 23, 2021 at 09:58:43PM +0200, Philippe Mathieu-Daudé wrote:
> When the NVMe block driver was introduced (see commit bdd6a90a9e5,
> January 2018), Linux VFIO_IOMMU_MAP_DMA ioctl was only returning
> -ENOMEM in case of error. The driver was correctly handling the
> error path to recycle its volatile IOVA mappings.
> 
> To fix CVE-2019-3882, Linux commit 492855939bdb ("vfio/type1: Limit
> DMA mappings per container", April 2019) added the -ENOSPC error to
> signal the user exhausted the DMA mappings available for a container.
> 
> The block driver started to mis-behave:
> 
>   qemu-system-x86_64: VFIO_MAP_DMA failed: No space left on device
>   (qemu)
>   (qemu) info status
>   VM status: paused (io-error)
>   (qemu) c
>   VFIO_MAP_DMA failed: No space left on device
>   (qemu) c
>   VFIO_MAP_DMA failed: No space left on device
> 
> (The VM is not resumable from here, hence stuck.)
> 
> Fix by handling the new -ENOSPC error (when DMA mappings are
> exhausted) without any distinction to the current -ENOMEM error,
> so we don't change the behavior on old kernels where the CVE-2019-3882
> fix is not present.
> 
> An easy way to reproduce this bug is to restrict the DMA mapping
> limit (65535 by default) when loading the VFIO IOMMU module:
> 
>   # modprobe vfio_iommu_type1 dma_entry_limit=666
> 
> Cc: qemu-sta...@nongnu.org
> Cc: Fam Zheng 
> Cc: Maxim Levitsky 
> Cc: Alex Williamson 
> Reported-by: Michal Prívozník 
> Fixes: bdd6a90a9e5 ("block: Add VFIO based NVMe driver")
> Buglink: https://bugs.launchpad.net/qemu/+bug/186
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/65
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> v3: Reworded (Fam)
> v2: KISS checking both errors undistinguishedly (Maxim)
> ---
>  block/nvme.c | 22 ++
>  1 file changed, 22 insertions(+)

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature

Re: [PATCH for-6.1? 1/6] mirror: Keep s->synced on error

2021-07-26 Thread Max Reitz


On 22.07.21 18:25, Vladimir Sementsov-Ogievskiy wrote:

22.07.2021 15:26, Max Reitz wrote:

An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Max Reitz 
---
  block/mirror.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 98fc66eabf..d73b704473 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -121,7 +121,6 @@ typedef enum MirrorMethod {
  static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool 
read,

  int error)
  {
-    s->synced = false;
  s->actively_synced = false;
  if (read) {
  return block_job_error_action(&s->common, s->on_source_error,



Looked through.. Yes, seems s->synced used as "is ready". Isn't it 
better to drop s->synced at all and use job_is_read() instead?


Sounds good, though I think for the change to be clear, I’d like to keep 
this patch and then drop s->synced on top.


Max

Hmm, s->actively_synced used only for assertion in 
active_write_settle().. That's not wrong, just interesting.

Re: [PATCH for-6.1? 4/6] job: Add job_cancel_requested()

2021-07-26 Thread Max Reitz


On 22.07.21 19:58, Vladimir Sementsov-Ogievskiy wrote:

22.07.2021 15:26, Max Reitz wrote:

Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) 


You have to repeat that this "cancel" is not "cancel".

So, the whole problem is that feature of mirror, on cancel in READY 
state do not cancel but do some specific kind of completion.


You try to make this thing correctly handled on generic layer..

Did you consider instead just drop the feature from generic layer? So 
that all *cancel* functions always do force-cancel. Then the internal 
implementation become a lot clearer.


Yes, I considered that, and I’ve decided against it (for now), because 
such a change would obviously be an incompatible change.  It would 
require a deprecation period, and so we would need to fix this bug now 
anyway.


But we have to support the qmp block-job-cancel of READY mirror (and 
commit) with force=false.


We can do it as an exclusion in qmp_block_job_cancel, something like:

if (job is mirror or commit AND it's ready AND force = false)
   mirror_soft_cancel(...);
else
   job_cancel(...);


I didn’t consider such a hack, though.  I don’t like it.  If we think 
that we should change our approach because mirror’s soft cancel is 
actually a completion mode, and the current situation is too confusing, 
such a change should be user-visible, too.  (I think there was this idea 
of having job-specific flags or parameters you could change at runtime, 
and so you’d just change the “pivot” parameter between true or false.)


Also, I don’t know whether this would really make anything “a lot” 
easier.  After this series job_is_cancelled() already tells the true 
story, so all we could really change is to drop force_cancel and unify 
the “s->should_complete || job_cancel_requested()” conditions in 
block/mirror.c into one variable.  So when I considered making cancel 
exclusively force-cancel jobs, I thought it wouldn’t actually be worth 
it in practice.



may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_request() as the general variant, which returns true for any
jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz 
---


[..]


--- a/job.c
+++ b/job.c
@@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
  }
    bool job_is_cancelled(Job *job)
+{
+    return job->cancelled && job->force_cancel;
+}
+
+bool job_cancel_requested(Job *job)
  {
  return job->cancelled;
  }
@@ -650,7 +655,7 @@ static void job_conclude(Job *job)
    static void job_update_rc(Job *job)
  {
-    if (!job->ret && job_is_cancelled(job)) {
+    if (!job->ret && job_cancel_requested(job)) {


Why not job_is_cancelled() here?

So in case of mirror other kind of completion we set ret to -ECANCELED?


I thought the return value is a user-visible thing, so I left it as-is.

Seems I was wrong, more below.


  job->ret = -ECANCELED;
  }
  if (job->ret) {
@@ -704,7 +709,7 @@ static int job_finalize_single(Job *job)
    /* Emit events only if we actually started */
  if (job_started(job)) {
-    if (job_is_cancelled(job)) {
+    if (job_cancel_requested(job)) {
  job_event_cancelled(job);


Same question here.. Shouldn't mirror report COMPLETED event in case 
of not-force cancelled in READY state?


Same here, I thought this is user-visible, nothing internal, so I should 
leave it as-is.


Now I see that cancelling mirror post-READY indeed should result in a 
COMPLETED event.  So I’m actually not exactly sure how mirror does that, 
despite this code here (which functionally isn’t changed by this patch), 
but it’s absolutely true that job_is_cancelled() would be more 
appropriate here.


(No iotest failed, so I thought this change was right.  Well.)


  } else {
  job_event_completed(job);
@@ -1015,7 +1020,7 @@ void job_complete(Job *job, Error **errp)
  if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
  return;
  }
-    if (job_is_cancelled(job) || !job->driver->complete) {
+    if (job_cancel_requested(job) || !job-

41 matches

Mail list logo