[PATCH v3 00/11] block/nvme: Rework error reporting

2021-09-02 Thread Philippe Mathieu-Daudé
(Series fully reviewed)

Hi,

This series contains various patches sent last year with
review comments addressed, few more cleanups, and a new
patch which remove the spurious "VFIO_MAP_DMA failed: No
space left on device" now poping up since commit 15a730e7a.
(it is the expected behavior, which is why we retry the
same call after flushing the DMA mappings).

Since v2:
- qemu_vfio_find_[fixed/temp]_iova retun bool (Klaus)
- Add Klaus's R-b

Since v1:
- Addressed Klaus review comments (cleaner Error* handling)
- Add Klaus's R-b

Regards,

Phil.

Philippe Mathieu-Daudé (11):
  block/nvme: Use safer trace format string
  util/vfio-helpers: Let qemu_vfio_verify_mappings() use error_report()
  util/vfio-helpers: Replace qemu_mutex_lock() calls with
QEMU_LOCK_GUARD
  util/vfio-helpers: Remove unreachable code in qemu_vfio_dma_map()
  block/nvme: Have nvme_create_queue_pair() report errors consistently
  util/vfio-helpers: Pass Error handle to qemu_vfio_dma_map()
  util/vfio-helpers: Extract qemu_vfio_water_mark_reached()
  util/vfio-helpers: Use error_setg in qemu_vfio_find_[fixed/temp]_iova
  util/vfio-helpers: Simplify qemu_vfio_dma_map() returning directly
  util/vfio-helpers: Let qemu_vfio_do_mapping() propagate Error
  block/nvme: Only report VFIO error on failed retry

 include/qemu/vfio-helpers.h |  2 +-
 block/nvme.c| 29 +++
 util/vfio-helpers.c | 99 -
 block/trace-events  |  2 +-
 4 files changed, 76 insertions(+), 56 deletions(-)

-- 
2.31.1





[PATCH v3 02/11] util/vfio-helpers: Let qemu_vfio_verify_mappings() use error_report()

2021-09-02 Thread Philippe Mathieu-Daudé
Instead of displaying the error on stderr, use error_report()
which also report to the monitor.

Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
---
 util/vfio-helpers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 95b86e6..1d149136299 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -660,13 +660,13 @@ static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
 if (QEMU_VFIO_DEBUG) {
 for (i = 0; i < s->nr_mappings - 1; ++i) {
 if (!(s->mappings[i].host < s->mappings[i + 1].host)) {
-fprintf(stderr, "item %d not sorted!\n", i);
+error_report("item %d not sorted!", i);
 qemu_vfio_dump_mappings(s);
 return false;
 }
 if (!(s->mappings[i].host + s->mappings[i].size <=
   s->mappings[i + 1].host)) {
-fprintf(stderr, "item %d overlap with next!\n", i);
+error_report("item %d overlap with next!", i);
 qemu_vfio_dump_mappings(s);
 return false;
 }
-- 
2.31.1




[PATCH v3 05/11] block/nvme: Have nvme_create_queue_pair() report errors consistently

2021-09-02 Thread Philippe Mathieu-Daudé
nvme_create_queue_pair() does not return a boolean value (indicating
eventual error) but a pointer, and is inconsistent in how it fills the
error handler. To fulfill callers expectations, always set an error
message on failure.

Reported-by: Auger Eric 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index e8dbbc23177..0786c501e46 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -220,6 +220,7 @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState 
*s,
 
 q = g_try_new0(NVMeQueuePair, 1);
 if (!q) {
+error_setg(errp, "Cannot allocate queue pair");
 return NULL;
 }
 trace_nvme_create_queue_pair(idx, q, size, aio_context,
@@ -228,6 +229,7 @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState 
*s,
   qemu_real_host_page_size);
 q->prp_list_pages = qemu_try_memalign(qemu_real_host_page_size, bytes);
 if (!q->prp_list_pages) {
+error_setg(errp, "Cannot allocate PRP page list");
 goto fail;
 }
 memset(q->prp_list_pages, 0, bytes);
@@ -239,6 +241,7 @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState 
*s,
 r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages, bytes,
   false, &prp_list_iova);
 if (r) {
+error_setg_errno(errp, -r, "Cannot map buffer for DMA");
 goto fail;
 }
 q->free_req_head = -1;
-- 
2.31.1




[PATCH v3 01/11] block/nvme: Use safer trace format string

2021-09-02 Thread Philippe Mathieu-Daudé
Fix when building with -Wshorten-64-to-32:

  warning: implicit conversion loses integer precision: 'unsigned long' to 
'int' [-Wshorten-64-to-32]

Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/trace-events | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/trace-events b/block/trace-events
index b3d2b1e62cb..f4f1267c8c0 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -156,7 +156,7 @@ nvme_dsm(void *s, uint64_t offset, uint64_t bytes) "s %p 
offset 0x%"PRIx64" byte
 nvme_dsm_done(void *s, uint64_t offset, uint64_t bytes, int ret) "s %p offset 
0x%"PRIx64" bytes %"PRId64" ret %d"
 nvme_dma_map_flush(void *s) "s %p"
 nvme_free_req_queue_wait(void *s, unsigned q_index) "s %p q #%u"
-nvme_create_queue_pair(unsigned q_index, void *q, unsigned size, void 
*aio_context, int fd) "index %u q %p size %u aioctx %p fd %d"
+nvme_create_queue_pair(unsigned q_index, void *q, size_t size, void 
*aio_context, int fd) "index %u q %p size %zu aioctx %p fd %d"
 nvme_free_queue_pair(unsigned q_index, void *q) "index %u q %p"
 nvme_cmd_map_qiov(void *s, void *cmd, void *req, void *qiov, int entries) "s 
%p cmd %p req %p qiov %p entries %d"
 nvme_cmd_map_qiov_pages(void *s, int i, uint64_t page) "s %p page[%d] 
0x%"PRIx64
-- 
2.31.1




[PATCH v3 03/11] util/vfio-helpers: Replace qemu_mutex_lock() calls with QEMU_LOCK_GUARD

2021-09-02 Thread Philippe Mathieu-Daudé
Simplify qemu_vfio_dma_[un]map() handlers by replacing a pair of
qemu_mutex_lock/qemu_mutex_unlock calls by the WITH_QEMU_LOCK_GUARD
macro.

Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
---
 util/vfio-helpers.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 1d149136299..d956866b4e9 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -735,7 +735,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 assert(QEMU_PTR_IS_ALIGNED(host, qemu_real_host_page_size));
 assert(QEMU_IS_ALIGNED(size, qemu_real_host_page_size));
 trace_qemu_vfio_dma_map(s, host, size, temporary, iova);
-qemu_mutex_lock(&s->lock);
+QEMU_LOCK_GUARD(&s->lock);
 mapping = qemu_vfio_find_mapping(s, host, &index);
 if (mapping) {
 iova0 = mapping->iova + ((uint8_t *)host - (uint8_t *)mapping->host);
@@ -778,7 +778,6 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 *iova = iova0;
 }
 out:
-qemu_mutex_unlock(&s->lock);
 return ret;
 }
 
@@ -813,14 +812,12 @@ void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host)
 }
 
 trace_qemu_vfio_dma_unmap(s, host);
-qemu_mutex_lock(&s->lock);
+QEMU_LOCK_GUARD(&s->lock);
 m = qemu_vfio_find_mapping(s, host, &index);
 if (!m) {
-goto out;
+return;
 }
 qemu_vfio_undo_mapping(s, m, NULL);
-out:
-qemu_mutex_unlock(&s->lock);
 }
 
 static void qemu_vfio_reset(QEMUVFIOState *s)
-- 
2.31.1




[PATCH v3 04/11] util/vfio-helpers: Remove unreachable code in qemu_vfio_dma_map()

2021-09-02 Thread Philippe Mathieu-Daudé
qemu_vfio_add_mapping() returns a pointer to an indexed entry
in pre-allocated QEMUVFIOState::mappings[], thus can not be NULL.
Remove the pointless check.

Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
---
 util/vfio-helpers.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index d956866b4e9..e7909222cfd 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -751,10 +751,6 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 }
 
 mapping = qemu_vfio_add_mapping(s, host, size, index + 1, iova0);
-if (!mapping) {
-ret = -ENOMEM;
-goto out;
-}
 assert(qemu_vfio_verify_mappings(s));
 ret = qemu_vfio_do_mapping(s, host, size, iova0);
 if (ret) {
-- 
2.31.1




[PATCH v3 10/11] util/vfio-helpers: Let qemu_vfio_do_mapping() propagate Error

2021-09-02 Thread Philippe Mathieu-Daudé
Pass qemu_vfio_do_mapping() an Error* argument so it can propagate
any error to callers. Replace error_report() which only report
to the monitor by the more generic error_setg_errno().

Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
---
 util/vfio-helpers.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 66148bd381c..00a80431a09 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -610,7 +610,7 @@ static IOVAMapping *qemu_vfio_add_mapping(QEMUVFIOState *s,
 
 /* Do the DMA mapping with VFIO. */
 static int qemu_vfio_do_mapping(QEMUVFIOState *s, void *host, size_t size,
-uint64_t iova)
+uint64_t iova, Error **errp)
 {
 struct vfio_iommu_type1_dma_map dma_map = {
 .argsz = sizeof(dma_map),
@@ -622,7 +622,7 @@ static int qemu_vfio_do_mapping(QEMUVFIOState *s, void 
*host, size_t size,
 trace_qemu_vfio_do_mapping(s, host, iova, size);
 
 if (ioctl(s->container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
-error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
+error_setg_errno(errp, errno, "VFIO_MAP_DMA failed");
 return -errno;
 }
 return 0;
@@ -772,7 +772,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 
 mapping = qemu_vfio_add_mapping(s, host, size, index + 1, iova0);
 assert(qemu_vfio_verify_mappings(s));
-ret = qemu_vfio_do_mapping(s, host, size, iova0);
+ret = qemu_vfio_do_mapping(s, host, size, iova0, errp);
 if (ret < 0) {
 qemu_vfio_undo_mapping(s, mapping, NULL);
 return ret;
@@ -782,7 +782,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 if (!qemu_vfio_find_temp_iova(s, size, &iova0, errp)) {
 return -ENOMEM;
 }
-ret = qemu_vfio_do_mapping(s, host, size, iova0);
+ret = qemu_vfio_do_mapping(s, host, size, iova0, errp);
 if (ret < 0) {
 return ret;
 }
-- 
2.31.1




[PATCH v3 07/11] util/vfio-helpers: Extract qemu_vfio_water_mark_reached()

2021-09-02 Thread Philippe Mathieu-Daudé
Extract qemu_vfio_water_mark_reached() for readability,
and have it provide an error hint it its Error* handle.

Suggested-by: Klaus Jensen 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
---
 util/vfio-helpers.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 77cdec845d9..306b5a83e48 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -721,6 +721,21 @@ qemu_vfio_find_temp_iova(QEMUVFIOState *s, size_t size, 
uint64_t *iova)
 return -ENOMEM;
 }
 
+/**
+ * qemu_vfio_water_mark_reached:
+ *
+ * Returns %true if high watermark has been reached, %false otherwise.
+ */
+static bool qemu_vfio_water_mark_reached(QEMUVFIOState *s, size_t size,
+ Error **errp)
+{
+if (s->high_water_mark - s->low_water_mark + 1 < size) {
+error_setg(errp, "iova exhausted (water mark reached)");
+return true;
+}
+return false;
+}
+
 /* Map [host, host + size) area into a contiguous IOVA address space, and store
  * the result in @iova if not NULL. The caller need to make sure the area is
  * aligned to page size, and mustn't overlap with existing mapping areas (split
@@ -742,7 +757,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 if (mapping) {
 iova0 = mapping->iova + ((uint8_t *)host - (uint8_t *)mapping->host);
 } else {
-if (s->high_water_mark - s->low_water_mark + 1 < size) {
+if (qemu_vfio_water_mark_reached(s, size, errp)) {
 ret = -ENOMEM;
 goto out;
 }
-- 
2.31.1




[PATCH v3 08/11] util/vfio-helpers: Use error_setg in qemu_vfio_find_[fixed/temp]_iova

2021-09-02 Thread Philippe Mathieu-Daudé
Both qemu_vfio_find_fixed_iova() and qemu_vfio_find_temp_iova()
return an errno which is unused (or overwritten). Have them propagate
eventual errors to callers, returning a boolean (which is what the
Error API recommends, see commit e3fe3988d78 "error: Document Error
API usage rules" for rationale).

Suggested-by: Klaus Jensen 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
---
 util/vfio-helpers.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 306b5a83e48..b93a3d35787 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -677,8 +677,8 @@ static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
 return true;
 }
 
-static int
-qemu_vfio_find_fixed_iova(QEMUVFIOState *s, size_t size, uint64_t *iova)
+static bool qemu_vfio_find_fixed_iova(QEMUVFIOState *s, size_t size,
+  uint64_t *iova, Error **errp)
 {
 int i;
 
@@ -693,14 +693,16 @@ qemu_vfio_find_fixed_iova(QEMUVFIOState *s, size_t size, 
uint64_t *iova)
 s->usable_iova_ranges[i].end - s->low_water_mark + 1 == 0) {
 *iova = s->low_water_mark;
 s->low_water_mark += size;
-return 0;
+return true;
 }
 }
-return -ENOMEM;
+error_setg(errp, "fixed iova range not found");
+
+return false;
 }
 
-static int
-qemu_vfio_find_temp_iova(QEMUVFIOState *s, size_t size, uint64_t *iova)
+static bool qemu_vfio_find_temp_iova(QEMUVFIOState *s, size_t size,
+ uint64_t *iova, Error **errp)
 {
 int i;
 
@@ -715,10 +717,12 @@ qemu_vfio_find_temp_iova(QEMUVFIOState *s, size_t size, 
uint64_t *iova)
 s->high_water_mark - s->usable_iova_ranges[i].start + 1 == 0) {
 *iova = s->high_water_mark - size;
 s->high_water_mark = *iova;
-return 0;
+return true;
 }
 }
-return -ENOMEM;
+error_setg(errp, "temporary iova range not found");
+
+return false;
 }
 
 /**
@@ -762,7 +766,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 goto out;
 }
 if (!temporary) {
-if (qemu_vfio_find_fixed_iova(s, size, &iova0)) {
+if (!qemu_vfio_find_fixed_iova(s, size, &iova0, errp)) {
 ret = -ENOMEM;
 goto out;
 }
@@ -776,7 +780,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 }
 qemu_vfio_dump_mappings(s);
 } else {
-if (qemu_vfio_find_temp_iova(s, size, &iova0)) {
+if (!qemu_vfio_find_temp_iova(s, size, &iova0, errp)) {
 ret = -ENOMEM;
 goto out;
 }
-- 
2.31.1




[PATCH v3 09/11] util/vfio-helpers: Simplify qemu_vfio_dma_map() returning directly

2021-09-02 Thread Philippe Mathieu-Daudé
To simplify qemu_vfio_dma_map():
- reduce 'ret' (returned value) scope by returning errno directly,
- remove the goto 'out' label.

Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
---
 util/vfio-helpers.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index b93a3d35787..66148bd381c 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -748,7 +748,6 @@ static bool qemu_vfio_water_mark_reached(QEMUVFIOState *s, 
size_t size,
 int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
   bool temporary, uint64_t *iova, Error **errp)
 {
-int ret = 0;
 int index;
 IOVAMapping *mapping;
 uint64_t iova0;
@@ -761,32 +760,31 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, 
size_t size,
 if (mapping) {
 iova0 = mapping->iova + ((uint8_t *)host - (uint8_t *)mapping->host);
 } else {
+int ret;
+
 if (qemu_vfio_water_mark_reached(s, size, errp)) {
-ret = -ENOMEM;
-goto out;
+return -ENOMEM;
 }
 if (!temporary) {
 if (!qemu_vfio_find_fixed_iova(s, size, &iova0, errp)) {
-ret = -ENOMEM;
-goto out;
+return -ENOMEM;
 }
 
 mapping = qemu_vfio_add_mapping(s, host, size, index + 1, iova0);
 assert(qemu_vfio_verify_mappings(s));
 ret = qemu_vfio_do_mapping(s, host, size, iova0);
-if (ret) {
+if (ret < 0) {
 qemu_vfio_undo_mapping(s, mapping, NULL);
-goto out;
+return ret;
 }
 qemu_vfio_dump_mappings(s);
 } else {
 if (!qemu_vfio_find_temp_iova(s, size, &iova0, errp)) {
-ret = -ENOMEM;
-goto out;
+return -ENOMEM;
 }
 ret = qemu_vfio_do_mapping(s, host, size, iova0);
-if (ret) {
-goto out;
+if (ret < 0) {
+return ret;
 }
 }
 }
@@ -794,8 +792,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 if (iova) {
 *iova = iova0;
 }
-out:
-return ret;
+return 0;
 }
 
 /* Reset the high watermark and free all "temporary" mappings. */
-- 
2.31.1




[PATCH v3 06/11] util/vfio-helpers: Pass Error handle to qemu_vfio_dma_map()

2021-09-02 Thread Philippe Mathieu-Daudé
Currently qemu_vfio_dma_map() displays errors on stderr.
When using management interface, this information is simply
lost. Pass qemu_vfio_dma_map() an Error** handle so it can
propagate the error to callers.

Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
---
 include/qemu/vfio-helpers.h |  2 +-
 block/nvme.c| 22 +++---
 util/vfio-helpers.c | 10 ++
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
index 4491c8e1a6e..bde9495b254 100644
--- a/include/qemu/vfio-helpers.h
+++ b/include/qemu/vfio-helpers.h
@@ -18,7 +18,7 @@ typedef struct QEMUVFIOState QEMUVFIOState;
 QEMUVFIOState *qemu_vfio_open_pci(const char *device, Error **errp);
 void qemu_vfio_close(QEMUVFIOState *s);
 int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
-  bool temporary, uint64_t *iova_list);
+  bool temporary, uint64_t *iova_list, Error **errp);
 int qemu_vfio_dma_reset_temporary(QEMUVFIOState *s);
 void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host);
 void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index,
diff --git a/block/nvme.c b/block/nvme.c
index 0786c501e46..80546b0babd 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -176,12 +176,11 @@ static bool nvme_init_queue(BDRVNVMeState *s, NVMeQueue 
*q,
 return false;
 }
 memset(q->queue, 0, bytes);
-r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, false, &q->iova);
+r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, false, &q->iova, errp);
 if (r) {
-error_setg(errp, "Cannot map queue");
-return false;
+error_prepend(errp, "Cannot map queue: ");
 }
-return true;
+return r == 0;
 }
 
 static void nvme_free_queue_pair(NVMeQueuePair *q)
@@ -239,9 +238,9 @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState 
*s,
 qemu_co_queue_init(&q->free_req_queue);
 q->completion_bh = aio_bh_new(aio_context, nvme_process_completion_bh, q);
 r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages, bytes,
-  false, &prp_list_iova);
+  false, &prp_list_iova, errp);
 if (r) {
-error_setg_errno(errp, -r, "Cannot map buffer for DMA");
+error_prepend(errp, "Cannot map buffer for DMA: ");
 goto fail;
 }
 q->free_req_head = -1;
@@ -534,9 +533,9 @@ static bool nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 error_setg(errp, "Cannot allocate buffer for identify response");
 goto out;
 }
-r = qemu_vfio_dma_map(s->vfio, id, id_size, true, &iova);
+r = qemu_vfio_dma_map(s->vfio, id, id_size, true, &iova, errp);
 if (r) {
-error_setg(errp, "Cannot map buffer for DMA");
+error_prepend(errp, "Cannot map buffer for DMA: ");
 goto out;
 }
 
@@ -1032,7 +1031,7 @@ static coroutine_fn int 
nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *cmd,
 try_map:
 r = qemu_vfio_dma_map(s->vfio,
   qiov->iov[i].iov_base,
-  len, true, &iova);
+  len, true, &iova, NULL);
 if (r == -ENOSPC) {
 /*
  * In addition to the -ENOMEM error, the VFIO_IOMMU_MAP_DMA
@@ -1524,14 +1523,15 @@ static void nvme_aio_unplug(BlockDriverState *bs)
 static void nvme_register_buf(BlockDriverState *bs, void *host, size_t size)
 {
 int ret;
+Error *local_err = NULL;
 BDRVNVMeState *s = bs->opaque;
 
-ret = qemu_vfio_dma_map(s->vfio, host, size, false, NULL);
+ret = qemu_vfio_dma_map(s->vfio, host, size, false, NULL, &local_err);
 if (ret) {
 /* FIXME: we may run out of IOVA addresses after repeated
  * bdrv_register_buf/bdrv_unregister_buf, because nvme_vfio_dma_unmap
  * doesn't reclaim addresses for fixed mappings. */
-error_report("nvme_register_buf failed: %s", strerror(-ret));
+error_reportf_err(local_err, "nvme_register_buf failed: ");
 }
 }
 
diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index e7909222cfd..77cdec845d9 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -463,13 +463,15 @@ static void qemu_vfio_ram_block_added(RAMBlockNotifier 
*n, void *host,
   size_t size, size_t max_size)
 {
 QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
+Error *local_err = NULL;
 int ret;
 
 trace_qemu_vfio_ram_block_added(s, host, max_size);
-ret = qemu_vfio_dma_map(s, host, max_size, false, NULL);
+ret = qemu_vfio_dma_map(s, host, max_size, false, NULL, &local_err);
 if (ret) {
-error_report("qemu_vfio_dma_map(%p, %zu) failed: %s", host, max_size,
- strerror(-ret));
+error_reportf_err(local_err,
+  "qemu_vfio_dma_map(%p, %zu) failed: ",

[PATCH v3 11/11] block/nvme: Only report VFIO error on failed retry

2021-09-02 Thread Philippe Mathieu-Daudé
We expect the first qemu_vfio_dma_map() to fail (indicating
DMA mappings exhaustion, see commit 15a730e7a3a). Do not
report the first failure as error, since we are going to
flush the mappings and retry.

This removes spurious error message displayed on the monitor:

  (qemu) c
  (qemu) qemu-kvm: VFIO_MAP_DMA failed: No space left on device
  (qemu) info status
  VM status: running

Reported-by: Tingting Mao 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
---
 block/nvme.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index 80546b0babd..abfe305baf2 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -1019,6 +1019,7 @@ static coroutine_fn int 
nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *cmd,
 uint64_t *pagelist = req->prp_list_page;
 int i, j, r;
 int entries = 0;
+Error *local_err = NULL, **errp = NULL;
 
 assert(qiov->size);
 assert(QEMU_IS_ALIGNED(qiov->size, s->page_size));
@@ -1031,7 +1032,7 @@ static coroutine_fn int 
nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *cmd,
 try_map:
 r = qemu_vfio_dma_map(s->vfio,
   qiov->iov[i].iov_base,
-  len, true, &iova, NULL);
+  len, true, &iova, errp);
 if (r == -ENOSPC) {
 /*
  * In addition to the -ENOMEM error, the VFIO_IOMMU_MAP_DMA
@@ -1066,6 +1067,8 @@ try_map:
 goto fail;
 }
 }
+errp = &local_err;
+
 goto try_map;
 }
 if (r) {
@@ -1109,6 +1112,9 @@ fail:
  * because they are already mapped before calling this function; for
  * temporary mappings, a later nvme_cmd_(un)map_qiov will reclaim by
  * calling qemu_vfio_dma_reset_temporary when necessary. */
+if (local_err) {
+error_reportf_err(local_err, "Cannot map buffer for DMA: ");
+}
 return r;
 }
 
-- 
2.31.1




Re: [PATCH v3 08/11] util/vfio-helpers: Use error_setg in qemu_vfio_find_[fixed/temp]_iova

2021-09-02 Thread Klaus Jensen
On Sep  2 09:00, Philippe Mathieu-Daudé wrote:
> Both qemu_vfio_find_fixed_iova() and qemu_vfio_find_temp_iova()
> return an errno which is unused (or overwritten). Have them propagate
> eventual errors to callers, returning a boolean (which is what the
> Error API recommends, see commit e3fe3988d78 "error: Document Error
> API usage rules" for rationale).
> 
> Suggested-by: Klaus Jensen 
> Reviewed-by: Klaus Jensen 
> Signed-off-by: Philippe Mathieu-Daudé 

The switch to bool LGTM.

Reviewed and Acked :)


signature.asc
Description: PGP signature


Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-02 Thread Leonardo Bras Soares Passos
Hello Daniel,.
A few more comments:

On Wed, Sep 1, 2021 at 5:51 AM Daniel P. Berrangé  wrote:
>
> On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote:
> > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > send calls. It does so by avoiding copying user data into kernel 
> > > > buffers.
> > > >
> > > > To make it work, three steps are needed:
> > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > 3 - Process the socket's error queue, dealing with any error
> > >
> > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > >
> > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > from a synchronous call to an asynchronous call.
> > >
> > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > until an asynchronous completion notification has been received from
> > > the socket error queue. These notifications are not required to
> > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > to the buffer if a re-transmit is needed.
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "Page pinning also changes system call semantics. It temporarily
> > >shares the buffer between process and network stack. Unlike with
> > >copying, the process cannot immediately overwrite the buffer
> > >after system call return without possibly modifying the data in
> > >flight. Kernel integrity is not affected, but a buggy program
> > >can possibly corrupt its own data stream."
> > >
> > > AFAICT, the design added in this patch does not provide any way
> > > to honour these requirements around buffer lifetime.
> > >
> > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > way. The buffer lifetime requirements imply need for an API
> > > design that is fundamentally different for asynchronous usage,
> > > with a callback to notify when the write has finished/failed.
> >
> > Regarding buffer reuse - it indeed has a very deep implication on the buffer
> > being available and it's not obvious at all.  Just to mention that the 
> > initial
> > user of this work will make sure all zero copy buffers will be guest pages 
> > only
> > (as it's only used in multi-fd), so they should always be there during the
> > process.
>
> That is not the case when migration is using TLS, because the buffers
> transmitted are the ciphertext buffer, not the plaintext guest page.

That makes sense.
I am still working my way on Migration code, and I ended up not
knowing that part.

I think we can work on KTLS for this case, and implement something
'like zerocopy' for this. I added more details in the previous mail I sent you.


>
> > In short, we may just want to at least having a way to make sure all zero
> > copied buffers are finished using and they're sent after some function 
> > returns
> > (e.g., qio_channel_flush()).  That may require us to do some accounting on 
> > when
> > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> > error queue and keep those information somewhere too.
> >
> > Some other side notes that reached my mind..
> >
> > The qio_channel_writev_full() may not be suitable for async operations, as 
> > the
> > name "full" implies synchronous to me.  So maybe we can add a new helper for
> > zero copy on the channel?
>
> All the APIs that exist today are fundamentally only suitable for sync
> operations. Supporting async correctly will definitely a brand new APIs
> separate from what exists today.

That's a great, I was thinking on implementing this as an optional method for
QIOChannel, more details in the previous mail :).

(sorry, I ended up reading/replying the mails in thread order)

>
> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Best regards,
Leonardo Bras




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-02 Thread Leonardo Bras Soares Passos
Hello Daniel, thanks for the feedback !

On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé  wrote:
>
> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> >
> > Change the send_write() interface of multifd, allowing it to pass down
> > flags for qio_channel_write*().
> >
> > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > other data being sent at the default copying approach.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/multifd-zlib.c | 7 ---
> >  migration/multifd-zstd.c | 7 ---
> >  migration/multifd.c  | 9 ++---
> >  migration/multifd.h  | 3 ++-
> >  4 files changed, 16 insertions(+), 10 deletions(-)
>
> > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> >  }
> >
> >  if (used) {
> > -ret = multifd_send_state->ops->send_write(p, used, 
> > &local_err);
> > +ret = multifd_send_state->ops->send_write(p, used, 
> > MSG_ZEROCOPY,
> > +  &local_err);
>
> I don't think it is valid to unconditionally enable this feature due to the
> resource usage implications
>
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
>
>   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
>if the socket option was not set, the socket exceeds its optmem
>limit or the user exceeds its ulimit on locked pages."

You are correct, I unfortunately missed this part in the docs :(

> The limit on locked pages is something that looks very likely to be
> exceeded unless you happen to be running a QEMU config that already
> implies locked memory (eg PCI assignment)

Do you mean the limit an user has on locking memory?

If so, that makes sense. I remember I needed to set the upper limit of locked
memory for the user before using it, or adding a capability to qemu before.

Maybe an option would be trying to mlock all guest memory before setting
zerocopy=on in qemu code. If it fails, we can print an error message and fall
back to not using zerocopy (following the idea of a new io_async_writev()
I told you in the previous mail).


>
>
> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Best regards,
Leonardo




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-02 Thread Jason Wang



在 2021/9/1 下午11:35, Peter Xu 写道:

On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:

On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:

On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:

On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:

Call qio_channel_set_zerocopy(true) in the start of every multifd thread.

Change the send_write() interface of multifd, allowing it to pass down
flags for qio_channel_write*().

Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
other data being sent at the default copying approach.

Signed-off-by: Leonardo Bras 
---
  migration/multifd-zlib.c | 7 ---
  migration/multifd-zstd.c | 7 ---
  migration/multifd.c  | 9 ++---
  migration/multifd.h  | 3 ++-
  4 files changed, 16 insertions(+), 10 deletions(-)
@@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
  }
  
  if (used) {

-ret = multifd_send_state->ops->send_write(p, used, &local_err);
+ret = multifd_send_state->ops->send_write(p, used, 
MSG_ZEROCOPY,
+  &local_err);

I don't think it is valid to unconditionally enable this feature due to the
resource usage implications

https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html

   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
if the socket option was not set, the socket exceeds its optmem
limit or the user exceeds its ulimit on locked pages."

The limit on locked pages is something that looks very likely to be
exceeded unless you happen to be running a QEMU config that already
implies locked memory (eg PCI assignment)

Yes it would be great to be a migration capability in parallel to multifd. At
initial phase if it's easy to be implemented on multi-fd only, we can add a
dependency between the caps.  In the future we can remove that dependency when
the code is ready to go without multifd.  Thanks,

Also, I'm wondering how zerocopy support interacts with kernel support
for kTLS and multipath-TCP, both of which we want to be able to use
with migration.

Copying Jason Wang for net implications between these features on kernel side



Note that the MSG_ZEROCOPY is contributed by Google :)



and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).



I think they can. Anyway kernel can choose to do datacopy when necessary.

Note that the "zerocopy" is probably not correct here. What's better is 
"Enable MSG_ZEROCOPY" since:


1) kernel supports various kinds of zerocopy, for TX, it has supported 
sendfile() for many years.

2) MSG_ZEROCOPY is only used for TX but not RX
3) TCP rx zerocopy is only supported via mmap() which requires some 
extra configurations e.g 4K MTU, driver support for header split etc.


[1] https://www.youtube.com/watch?v=_ZfiQGWFvg0

Thanks




 From the safe side we may want to only enable one of them until we prove
they'll work together I guess..

Not a immediate concern as I don't really think any of them is really
explicitly supported in qemu.

KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
least we need some knob to detect whether kTLS is enabled in gnutls.






Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-02 Thread Leonardo Bras Soares Passos
Hello Peter, thank you for this feedback!

On Tue, Aug 31, 2021 at 5:29 PM Peter Xu  wrote:
> Yes it would be great to be a migration capability in parallel to multifd. At
> initial phase if it's easy to be implemented on multi-fd only, we can add a
> dependency between the caps.  In the future we can remove that dependency when
> the code is ready to go without multifd.  Thanks,

I thought the idea was to use MSG_ZEROCOPY whenever possible, and otherwise
fall back to the default copying mechanism.

On replies to 2/3 I mentioned a new method to QIOChannel interface,
that would fall
back to copying, and that may be a way to not have to use a capability.

Best regards,
Leonardo




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-02 Thread Leonardo Bras Soares Passos
A few more comments on this one:

On Wed, Sep 1, 2021 at 12:44 PM Daniel P. Berrangé  wrote:
>
> > From the safe side we may want to only enable one of them until we prove
> > they'll work together I guess..
>
> MPTCP is good when we're network limited for migration
>
> KTLS will be good when we're CPU limited on AES for migration,
> which is essentially always when TLS is used.
>
> ZEROCOPY will be good when we're CPU limited for data copy
> on migration, or to reduce the impact on other concurrent
> VMs on the same CPUs.
>
> Ultimately we woudld benefit from all of them at the same
> time, if it were technically possible todo.

Maybe I misunderstood something, but IIRC KTLS kind of does some
'zerocopy'.

I mean, on an usual setup, we would have a buffer A that would contain
the encrypted data, and a buffer B that would receive the encrypted data,
and a buffer C, in kernel, that would receive a copy of buffer B.

On KTLS, the buffer A would be passed to kernel and be encrypted directly
in buffer C, avoiding one copy.
Oh, it's possible Buffer A would be copied to buffer B', which is
inside the kernel,
and then encrypted to buffer C, but I am not sure if this would make sense.

Anyway, other than B' case, it would make no sense having MSG_ZEROCOPY
in QIOChannel_TLS, so that's something that I need to change in this patchset.

>
> > Not a immediate concern as I don't really think any of them is really
> > explicitly supported in qemu.
>
> QEMU has mptcp support already:
>
>   commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7
>   Author: Dr. David Alan Gilbert 
>   Date:   Wed Apr 21 12:28:34 2021 +0100
>
> sockets: Support multipath TCP
>
> Multipath TCP allows combining multiple interfaces/routes into a single
> socket, with very little work for the user/admin.
>
> It's enabled by 'mptcp' on most socket addresses:
>
>./qemu-system-x86_64 -nographic -incoming tcp:0:,mptcp
>
> > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> > gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> > least we need some knob to detect whether kTLS is enabled in gnutls.
>
> It isn't possible for gnutls to transparently enable KTLS, because
> GNUTLS doesn't get to see the actual socket directly

Yes, IIRC it uses gnuTLS with callbacks instead.

> - it'll need
> some work in QEMU to enable it.

Maybe it's overkill, but what if we get to implement using KTLS
directly in QEMU,
and fall back to gnuTLS when it's not available?




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-02 Thread Leonardo Bras Soares Passos
Thanks for contributing Jason!

On Thu, Sep 2, 2021 at 4:23 AM Jason Wang  wrote:
>
>
> 在 2021/9/1 下午11:35, Peter Xu 写道:
> > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> >> On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> >>> On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
>  On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > Call qio_channel_set_zerocopy(true) in the start of every multifd 
> > thread.
> >
> > Change the send_write() interface of multifd, allowing it to pass down
> > flags for qio_channel_write*().
> >
> > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > other data being sent at the default copying approach.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >   migration/multifd-zlib.c | 7 ---
> >   migration/multifd-zstd.c | 7 ---
> >   migration/multifd.c  | 9 ++---
> >   migration/multifd.h  | 3 ++-
> >   4 files changed, 16 insertions(+), 10 deletions(-)
> > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> >   }
> >
> >   if (used) {
> > -ret = multifd_send_state->ops->send_write(p, used, 
> > &local_err);
> > +ret = multifd_send_state->ops->send_write(p, used, 
> > MSG_ZEROCOPY,
> > +  &local_err);
>  I don't think it is valid to unconditionally enable this feature due to 
>  the
>  resource usage implications
> 
>  https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> 
> "A zerocopy failure will return -1 with errno ENOBUFS. This happens
>  if the socket option was not set, the socket exceeds its optmem
>  limit or the user exceeds its ulimit on locked pages."
> 
>  The limit on locked pages is something that looks very likely to be
>  exceeded unless you happen to be running a QEMU config that already
>  implies locked memory (eg PCI assignment)
> >>> Yes it would be great to be a migration capability in parallel to 
> >>> multifd. At
> >>> initial phase if it's easy to be implemented on multi-fd only, we can add 
> >>> a
> >>> dependency between the caps.  In the future we can remove that dependency 
> >>> when
> >>> the code is ready to go without multifd.  Thanks,
> >> Also, I'm wondering how zerocopy support interacts with kernel support
> >> for kTLS and multipath-TCP, both of which we want to be able to use
> >> with migration.
> > Copying Jason Wang for net implications between these features on kernel 
> > side
>
>
> Note that the MSG_ZEROCOPY is contributed by Google :)
>
>
> > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).
>
>
> I think they can. Anyway kernel can choose to do datacopy when necessary.
>
> Note that the "zerocopy" is probably not correct here. What's better is
> "Enable MSG_ZEROCOPY" since:
>
> 1) kernel supports various kinds of zerocopy, for TX, it has supported
> sendfile() for many years.
> 2) MSG_ZEROCOPY is only used for TX but not RX
> 3) TCP rx zerocopy is only supported via mmap() which requires some
> extra configurations e.g 4K MTU, driver support for header split etc.

RX would be my next challenge :)

>
> [1] https://www.youtube.com/watch?v=_ZfiQGWFvg0

Thank you for sharing!

Best regards,
Leonardo




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-02 Thread Daniel P . Berrangé
On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> Hello Daniel, thanks for the feedback !
> 
> On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé  
> wrote:
> >
> > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > >
> > > Change the send_write() interface of multifd, allowing it to pass down
> > > flags for qio_channel_write*().
> > >
> > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > other data being sent at the default copying approach.
> > >
> > > Signed-off-by: Leonardo Bras 
> > > ---
> > >  migration/multifd-zlib.c | 7 ---
> > >  migration/multifd-zstd.c | 7 ---
> > >  migration/multifd.c  | 9 ++---
> > >  migration/multifd.h  | 3 ++-
> > >  4 files changed, 16 insertions(+), 10 deletions(-)
> >
> > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > >  }
> > >
> > >  if (used) {
> > > -ret = multifd_send_state->ops->send_write(p, used, 
> > > &local_err);
> > > +ret = multifd_send_state->ops->send_write(p, used, 
> > > MSG_ZEROCOPY,
> > > +  &local_err);
> >
> > I don't think it is valid to unconditionally enable this feature due to the
> > resource usage implications
> >
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >
> >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> >if the socket option was not set, the socket exceeds its optmem
> >limit or the user exceeds its ulimit on locked pages."
> 
> You are correct, I unfortunately missed this part in the docs :(
> 
> > The limit on locked pages is something that looks very likely to be
> > exceeded unless you happen to be running a QEMU config that already
> > implies locked memory (eg PCI assignment)
> 
> Do you mean the limit an user has on locking memory?

Yes, by default limit QEMU sees will be something very small.

> If so, that makes sense. I remember I needed to set the upper limit of locked
> memory for the user before using it, or adding a capability to qemu before.
> 
> Maybe an option would be trying to mlock all guest memory before setting
> zerocopy=on in qemu code. If it fails, we can print an error message and fall
> back to not using zerocopy (following the idea of a new io_async_writev()
> I told you in the previous mail).

Currently ability to lock memory is something that has to be configured
when QEMU starts, and it requires libvirt to grant suitable permissions
to QEMU. Memory locking is generally undesirable because it prevents
memory overcommit. Or rather if you are allowing memory overcommit, then
allowing memory locking is a way to kill your entire host.

I don't think we can unconditionally grant ability to lock arbitrary
guest RAM at startup, just to satisfy a possible desire to use zerocopy
migration later. Granting it at runtime feels questionable as you now
need to track and predict how much locked memory you've allowed, and
also have possible problems with revokation.

Possibly you could unconditionally grant ability to lock a small amount
of guest RAM at startup, but how small can it be, while still making a
useful difference to migration. It would imply we also need to be very
careful with migration to avoid having too large an amount of outstanding
zerocopy requests to exceed the limit.

IOW, the only clear place in which we can use zerocopy, is where we are
already forced to accept the penalty of locked memory at startup. eg when
the guest is using huge pages and no overcommit, or possibly when the guest
is using PCI device assignment, though in the latter I can't remember if
we allow entire of guest RAM to be locked or not.

Overall the memory locking needs look like a significant constraint that
will affect ability to use this feature.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v1 1/3] io: Enable write flags for QIOChannel

2021-09-02 Thread Leonardo Bras Soares Passos
Thanks for the feedback Eric!

On Wed, Sep 1, 2021 at 5:54 PM Eric Blake  wrote:
>
> On Tue, Aug 31, 2021 at 08:02:37AM -0300, Leonardo Bras wrote:
> > Some syscalls used for writting, such as sendmsg(), accept flags that
> > can modify their behavior, even allowing the usage of features such as
> > MSG_ZEROCOPY.
> >
> > Change qio_channel_write*() interface to allow passing down flags,
> > allowing a more flexible use of IOChannel.
> >
> > At first, it's use is enabled for QIOChannelSocket, but can be easily
> > extended to any other QIOChannel implementation.
>
> As a followup to this patch, I wonder if we can also get performance
> improvements by implementing MSG_MORE, and using that in preference to
> corking/uncorking to better indicate that back-to-back short messages
> may behave better when grouped together over the wire.  At least the
> NBD code could make use of it (going off of my experience with the
> libnbd project demonstrating a performance boost when we added
> MSG_MORE support there).

That's interesting!
We could use this patchset for testing that out, as I believe it's easy
to add those flags to the sendmsg() we want to have it enabled.

>
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  chardev/char-io.c   |  2 +-
> >  hw/remote/mpqemu-link.c |  2 +-
> >  include/io/channel.h| 56 -
> >  io/channel-buffer.c |  1 +
> >  io/channel-command.c|  1 +
> >  io/channel-file.c   |  1 +
> >  io/channel-socket.c |  4 ++-
> >  io/channel-tls.c|  1 +
> >  io/channel-websock.c|  1 +
> >  io/channel.c| 53 ++-
> >  migration/rdma.c|  1 +
> >  scsi/pr-manager-helper.c|  2 +-
> >  tests/unit/test-io-channel-socket.c |  1 +
> >  13 files changed, 81 insertions(+), 45 deletions(-)
> >
> > diff --git a/chardev/char-io.c b/chardev/char-io.c
> > index 8ced184160..4ea7b1ee2a 100644
> > --- a/chardev/char-io.c
> > +++ b/chardev/char-io.c
> > @@ -122,7 +122,7 @@ int io_channel_send_full(QIOChannel *ioc,
> >
> >  ret = qio_channel_writev_full(
> >  ioc, &iov, 1,
> > -fds, nfds, NULL);
> > +fds, 0, nfds, NULL);
>
> 0 before nfds here...

Good catch! I will fix that for v2!

>
> >  if (ret == QIO_CHANNEL_ERR_BLOCK) {
> >  if (offset) {
> >  return offset;
> > diff --git a/hw/remote/mpqemu-link.c b/hw/remote/mpqemu-link.c
> > index 7e841820e5..0d13321ef0 100644
> > --- a/hw/remote/mpqemu-link.c
> > +++ b/hw/remote/mpqemu-link.c
> > @@ -69,7 +69,7 @@ bool mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc, 
> > Error **errp)
> >  }
> >
> >  if (!qio_channel_writev_full_all(ioc, send, G_N_ELEMENTS(send),
> > -fds, nfds, errp)) {
> > + fds, nfds, 0, errp)) {
>
> Thanks for fixing the broken indentation.

I kept questioning myself if I was breaking something here :)

>
> ...but after nfds here, so one is wrong; up to this point in a linear
> review, I can't tell which was intended...
>
> > +++ b/include/io/channel.h
> > @@ -104,6 +104,7 @@ struct QIOChannelClass {
> >   size_t niov,
> >   int *fds,
> >   size_t nfds,
> > + int flags,
> >   Error **errp);
>
> ...and finally I see that in general, you wanted to add the argument
> after.  Looks like the change to char-io.c is buggy.

Yeap!

>
> (You can use scripts/git.orderfile as a way to force your patch to
> list the .h file first, to make it easier for linear review).

Nice tip! just added to my qemu gitconfig :)

>
> >  ssize_t (*io_readv)(QIOChannel *ioc,
> >  const struct iovec *iov,
> > @@ -260,6 +261,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
> >  size_t niov,
> >  int *fds,
> >  size_t nfds,
> > +int flags,
> >  Error **errp);
> >
> >  /**
> > @@ -325,6 +327,7 @@ int qio_channel_readv_all(QIOChannel *ioc,
> >   * @ioc: the channel object
> >   * @iov: the array of memory regions to write data from
> >   * @niov: the length of the @iov array
> > + * @flags: optional sending flags
> >   * @errp: pointer to a NULL-initialized error object
> >   *
> >   * Write data to the IO channel, reading it from the
> > @@ -339,10 +342,14 @@ int qio_channel_readv_all(QIOChannel *ioc,
> >   *
> >   * Returns: 0 if all bytes were written, or -1 on error
> >   */
> > -int qio_channel_writev_all(QIOChannel *ioc,
> > -   const struct iovec *iov,
> > -   size_t niov,
> > -   Error **erp);
> > +int

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-02 Thread Daniel P . Berrangé
On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> Hello Daniel, thank you for the feedback!
> 
> Comments inline.
> 
> On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé  
> wrote:
> >
> > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > send calls. It does so by avoiding copying user data into kernel buffers.
> > >
> > > To make it work, three steps are needed:
> > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > 3 - Process the socket's error queue, dealing with any error
> >
> > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> >
> > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > from a synchronous call to an asynchronous call.
> 
> You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in
> a somehow synchronous way, but returning errp (and sometimes closing the
> channel because of it) was a poor implementation.
> 
> >
> > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > until an asynchronous completion notification has been received from
> > the socket error queue. These notifications are not required to
> > arrive in-order, even for a TCP stream, because the kernel hangs on
> > to the buffer if a re-transmit is needed.
> >
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >
> >   "Page pinning also changes system call semantics. It temporarily
> >shares the buffer between process and network stack. Unlike with
> >copying, the process cannot immediately overwrite the buffer
> >after system call return without possibly modifying the data in
> >flight. Kernel integrity is not affected, but a buggy program
> >can possibly corrupt its own data stream."
> >
> 
> By the above piece of documentation, I get there is no problem in
> overwriting the buffer, but a corrupt, or newer version of the memory may
> be sent instead of the original one. I am pointing this out because there
> are workloads like page migration that would not be impacted, given
> once the buffer is changed, it will dirty the page and it will be re-sent.

The idea of having the buffer overwritten is still seriously worrying
me. I get the idea that in theory the "worst" that should happen is
that we unexpectedly transmit newer guest memory contents. There are
so many edge cases in migration code though that I'm worried there
might be scenarios in which that is still going to cause problems.

The current migration code has a synchronization point at the end of
each iteration of the migration loop. At the end of each iteration,
the source has a guarantee that all pages from the dirty bitmap have
now been sent. With the ZEROCOPY feature we have entirely removed all
synchronization points from the source host wrt memory sending. At
best we know that the pages will have been sent if we get to close()
without seeing errors, unless we're going to explicitly track the
completion messages. The TCP re-transmit semantics are especially
worrying to me, because the re-transmit is liable to send different
data than was in the original lost TCP packet.

Maybe everything is still safe because TCP is a reliable ordered
stream, and thus eventually the dst will get the newest guest
page contents. I'm still very worried that we have code in the
source that implicitly relies on there being some synchronization
points that we've effectively removed.

> > AFAICT, the design added in this patch does not provide any way
> > to honour these requirements around buffer lifetime.
> >
> > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > way. The buffer lifetime requirements imply need for an API
> > design that is fundamentally different for asynchronous usage,
> > with a callback to notify when the write has finished/failed.
> 
> That is a good point.
> Proposing a new optional method like io_async_writev() + an error
> checking mechanism could do the job.
> io_async_writev() could fall-back to io_writev() in cases where it's not
> implemented.
> 
> I am not sure about the error checking yet.
> Options I can see are:
> 1 - A callback, as you suggested, which IIUC would be provided by
> code using the QIOChannel, and would only fix the reported errors,
> leaving the responsibility of checking for errors to the IOChannel code.

A callback is fairly simple, but potentially limits performance. Reading
the kernel docs it seems they intentionally merge notifications to enable
a whole contiguous set to be processed in one go. A callback would
effectively be unmerging them requiring processing one a time. Not
sure how much this matters to our usage.

> 2 - A new method, maybe io_async_errck(), which would be called
> whenever the using code wants to deal with pending errors. It could
> return an array/list of IOVs that 

Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-02 Thread Leonardo Bras Soares Passos
On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé  wrote:
>
> On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel, thanks for the feedback !
> >
> > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > Call qio_channel_set_zerocopy(true) in the start of every multifd 
> > > > thread.
> > > >
> > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > flags for qio_channel_write*().
> > > >
> > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > other data being sent at the default copying approach.
> > > >
> > > > Signed-off-by: Leonardo Bras 
> > > > ---
> > > >  migration/multifd-zlib.c | 7 ---
> > > >  migration/multifd-zstd.c | 7 ---
> > > >  migration/multifd.c  | 9 ++---
> > > >  migration/multifd.h  | 3 ++-
> > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > >
> > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > >  }
> > > >
> > > >  if (used) {
> > > > -ret = multifd_send_state->ops->send_write(p, used, 
> > > > &local_err);
> > > > +ret = multifd_send_state->ops->send_write(p, used, 
> > > > MSG_ZEROCOPY,
> > > > +  &local_err);
> > >
> > > I don't think it is valid to unconditionally enable this feature due to 
> > > the
> > > resource usage implications
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > >if the socket option was not set, the socket exceeds its optmem
> > >limit or the user exceeds its ulimit on locked pages."
> >
> > You are correct, I unfortunately missed this part in the docs :(
> >
> > > The limit on locked pages is something that looks very likely to be
> > > exceeded unless you happen to be running a QEMU config that already
> > > implies locked memory (eg PCI assignment)
> >
> > Do you mean the limit an user has on locking memory?
>
> Yes, by default limit QEMU sees will be something very small.
>
> > If so, that makes sense. I remember I needed to set the upper limit of 
> > locked
> > memory for the user before using it, or adding a capability to qemu before.
> >
> > Maybe an option would be trying to mlock all guest memory before setting
> > zerocopy=on in qemu code. If it fails, we can print an error message and 
> > fall
> > back to not using zerocopy (following the idea of a new io_async_writev()
> > I told you in the previous mail).
>
> Currently ability to lock memory is something that has to be configured
> when QEMU starts, and it requires libvirt to grant suitable permissions
> to QEMU. Memory locking is generally undesirable because it prevents
> memory overcommit. Or rather if you are allowing memory overcommit, then
> allowing memory locking is a way to kill your entire host.

You mean it's gonna consume too much memory, or something else?

>
> I don't think we can unconditionally grant ability to lock arbitrary
> guest RAM at startup, just to satisfy a possible desire to use zerocopy
> migration later. Granting it at runtime feels questionable as you now
> need to track and predict how much locked memory you've allowed, and
> also have possible problems with revokation.

(I am really new to this, so please forgive me if I am asking dumb or
overly basic questions)

What does revokation means in this context?
You give the process hability to lock n MB of memory, and then you take it?
Why would that happen? Is Locked memory a limited resource?

>
> Possibly you could unconditionally grant ability to lock a small amount
> of guest RAM at startup, but how small can it be, while still making a
> useful difference to migration. It would imply we also need to be very
> careful with migration to avoid having too large an amount of outstanding
> zerocopy requests to exceed the limit.

Yeah, having to decide on a value that would be ok to lock is very
complex, given we can migrate with multifd, which can make this value grow
a lot. Except if we only allow a few of those fds to really use zerocopy.

>
> IOW, the only clear place in which we can use zerocopy, is where we are
> already forced to accept the penalty of locked memory at startup. eg when
> the guest is using huge pages and no overcommit, or possibly when the guest
> is using PCI device assignment,

It would be something already, given that those scenarios are the ones with
the largest number of pages to migrate. But I understand we could give it a try
on other scenarios.

> though in the latter I can't remember if
> we allow entire of guest RAM to be locked or not.

If I recall correctly on a previous discussion, this was the case at least for
PCI passthrough.

>
> Overall the memory locking needs look like a significant constrai

Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-02 Thread Daniel P . Berrangé
On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé  wrote:
> >
> > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > Hello Daniel, thanks for the feedback !
> > >
> > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé  
> > > wrote:
> > > >
> > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd 
> > > > > thread.
> > > > >
> > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > flags for qio_channel_write*().
> > > > >
> > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping 
> > > > > the
> > > > > other data being sent at the default copying approach.
> > > > >
> > > > > Signed-off-by: Leonardo Bras 
> > > > > ---
> > > > >  migration/multifd-zlib.c | 7 ---
> > > > >  migration/multifd-zstd.c | 7 ---
> > > > >  migration/multifd.c  | 9 ++---
> > > > >  migration/multifd.h  | 3 ++-
> > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > >
> > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > >  }
> > > > >
> > > > >  if (used) {
> > > > > -ret = multifd_send_state->ops->send_write(p, used, 
> > > > > &local_err);
> > > > > +ret = multifd_send_state->ops->send_write(p, used, 
> > > > > MSG_ZEROCOPY,
> > > > > +  
> > > > > &local_err);
> > > >
> > > > I don't think it is valid to unconditionally enable this feature due to 
> > > > the
> > > > resource usage implications
> > > >
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > >
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > >if the socket option was not set, the socket exceeds its optmem
> > > >limit or the user exceeds its ulimit on locked pages."
> > >
> > > You are correct, I unfortunately missed this part in the docs :(
> > >
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > >
> > > Do you mean the limit an user has on locking memory?
> >
> > Yes, by default limit QEMU sees will be something very small.
> >
> > > If so, that makes sense. I remember I needed to set the upper limit of 
> > > locked
> > > memory for the user before using it, or adding a capability to qemu 
> > > before.
> > >
> > > Maybe an option would be trying to mlock all guest memory before setting
> > > zerocopy=on in qemu code. If it fails, we can print an error message and 
> > > fall
> > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > I told you in the previous mail).
> >
> > Currently ability to lock memory is something that has to be configured
> > when QEMU starts, and it requires libvirt to grant suitable permissions
> > to QEMU. Memory locking is generally undesirable because it prevents
> > memory overcommit. Or rather if you are allowing memory overcommit, then
> > allowing memory locking is a way to kill your entire host.
> 
> You mean it's gonna consume too much memory, or something else?

Essentially yes. 

> > I don't think we can unconditionally grant ability to lock arbitrary
> > guest RAM at startup, just to satisfy a possible desire to use zerocopy
> > migration later. Granting it at runtime feels questionable as you now
> > need to track and predict how much locked memory you've allowed, and
> > also have possible problems with revokation.
> 
> (I am really new to this, so please forgive me if I am asking dumb or
> overly basic questions)
> 
> What does revokation means in this context?
> You give the process hability to lock n MB of memory, and then you take it?
> Why would that happen? Is Locked memory a limited resource?

Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
total.

So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
which exceeds physical RAM. Most of the time this may well be fine
as the VMs don't concurrently need their full RAM allocation, and
worst case they'll get pushed to swap as the kernel re-shares
memory in respose to load. So perhaps each VM only needs 20 GB
resident at any time, but over time one VM can burst upto 32 GB
and then 12 GB of it get swapped out later when inactive.

But now consider if we allowed 2 of the VMs to lock memory for
purposes of migration. Those 2 VMs can now pin 64 GB of memory
in the worst case, leaving no free memory for the 3rd VM or
for the OS. This will likely take down the entire host, regardless
of swap availability.

IOW, if you are overcomitting RAM you have to be extremely
careful about allowing any VM to lock memory. If yo

Re: [PATCH RFC 0/8] blockdev-replace

2021-09-02 Thread Vladimir Sementsov-Ogievskiy

ping

02.08.2021 21:54, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

As a continuation of "Qemu block filter insertion/removal API"
discussion, here is my proposal of blockdev-replace.

The realization allows:

- replace children of different parents: BDS, block devices, block
   exports

- automatically replace all parents of specific BDS, excluding creating
   loops (like bdrv_replace_node())

- do several replacements in one transaction

It's an untested draft, so you may go to patch 8, to look at QAPI
addition.

Vladimir Sementsov-Ogievskiy (8):
   block-backend: blk_root(): drop const specifier on return type
   block: add BlockParentClass class
   block: realize BlockParentClass for BlockDriverState
   block/export: realize BlockParentClass functionality
   qdev: improve find_device_state() to distinguish simple not found case
   qdev: realize BlockParentClass
   block: improve bdrv_replace_node_noperm()
   qapi: add blockdev-replace command

  qapi/block-core.json   |  78 
  include/block/block-parent.h   |  32 +++
  include/sysemu/block-backend.h |   2 +-
  block.c| 158 -
  block/block-backend.c  |   2 +-
  block/block-parent.c   |  66 ++
  block/export/export.c  |  44 +
  softmmu/qdev-monitor.c |  90 +++
  block/meson.build  |   1 +
  9 files changed, 453 insertions(+), 20 deletions(-)
  create mode 100644 include/block/block-parent.h
  create mode 100644 block/block-parent.c




--
Best regards,
Vladimir



Re: [RFC] qemu_cleanup: do vm_shutdown() before bdrv_drain_all_begin()

2021-09-02 Thread Vladimir Sementsov-Ogievskiy

ping

30.07.2021 17:29, Vladimir Sementsov-Ogievskiy wrote:

That doesn't seem good to stop handling io when guest is still running.
For example it leads to the following:

After bdrv_drain_all_begin(), during vm_shutdown() scsi_dma_writev()
calls blk_aio_pwritev(). As we are in drained section the request waits
in blk_wait_while_drained().

Next, during bdrv_close_all() bs is removed from blk, and blk drain
finishes. So, the request is resumed, and fails with -ENOMEDIUM.
Corresponding BLOCK_IO_ERROR event is sent and takes place in libvirt
log. That doesn't seem good.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

Hi all!

In our product (v5.2 based) we faced -ENOMEDIUM BLOCK_IO_ERROR events
during qemu termination (by SIGTERM). I don't have a reproducer for
master. Still the problem seems possible.

Ideas of how to reproduce it are welcome.

Also, I thought that issuing blk_ requests from SCSI is not possible
during drained section, and logic with blk_wait_while_drained() was
introduced for IDE..  Which code is responsible for not issuing SCSI
requests during drained sections? May be it is racy.. Or it may be our
downstream problem, I don't know :(

  softmmu/runstate.c | 11 ---
  1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 10d9b7365a..1966d773f3 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -797,21 +797,18 @@ void qemu_cleanup(void)
   */
  blk_exp_close_all();
  
+/* No more vcpu or device emulation activity beyond this point */

+vm_shutdown();
+replay_finish();
+
  /*
   * We must cancel all block jobs while the block layer is drained,
   * or cancelling will be affected by throttling and thus may block
   * for an extended period of time.
- * vm_shutdown() will bdrv_drain_all(), so we may as well include
- * it in the drained section.
   * We do not need to end this section, because we do not want any
   * requests happening from here on anyway.
   */
  bdrv_drain_all_begin();
-
-/* No more vcpu or device emulation activity beyond this point */
-vm_shutdown();
-replay_finish();
-
  job_cancel_sync_all();
  bdrv_close_all();
  




--
Best regards,
Vladimir



Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-02 Thread Leonardo Bras Soares Passos
On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé  wrote:
>
> On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel, thank you for the feedback!
> >
> > Comments inline.
> >
> > On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > send calls. It does so by avoiding copying user data into kernel 
> > > > buffers.
> > > >
> > > > To make it work, three steps are needed:
> > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > 3 - Process the socket's error queue, dealing with any error
> > >
> > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > >
> > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > from a synchronous call to an asynchronous call.
> >
> > You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY 
> > in
> > a somehow synchronous way, but returning errp (and sometimes closing the
> > channel because of it) was a poor implementation.
> >
> > >
> > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > until an asynchronous completion notification has been received from
> > > the socket error queue. These notifications are not required to
> > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > to the buffer if a re-transmit is needed.
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "Page pinning also changes system call semantics. It temporarily
> > >shares the buffer between process and network stack. Unlike with
> > >copying, the process cannot immediately overwrite the buffer
> > >after system call return without possibly modifying the data in
> > >flight. Kernel integrity is not affected, but a buggy program
> > >can possibly corrupt its own data stream."
> > >
> >
> > By the above piece of documentation, I get there is no problem in
> > overwriting the buffer, but a corrupt, or newer version of the memory may
> > be sent instead of the original one. I am pointing this out because there
> > are workloads like page migration that would not be impacted, given
> > once the buffer is changed, it will dirty the page and it will be re-sent.
>
> The idea of having the buffer overwritten is still seriously worrying
> me. I get the idea that in theory the "worst" that should happen is
> that we unexpectedly transmit newer guest memory contents. There are
> so many edge cases in migration code though that I'm worried there
> might be scenarios in which that is still going to cause problems.

I agree we should keep a eye on that, maybe have David Gilbert's opinion
on that.

>
> The current migration code has a synchronization point at the end of
> each iteration of the migration loop. At the end of each iteration,
> the source has a guarantee that all pages from the dirty bitmap have
> now been sent. With the ZEROCOPY feature we have entirely removed all
> synchronization points from the source host wrt memory sending. At
> best we know that the pages will have been sent if we get to close()
> without seeing errors, unless we're going to explicitly track the
> completion messages. The TCP re-transmit semantics are especially
> worrying to me, because the re-transmit is liable to send different
> data than was in the original lost TCP packet.
>
> Maybe everything is still safe because TCP is a reliable ordered
> stream, and thus eventually the dst will get the newest guest
> page contents. I'm still very worried that we have code in the
> source that implicitly relies on there being some synchronization
> points that we've effectively removed.
>
> > > AFAICT, the design added in this patch does not provide any way
> > > to honour these requirements around buffer lifetime.
> > >
> > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > way. The buffer lifetime requirements imply need for an API
> > > design that is fundamentally different for asynchronous usage,
> > > with a callback to notify when the write has finished/failed.
> >
> > That is a good point.
> > Proposing a new optional method like io_async_writev() + an error
> > checking mechanism could do the job.
> > io_async_writev() could fall-back to io_writev() in cases where it's not
> > implemented.
> >
> > I am not sure about the error checking yet.
> > Options I can see are:
> > 1 - A callback, as you suggested, which IIUC would be provided by
> > code using the QIOChannel, and would only fix the reported errors,
> > leaving the responsibility of checking for errors to the IOChannel code.
>
> A callback is fairly simple, but potentially limits performance. Reading
> the kernel docs it seems they intentionally merge notifications to enable
> a whole cont

[PATCH v4 5/5] iotests/297: Cover tests/

2021-09-02 Thread Hanna Reitz
297 so far does not check the named tests, which reside in the tests/
directory (i.e. full path tests/qemu-iotests/tests).  Fix it.

Thanks to the previous two commits, all named tests pass its scrutiny,
so we do not have to add anything to SKIP_FILES.

Signed-off-by: Hanna Reitz 
Reviewed-by: Willian Rampazzo 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
---
 tests/qemu-iotests/297 | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/297 b/tests/qemu-iotests/297
index c7d709cf50..ac10bd1e1a 100755
--- a/tests/qemu-iotests/297
+++ b/tests/qemu-iotests/297
@@ -55,8 +55,9 @@ def is_python_file(filename):
 
 
 def run_linters():
-files = [filename for filename in (set(os.listdir('.')) - set(SKIP_FILES))
- if is_python_file(filename)]
+named_tests = [f'tests/{entry}' for entry in os.listdir('tests')]
+check_tests = set(os.listdir('.') + named_tests) - set(SKIP_FILES)
+files = [filename for filename in check_tests if is_python_file(filename)]
 
 iotests.logger.debug('Files to be checked:')
 iotests.logger.debug(', '.join(sorted(files)))
-- 
2.31.1




Re: [PATCH v3 00/11] block/nvme: Rework error reporting

2021-09-02 Thread Stefan Hajnoczi
On Thu, Sep 02, 2021 at 09:00:14AM +0200, Philippe Mathieu-Daudé wrote:
> (Series fully reviewed)
> 
> Hi,
> 
> This series contains various patches sent last year with
> review comments addressed, few more cleanups, and a new
> patch which remove the spurious "VFIO_MAP_DMA failed: No
> space left on device" now poping up since commit 15a730e7a.
> (it is the expected behavior, which is why we retry the
> same call after flushing the DMA mappings).
> 
> Since v2:
> - qemu_vfio_find_[fixed/temp]_iova retun bool (Klaus)
> - Add Klaus's R-b
> 
> Since v1:
> - Addressed Klaus review comments (cleaner Error* handling)
> - Add Klaus's R-b
> 
> Regards,
> 
> Phil.
> 
> Philippe Mathieu-Daudé (11):
>   block/nvme: Use safer trace format string
>   util/vfio-helpers: Let qemu_vfio_verify_mappings() use error_report()
>   util/vfio-helpers: Replace qemu_mutex_lock() calls with
> QEMU_LOCK_GUARD
>   util/vfio-helpers: Remove unreachable code in qemu_vfio_dma_map()
>   block/nvme: Have nvme_create_queue_pair() report errors consistently
>   util/vfio-helpers: Pass Error handle to qemu_vfio_dma_map()
>   util/vfio-helpers: Extract qemu_vfio_water_mark_reached()
>   util/vfio-helpers: Use error_setg in qemu_vfio_find_[fixed/temp]_iova
>   util/vfio-helpers: Simplify qemu_vfio_dma_map() returning directly
>   util/vfio-helpers: Let qemu_vfio_do_mapping() propagate Error
>   block/nvme: Only report VFIO error on failed retry
> 
>  include/qemu/vfio-helpers.h |  2 +-
>  block/nvme.c| 29 +++
>  util/vfio-helpers.c | 99 -
>  block/trace-events  |  2 +-
>  4 files changed, 76 insertions(+), 56 deletions(-)
> 
> -- 
> 2.31.1
> 
> 

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature


[PATCH v4 3/5] migrate-bitmaps-test: Fix pylint warnings

2021-09-02 Thread Hanna Reitz
There are a couple of things pylint takes issue with:
- The "time" import is unused
- The import order (iotests should come last)
- get_bitmap_hash() doesn't use @self and so should be a function
- Semicolons at the end of some lines
- Parentheses after "if"
- Some lines are too long (80 characters instead of 79)
- inject_test_case()'s @name parameter shadows a top-level @name
  variable
- "lambda self: mc(self)" were equivalent to just "mc", but in
  inject_test_case(), it is not equivalent, so add a comment and disable
  the warning locally
- Always put two empty lines after a function
- f'exec: cat > /dev/null' does not need to be an f-string

Fix them.

Signed-off-by: Hanna Reitz 
---
 tests/qemu-iotests/tests/migrate-bitmaps-test | 43 +++
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/tests/qemu-iotests/tests/migrate-bitmaps-test 
b/tests/qemu-iotests/tests/migrate-bitmaps-test
index a5c7bc83e0..dc431c35b3 100755
--- a/tests/qemu-iotests/tests/migrate-bitmaps-test
+++ b/tests/qemu-iotests/tests/migrate-bitmaps-test
@@ -20,11 +20,10 @@
 #
 
 import os
-import iotests
-import time
 import itertools
 import operator
 import re
+import iotests
 from iotests import qemu_img, qemu_img_create, Timeout
 
 
@@ -37,6 +36,12 @@ mig_cmd = 'exec: cat > ' + mig_file
 incoming_cmd = 'exec: cat ' + mig_file
 
 
+def get_bitmap_hash(vm):
+result = vm.qmp('x-debug-block-dirty-bitmap-sha256',
+node='drive0', name='bitmap0')
+return result['return']['sha256']
+
+
 class TestDirtyBitmapMigration(iotests.QMPTestCase):
 def tearDown(self):
 self.vm_a.shutdown()
@@ -62,21 +67,16 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
 params['persistent'] = True
 
 result = vm.qmp('block-dirty-bitmap-add', **params)
-self.assert_qmp(result, 'return', {});
-
-def get_bitmap_hash(self, vm):
-result = vm.qmp('x-debug-block-dirty-bitmap-sha256',
-node='drive0', name='bitmap0')
-return result['return']['sha256']
+self.assert_qmp(result, 'return', {})
 
 def check_bitmap(self, vm, sha256):
 result = vm.qmp('x-debug-block-dirty-bitmap-sha256',
 node='drive0', name='bitmap0')
 if sha256:
-self.assert_qmp(result, 'return/sha256', sha256);
+self.assert_qmp(result, 'return/sha256', sha256)
 else:
 self.assert_qmp(result, 'error/desc',
-"Dirty bitmap 'bitmap0' not found");
+"Dirty bitmap 'bitmap0' not found")
 
 def do_test_migration_resume_source(self, persistent, migrate_bitmaps):
 granularity = 512
@@ -97,7 +97,7 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
 self.add_bitmap(self.vm_a, granularity, persistent)
 for r in regions:
 self.vm_a.hmp_qemu_io('drive0', 'write %d %d' % r)
-sha256 = self.get_bitmap_hash(self.vm_a)
+sha256 = get_bitmap_hash(self.vm_a)
 
 result = self.vm_a.qmp('migrate', uri=mig_cmd)
 while True:
@@ -106,7 +106,7 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
 break
 while True:
 result = self.vm_a.qmp('query-status')
-if (result['return']['status'] == 'postmigrate'):
+if result['return']['status'] == 'postmigrate':
 break
 
 # test that bitmap is still here
@@ -164,7 +164,7 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
 self.add_bitmap(self.vm_a, granularity, persistent)
 for r in regions:
 self.vm_a.hmp_qemu_io('drive0', 'write %d %d' % r)
-sha256 = self.get_bitmap_hash(self.vm_a)
+sha256 = get_bitmap_hash(self.vm_a)
 
 if pre_shutdown:
 self.vm_a.shutdown()
@@ -214,16 +214,22 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
 self.check_bitmap(self.vm_b, sha256 if persistent else False)
 
 
-def inject_test_case(klass, name, method, *args, **kwargs):
+def inject_test_case(klass, suffix, method, *args, **kwargs):
 mc = operator.methodcaller(method, *args, **kwargs)
-setattr(klass, 'test_' + method + name, lambda self: mc(self))
+# We want to add a function attribute to `klass`, so that it is
+# correctly converted to a method on instantiation.  The
+# methodcaller object `mc` is a callable, not a function, so we
+# need the lambda to turn it into a function.
+# pylint: disable=unnecessary-lambda
+setattr(klass, 'test_' + method + suffix, lambda self: mc(self))
+
 
 for cmb in list(itertools.product((True, False), repeat=5)):
 name = ('_' if cmb[0] else '_not_') + 'persistent_'
 name += ('_' if cmb[1] else '_not_') + 'migbitmap_'
 name += '_online' if cmb[2] else '_offline'
 name += '_shared' if cmb[3] else '_nonshared'
-if (cmb[4]):
+if cmb[4]:
 name += '__pre_shutdown'
 
 inject_test_cas

[PATCH v4 1/5] iotests/297: Drop 169 and 199 from the skip list

2021-09-02 Thread Hanna Reitz
169 and 199 have been renamed and moved to tests/ (commit a44be0334be:
"iotests: rename and move 169 and 199 tests"), so we can drop them from
the skip list.

Signed-off-by: Hanna Reitz 
Reviewed-by: Willian Rampazzo 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
---
 tests/qemu-iotests/297 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/297 b/tests/qemu-iotests/297
index 345b617b34..c7d709cf50 100755
--- a/tests/qemu-iotests/297
+++ b/tests/qemu-iotests/297
@@ -29,7 +29,7 @@ import iotests
 SKIP_FILES = (
 '030', '040', '041', '044', '045', '055', '056', '057', '065', '093',
 '096', '118', '124', '132', '136', '139', '147', '148', '149',
-'151', '152', '155', '163', '165', '169', '194', '196', '199', '202',
+'151', '152', '155', '163', '165', '194', '196', '202',
 '203', '205', '206', '207', '208', '210', '211', '212', '213', '216',
 '218', '219', '224', '228', '234', '235', '236', '237', '238',
 '240', '242', '245', '246', '248', '255', '256', '257', '258', '260',
-- 
2.31.1




[PATCH v4 4/5] mirror-top-perms: Fix AbnormalShutdown path

2021-09-02 Thread Hanna Reitz
The AbnormalShutdown exception class is not in qemu.machine, but in
qemu.machine.machine.  (qemu.machine.AbnormalShutdown was enough for
Python to find it in order to run this test, but pylint complains about
it.)

Signed-off-by: Hanna Reitz 
---
 tests/qemu-iotests/tests/mirror-top-perms | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/tests/mirror-top-perms 
b/tests/qemu-iotests/tests/mirror-top-perms
index 451a0666f8..2fc8dd66e0 100755
--- a/tests/qemu-iotests/tests/mirror-top-perms
+++ b/tests/qemu-iotests/tests/mirror-top-perms
@@ -47,7 +47,7 @@ class TestMirrorTopPerms(iotests.QMPTestCase):
 def tearDown(self):
 try:
 self.vm.shutdown()
-except qemu.machine.AbnormalShutdown:
+except qemu.machine.machine.AbnormalShutdown:
 pass
 
 if self.vm_b is not None:
-- 
2.31.1




[PATCH v4 2/5] migrate-bitmaps-postcopy-test: Fix pylint warnings

2021-09-02 Thread Hanna Reitz
pylint complains that discards1_sha256 and all_discards_sha256 are first
set in non-__init__ methods.

These variables are not really class-variables anyway, so let them
instead be returned by start_postcopy(), thus silencing pylint.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Hanna Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 .../tests/migrate-bitmaps-postcopy-test | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test 
b/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test
index 584062b412..00ebb5c251 100755
--- a/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test
+++ b/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test
@@ -132,10 +132,10 @@ class 
TestDirtyBitmapPostcopyMigration(iotests.QMPTestCase):
 
 result = self.vm_a.qmp('x-debug-block-dirty-bitmap-sha256',
node='drive0', name='bitmap0')
-self.discards1_sha256 = result['return']['sha256']
+discards1_sha256 = result['return']['sha256']
 
 # Check, that updating the bitmap by discards works
-assert self.discards1_sha256 != empty_sha256
+assert discards1_sha256 != empty_sha256
 
 # We want to calculate resulting sha256. Do it in bitmap0, so, disable
 # other bitmaps
@@ -148,7 +148,7 @@ class TestDirtyBitmapPostcopyMigration(iotests.QMPTestCase):
 
 result = self.vm_a.qmp('x-debug-block-dirty-bitmap-sha256',
node='drive0', name='bitmap0')
-self.all_discards_sha256 = result['return']['sha256']
+all_discards_sha256 = result['return']['sha256']
 
 # Now, enable some bitmaps, to be updated during migration
 for i in range(2, nb_bitmaps, 2):
@@ -173,10 +173,11 @@ class 
TestDirtyBitmapPostcopyMigration(iotests.QMPTestCase):
 
 event_resume = self.vm_b.event_wait('RESUME')
 self.vm_b_events.append(event_resume)
-return event_resume
+return (event_resume, discards1_sha256, all_discards_sha256)
 
 def test_postcopy_success(self):
-event_resume = self.start_postcopy()
+event_resume, discards1_sha256, all_discards_sha256 = \
+self.start_postcopy()
 
 # enabled bitmaps should be updated
 apply_discards(self.vm_b, discards2)
@@ -217,7 +218,7 @@ class TestDirtyBitmapPostcopyMigration(iotests.QMPTestCase):
 for i in range(0, nb_bitmaps, 5):
 result = self.vm_b.qmp('x-debug-block-dirty-bitmap-sha256',
node='drive0', name='bitmap{}'.format(i))
-sha = self.discards1_sha256 if i % 2 else self.all_discards_sha256
+sha = discards1_sha256 if i % 2 else all_discards_sha256
 self.assert_qmp(result, 'return/sha256', sha)
 
 def test_early_shutdown_destination(self):
-- 
2.31.1




[PATCH v2] block: drop BLK_PERM_GRAPH_MOD

2021-09-02 Thread Vladimir Sementsov-Ogievskiy
First, this permission never protected a node from being changed, as
generic child-replacing functions don't check it.

Second, it's a strange thing: it presents a permission of parent node
to change its child. But generally, children are replaced by different
mechanisms, like jobs or qmp commands, not by nodes.

Graph-mod permission is hard to understand. All other permissions
describe operations which done by parent node on its child: read,
write, resize. Graph modification operations are something completely
different.

The only place where BLK_PERM_GRAPH_MOD is used as "perm" (not shared
perm) is mirror_start_job, for s->target. Still modern code should use
bdrv_freeze_backing_chain() to protect from graph modification, if we
don't do it somewhere it may be considered as a bug. So, it's a bit
risky to drop GRAPH_MOD, and analyzing of possible loss of protection
is hard. But one day we should do it, let's do it now.

One more bit of information is that locking the corresponding byte in
file-posix doesn't make sense at all.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

v2: fix grammar

 qapi/block-core.json  |  7 ++-
 include/block/block.h |  9 +
 block.c   |  7 +--
 block/commit.c|  1 -
 block/mirror.c| 15 +++
 hw/block/block.c  |  3 +--
 scripts/render_block_graph.py |  1 -
 tests/qemu-iotests/273.out|  4 
 8 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 06674c25c9..6fa2c4ab82 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1825,14 +1825,11 @@
 #
 # @resize: This permission is required to change the size of a block node.
 #
-# @graph-mod: This permission is required to change the node that this
-# BdrvChild points to.
-#
 # Since: 4.0
 ##
 { 'enum': 'BlockPermission',
-  'data': [ 'consistent-read', 'write', 'write-unchanged', 'resize',
-'graph-mod' ] }
+  'data': [ 'consistent-read', 'write', 'write-unchanged', 'resize' ] }
+
 ##
 # @XDbgBlockGraphEdge:
 #
diff --git a/include/block/block.h b/include/block/block.h
index 3477290f9a..bccca61429 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -269,12 +269,13 @@ enum {
 BLK_PERM_RESIZE = 0x08,
 
 /**
- * This permission is required to change the node that this BdrvChild
- * points to.
+ * There was a now-removed bit BLK_PERM_GRAPH_MOD, with value of 0x10. QEMU
+ * 6.1 and earlier may still lock the corresponding byte in 
block/file-posix
+ * locking.  So, implementing some new permission should be very careful to
+ * not interfere with this old unused thing.
  */
-BLK_PERM_GRAPH_MOD  = 0x10,
 
-BLK_PERM_ALL= 0x1f,
+BLK_PERM_ALL= 0x0f,
 
 DEFAULT_PERM_PASSTHROUGH= BLK_PERM_CONSISTENT_READ
  | BLK_PERM_WRITE
diff --git a/block.c b/block.c
index e97ce0b1c8..761d47d6f3 100644
--- a/block.c
+++ b/block.c
@@ -2397,7 +2397,6 @@ char *bdrv_perm_names(uint64_t perm)
 { BLK_PERM_WRITE,   "write" },
 { BLK_PERM_WRITE_UNCHANGED, "write unchanged" },
 { BLK_PERM_RESIZE,  "resize" },
-{ BLK_PERM_GRAPH_MOD,   "change children" },
 { 0, NULL }
 };
 
@@ -2513,8 +2512,7 @@ static void bdrv_default_perms_for_cow(BlockDriverState 
*bs, BdrvChild *c,
 shared = 0;
 }
 
-shared |= BLK_PERM_CONSISTENT_READ | BLK_PERM_GRAPH_MOD |
-  BLK_PERM_WRITE_UNCHANGED;
+shared |= BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED;
 
 if (bs->open_flags & BDRV_O_INACTIVE) {
 shared |= BLK_PERM_WRITE | BLK_PERM_RESIZE;
@@ -2632,7 +2630,6 @@ uint64_t bdrv_qapi_perm_to_blk_perm(BlockPermission 
qapi_perm)
 [BLOCK_PERMISSION_WRITE]= BLK_PERM_WRITE,
 [BLOCK_PERMISSION_WRITE_UNCHANGED]  = BLK_PERM_WRITE_UNCHANGED,
 [BLOCK_PERMISSION_RESIZE]   = BLK_PERM_RESIZE,
-[BLOCK_PERMISSION_GRAPH_MOD]= BLK_PERM_GRAPH_MOD,
 };
 
 QEMU_BUILD_BUG_ON(ARRAY_SIZE(permissions) != BLOCK_PERMISSION__MAX);
@@ -5326,8 +5323,6 @@ int bdrv_drop_intermediate(BlockDriverState *top, 
BlockDriverState *base,
 update_inherits_from = bdrv_inherits_from_recursive(base, explicit_top);
 
 /* success - we can delete the intermediate states, and link top->base */
-/* TODO Check graph modification op blockers (BLK_PERM_GRAPH_MOD) once
- * we've figured out how they should work. */
 if (!backing_file_str) {
 bdrv_refresh_filename(base);
 backing_file_str = base->filename;
diff --git a/block/commit.c b/block/commit.c
index 42792b4556..837b07e314 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -370,7 +370,6 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 s->base = blk_new(s->common.job.aio_context,
   ba

[PATCH v4 0/5] iotests/297: Cover tests/

2021-09-02 Thread Hanna Reitz
v1: https://lists.nongnu.org/archive/html/qemu-block/2021-03/msg01471.html
v2: https://lists.nongnu.org/archive/html/qemu-block/2021-05/msg00492.html
v3: https://lists.nongnu.org/archive/html/qemu-block/2021-05/msg00569.html


Hi,

Sorry for the long delay, here is v4 to make our lint checking iotest
297 cover the tests/ subdirectory.


v4:
- Fixed the explanatory comment in patch 3 as suggested by Vladimir
- Added patch 4


git-backport-diff against v3:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/5:[] [-C] 'iotests/297: Drop 169 and 199 from the skip list'
002/5:[] [--] 'migrate-bitmaps-postcopy-test: Fix pylint warnings'
003/5:[0006] [FC] 'migrate-bitmaps-test: Fix pylint warnings'
004/5:[down] 'mirror-top-perms: Fix AbnormalShutdown path'
005/5:[] [--] 'iotests/297: Cover tests/'


Hanna Reitz (5):
  iotests/297: Drop 169 and 199 from the skip list
  migrate-bitmaps-postcopy-test: Fix pylint warnings
  migrate-bitmaps-test: Fix pylint warnings
  mirror-top-perms: Fix AbnormalShutdown path
  iotests/297: Cover tests/

 tests/qemu-iotests/297|  7 +--
 .../tests/migrate-bitmaps-postcopy-test   | 13 +++---
 tests/qemu-iotests/tests/migrate-bitmaps-test | 43 +++
 tests/qemu-iotests/tests/mirror-top-perms |  2 +-
 4 files changed, 37 insertions(+), 28 deletions(-)

-- 
2.31.1




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-02 Thread Leonardo Bras Soares Passos
On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé  wrote:
>
> On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos 
> > > wrote:
> > > > Hello Daniel, thanks for the feedback !
> > > >
> > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé 
> > > >  wrote:
> > > > >
> > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd 
> > > > > > thread.
> > > > > >
> > > > > > Change the send_write() interface of multifd, allowing it to pass 
> > > > > > down
> > > > > > flags for qio_channel_write*().
> > > > > >
> > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping 
> > > > > > the
> > > > > > other data being sent at the default copying approach.
> > > > > >
> > > > > > Signed-off-by: Leonardo Bras 
> > > > > > ---
> > > > > >  migration/multifd-zlib.c | 7 ---
> > > > > >  migration/multifd-zstd.c | 7 ---
> > > > > >  migration/multifd.c  | 9 ++---
> > > > > >  migration/multifd.h  | 3 ++-
> > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > >
> > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > >  }
> > > > > >
> > > > > >  if (used) {
> > > > > > -ret = multifd_send_state->ops->send_write(p, used, 
> > > > > > &local_err);
> > > > > > +ret = multifd_send_state->ops->send_write(p, used, 
> > > > > > MSG_ZEROCOPY,
> > > > > > +  
> > > > > > &local_err);
> > > > >
> > > > > I don't think it is valid to unconditionally enable this feature due 
> > > > > to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >if the socket option was not set, the socket exceeds its optmem
> > > > >limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > >
> > > Yes, by default limit QEMU sees will be something very small.
> > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of 
> > > > locked
> > > > memory for the user before using it, or adding a capability to qemu 
> > > > before.
> > > >
> > > > Maybe an option would be trying to mlock all guest memory before setting
> > > > zerocopy=on in qemu code. If it fails, we can print an error message 
> > > > and fall
> > > > back to not using zerocopy (following the idea of a new 
> > > > io_async_writev()
> > > > I told you in the previous mail).
> > >
> > > Currently ability to lock memory is something that has to be configured
> > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > to QEMU. Memory locking is generally undesirable because it prevents
> > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > allowing memory locking is a way to kill your entire host.
> >
> > You mean it's gonna consume too much memory, or something else?
>
> Essentially yes.

Well, maybe we can check for available memory before doing that,
but maybe it's too much effort.

>
> > > I don't think we can unconditionally grant ability to lock arbitrary
> > > guest RAM at startup, just to satisfy a possible desire to use zerocopy
> > > migration later. Granting it at runtime feels questionable as you now
> > > need to track and predict how much locked memory you've allowed, and
> > > also have possible problems with revokation.
> >
> > (I am really new to this, so please forgive me if I am asking dumb or
> > overly basic questions)
> >
> > What does revokation means in this context?
> > You give the process hability to lock n MB of memory, and then you take it?
> > Why would that happen? Is Locked memory a limited resource?
>
> Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> total.
>
> So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> which exceeds physical RAM. Most of the time this may well be fine
> as the VMs don't concurrently need their full RAM allocation, and
> worst case they'll get pushed to swap as the kernel re-shares
> memory in respose to load. So perhaps each VM only needs 20 GB
> resident at any time, but over time one VM can burst upto 32 GB
> and then 12 GB of it get swapped out later when inac

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-02 Thread Daniel P . Berrangé
On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé  wrote:
> >
> > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> >
> > > > I would suggest checkig in close(), but as mentioned
> > > > earlier, I think the design is flawed because the caller
> > > > fundamentally needs to know about completion for every
> > > > single write they make in order to know when the buffer
> > > > can be released / reused.
> > >
> > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > activated with a
> > > parameter flag, or on a different method if callback is preferred):
> > > In the MSG_ZEROCOPY docs, we can see that the example includes using a 
> > > poll()
> > > syscall after each packet sent, and this means the fd gets a signal after 
> > > each
> > > sendmsg() happens, with error or not.
> > >
> > > We could harness this with a poll() and a relatively high timeout:
> > > - We stop sending packets, and then call poll().
> > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > took too long
> > > without sendmsg() happening, meaning all the packets are sent.
> > > - If it returns anything else, we go back to fixing the errors found 
> > > (re-send)
> > >
> > > The problem may be defining the value of this timeout, but it could be
> > > called only
> > > when zerocopy is active.
> >
> > Maybe we need to check completions at the end of each iteration of the
> > migration dirtypage loop ?
> 
> Sorry, I am really new to this, and I still couldn't understand why would we
> need to check at the end of each iteration, instead of doing a full check at 
> the
> end.

The end of each iteration is an implicit synchronization point in the
current migration code.

For example, we might do 2 iterations of migration pre-copy, and then
switch to post-copy mode. If the data from those 2 iterations hasn't
been sent at the point we switch to post-copy, that is a semantic
change from current behaviour. I don't know if that will have an
problematic effect on the migration process, or not. Checking the
async completions at the end of each iteration though, would ensure
the semantics similar to current semantics, reducing risk of unexpected
problems.


> > > or something like this, if we want it to stick with zerocopy if
> > > setting it off fails.
> > > if (ret >= 0) {
> > > sioc->zerocopy_enabled = enabled;
> > > }
> >
> > Yes, that is a bug fix we need, but actually I was refering
> > to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY
> > from 'flags', if zerocopy_enabled is not set, to avoid EINVAL
> > from sendmsg.
> 
> Agree, something like io_writev(,, sioc->zerocopy_enabled ?
> MSG_ZEROCOPY : 0, errp)'
> should do, right?
> (or an io_async_writev(), that will fall_back to io_writev() if
> zerocopy is disabled)

Something like that - depends what API we end up deciding on

> > > We could implement KTLS as io_async_writev() for Channel_TLS, and change 
> > > this
> > > flag to async_enabled. If KTLS is not available, it would fall back to
> > > using gnuTLS on io_writev, just like it would happen in zerocopy.
> >
> > If gnutls is going to support KTLS in a way we can use, I dont want to
> > have any of that duplicated in QEMU code. I want to be able delegate
> > as much as possible to gnutls, at least so that QEMU isn't in the loop
> > for any crypto sensitive parts, as that complicates certification
> > of crypto.
> 
> Yeah, that's a very good argument :)
> I think it's probably the case to move from the current callback 
> implementation
> to the implementation in which we give gnuTLS the file descriptor.

That would introduce a coupling  between the two QIOChannel impls
though, which is avoided so far, because we didn't want an assuption
that a QIOChannel == a FD.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v4 3/5] migrate-bitmaps-test: Fix pylint warnings

2021-09-02 Thread Vladimir Sementsov-Ogievskiy

02.09.2021 12:40, Hanna Reitz wrote:

There are a couple of things pylint takes issue with:
- The "time" import is unused
- The import order (iotests should come last)
- get_bitmap_hash() doesn't use @self and so should be a function
- Semicolons at the end of some lines
- Parentheses after "if"
- Some lines are too long (80 characters instead of 79)
- inject_test_case()'s @name parameter shadows a top-level @name
   variable
- "lambda self: mc(self)" were equivalent to just "mc", but in
   inject_test_case(), it is not equivalent, so add a comment and disable
   the warning locally
- Always put two empty lines after a function
- f'exec: cat > /dev/null' does not need to be an f-string

Fix them.

Signed-off-by: Hanna Reitz


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v4 4/5] mirror-top-perms: Fix AbnormalShutdown path

2021-09-02 Thread Vladimir Sementsov-Ogievskiy

02.09.2021 12:40, Hanna Reitz wrote:

The AbnormalShutdown exception class is not in qemu.machine, but in
qemu.machine.machine.  (qemu.machine.AbnormalShutdown was enough for
Python to find it in order to run this test, but pylint complains about
it.)

Signed-off-by: Hanna Reitz 
---
  tests/qemu-iotests/tests/mirror-top-perms | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/tests/mirror-top-perms 
b/tests/qemu-iotests/tests/mirror-top-perms
index 451a0666f8..2fc8dd66e0 100755
--- a/tests/qemu-iotests/tests/mirror-top-perms
+++ b/tests/qemu-iotests/tests/mirror-top-perms
@@ -47,7 +47,7 @@ class TestMirrorTopPerms(iotests.QMPTestCase):
  def tearDown(self):
  try:
  self.vm.shutdown()
-except qemu.machine.AbnormalShutdown:
+except qemu.machine.machine.AbnormalShutdown:
  pass
  
  if self.vm_b is not None:




Hmm, interesting.. May be that bad that module has same name as subpackage?

Anyway, I don't care too much. Interesting how Python find it o_O.

Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-02 Thread Daniel P . Berrangé
On Thu, Sep 02, 2021 at 06:49:06AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé  wrote:
> >
> > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé  
> > > wrote:
> > > >
> > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos 
> > > > wrote:
> > > > > Hello Daniel, thanks for the feedback !
> > > > >
> > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé 
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd 
> > > > > > > thread.
> > > > > > >
> > > > > > > Change the send_write() interface of multifd, allowing it to pass 
> > > > > > > down
> > > > > > > flags for qio_channel_write*().
> > > > > > >
> > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while 
> > > > > > > keeping the
> > > > > > > other data being sent at the default copying approach.
> > > > > > >
> > > > > > > Signed-off-by: Leonardo Bras 
> > > > > > > ---
> > > > > > >  migration/multifd-zlib.c | 7 ---
> > > > > > >  migration/multifd-zstd.c | 7 ---
> > > > > > >  migration/multifd.c  | 9 ++---
> > > > > > >  migration/multifd.h  | 3 ++-
> > > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > >
> > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > > >  }
> > > > > > >
> > > > > > >  if (used) {
> > > > > > > -ret = multifd_send_state->ops->send_write(p, 
> > > > > > > used, &local_err);
> > > > > > > +ret = multifd_send_state->ops->send_write(p, 
> > > > > > > used, MSG_ZEROCOPY,
> > > > > > > +  
> > > > > > > &local_err);
> > > > > >
> > > > > > I don't think it is valid to unconditionally enable this feature 
> > > > > > due to the
> > > > > > resource usage implications
> > > > > >
> > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > >
> > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This 
> > > > > > happens
> > > > > >if the socket option was not set, the socket exceeds its optmem
> > > > > >limit or the user exceeds its ulimit on locked pages."
> > > > >
> > > > > You are correct, I unfortunately missed this part in the docs :(
> > > > >
> > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > implies locked memory (eg PCI assignment)
> > > > >
> > > > > Do you mean the limit an user has on locking memory?
> > > >
> > > > Yes, by default limit QEMU sees will be something very small.
> > > >
> > > > > If so, that makes sense. I remember I needed to set the upper limit 
> > > > > of locked
> > > > > memory for the user before using it, or adding a capability to qemu 
> > > > > before.
> > > > >
> > > > > Maybe an option would be trying to mlock all guest memory before 
> > > > > setting
> > > > > zerocopy=on in qemu code. If it fails, we can print an error message 
> > > > > and fall
> > > > > back to not using zerocopy (following the idea of a new 
> > > > > io_async_writev()
> > > > > I told you in the previous mail).
> > > >
> > > > Currently ability to lock memory is something that has to be configured
> > > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > > to QEMU. Memory locking is generally undesirable because it prevents
> > > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > > allowing memory locking is a way to kill your entire host.
> > >
> > > You mean it's gonna consume too much memory, or something else?
> >
> > Essentially yes.
> 
> Well, maybe we can check for available memory before doing that,
> but maybe it's too much effort.

>From a mgmt app POV, we assume QEMU is untrustworthy, so the mgmt
app needs to enforce policy based on the worst case that the
VM configuration allows for.

Checking current available memory is not viable, because even
if you see 10 GB free, QEMU can't know if that free memory is
there to satisfy other VMs's worst case needs, or its own. QEMU
needs to be explicitly told whether its OK to use locked memory,
and an enforcement policy applied to it.


> > Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> > overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> > total.
> >
> > So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> > which exceeds physical RAM. Most of the time this may well be fine
> > as the VMs don't concurrently need their full RAM allocation, and
> > worst case they'll get pushed to swap as the kernel re-shares
> > memory in respose to load. So perhaps each VM only needs 20 GB
> > resident at any time, but ov

Re: [PATCH v4 4/5] mirror-top-perms: Fix AbnormalShutdown path

2021-09-02 Thread Philippe Mathieu-Daudé
On 9/2/21 11:58 AM, Vladimir Sementsov-Ogievskiy wrote:
> 02.09.2021 12:40, Hanna Reitz wrote:
>> The AbnormalShutdown exception class is not in qemu.machine, but in
>> qemu.machine.machine.  (qemu.machine.AbnormalShutdown was enough for
>> Python to find it in order to run this test, but pylint complains about
>> it.)
>>
>> Signed-off-by: Hanna Reitz 
>> ---
>>   tests/qemu-iotests/tests/mirror-top-perms | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tests/qemu-iotests/tests/mirror-top-perms
>> b/tests/qemu-iotests/tests/mirror-top-perms
>> index 451a0666f8..2fc8dd66e0 100755
>> --- a/tests/qemu-iotests/tests/mirror-top-perms
>> +++ b/tests/qemu-iotests/tests/mirror-top-perms
>> @@ -47,7 +47,7 @@ class TestMirrorTopPerms(iotests.QMPTestCase):
>>   def tearDown(self):
>>   try:
>>   self.vm.shutdown()
>> -    except qemu.machine.AbnormalShutdown:
>> +    except qemu.machine.machine.AbnormalShutdown:
>>   pass
>>     if self.vm_b is not None:
>>
> 
> Hmm, interesting.. May be that bad that module has same name as subpackage?

Confusing indeed. Could this be improved?




Re: [PATCH 2/2] iotests: Fix use-{list,dict}-literal warnings

2021-09-02 Thread Vladimir Sementsov-Ogievskiy

24.08.2021 18:35, Hanna Reitz wrote:

pylint proposes using `[]` instead of `list()` and `{}` instead of
`dict()`, because it is faster.  That seems simple enough, so heed its
advice.

Signed-off-by: Hanna Reitz


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-02 Thread Leonardo Bras Soares Passos
On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé  wrote:
>
> On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos 
> > > wrote:
> > >
> > > > > I would suggest checkig in close(), but as mentioned
> > > > > earlier, I think the design is flawed because the caller
> > > > > fundamentally needs to know about completion for every
> > > > > single write they make in order to know when the buffer
> > > > > can be released / reused.
> > > >
> > > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > > activated with a
> > > > parameter flag, or on a different method if callback is preferred):
> > > > In the MSG_ZEROCOPY docs, we can see that the example includes using a 
> > > > poll()
> > > > syscall after each packet sent, and this means the fd gets a signal 
> > > > after each
> > > > sendmsg() happens, with error or not.
> > > >
> > > > We could harness this with a poll() and a relatively high timeout:
> > > > - We stop sending packets, and then call poll().
> > > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > > took too long
> > > > without sendmsg() happening, meaning all the packets are sent.
> > > > - If it returns anything else, we go back to fixing the errors found 
> > > > (re-send)
> > > >
> > > > The problem may be defining the value of this timeout, but it could be
> > > > called only
> > > > when zerocopy is active.
> > >
> > > Maybe we need to check completions at the end of each iteration of the
> > > migration dirtypage loop ?
> >
> > Sorry, I am really new to this, and I still couldn't understand why would we
> > need to check at the end of each iteration, instead of doing a full check 
> > at the
> > end.
>
> The end of each iteration is an implicit synchronization point in the
> current migration code.
>
> For example, we might do 2 iterations of migration pre-copy, and then
> switch to post-copy mode. If the data from those 2 iterations hasn't
> been sent at the point we switch to post-copy, that is a semantic
> change from current behaviour. I don't know if that will have an
> problematic effect on the migration process, or not. Checking the
> async completions at the end of each iteration though, would ensure
> the semantics similar to current semantics, reducing risk of unexpected
> problems.
>

What if we do the 'flush()' before we start post-copy, instead of after each
iteration? would that be enough?

>
> > > > or something like this, if we want it to stick with zerocopy if
> > > > setting it off fails.
> > > > if (ret >= 0) {
> > > > sioc->zerocopy_enabled = enabled;
> > > > }
> > >
> > > Yes, that is a bug fix we need, but actually I was refering
> > > to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY
> > > from 'flags', if zerocopy_enabled is not set, to avoid EINVAL
> > > from sendmsg.
> >
> > Agree, something like io_writev(,, sioc->zerocopy_enabled ?
> > MSG_ZEROCOPY : 0, errp)'
> > should do, right?
> > (or an io_async_writev(), that will fall_back to io_writev() if
> > zerocopy is disabled)
>
> Something like that - depends what API we end up deciding on

Great :)

>
> > > > We could implement KTLS as io_async_writev() for Channel_TLS, and 
> > > > change this
> > > > flag to async_enabled. If KTLS is not available, it would fall back to
> > > > using gnuTLS on io_writev, just like it would happen in zerocopy.
> > >
> > > If gnutls is going to support KTLS in a way we can use, I dont want to
> > > have any of that duplicated in QEMU code. I want to be able delegate
> > > as much as possible to gnutls, at least so that QEMU isn't in the loop
> > > for any crypto sensitive parts, as that complicates certification
> > > of crypto.
> >
> > Yeah, that's a very good argument :)
> > I think it's probably the case to move from the current callback 
> > implementation
> > to the implementation in which we give gnuTLS the file descriptor.
>
> That would introduce a coupling  between the two QIOChannel impls
> though, which is avoided so far, because we didn't want an assuption
> that a QIOChannel == a FD.

Interesting, QIOChannelTLS would necessarily need to have a QIOChannelSocket,
is that right?

Maybe we can get a QIOChannelKTLS, which would use gnuTLS but get to
have it's own fd,
but that possibly would cause a lot of duplicated code.
On the other hand, this way the QIOChannelTLS would still be able to
accept any other
channel, if desired.

Anyway, it's a complicated decision :)

> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Thank you, I am learning a lot today :)

Best regards,
Leonardo




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-02 Thread Daniel P . Berrangé
On Thu, Sep 02, 2021 at 07:19:58AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé  wrote:
> >
> > On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote:
> > > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé  
> > > wrote:
> > > >
> > > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos 
> > > > wrote:
> > > >
> > > > > > I would suggest checkig in close(), but as mentioned
> > > > > > earlier, I think the design is flawed because the caller
> > > > > > fundamentally needs to know about completion for every
> > > > > > single write they make in order to know when the buffer
> > > > > > can be released / reused.
> > > > >
> > > > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > > > activated with a
> > > > > parameter flag, or on a different method if callback is preferred):
> > > > > In the MSG_ZEROCOPY docs, we can see that the example includes using 
> > > > > a poll()
> > > > > syscall after each packet sent, and this means the fd gets a signal 
> > > > > after each
> > > > > sendmsg() happens, with error or not.
> > > > >
> > > > > We could harness this with a poll() and a relatively high timeout:
> > > > > - We stop sending packets, and then call poll().
> > > > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > > > took too long
> > > > > without sendmsg() happening, meaning all the packets are sent.
> > > > > - If it returns anything else, we go back to fixing the errors found 
> > > > > (re-send)
> > > > >
> > > > > The problem may be defining the value of this timeout, but it could be
> > > > > called only
> > > > > when zerocopy is active.
> > > >
> > > > Maybe we need to check completions at the end of each iteration of the
> > > > migration dirtypage loop ?
> > >
> > > Sorry, I am really new to this, and I still couldn't understand why would 
> > > we
> > > need to check at the end of each iteration, instead of doing a full check 
> > > at the
> > > end.
> >
> > The end of each iteration is an implicit synchronization point in the
> > current migration code.
> >
> > For example, we might do 2 iterations of migration pre-copy, and then
> > switch to post-copy mode. If the data from those 2 iterations hasn't
> > been sent at the point we switch to post-copy, that is a semantic
> > change from current behaviour. I don't know if that will have an
> > problematic effect on the migration process, or not. Checking the
> > async completions at the end of each iteration though, would ensure
> > the semantics similar to current semantics, reducing risk of unexpected
> > problems.
> >
> 
> What if we do the 'flush()' before we start post-copy, instead of after each
> iteration? would that be enough?

Possibly, yes. This really need David G's input since he understands
the code in way more detail than me.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-02 Thread Leonardo Bras Soares Passos
On Thu, Sep 2, 2021 at 6:59 AM Daniel P. Berrangé  wrote:
>
> On Thu, Sep 02, 2021 at 06:49:06AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos 
> > > wrote:
> > > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé  
> > > > wrote:
> > > > >
> > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos 
> > > > > wrote:
> > > > > > Hello Daniel, thanks for the feedback !
> > > > > >
> > > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > > > Call qio_channel_set_zerocopy(true) in the start of every 
> > > > > > > > multifd thread.
> > > > > > > >
> > > > > > > > Change the send_write() interface of multifd, allowing it to 
> > > > > > > > pass down
> > > > > > > > flags for qio_channel_write*().
> > > > > > > >
> > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while 
> > > > > > > > keeping the
> > > > > > > > other data being sent at the default copying approach.
> > > > > > > >
> > > > > > > > Signed-off-by: Leonardo Bras 
> > > > > > > > ---
> > > > > > > >  migration/multifd-zlib.c | 7 ---
> > > > > > > >  migration/multifd-zstd.c | 7 ---
> > > > > > > >  migration/multifd.c  | 9 ++---
> > > > > > > >  migration/multifd.h  | 3 ++-
> > > > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > > >
> > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void 
> > > > > > > > *opaque)
> > > > > > > >  }
> > > > > > > >
> > > > > > > >  if (used) {
> > > > > > > > -ret = multifd_send_state->ops->send_write(p, 
> > > > > > > > used, &local_err);
> > > > > > > > +ret = multifd_send_state->ops->send_write(p, 
> > > > > > > > used, MSG_ZEROCOPY,
> > > > > > > > +  
> > > > > > > > &local_err);
> > > > > > >
> > > > > > > I don't think it is valid to unconditionally enable this feature 
> > > > > > > due to the
> > > > > > > resource usage implications
> > > > > > >
> > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > > >
> > > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This 
> > > > > > > happens
> > > > > > >if the socket option was not set, the socket exceeds its optmem
> > > > > > >limit or the user exceeds its ulimit on locked pages."
> > > > > >
> > > > > > You are correct, I unfortunately missed this part in the docs :(
> > > > > >
> > > > > > > The limit on locked pages is something that looks very likely to 
> > > > > > > be
> > > > > > > exceeded unless you happen to be running a QEMU config that 
> > > > > > > already
> > > > > > > implies locked memory (eg PCI assignment)
> > > > > >
> > > > > > Do you mean the limit an user has on locking memory?
> > > > >
> > > > > Yes, by default limit QEMU sees will be something very small.
> > > > >
> > > > > > If so, that makes sense. I remember I needed to set the upper limit 
> > > > > > of locked
> > > > > > memory for the user before using it, or adding a capability to qemu 
> > > > > > before.
> > > > > >
> > > > > > Maybe an option would be trying to mlock all guest memory before 
> > > > > > setting
> > > > > > zerocopy=on in qemu code. If it fails, we can print an error 
> > > > > > message and fall
> > > > > > back to not using zerocopy (following the idea of a new 
> > > > > > io_async_writev()
> > > > > > I told you in the previous mail).
> > > > >
> > > > > Currently ability to lock memory is something that has to be 
> > > > > configured
> > > > > when QEMU starts, and it requires libvirt to grant suitable 
> > > > > permissions
> > > > > to QEMU. Memory locking is generally undesirable because it prevents
> > > > > memory overcommit. Or rather if you are allowing memory overcommit, 
> > > > > then
> > > > > allowing memory locking is a way to kill your entire host.
> > > >
> > > > You mean it's gonna consume too much memory, or something else?
> > >
> > > Essentially yes.
> >
> > Well, maybe we can check for available memory before doing that,
> > but maybe it's too much effort.
>
> From a mgmt app POV, we assume QEMU is untrustworthy, so the mgmt
> app needs to enforce policy based on the worst case that the
> VM configuration allows for.
>
> Checking current available memory is not viable, because even
> if you see 10 GB free, QEMU can't know if that free memory is
> there to satisfy other VMs's worst case needs, or its own. QEMU
> needs to be explicitly told whether its OK to use locked memory,
> and an enforcement policy applied to it.


Yeah, it makes sense to let the mgmt app deal with that and enable/disable
the MSG_ZEROCOPY on migration whenever it sees fit.


>
>
> > > Consider a VM host with 64 GB of RAM 

[PATCH v6 0/5] block/nbd: drop connection_co

2021-09-02 Thread Vladimir Sementsov-Ogievskiy
v6:
01,02: add Eric's r-b
03: make new interface clearer
04,05: rebased on updated 03

Vladimir Sementsov-Ogievskiy (5):
  block/nbd: nbd_channel_error() shutdown channel unconditionally
  block/nbd: move nbd_recv_coroutines_wake_all() up
  block/nbd: refactor nbd_recv_coroutines_wake_all()
  block/nbd: drop connection_co
  block/nbd: check that received handle is valid

 block/nbd.c  | 412 +++
 nbd/client.c |   2 -
 2 files changed, 121 insertions(+), 293 deletions(-)

-- 
2.29.2




[PATCH v6 1/5] block/nbd: nbd_channel_error() shutdown channel unconditionally

2021-09-02 Thread Vladimir Sementsov-Ogievskiy
Don't rely on connection being totally broken in case of -EIO. More
safe and correct just shutdown the channel anyway, as we change the
state and going to reconnect.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 block/nbd.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index f6ff1c4fb4..d88f4b954c 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -129,15 +129,16 @@ static bool nbd_client_connected(BDRVNBDState *s)
 
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
+if (nbd_client_connected(s)) {
+qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+}
+
 if (ret == -EIO) {
 if (nbd_client_connected(s)) {
 s->state = s->reconnect_delay ? NBD_CLIENT_CONNECTING_WAIT :
 NBD_CLIENT_CONNECTING_NOWAIT;
 }
 } else {
-if (nbd_client_connected(s)) {
-qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
-}
 s->state = NBD_CLIENT_QUIT;
 }
 }
-- 
2.29.2




[PATCH v6 2/5] block/nbd: move nbd_recv_coroutines_wake_all() up

2021-09-02 Thread Vladimir Sementsov-Ogievskiy
We are going to use it in nbd_channel_error(), so move it up. Note,
that we are going also refactor and rename
nbd_recv_coroutines_wake_all() in future anyway, so keeping it where it
is and making forward declaration doesn't make real sense.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
---
 block/nbd.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index d88f4b954c..32e3826ba2 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -127,6 +127,20 @@ static bool nbd_client_connected(BDRVNBDState *s)
 return qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTED;
 }
 
+static void nbd_recv_coroutines_wake_all(BDRVNBDState *s)
+{
+int i;
+
+for (i = 0; i < MAX_NBD_REQUESTS; i++) {
+NBDClientRequest *req = &s->requests[i];
+
+if (req->coroutine && req->receiving) {
+req->receiving = false;
+aio_co_wake(req->coroutine);
+}
+}
+}
+
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
 if (nbd_client_connected(s)) {
@@ -143,20 +157,6 @@ static void nbd_channel_error(BDRVNBDState *s, int ret)
 }
 }
 
-static void nbd_recv_coroutines_wake_all(BDRVNBDState *s)
-{
-int i;
-
-for (i = 0; i < MAX_NBD_REQUESTS; i++) {
-NBDClientRequest *req = &s->requests[i];
-
-if (req->coroutine && req->receiving) {
-req->receiving = false;
-aio_co_wake(req->coroutine);
-}
-}
-}
-
 static void reconnect_delay_timer_del(BDRVNBDState *s)
 {
 if (s->reconnect_delay_timer) {
-- 
2.29.2




[PATCH v6 4/5] block/nbd: drop connection_co

2021-09-02 Thread Vladimir Sementsov-Ogievskiy
OK, that's a big rewrite of the logic.

Pre-patch we have an always running coroutine - connection_co. It does
reply receiving and reconnecting. And it leads to a lot of difficult
and unobvious code around drained sections and context switch. We also
abuse bs->in_flight counter which is increased for connection_co and
temporary decreased in points where we want to allow drained section to
begin. One of these place is in another file: in nbd_read_eof() in
nbd/client.c.

We also cancel reconnect and requests waiting for reconnect on drained
begin which is not correct. And this patch fixes that.

Let's finally drop this always running coroutine and go another way:
do both reconnect and receiving in request coroutines.

The detailed list of changes below (in the sequence of diff hunks).

1. receiving coroutines are woken directly from nbd_channel_error, when
   we change s->state

2. nbd_co_establish_connection_cancel(): we don't have drain_begin now,
   and in nbd_teardown_connection() all requests should already be
   finished (and reconnect is done from request). So
   nbd_co_establish_connection_cancel() is called from
   nbd_cancel_in_flight() (to cancel the request that is doing
   nbd_co_establish_connection()) and from reconnect_delay_timer_cb()
   (previously we didn't need it, as reconnect delay only should cancel
   active requests not the reconnection itself. But now reconnection
   itself is done in the separate thread (we now call
   nbd_client_connection_enable_retry() in nbd_open()), and we need to
   cancel the requests that waits in nbd_co_establish_connection()
   now).

2. We do receive headers in request coroutine. But we also should
   dispatch replies for another pending requests. So,
   nbd_connection_entry() is turned into nbd_receive_replies(), which
   does reply dispatching until it receive another request headers, and
   returns when it receive the requested header.

3. All old staff around drained sections and context switch is dropped.
   In details:
   - we don't need to move connection_co to new aio context, as we
 don't have connection_co anymore
   - we don't have a fake "request" of connection_co (extra increasing
 in_flight), so don't care with it in drain_begin/end
   - we don't stop reconnection during drained section anymore. This
 means that drain_begin may wait for a long time (up to
 reconnect_delay). But that's an improvement and more correct
 behavior see below[*]

4. In nbd_teardown_connection() we don't have to wait for
   connection_co, as it is dropped. And cleanup for s->ioc and nbd_yank
   is moved here from removed connection_co.

5. In nbd_co_do_establish_connection() we now should handle
   NBD_CLIENT_CONNECTING_NOWAIT: if new request comes when we are in
   NBD_CLIENT_CONNECTING_NOWAIT, it still should call
   nbd_co_establish_connection() (who knows, maybe connection already
   established by thread in background). But we shouldn't wait: if
   nbd_co_establish_connection() can't return new channel immediately
   the request should fail (we are in NBD_CLIENT_CONNECTING_NOWAIT
   state).

6. nbd_reconnect_attempt() is simplified: it's now easier to wait for
   other requests in the caller, so here we just assert that fact.
   Also delay time is now initialized here: we can easily detect first
   attempt and start a timer.

7. nbd_co_reconnect_loop() is dropped, we don't need it. Reconnect
   retries are fully handle by thread (nbd/client-connection.c), delay
   timer we initialize in nbd_reconnect_attempt(), we don't have to
   bother with s->drained and friends. nbd_reconnect_attempt() now
   called from nbd_co_send_request().

8. nbd_connection_entry is dropped: reconnect is now handled by
   nbd_co_send_request(), receiving reply is now handled by
   nbd_receive_replies(): all handled from request coroutines.

9. So, welcome new nbd_receive_replies() called from request coroutine,
   that receives reply header instead of nbd_connection_entry().
   Like with sending requests, only one coroutine may receive in a
   moment. So we introduce receive_mutex, which is locked around
   nbd_receive_reply(). It also protects some related fields. Still,
   full audit of thread-safety in nbd driver is a separate task.
   New function waits for a reply with specified handle being received
   and works rather simple:

   Under mutex:
 - if current handle is 0, do receive by hand. If another handle
   received - switch to other request coroutine, release mutex and
   yield. Otherwise return success
 - if current handle == requested handle, we are done
 - otherwise, release mutex and yield

10: in nbd_co_send_request() we now do nbd_reconnect_attempt() if
needed. Also waiting in free_sema queue we now wait for one of two
conditions:
- connectED, in_flight < MAX_NBD_REQUESTS (so we can start new one)
- connectING, in_flight == 0, so we can call
  nbd_reconnect_attempt()
And this logic is protected by s->send_

[PATCH v6 3/5] block/nbd: refactor nbd_recv_coroutines_wake_all()

2021-09-02 Thread Vladimir Sementsov-Ogievskiy
Split out nbd_recv_coroutine_wake_one(), as it will be used in
separate.
Rename the function and add a possibility to wake only first found
sleeping coroutine.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 32e3826ba2..52b0733684 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -127,16 +127,24 @@ static bool nbd_client_connected(BDRVNBDState *s)
 return qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTED;
 }
 
-static void nbd_recv_coroutines_wake_all(BDRVNBDState *s)
+static bool nbd_recv_coroutine_wake_one(NBDClientRequest *req)
+{
+if (req->receiving) {
+req->receiving = false;
+aio_co_wake(req->coroutine);
+return true;
+}
+
+return false;
+}
+
+static void nbd_recv_coroutines_wake(BDRVNBDState *s, bool all)
 {
 int i;
 
 for (i = 0; i < MAX_NBD_REQUESTS; i++) {
-NBDClientRequest *req = &s->requests[i];
-
-if (req->coroutine && req->receiving) {
-req->receiving = false;
-aio_co_wake(req->coroutine);
+if (nbd_recv_coroutine_wake_one(&s->requests[i]) && !all) {
+return;
 }
 }
 }
@@ -415,7 +423,7 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState 
*s)
 
 while (s->in_flight > 0) {
 qemu_co_mutex_unlock(&s->send_mutex);
-nbd_recv_coroutines_wake_all(s);
+nbd_recv_coroutines_wake(s, true);
 s->wait_in_flight = true;
 qemu_coroutine_yield();
 s->wait_in_flight = false;
@@ -558,7 +566,7 @@ static coroutine_fn void nbd_connection_entry(void *opaque)
 }
 
 qemu_co_queue_restart_all(&s->free_sema);
-nbd_recv_coroutines_wake_all(s);
+nbd_recv_coroutines_wake(s, true);
 bdrv_dec_in_flight(s->bs);
 
 s->connection_co = NULL;
@@ -1035,7 +1043,7 @@ static coroutine_fn int nbd_co_receive_one_chunk(
 if (s->connection_co && !s->wait_in_flight) {
 /*
  * We must check s->wait_in_flight, because we may entered by
- * nbd_recv_coroutines_wake_all(), in this case we should not
+ * nbd_recv_coroutines_wake(), in this case we should not
  * wake connection_co here, it will woken by last request.
  */
 aio_co_wake(s->connection_co);
-- 
2.29.2




[PATCH v6 5/5] block/nbd: check that received handle is valid

2021-09-02 Thread Vladimir Sementsov-Ogievskiy
If we don't have active request, that waiting for this handle to be
received, we should report an error.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 170a8c8eeb..306b2de9f2 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -58,6 +58,7 @@ typedef struct {
 Coroutine *coroutine;
 uint64_t offset;/* original offset of the request */
 bool receiving; /* sleeping in the yield in nbd_receive_replies */
+bool reply_possible;/* reply header not yet received */
 } NBDClientRequest;
 
 typedef enum NBDClientState {
@@ -415,16 +416,11 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState 
*s, uint64_t handle)
 return 0;
 }
 ind2 = HANDLE_TO_INDEX(s, s->reply.handle);
-if (ind2 >= MAX_NBD_REQUESTS || !s->requests[ind2].coroutine) {
-/*
- * We only check that ind2 request exists. But don't check is it 
now
- * waiting for the reply header or not. We can't just check
- * s->requests[ind2].receiving: ind2 request may wait in trying to
- * lock receive_mutex. So that's a TODO.
- */
+if (ind2 >= MAX_NBD_REQUESTS || !s->requests[ind2].reply_possible) {
 nbd_channel_error(s, -EINVAL);
 return -EINVAL;
 }
+s->requests[ind2].reply_possible = false;
 nbd_recv_coroutine_wake_one(&s->requests[ind2]);
 }
 }
@@ -467,6 +463,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
 s->requests[i].coroutine = qemu_coroutine_self();
 s->requests[i].offset = request->from;
 s->requests[i].receiving = false;
+s->requests[i].reply_possible = true;
 
 request->handle = INDEX_TO_HANDLE(s, i);
 
-- 
2.29.2




Re: [PULL 00/56] Block patches

2021-09-02 Thread Peter Maydell
On Wed, 1 Sept 2021 at 16:16, Hanna Reitz  wrote:
>
> The following changes since commit ec397e90d21269037280633b6058d1f280e27667:
>
>   Merge remote-tracking branch 
> 'remotes/alistair/tags/pull-riscv-to-apply-20210901-2' into staging 
> (2021-09-01 08:33:02 +0100)
>
> are available in the Git repository at:
>
>   https://github.com/XanClic/qemu.git tags/pull-block-2021-09-01
>
> for you to fetch changes up to ebd979c74e2b8a7275090475df36dde4ab858320:
>
>   block/file-win32: add reopen handlers (2021-09-01 14:38:08 +0200)
>
>
> **Note:** I’ve signed the pull request tag with my new GPG key, which I
> have uploaded here:
>
>   https://xanclic.moe/A1FA40D098019CDF

Hi. Unfortunately my employer's internet setup blocks access to
that site :-(

> (I’ve also tried uploading this key to several keyservers, but it
> appears to me that keyservers are kind of a thing of the past now,
> especially when it comes to uploading keys with signatures on them...)

IME keyserver.ubuntu.com is still functional.

thanks
-- PMM



Re: [PULL 00/56] Block patches

2021-09-02 Thread Hanna Reitz

On 02.09.21 13:07, Peter Maydell wrote:

On Wed, 1 Sept 2021 at 16:16, Hanna Reitz  wrote:

The following changes since commit ec397e90d21269037280633b6058d1f280e27667:

   Merge remote-tracking branch 
'remotes/alistair/tags/pull-riscv-to-apply-20210901-2' into staging (2021-09-01 
08:33:02 +0100)

are available in the Git repository at:

   https://github.com/XanClic/qemu.git tags/pull-block-2021-09-01

for you to fetch changes up to ebd979c74e2b8a7275090475df36dde4ab858320:

   block/file-win32: add reopen handlers (2021-09-01 14:38:08 +0200)


**Note:** I’ve signed the pull request tag with my new GPG key, which I
have uploaded here:

   https://xanclic.moe/A1FA40D098019CDF

Hi. Unfortunately my employer's internet setup blocks access to
that site :-(


That’s too bad. :(

(Perhaps it works by IP? http://159.69.123.47:8080/A1FA40D098019CDF)

But I think it’s small enough that I can just attach it to this mail 
(which I’ve done).



(I’ve also tried uploading this key to several keyservers, but it
appears to me that keyservers are kind of a thing of the past now,
especially when it comes to uploading keys with signatures on them...)

IME keyserver.ubuntu.com is still functional.


Yes, I’ve uploaded it there, too:

https://keyserver.ubuntu.com/pks/lookup?search=hreitz%40redhat.com&fingerprint=on&op=index

And it looks like that indeed it has capture my signature.

Hanna
-BEGIN PGP PUBLIC KEY BLOCK-

mQINBGESUhkBEAC52kwlLdL8hxpVL4RvPcevQKczKaPfouKVB2aqkBW1WuF/8aqv
lwGGnlZlh3PeidIC0CMA3qrGrazmqbHHxUosikywDzB+Bret2/RdXuX58NEKClBR
7dGgu1s0Lb7rYZEjxJYlaXxKg7Jxcq2uow21sbOhldOTdS64afM7MMbaHiHe544x
nDJp6XBqzB+4Tm0vX+cPU8LnczmwZmB0fRZrYCBYtsY7xKQglLYz7v6wenP8b4BE
J/Ckctg2A/MhQv1PPoXKJoZs0OASMxy7naO4pT6cCOADzZxUYVlFFGq/TzrV7diE
RNSCyfCmP3tSWD4P3fRcFiFXdJ5v4KXTE9YtBfayx5ioapV6I64iBKSQM/6czgJI
dOWfLpYosLOO/sWHtkBBnLJ/to1NtTOqJIkeS5+xwGtAnDtZ/9uUu8XpfSmIpsyn
4HaNCZ4DZxcdyrslqZsEWLAv8344vm8wEJlpMroySI5t4tc3V7QflClRpOXVDdib
V/gxQ3DpSi6WgqVP4bNMGjyXYEqbeK6AGwGkGOoyvm65gC+rl8ANlRYtXdAXAm1q
Qdy+TEo6Sp7WOYo2Zd0bFD+0jWVEGbU3K59bkULBWeorQtRjEP9uz0Vjxn3MoqRC
1U6WsfL8W4LDPNPrwfM8x+HBhxk4G98RdyJeFZq3j5Q5ELsUjGtLEh+qSwARAQAB
tB9IYW5uYSBSZWl0eiA8aHJlaXR6QHJlZGhhdC5jb20+iQJUBBMBCAA+FiEEy2LX
oO44KeRfAE00ofpA0JgBnN8FAmESUhkCGwMFCRLMAwAFCwkIBwIGFQoJCAsCBBYC
AwECHgECF4AACgkQofpA0JgBnN+ZDQ/9GlMbziLjqWMcpkuILhEI6Je8kv9TVMsr
134fTdXQaqveJebUF6JgurpeZbm9fDpnCkk4AA59nfR7GM/AOUk0J43xhiqgnSdH
R2zLaDwvwT5CiBUAx2d2AAlemLw7gHiLPdqg2tg+YcH/LLDp8gqEQ7VmTXQh0lZ7
Re/0kvi4w/WpKsdGsZ/OCabCb57Jn8JI9+CK72QjmCCVOjGmM8RMRSMU5plHX6J0
awk5ALuazMuIB/QxCX8cdYzyBXhkjQjYtN9xv5Q7SCSshotWJkQekuhSxJozzRmC
cj22j6UiviyWZ2S4BLMkG4/7ny7SyW3kP7nzzBnR2O98O2Xj6tJdwH1UmA9c23E3
iseWQF/YblHVhK1e2fjZ55cA11c5AJx7y4KYE+lD3yZJLRvjVdkMe1DHWGg3usFY
nnhxuEBSyH8FjJ1H3YCOzdBeX2ijhgolEyjLcR6IO57f0BTgKADNjYyXM6VV2ejK
CUHbzABaGSf99ecG61efFe6xPTCMAcpNjSYPS4i3pJ+wfTwgdJ/i8RRHek81hRTW
d8zj/rbbX34qlSb3syuzLSvIz0bIMVEmNUKWsk51Cl+ea7OpVuXQOl0RaWwar/GM
6L9RwTKB7pudDiXVJvEDYM3Rl6AEXpPxZWU5yCp+pdtJKQW6mlJMLhZnga+1Pl65
8m16f56jMQuJATMEEwEIAB0WIQSRvrYKMNs+iFfRGCn0B9sAYdXPQAUCYRJhGwAK
CRD0B9sAYdXPQO4SB/9wMXXNklyWttcrHtdY3WXIXQTbnzsuqt3F2McJdLDeaq81
nb3bKRNHr67G9htmEYaBNAwRIxX3FQrK9INDY2xYixuQboMGA4DIToa87698rgJH
K+OGKplnXqv3QPUcD0HCF4HtL+q+cjaZ/ALbTu5C4D9gohvOQY7baTMm5Z8/zFlt
MwrdbehM3Ki6lrEJQODntgNqSxHbJuhS7t0bwu9dEK1AWG/JRPlV83SANYgj3Pkx
DlPmX6kxy/JrfGhMSITH6xuYFufLHVITLizSqqZg8ct10Yc7/00tPYD8ksZ1TApq
Spd3mEaV1puBMAeK/6VHHgjs7Dpj+jhgDd2TXXOguQINBGESUhkBEAD/U9IK55Ay
DTpLxkVj7Z/AEa9KVBEaJ91mOGhZcUzmR5lpq1XKqvBkqQeb8BZWgusz93h57FpB
tBDOar8zNUzFuLOcbbCqaDRP/wnGnD84nQuvV+nzV0u+o1pj/Kt7Jot2nAjm9kAd
mxyBB8t4B9JQ7kvQGmL5MaLEDDE1pTwMDAyU8I2WExZ4ltEvyBccrDEdiRIbT41M
n/rFBS9bccIzMZ2T05RvyiOMwQw5LQPPQgQWoWnPlmnRYkTvieuD1MNAnbmEcBoY
Slz6hbiaXDje8RKiy+7CyppgSpg7ot2yERoONfCVJrCvh2+1JM/9u1XxSnl6zxAq
KOoROpG8g/RedGdA7Oum0ZIybg1cZJZcHtyxle7q6EbP2p4Br66FVeicrteQ54nP
GOOhfIsIvti3hrcifm8MWCAxhssy88ZwKhXWjdQXh5Kra6q1gfVIA3UduouuJSRT
XcyaHKxIDdDv1NblRWSyM9uhnTWym3w3qwi7XmrydDot34x54sKMSBH1hoINkX2n
D89aqGNMCXr2M7pORJHA6puqhTVwBIH+dGi+06+XIvkXs1Kx+uMB8I3L81cdIyYp
2+XCW2Rjr0qOCpFKDbwK5JpJ3PlioekxyF888DRPsGI8zmUbdEb+Cx76TgqplYAM
L0j/Ein7RCt2rGCsndGLDJ31fgw78RqzAwARAQABiQI8BBgBCAAmFiEEy2LXoO44
KeRfAE00ofpA0JgBnN8FAmESUhkCGwwFCRLMAwAACgkQofpA0JgBnN/qaRAAonwz
w8dSuM0mTSSqoI3qSUjhupAmEiNJKKBwnBAFbaIZIzF4e9QEjlLaYiO3WyhVCw4Y
dx03+dbdj3D9H435Rw4pw8xhhpECawbLmvxDfXBTwiGcTAXmA/V35hkcOre0u37i
2n3/k1Ei27QyZnJ1j6kifDlwS0xvVuHXz2MC+L2lpCX6yVpdpWt6bMyNUchZ2WVv
th0E36SxeL7p5QxySXFNxWw4RxhdrXqR8f4MP8lInd5FubWL5vO75VOdW6W49qeW
vc5qk+eKo7244Iz8tZRB/VlgYjH0oIPfDdbLu1hhWtAMv/zdjvRK3pxk2lR8mxCG
N6bZ5nvfrNw6XPCZqjyDybzi8wr8cmOQB9mU8NFR0itkCqwp+3ILQzXrr9pOj3JW
KwKDKq+2+Be8H1EmoQuo9iHoCB6xqBmNM3t3R/1/IQIjAJEdh95S3ydCavlqnReB
Cw175nIvKp5bYFP+VSowinDhX140A7MHt1SguuzfuhQK4iJJIFwXJudmwWCFT5lY
I0koXRazkziP/5yEnlqHM6UmhrafvzW12sMNzIRBVoA5Ui7R4bgGDBT6W6AahUEG
NY03132p32TUfwcrt+czPFuMe3AZ2XlPLSB5KxIAal86XJiTwsN3knXyzkdAkU6z
VqSqp55kfm42yTjyrA1QBBouzs44B7O7puouzlU=
=iYO9
-END PGP PUBLIC KEY BLOCK-


Re: [PULL 00/56] Block patches

2021-09-02 Thread Peter Maydell
On Wed, 1 Sept 2021 at 16:16, Hanna Reitz  wrote:
>
> The following changes since commit ec397e90d21269037280633b6058d1f280e27667:
>
>   Merge remote-tracking branch 
> 'remotes/alistair/tags/pull-riscv-to-apply-20210901-2' into staging 
> (2021-09-01 08:33:02 +0100)
>
> are available in the Git repository at:
>
>   https://github.com/XanClic/qemu.git tags/pull-block-2021-09-01
>
> for you to fetch changes up to ebd979c74e2b8a7275090475df36dde4ab858320:
>
>   block/file-win32: add reopen handlers (2021-09-01 14:38:08 +0200)
>
>
> **Note:** I’ve signed the pull request tag with my new GPG key, which I
> have uploaded here:
>
>   https://xanclic.moe/A1FA40D098019CDF
>
> Included in that key should be the signature I made with my old key
> (F407DB0061D5CF40), and I hope that’s sufficient to establish a
> reasonable level of trust.
>
> (I’ve also tried uploading this key to several keyservers, but it
> appears to me that keyservers are kind of a thing of the past now,
> especially when it comes to uploading keys with signatures on them...)
>
> 
> Block patches:
> - Make the backup-top filter driver available for user-created block
>   nodes (i.e. via blockdev-add)
> - Allow running iotests with gdb or valgrind being attached to qemu
>   instances
> - Fix the raw format driver's permissions: There is no metadata, so we
>   only need WRITE or RESIZE when the parent needs it
> - Basic reopen implementation for win32 files (file-win32.c) so that
>   qemu-img commit can work
> - uclibc/musl build fix for the FUSE export code
> - Some iotests delinting
> - block-hmp-cmds.c refactoring
>


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/6.2
for any user-visible changes.

-- PMM



Re: [PATCH V2] block/rbd: implement bdrv_co_block_status

2021-09-02 Thread Peter Lieven

Am 24.08.21 um 22:39 schrieb Ilya Dryomov:

On Mon, Aug 23, 2021 at 11:38 AM Peter Lieven  wrote:

Am 22.08.21 um 23:02 schrieb Ilya Dryomov:

On Tue, Aug 10, 2021 at 3:41 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff features which depends on the object-map

Hi Peter,

Nit: "has the fast-diff feature which depends on the object-map and
exclusive-lock features"


will reword in V3.



and exclusive-lock. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

   block/rbd.c | 125 
   1 file changed, 125 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..8692e76f40 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -88,6 +88,7 @@ typedef struct BDRVRBDState {
   char *namespace;
   uint64_t image_size;
   uint64_t object_size;
+uint64_t features;
   } BDRVRBDState;

   typedef struct RBDTask {
@@ -983,6 +984,13 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
   s->image_size = info.size;
   s->object_size = info.obj_size;

+r = rbd_get_features(s->image, &s->features);
+if (r < 0) {
+error_setg_errno(errp, -r, "error getting image features from %s",
+ s->image_name);
+goto failed_post_open;
+}

The object-map and fast-diff features can be enabled/disabled while the
image is open so this should probably go to qemu_rbd_co_block_status().


+
   /* If we are using an rbd snapshot, we must be r/o, otherwise
* leave as-is */
   if (s->snap != NULL) {
@@ -1259,6 +1267,122 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
   return spec_info;
   }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;
+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we catched all cases above and allocation state has not

catched -> caught


+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ 

Re: [PATCH] block/vvfat: Fix ro shared folder

2021-09-02 Thread Guillaume Roche
Hi Philippe,

Le mar. 31 août 2021 à 18:06, Philippe Mathieu-Daudé
 a écrit :
>
> Hi Guillaume,
>
> On 8/31/21 4:17 PM, Guillaume Roche wrote:
> > QEMU exits in error when passing a vfat shared folder in read-only mode.
> >
> > To fix this issue, this patch removes any potential write permission
> > from cumulative_perms, when a read-only block device is in use.
> >
> > Buglink: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=918950
> >
> > Signed-off-by: Guillaume Roche 
> > ---
> > This is an attempt to fix this behavior, but it feels a bit hacky to me
> > since this patch checks for the vvfat format in a generic function.
>
> What about implementing bdrv_vvfat::bdrv_check_perm()?

Thanks for this feedback. I had a look at your suggestion, but I'm a
bit confused.

As I understand it, bdrv_node_refresh_perm() calls
bdrv_get_cumulative_perm() (which I patched to remove the write
permission) then it checks the permissions.
Afterwards, it calls bdrv_drv_set_perm() that in turn calls
bs->drv->bdrv_check_perm(). So even if I implement
bdrv_vvfat::bdrv_check_perm(), I will get the permission error before
it is called.

Could you elaborate a bit on what you have in mind please?

Regards,
Guillaume

>
> > However, I'd be glad to have some advice to make it better. Anyway, I
> > ran the block tests to ensure this does not introduce any regression.
> >
> > To add some context: I know that this can be worked around by setting
> > the shared folder in rw mode. But our use-case requires using both
> > shared and VM snapshots, and QEMU prevents using snapshot with a rw
> > shared folder.
> >
> >  block.c | 6 ++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/block.c b/block.c
> > index e97ce0b1c8..3f59e3843f 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -2383,6 +2383,12 @@ void bdrv_get_cumulative_perm(BlockDriverState *bs, 
> > uint64_t *perm,
> >  cumulative_shared_perms &= c->shared_perm;
> >  }
> >
> > +/* Discard write permission if vvfat block device is read-only */
> > +const char *format = bdrv_get_format_name(bs);
> > +if (format != NULL && strncmp(format, "vvfat", 5) == 0 && 
> > bdrv_is_read_only(bs)) {
> > +cumulative_perms &= ~BLK_PERM_WRITE;
> > +}
> > +
> >  *perm = cumulative_perms;
> >  *shared_perm = cumulative_shared_perms;
> >  }
> >
>