Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-08 Thread Klaus Jensen

On Jul  9 08:55, Klaus Jensen wrote:

On Jul  9 08:16, Hannes Reinecke wrote:

On 7/9/21 8:05 AM, Klaus Jensen wrote:

On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller hot
unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme devices
can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't believe
it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +---
hw/nvme/ctrl.c   | 14 ++--
hw/nvme/ns.c | 55 +++-
hw/nvme/subsys.c |  9 
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0



Applied patches 1 through 3 to nvme-next.


So, how do we go about with patch 4?
Without it this whole exercise is a bit pointless, seeing that it 
doesn't fix anything.




Patch 1-3 are fixes we need anyway, so I thought I might as well apply 
them :)



Shall we go with that patch as an interim solution?
Will you replace it with your 'object' patch?
What is the plan?



Yes, if acceptable, I would like to use patch 4 as an interim 
solution. We have a bug we need to fix for 6.1, and I belive this does 
the job.


I considered changing the existing nvme-bus to be on the main system 
bus, but then we break the existing behavior that the namespaces 
attach to the most recently defined controller in the absence of the 
shared parameter or an explicit bus parameter.


Wrt. "the plan", right now, I see two solutions going forward:

1. Introduce new -object's for nvme-nvm-subsystem and nvme-ns
  This is the approach that I am taking right now and it works well. 
Itallows many-to-many relationships and separates the life times 
ofsubsystems, namespaces and controllers like you mentioned.


  Conceptually, I also really like that the subsystem and namespace 
arenot "devices". One could argue that the namespace is comparable 
to aSCSI LUN (-device scsi-hd, right?), but where the SCSI LUN 
actually"shows up" in the host, the nvme namespace does not.


  My series handles backwards compatibility by keeping -device 
"shims"around that just wraps the new objects but behaves like it 
used to.The plan would be to deprecate these devices.


  The downside to this approach is that it moves the subsystem and
namespaces out of the "qdev tree (info qtree)" and into the pure QOM
"/objects" tree. Instead of qtree, we can have QMP and HMP commands
for introspection.


2. Make the subsystem a "system bus device"
  This way we add an "nvme-nvm-subsystems" bus as a direct child of 
themain system bus, and we can possibly get rid of the explicit 
-devicenvme-subsys as well. We change the namespace device to plug 
into thatinstead. The nvme controller device still needs to plug 
into the PCIbus, so it cannot be a child of the subsystems bus, 
but can keepusing a link parameter to hook into the subsystem and 
attach to anynamespaces it would like.


  I'm unsure if we can do this without deprecating the existing
namespace device, just like option 1.


  I have not implemented this, so I need to look more into it. It 
seemslike the main thing that this gives us compared to 1) is 
`info qtree`support and we still end up just "wiring" namespace 
attachment withbacklinks a

Re: [PATCH] tests/qtest/nvme-test: add persistent memory region test

2021-07-08 Thread Klaus Jensen

On Jun 18 16:04, Gollu Appalanaidu wrote:

This will test the PMR functionality.

Signed-off-by: Gollu Appalanaidu 
---
tests/qtest/nvme-test.c | 78 -
1 file changed, 77 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/nvme-test.c b/tests/qtest/nvme-test.c
index d32c953a38..6d557be6ca 100644
--- a/tests/qtest/nvme-test.c
+++ b/tests/qtest/nvme-test.c
@@ -13,6 +13,7 @@
#include "libqos/libqtest.h"
#include "libqos/qgraph.h"
#include "libqos/pci.h"
+#include "include/block/nvme.h"

typedef struct QNvme QNvme;

@@ -21,6 +22,9 @@ struct QNvme {
QPCIDevice dev;
};

+static char *t_path;
+#define TEST_IMAGE_SIZE (2 * 1024 * 1024)
+
static void *nvme_get_driver(void *obj, const char *interface)
{
QNvme *nvme = obj;
@@ -66,12 +70,77 @@ static void nvmetest_oob_cmb_test(void *obj, void *data, 
QGuestAllocator *alloc)
g_assert_cmpint(qpci_io_readl(pdev, bar, cmb_bar_size - 1), !=, 0x44332211);
}

+static void nvmetest_pmr_reg_test(void *obj, void *data, QGuestAllocator 
*alloc)
+{
+QNvme *nvme = obj;
+QPCIDevice *pdev = &nvme->dev;
+QPCIBar pmr_bar, nvme_bar;
+uint32_t pmrcap, pmrsts;
+
+qpci_device_enable(pdev);
+pmr_bar = qpci_iomap(pdev, 4, NULL);
+
+/* Without Enabling PMRCTL check bar enablemet */
+qpci_io_writel(pdev, pmr_bar, 0, 0xccbbaa99);
+g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), !=, 0x99);
+g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), !=, 0xaa99);
+
+/* Map NVMe Bar Register to Enable the Mem Region */
+nvme_bar = qpci_iomap(pdev, 0, NULL);
+
+pmrcap = qpci_io_readl(pdev, nvme_bar, 0xe00);
+g_assert_cmpint(NVME_PMRCAP_RDS(pmrcap), ==, 0x1);
+g_assert_cmpint(NVME_PMRCAP_WDS(pmrcap), ==, 0x1);
+g_assert_cmpint(NVME_PMRCAP_BIR(pmrcap), ==, 0x4);
+g_assert_cmpint(NVME_PMRCAP_PMRWBM(pmrcap), ==, 0x2);
+g_assert_cmpint(NVME_PMRCAP_CMSS(pmrcap), ==, 0x1);
+
+/* Enable PMRCTRL */
+qpci_io_writel(pdev, nvme_bar, 0xe04, 0x1);
+
+qpci_io_writel(pdev, pmr_bar, 0, 0x44332211);
+g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), ==, 0x11);
+g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), ==, 0x2211);
+g_assert_cmpint(qpci_io_readl(pdev, pmr_bar, 0), ==, 0x44332211);
+
+pmrsts = qpci_io_readl(pdev, nvme_bar, 0xe08);
+g_assert_cmpint(NVME_PMRSTS_NRDY(pmrsts), ==, 0x0);
+
+/* Disable PMRCTRL */
+qpci_io_writel(pdev, nvme_bar, 0xe04, 0x0);
+
+qpci_io_writel(pdev, pmr_bar, 0, 0x88776655);
+g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), !=, 0x55);
+g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), !=, 0x6655);
+g_assert_cmpint(qpci_io_readl(pdev, pmr_bar, 0), !=, 0x88776655);
+
+pmrsts = qpci_io_readl(pdev, nvme_bar, 0xe08);
+g_assert_cmpint(NVME_PMRSTS_NRDY(pmrsts), ==, 0x1);
+
+qpci_iounmap(pdev, nvme_bar);
+qpci_iounmap(pdev, pmr_bar);
+}
+
static void nvme_register_nodes(void)
{
+int fd, ret;
+t_path = g_strdup("/tmp/qtest.XX");
+
+/* Create a temporary raw image*/
+fd = mkstemp(t_path);
+g_assert(fd >= 0);
+ret = ftruncate(fd, TEST_IMAGE_SIZE);
+g_assert(ret == 0);
+close(fd);
+
+char *pmr_cmd_line = g_strdup_printf("-object memory-backend-file,id=pmr0,"
+ "share=on,mem-path=%s,size=8", 
t_path);
+
QOSGraphEdgeOptions opts = {
.extra_device_opts = "addr=04.0,drive=drv0,serial=foo",
.before_cmd_line = "-drive id=drv0,if=none,file=null-co://,"
-   "file.read-zeroes=on,format=raw",
+   "file.read-zeroes=on,format=raw ",
+   pmr_cmd_line,
};

add_qpci_address(&opts, &(QPCIAddress) { .devfn = QPCI_DEVFN(4, 0) });
@@ -83,6 +152,13 @@ static void nvme_register_nodes(void)
qos_add_test("oob-cmb-access", "nvme", nvmetest_oob_cmb_test, 
&(QOSGraphTestOptions) {
.edge.extra_device_opts = "cmb_size_mb=2"
});
+
+qos_add_test("pmr-test-access", "nvme", nvmetest_pmr_reg_test,
+ &(QOSGraphTestOptions) {
+.edge.extra_device_opts = "pmrdev=pmr0"
+});
+
+unlink(t_path);
}

libqos_init(nvme_register_nodes);
--
2.17.1



Applied to nvme-next. I swapped the memory-backend-file with a 
memory-backend-ram so we don't need to setup an actual file.


signature.asc
Description: PGP signature


Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-08 Thread Klaus Jensen

On Jul  9 08:16, Hannes Reinecke wrote:

On 7/9/21 8:05 AM, Klaus Jensen wrote:

On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller hot
unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme devices
can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't believe
it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +---
hw/nvme/ctrl.c   | 14 ++--
hw/nvme/ns.c | 55 +++-
hw/nvme/subsys.c |  9 
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0



Applied patches 1 through 3 to nvme-next.


So, how do we go about with patch 4?
Without it this whole exercise is a bit pointless, seeing that it 
doesn't fix anything.




Patch 1-3 are fixes we need anyway, so I thought I might as well apply 
them :)



Shall we go with that patch as an interim solution?
Will you replace it with your 'object' patch?
What is the plan?



Yes, if acceptable, I would like to use patch 4 as an interim solution. 
We have a bug we need to fix for 6.1, and I belive this does the job.


I considered changing the existing nvme-bus to be on the main system 
bus, but then we break the existing behavior that the namespaces attach 
to the most recently defined controller in the absence of the shared 
parameter or an explicit bus parameter.


Wrt. "the plan", right now, I see two solutions going forward:

1. Introduce new -object's for nvme-nvm-subsystem and nvme-ns
   This is the approach that I am taking right now and it works well. It 
   allows many-to-many relationships and separates the life times of 
   subsystems, namespaces and controllers like you mentioned.


   Conceptually, I also really like that the subsystem and namespace are 
   not "devices". One could argue that the namespace is comparable to a 
   SCSI LUN (-device scsi-hd, right?), but where the SCSI LUN actually 
   "shows up" in the host, the nvme namespace does not.


   My series handles backwards compatibility by keeping -device "shims" 
   around that just wraps the new objects but behaves like it used to. 
   The plan would be to deprecate these devices.


   The downside to this approach is that it moves the subsystem and 
   namespaces out of the "qdev tree (info qtree)" and into the pure QOM 
   "/objects" tree. Instead of qtree, we can have QMP and HMP commands 
   for introspection.


2. Make the subsystem a "system bus device"
   This way we add an "nvme-nvm-subsystems" bus as a direct child of the 
   main system bus, and we can possibly get rid of the explicit -device 
   nvme-subsys as well. We change the namespace device to plug into that 
   instead. The nvme controller device still needs to plug into the PCI 
   bus, so it cannot be a child of the subsystems bus, but can keep 
   using a link parameter to hook into the subsystem and attach to any 
   namespaces it would like.


   I'm unsure if we can do this without deprecating the existing 
   namespace device, just like option 1.


   I have not implemented this, so I need to look more into it. It seems 
   like the main thing that this gives us compared to 1) is `info qtree` 
   support and we still end up just "wiring" namespace attachment with 
   backlinks anyway.


I'm not sure what I wou

Re: [PATCH 2/2] qemu-img: Add --skip-broken for 'convert --bitmaps'

2021-07-08 Thread Vladimir Sementsov-Ogievskiy

08.07.2021 04:30, Eric Blake wrote:

The point of 'qemu-img convert --bitmaps' is to be a convenience for
actions that are already possible through a string of smaller
'qemu-img bitmap' sub-commands.  One situation not accounted for
already is that if a source image contains an inconsistent bitmap (for
example, because a qemu process died abruptly before flushing bitmap
state), the user MUST delete those inconsistent bitmaps before
anything else useful can be done with the image.

We don't want to delete inconsistent bitmaps by default: although a
corrupt bitmap is only a loss of optimization rather than a corruption
of user-visible data, it is still nice to require the user to opt in
to the fact that they are aware of the loss of the bitmap.  Still,
requiring the user to check 'qemu-img info' to see whether bitmaps are
consistent, then use 'qemu-img bitmap --remove' to remove offenders,
all before using 'qemu-img convert', is a lot more work than just
adding a knob 'qemu-img convert --bitmaps --skip-broken' which opts in
to skipping the broken bitmaps.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1946084
Signed-off-by: Eric Blake 
---
  docs/tools/qemu-img.rst   |  8 +++-
  qemu-img.c| 20 +--
  tests/qemu-iotests/tests/qemu-img-bitmaps |  4 
  tests/qemu-iotests/tests/qemu-img-bitmaps.out | 14 +
  4 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index 1d8470eada0e..5cf1c764597b 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -414,7 +414,7 @@ Command description:
4
  Error on reading data

-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [--bitmaps] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l 
SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m NUM_COROUTINES] [-W] 
FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [--bitmaps [--skip-broken]] [-U] [-C] [-c] [-p] [-q] [-n] 
[-f FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o 
OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m 
NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME


Of course, [--bitmaps [--skip-broken]] looks like --skip-broken is a 
suboption.. But actually it's not so. So, shouldn't it be named more explicit, 
like --skip-broken-bitmaps ? To be sure that we will not interfere in future 
with some other broken things we want to skip? And to avoid strange but correct 
command lines like

qemu-img convert --skip-broken  --bitmaps  src dst




Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM*
to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can
@@ -456,6 +456,12 @@ Command description:
*NUM_COROUTINES* specifies how many coroutines work in parallel during
the convert process (defaults to 8).

+  Use of ``--bitmaps`` requests that any persistent bitmaps present in
+  the original are also copied to the destination.  If any bitmap is
+  inconsistent in the source, the conversion will fail unless
+  ``--skip-broken`` is also specified to copy only the consistent
+  bitmaps.
+
  .. option:: create [--object OBJECTDEF] [-q] [-f FMT] [-b BACKING_FILE] [-F 
BACKING_FMT] [-u] [-o OPTIONS] FILENAME [SIZE]

Create the new disk image *FILENAME* of size *SIZE* and format
diff --git a/qemu-img.c b/qemu-img.c
index 68a4d298098f..e8b012f39c0c 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -82,6 +82,7 @@ enum {
  OPTION_MERGE = 274,
  OPTION_BITMAPS = 275,
  OPTION_FORCE = 276,
+OPTION_SKIP_BROKEN = 277,
  };

  typedef enum OutputFormat {
@@ -2101,7 +2102,8 @@ static int convert_do_copy(ImgConvertState *s)
  return s->ret;
  }

-static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst)
+static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst,
+bool skip_broken)
  {
  BdrvDirtyBitmap *bm;
  Error *err = NULL;
@@ -2113,6 +2115,10 @@ static int convert_copy_bitmaps(BlockDriverState *src, 
BlockDriverState *dst)
  continue;
  }
  name = bdrv_dirty_bitmap_name(bm);
+if (skip_broken && bdrv_dirty_bitmap_inconsistent(bm)) {
+warn_report("Skipping inconsistent bitmap %s", name);
+continue;
+}
  qmp_block_dirty_bitmap_add(dst->node_name, name,
 true, bdrv_dirty_bitmap_granularity(bm),
 true, true,
@@ -2167,6 +2173,7 @@ static int img_convert(int argc, char **argv)
  bool force_share = false;
  bool explict_min_sparse = false;
  bool bitmaps = false;
+bool skip_broken = false;
  int64_t rate_limit = 0;

   

Re: [PATCH 1/2] iotests: Improve and rename test 291 to qemu-img-bitmap

2021-07-08 Thread Vladimir Sementsov-Ogievskiy

08.07.2021 04:30, Eric Blake wrote:

Enhance the test to demonstrate behavior of qemu-img with a qcow2
image containing an inconsistent bitmap, and rename it now that we
support useful iotest names.

While at it, fix a missing newline in the error message thus exposed.

Signed-off-by: Eric Blake 
---
  block/dirty-bitmap.c  |  2 +-
  .../{291 => tests/qemu-img-bitmaps}   | 13 +++-
  .../{291.out => tests/qemu-img-bitmaps.out}   | 32 ++-
  3 files changed, 44 insertions(+), 3 deletions(-)
  rename tests/qemu-iotests/{291 => tests/qemu-img-bitmaps} (92%)
  rename tests/qemu-iotests/{291.out => tests/qemu-img-bitmaps.out} (82%)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 68d295d6e3ed..0ef46163e3ea 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -193,7 +193,7 @@ int bdrv_dirty_bitmap_check(const BdrvDirtyBitmap *bitmap, 
uint32_t flags,
  error_setg(errp, "Bitmap '%s' is inconsistent and cannot be used",
 bitmap->name);
  error_append_hint(errp, "Try block-dirty-bitmap-remove to delete"
-  " this bitmap from disk");
+  " this bitmap from disk\n");
  return -1;
  }

diff --git a/tests/qemu-iotests/291 b/tests/qemu-iotests/tests/qemu-img-bitmaps
similarity index 92%
rename from tests/qemu-iotests/291
rename to tests/qemu-iotests/tests/qemu-img-bitmaps
index 20efb080a6c0..76cd9e31e850 100755
--- a/tests/qemu-iotests/291
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps
@@ -3,7 +3,7 @@
  #
  # Test qemu-img bitmap handling
  #
-# Copyright (C) 2018-2020 Red Hat, Inc.
+# Copyright (C) 2018-2021 Red Hat, Inc.
  #
  # This program is free software; you can redistribute it and/or modify
  # it under the terms of the GNU General Public License as published by
@@ -32,6 +32,7 @@ _cleanup()
  trap "_cleanup; exit \$status" 0 1 2 3 15

  # get standard environment, filters and checks
+cd ..
  . ./common.rc
  . ./common.filter
  . ./common.nbd
@@ -129,6 +130,16 @@ $QEMU_IMG map --output=json --image-opts \

  nbd_server_stop

+echo
+echo "=== Check handling of inconsistent bitmap ==="
+echo
+
+$QEMU_IO -c abort "$TEST_IMG" 2>/dev/null
+$QEMU_IMG bitmap --add "$TEST_IMG" b4
+$QEMU_IMG bitmap --remove "$TEST_IMG" b1
+_img_info --format-specific | _filter_irrelevant_img_info
+$QEMU_IMG convert --bitmaps -O qcow2 "$TEST_IMG" "$TEST_IMG.copy"


Worth then removing remaining inconsistent bitmaps and try again?

I think you should now remove $TEST_IMG.copy in _cleanup

with squashed in

--- a/tests/qemu-iotests/tests/qemu-img-bitmaps
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps
@@ -27,6 +27,7 @@ status=1 # failure is the default!
 _cleanup()
 {
 _cleanup_test_img
+_rm_test_img "$TEST_IMG.copy"
 nbd_server_stop
 }
 trap "_cl

Tested-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Vladimir Sementsov-Ogievskiy 


+
  # success, all done
  echo '*** done'
  rm -f $seq.full
diff --git a/tests/qemu-iotests/291.out 
b/tests/qemu-iotests/tests/qemu-img-bitmaps.out
similarity index 82%
rename from tests/qemu-iotests/291.out
rename to tests/qemu-iotests/tests/qemu-img-bitmaps.out
index 018d6b103f87..17b34eaed30f 100644
--- a/tests/qemu-iotests/291.out
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps.out
@@ -1,4 +1,4 @@
-QA output created by 291
+QA output created by qemu-img-bitmaps

  === Initial image setup ===

@@ -115,4 +115,34 @@ Format specific information:
  [{ "start": 0, "length": 2097152, "depth": 0, "present": true, "zero": false, "data": 
true, "offset": OFFSET},
  { "start": 2097152, "length": 1048576, "depth": 0, "present": false, "zero": false, 
"data": false},
  { "start": 3145728, "length": 7340032, "depth": 0, "present": true, "zero": false, 
"data": true, "offset": OFFSET}]
+
+=== Check handling of inconsistent bitmap ===
+
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 10 MiB (10485760 bytes)
+cluster_size: 65536
+backing file: TEST_DIR/t.IMGFMT.base
+backing file format: IMGFMT
+Format specific information:
+bitmaps:
+[0]:
+flags:
+[0]: in-use
+[1]: auto
+name: b2
+granularity: 65536
+[1]:
+flags:
+[0]: in-use
+name: b0
+granularity: 65536
+[2]:
+flags:
+[0]: auto
+name: b4
+granularity: 65536
+corrupt: false
+qemu-img: Failed to populate bitmap b0: Bitmap 'b0' is inconsistent and cannot 
be used
+Try block-dirty-bitmap-remove to delete this bitmap from disk
  *** done




--
Best regards,
Vladimir



Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-08 Thread Hannes Reinecke

On 7/9/21 8:05 AM, Klaus Jensen wrote:

On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller hot
unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme devices
can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't believe
it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +---
hw/nvme/ctrl.c   | 14 ++--
hw/nvme/ns.c | 55 +++-
hw/nvme/subsys.c |  9 
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0



Applied patches 1 through 3 to nvme-next.


So, how do we go about with patch 4?
Without it this whole exercise is a bit pointless, seeing that it 
doesn't fix anything.


Shall we go with that patch as an interim solution?
Will you replace it with your 'object' patch?
What is the plan?

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer



Re: [PATCH]: /hw/nvme/ctrl error handling if descriptors are greater than 1024

2021-07-08 Thread Klaus Jensen

On Jul  6 16:13, Padmakar Kalghatgi wrote:

From: padmakar



if the number of descriptors or pages is more than 1024,

dma writes or reads will result in failure. Hence, we check

if the number of descriptors or pages is more than 1024

in the nvme module and return Internal Device error.



Signed-off-by: Padmakar Kalghatgi

---

hw/nvme/ctrl.c   | 14 ++

hw/nvme/trace-events |  1 +

2 files changed, 15 insertions(+)



diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c

index 40a7efc..082592f 100644

--- a/hw/nvme/ctrl.c

+++ b/hw/nvme/ctrl.c

@@ -602,6 +602,20 @@ static uint16_t nvme_map_addr(NvmeCtrl *n, NvmeSg *sg,
hwaddr addr, size_t len)

return NVME_SUCCESS;

}

+/*

+ *  The QEMU has an inherent issue where in if the no.

+ *  of descriptors is more than 1024, it will result in

+ *  failure during the dma write or reads. Hence, we need

+ *  to return the error.

+ */

+

+if (((sg->flags & NVME_SG_DMA) && ((sg->qsg.nsg + 1) > IOV_MAX)) ||

+((sg->iov.niov + 1) > IOV_MAX)) {

+NVME_GUEST_ERR(pci_nvme_ub_sg_desc_toohigh,

+   "number of descriptors is greater than 1024");

+return NVME_INTERNAL_DEV_ERROR;

+}

+

trace_pci_nvme_map_addr(addr, len);

if (nvme_addr_is_cmb(n, addr)) {

diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events

index ea33d0c..bfe1a3b 100644

--- a/hw/nvme/trace-events

+++ b/hw/nvme/trace-events

@@ -202,3 +202,4 @@ pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t
new_head) "completion qu

pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write
for nonexistent queue, sqid=%"PRIu32", ignoring"

pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail)
"submission queue doorbell write value beyond queue size, sqid=%"PRIu32",
new_head=%"PRIu16", ignoring"

pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"

+pci_nvme_ub_sg_desc_toohigh(void) "the number of sg descriptors is too
high"

--

2.7.0.windows.1





Applied to nvme-next, but made the error message more generic (this also 
applied to PRPs).


signature.asc
Description: PGP signature


Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-08 Thread Klaus Jensen

On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller hot
unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme devices
can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't believe
it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +---
hw/nvme/ctrl.c   | 14 ++--
hw/nvme/ns.c | 55 +++-
hw/nvme/subsys.c |  9 
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0



Applied patches 1 through 3 to nvme-next.


signature.asc
Description: PGP signature


Re: [PATCH] vhost-user: Fix backends without multiqueue support

2021-07-08 Thread Raphael Norwitz
On Mon, Jul 05, 2021 at 07:14:29PM +0200, Kevin Wolf wrote:
> dev->max_queues was never initialised for backends that don't support
> VHOST_USER_PROTOCOL_F_MQ, so it would use 0 as the maximum number of
> queues to check against and consequently fail for any such backend.
> 
> Set it to 1 if the backend doesn't have multiqueue support.
> 
> Fixes: c90bd505a3e8210c23d69fecab9ee6f56ec4a161
> Signed-off-by: Kevin Wolf 

Reviewed-by: Raphael Norwitz 

> ---
>  hw/virtio/vhost-user.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 1ac4a2ebec..29ea2b4fce 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -1913,7 +1913,10 @@ static int vhost_user_backend_init(struct vhost_dev 
> *dev, void *opaque,
>  if (err < 0) {
>  return -EPROTO;
>  }
> +} else {
> +dev->max_queues = 1;
>  }
> +
>  if (dev->num_queues && dev->max_queues < dev->num_queues) {
>  error_setg(errp, "The maximum number of queues supported by the "
> "backend is %" PRIu64, dev->max_queues);
> -- 
> 2.31.1
> 


Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-08 Thread Peter Lieven



> Am 08.07.2021 um 14:18 schrieb Kevin Wolf :
> 
> Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben:
>>> Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
>>> Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:
 I will have a decent look after my vacation.
>>> 
>>> Sounds good, thanks. Enjoy your vacation!
>> 
>> As I had to fire up my laptop to look into another issue anyway, I
>> have sent two patches for updating MAINTAINERS and to fix the int vs.
>> bool mix for task->complete.
> 
> I think you need to reevaluate your definition of vacation. ;-)

Lets talk about this when the kids are grown up. Sometimes sending patches can 
be quite relaxing :-)

> 
> But thanks anyway.
> 
>> As Paolos fix (5f50be9b5) is relatively new and there are maybe other
>> non obvious problems when removing the BH indirection and we are close
>> to soft freeze I would leave the BH removal change for 6.2.
> 
> Sure, code cleanups aren't urgent.

Isn’t the indirection also a slight performance drop?

Peter






Re: [PATCH 3/2] qemu-img: Improve error for rebase without backing format

2021-07-08 Thread Kevin Wolf
Am 08.07.2021 um 17:52 hat Eric Blake geschrieben:
> When removeing support for qemu-img being able to create backing
> chains without embedded backing formats, we caused a poor error
> message as caught by iotest 114.  Improve the situation to inform the
> user what went wrong.
> 
> Suggested-by: Kevin Wolf 
> Signed-off-by: Eric Blake 

Thanks, applied to the block branch.

Kevin




[PATCH 3/2] qemu-img: Improve error for rebase without backing format

2021-07-08 Thread Eric Blake
When removeing support for qemu-img being able to create backing
chains without embedded backing formats, we caused a poor error
message as caught by iotest 114.  Improve the situation to inform the
user what went wrong.

Suggested-by: Kevin Wolf 
Signed-off-by: Eric Blake 
---
 qemu-img.c | 3 +++
 tests/qemu-iotests/114.out | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/qemu-img.c b/qemu-img.c
index 7c0e73882dd4..b017734c255a 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3786,6 +3786,9 @@ static int img_rebase(int argc, char **argv)
 if (ret == -ENOSPC) {
 error_report("Could not change the backing file to '%s': No "
  "space left in the file header", out_baseimg);
+} else if (ret == -EINVAL && out_baseimg && !out_basefmt) {
+error_report("Could not change the backing file to '%s': backing "
+ "format must be specified", out_baseimg);
 } else if (ret < 0) {
 error_report("Could not change the backing file to '%s': %s",
 out_baseimg, strerror(-ret));
diff --git a/tests/qemu-iotests/114.out b/tests/qemu-iotests/114.out
index 172454401257..016e9ce3ecfb 100644
--- a/tests/qemu-iotests/114.out
+++ b/tests/qemu-iotests/114.out
@@ -14,7 +14,7 @@ qemu-io: can't open device TEST_DIR/t.qcow2: Could not open 
backing file: Unknow
 no file open, try 'help open'
 read 4096/4096 bytes at offset 0
 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-qemu-img: Could not change the backing file to 
'/home/eblake/qemu/build/tests/qemu-iotests/scratch/t.qcow2.base': Invalid 
argument
+qemu-img: Could not change the backing file to 
'/home/eblake/qemu/build/tests/qemu-iotests/scratch/t.qcow2.base': backing 
format must be specified
 read 4096/4096 bytes at offset 0
 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 *** done
-- 
2.31.1




Re: [PATCH v6 6/6] block: Make blockdev-reopen stable API

2021-07-08 Thread Eric Blake
On Thu, Jul 08, 2021 at 01:47:09PM +0200, Kevin Wolf wrote:
> From: Alberto Garcia 
> 
> This patch drops the 'x-' prefix from x-blockdev-reopen.
> 
> Signed-off-by: Alberto Garcia 
> Signed-off-by: Kevin Wolf 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> ---

> +++ b/qapi/block-core.json
> @@ -4219,7 +4219,7 @@
>  { 'command': 'blockdev-add', 'data': 'BlockdevOptions', 'boxed': true }
>  
>  ##
> -# @x-blockdev-reopen:
> +# @blockdev-reopen:
>  #
>  # Reopens one or more block devices using the given set of options.
>  # Any option not specified will be reset to its default value regardless
> @@ -4257,9 +4257,9 @@
>  # image does not have a default backing file name as part of its
>  # metadata.
>  #
> -# Since: 4.0
> +# Since: 6.0

6.1 now

And yay! We are finally getting here!

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 2/2] qemu-img: Require -F with -b backing image

2021-07-08 Thread Eric Blake
On Thu, Jul 08, 2021 at 03:00:36PM +0200, Kevin Wolf wrote:
> Am 03.05.2021 um 23:36 hat Eric Blake geschrieben:
> > @@ -17,7 +14,7 @@ qemu-io: can't open device TEST_DIR/t.qcow2: Could not 
> > open backing file: Unknow
> >  no file open, try 'help open'
> >  read 4096/4096 bytes at offset 0
> >  4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > -qemu-img: warning: Deprecated use of backing file without explicit backing 
> > format, use of this image requires potentially unsafe format probing
> > +qemu-img: Could not change the backing file to 
> > '/home/eblake/qemu/build/tests/qemu-iotests/scratch/t.qcow2.base': Invalid 
> > argument
> >  read 4096/4096 bytes at offset 0
> >  4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> 
> This is not exactly an improvement for the error message. Maybe worth a
> follow-up patch?

Sure, will do.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v6 1/6] qcow2: Fix dangling pointer after reopen for 'file'

2021-07-08 Thread Vladimir Sementsov-Ogievskiy

08.07.2021 14:47, Kevin Wolf wrote:

Without an external data file, s->data_file is a second pointer with the
same value as bs->file. When changing bs->file to a different BdrvChild
and freeing the old BdrvChild, s->data_file must also be updated,
otherwise it points to freed memory and causes crashes.

This problem was caught by iotests case 245.

Fixes: df2b7086f169239ebad5d150efa29c9bb6d4f820
Signed-off-by: Kevin Wolf


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH v6 1/6] qcow2: Fix dangling pointer after reopen for 'file'

2021-07-08 Thread Eric Blake
On Thu, Jul 08, 2021 at 01:47:04PM +0200, Kevin Wolf wrote:
> Without an external data file, s->data_file is a second pointer with the
> same value as bs->file. When changing bs->file to a different BdrvChild
> and freeing the old BdrvChild, s->data_file must also be updated,
> otherwise it points to freed memory and causes crashes.
> 
> This problem was caught by iotests case 245.
> 
> Fixes: df2b7086f169239ebad5d150efa29c9bb6d4f820
> Signed-off-by: Kevin Wolf 
> ---
>  block/qcow2.c | 29 +
>  1 file changed, 29 insertions(+)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




[PULL 5/5] block/io: Merge discard request alignments

2021-07-08 Thread Stefan Hajnoczi
From: Akihiko Odaki 

Signed-off-by: Akihiko Odaki 
Message-id: 20210705130458.97642-3-akihiko.od...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 block/io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/io.c b/block/io.c
index cf177a9d2d..e0a689c584 100644
--- a/block/io.c
+++ b/block/io.c
@@ -125,6 +125,8 @@ void bdrv_parent_drained_begin_single(BdrvChild *c, bool 
poll)
 
 static void bdrv_merge_limits(BlockLimits *dst, const BlockLimits *src)
 {
+dst->pdiscard_alignment = MAX(dst->pdiscard_alignment,
+  src->pdiscard_alignment);
 dst->opt_transfer = MAX(dst->opt_transfer, src->opt_transfer);
 dst->max_transfer = MIN_NON_ZERO(dst->max_transfer, src->max_transfer);
 dst->max_hw_transfer = MIN_NON_ZERO(dst->max_hw_transfer,
-- 
2.31.1



[PULL 3/5] block/file-posix: Optimize for macOS

2021-07-08 Thread Stefan Hajnoczi
From: Akihiko Odaki 

This commit introduces "punch hole" operation and optimizes transfer
block size for macOS.

Thanks to Konstantin Nazarov for detailed analysis of a flaw in an
old version of this change:
https://gist.github.com/akihikodaki/87df4149e7ca87f18dc56807ec5a1bc5#gistcomment-3654667

Signed-off-by: Akihiko Odaki 
Message-id: 20210705130458.97642-1-akihiko.od...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 block/file-posix.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index a26eab0ac3..cb9bffe047 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -46,6 +46,7 @@
 #if defined(HAVE_HOST_BLOCK_DEVICE)
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1254,6 +1255,15 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 return;
 }
 
+#if defined(__APPLE__) && (__MACH__)
+struct statfs buf;
+
+if (!fstatfs(s->fd, &buf)) {
+bs->bl.opt_transfer = buf.f_iosize;
+bs->bl.pdiscard_alignment = buf.f_bsize;
+}
+#endif
+
 if (bs->sg || S_ISBLK(st.st_mode)) {
 int ret = hdev_get_max_hw_transfer(s->fd, &st);
 
@@ -1591,6 +1601,7 @@ out:
 }
 }
 
+#if defined(CONFIG_FALLOCATE) || defined(BLKZEROOUT) || defined(BLKDISCARD)
 static int translate_err(int err)
 {
 if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP ||
@@ -1599,6 +1610,7 @@ static int translate_err(int err)
 }
 return err;
 }
+#endif
 
 #ifdef CONFIG_FALLOCATE
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
@@ -1811,16 +1823,27 @@ static int handle_aiocb_discard(void *opaque)
 }
 } while (errno == EINTR);
 
-ret = -errno;
+ret = translate_err(-errno);
 #endif
 } else {
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
 ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
aiocb->aio_offset, aiocb->aio_nbytes);
+ret = translate_err(-errno);
+#elif defined(__APPLE__) && (__MACH__)
+fpunchhole_t fpunchhole;
+fpunchhole.fp_flags = 0;
+fpunchhole.reserved = 0;
+fpunchhole.fp_offset = aiocb->aio_offset;
+fpunchhole.fp_length = aiocb->aio_nbytes;
+if (fcntl(s->fd, F_PUNCHHOLE, &fpunchhole) == -1) {
+ret = errno == ENODEV ? -ENOTSUP : -errno;
+} else {
+ret = 0;
+}
 #endif
 }
 
-ret = translate_err(ret);
 if (ret == -ENOTSUP) {
 s->has_discard = false;
 }
-- 
2.31.1



[PULL 4/5] block: Add backend_defaults property

2021-07-08 Thread Stefan Hajnoczi
From: Akihiko Odaki 

backend_defaults property allow users to control if default block
properties should be decided with backend information.

If it is off, any backend information will be discarded, which is
suitable if you plan to perform live migration to a different disk backend.

If it is on, a block device may utilize backend information more
aggressively.

By default, it is auto, which uses backend information for block
sizes and ignores the others, which is consistent with the older
versions.

Signed-off-by: Akihiko Odaki 
Message-id: 20210705130458.97642-2-akihiko.od...@gmail.com
Signed-off-by: Stefan Hajnoczi 
---
 include/hw/block/block.h   |  3 +++
 hw/block/block.c   | 42 ++
 tests/qemu-iotests/172.out | 38 ++
 3 files changed, 79 insertions(+), 4 deletions(-)

diff --git a/include/hw/block/block.h b/include/hw/block/block.h
index c172cbe65f..5902c0440a 100644
--- a/include/hw/block/block.h
+++ b/include/hw/block/block.h
@@ -19,6 +19,7 @@
 
 typedef struct BlockConf {
 BlockBackend *blk;
+OnOffAuto backend_defaults;
 uint32_t physical_block_size;
 uint32_t logical_block_size;
 uint32_t min_io_size;
@@ -48,6 +49,8 @@ static inline unsigned int get_physical_block_exp(BlockConf 
*conf)
 }
 
 #define DEFINE_BLOCK_PROPERTIES_BASE(_state, _conf) \
+DEFINE_PROP_ON_OFF_AUTO("backend_defaults", _state, \
+_conf.backend_defaults, ON_OFF_AUTO_AUTO),  \
 DEFINE_PROP_BLOCKSIZE("logical_block_size", _state, \
   _conf.logical_block_size),\
 DEFINE_PROP_BLOCKSIZE("physical_block_size", _state,\
diff --git a/hw/block/block.c b/hw/block/block.c
index 1e34573da7..d47ebf005a 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -65,24 +65,58 @@ bool blkconf_blocksizes(BlockConf *conf, Error **errp)
 {
 BlockBackend *blk = conf->blk;
 BlockSizes blocksizes;
-int backend_ret;
+BlockDriverState *bs;
+bool use_blocksizes;
+bool use_bs;
+
+switch (conf->backend_defaults) {
+case ON_OFF_AUTO_AUTO:
+use_blocksizes = !blk_probe_blocksizes(blk, &blocksizes);
+use_bs = false;
+break;
+
+case ON_OFF_AUTO_ON:
+use_blocksizes = !blk_probe_blocksizes(blk, &blocksizes);
+bs = blk_bs(blk);
+use_bs = bs;
+break;
+
+case ON_OFF_AUTO_OFF:
+use_blocksizes = false;
+use_bs = false;
+break;
+
+default:
+abort();
+}
 
-backend_ret = blk_probe_blocksizes(blk, &blocksizes);
 /* fill in detected values if they are not defined via qemu command line */
 if (!conf->physical_block_size) {
-if (!backend_ret) {
+if (use_blocksizes) {
conf->physical_block_size = blocksizes.phys;
 } else {
 conf->physical_block_size = BDRV_SECTOR_SIZE;
 }
 }
 if (!conf->logical_block_size) {
-if (!backend_ret) {
+if (use_blocksizes) {
 conf->logical_block_size = blocksizes.log;
 } else {
 conf->logical_block_size = BDRV_SECTOR_SIZE;
 }
 }
+if (use_bs) {
+if (!conf->opt_io_size) {
+conf->opt_io_size = bs->bl.opt_transfer;
+}
+if (conf->discard_granularity == -1) {
+if (bs->bl.pdiscard_alignment) {
+conf->discard_granularity = bs->bl.pdiscard_alignment;
+} else if (bs->bl.request_alignment != 1) {
+conf->discard_granularity = bs->bl.request_alignment;
+}
+}
+}
 
 if (conf->logical_block_size > conf->physical_block_size) {
 error_setg(errp,
diff --git a/tests/qemu-iotests/172.out b/tests/qemu-iotests/172.out
index d53f61d0de..4cf4d536b4 100644
--- a/tests/qemu-iotests/172.out
+++ b/tests/qemu-iotests/172.out
@@ -21,6 +21,7 @@ Testing:
   dev: floppy, id ""
 unit = 0 (0x0)
 drive = "floppy0"
+backend_defaults = "auto"
 logical_block_size = 512 (512 B)
 physical_block_size = 512 (512 B)
 min_io_size = 0 (0 B)
@@ -48,6 +49,7 @@ Testing: -fda TEST_DIR/t.qcow2
   dev: floppy, id ""
 unit = 0 (0x0)
 drive = "floppy0"
+backend_defaults = "auto"
 logical_block_size = 512 (512 B)
 physical_block_size = 512 (512 B)
 min_io_size = 0 (0 B)
@@ -85,6 +87,7 @@ Testing: -fdb TEST_DIR/t.qcow2
   dev: floppy, id ""
 unit = 1 (0x1)
 drive = "floppy1"
+backend_defaults = "auto"
 logical_block_size = 512 (512 B)
 physical_block_size = 512 (512 B)
 min_io_size = 0 (0 B)
@@ -96,6 +99,7 @@ Testing: -fdb TEST_DIR/t.qcow2

[PULL 1/5] util/async: add a human-readable name to BHs for debugging

2021-07-08 Thread Stefan Hajnoczi
It can be difficult to debug issues with BHs in production environments.
Although BHs can usually be identified by looking up their ->cb()
function pointer, this requires debug information for the program. It is
also not possible to print human-readable diagnostics about BHs because
they have no identifier.

This patch adds a name to each BH. The name is not unique per instance
but differentiates between cb() functions, which is usually enough. It's
done by changing aio_bh_new() and friends to macros that stringify cb.

The next patch will use the name field when reporting leaked BHs.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Philippe Mathieu-Daudé 
Message-Id: <20210414200247.917496-2-stefa...@redhat.com>
---
 include/block/aio.h| 31 ---
 include/qemu/main-loop.h   |  4 +++-
 tests/unit/ptimer-test-stubs.c |  2 +-
 util/async.c   |  9 +++--
 util/main-loop.c   |  4 ++--
 5 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 10fcae1515..807edce9b5 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -291,20 +291,45 @@ void aio_context_acquire(AioContext *ctx);
 /* Relinquish ownership of the AioContext. */
 void aio_context_release(AioContext *ctx);
 
+/**
+ * aio_bh_schedule_oneshot_full: Allocate a new bottom half structure that will
+ * run only once and as soon as possible.
+ *
+ * @name: A human-readable identifier for debugging purposes.
+ */
+void aio_bh_schedule_oneshot_full(AioContext *ctx, QEMUBHFunc *cb, void 
*opaque,
+  const char *name);
+
 /**
  * aio_bh_schedule_oneshot: Allocate a new bottom half structure that will run
  * only once and as soon as possible.
+ *
+ * A convenience wrapper for aio_bh_schedule_oneshot_full() that uses cb as the
+ * name string.
  */
-void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque);
+#define aio_bh_schedule_oneshot(ctx, cb, opaque) \
+aio_bh_schedule_oneshot_full((ctx), (cb), (opaque), (stringify(cb)))
 
 /**
- * aio_bh_new: Allocate a new bottom half structure.
+ * aio_bh_new_full: Allocate a new bottom half structure.
  *
  * Bottom halves are lightweight callbacks whose invocation is guaranteed
  * to be wait-free, thread-safe and signal-safe.  The #QEMUBH structure
  * is opaque and must be allocated prior to its use.
+ *
+ * @name: A human-readable identifier for debugging purposes.
  */
-QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque);
+QEMUBH *aio_bh_new_full(AioContext *ctx, QEMUBHFunc *cb, void *opaque,
+const char *name);
+
+/**
+ * aio_bh_new: Allocate a new bottom half structure
+ *
+ * A convenience wrapper for aio_bh_new_full() that uses the cb as the name
+ * string.
+ */
+#define aio_bh_new(ctx, cb, opaque) \
+aio_bh_new_full((ctx), (cb), (opaque), (stringify(cb)))
 
 /**
  * aio_notify: Force processing of pending events.
diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
index 98aef5647c..8dbc6fcb89 100644
--- a/include/qemu/main-loop.h
+++ b/include/qemu/main-loop.h
@@ -294,7 +294,9 @@ void qemu_cond_timedwait_iothread(QemuCond *cond, int ms);
 
 void qemu_fd_register(int fd);
 
-QEMUBH *qemu_bh_new(QEMUBHFunc *cb, void *opaque);
+#define qemu_bh_new(cb, opaque) \
+qemu_bh_new_full((cb), (opaque), (stringify(cb)))
+QEMUBH *qemu_bh_new_full(QEMUBHFunc *cb, void *opaque, const char *name);
 void qemu_bh_schedule_idle(QEMUBH *bh);
 
 enum {
diff --git a/tests/unit/ptimer-test-stubs.c b/tests/unit/ptimer-test-stubs.c
index 7f801a4d09..2a3ef58799 100644
--- a/tests/unit/ptimer-test-stubs.c
+++ b/tests/unit/ptimer-test-stubs.c
@@ -108,7 +108,7 @@ int64_t qemu_clock_deadline_ns_all(QEMUClockType type, int 
attr_mask)
 return deadline;
 }
 
-QEMUBH *qemu_bh_new(QEMUBHFunc *cb, void *opaque)
+QEMUBH *qemu_bh_new_full(QEMUBHFunc *cb, void *opaque, const char *name)
 {
 QEMUBH *bh = g_new(QEMUBH, 1);
 
diff --git a/util/async.c b/util/async.c
index 5d9b7cc1eb..9a668996b8 100644
--- a/util/async.c
+++ b/util/async.c
@@ -57,6 +57,7 @@ enum {
 
 struct QEMUBH {
 AioContext *ctx;
+const char *name;
 QEMUBHFunc *cb;
 void *opaque;
 QSLIST_ENTRY(QEMUBH) next;
@@ -107,7 +108,8 @@ static QEMUBH *aio_bh_dequeue(BHList *head, unsigned *flags)
 return bh;
 }
 
-void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
+void aio_bh_schedule_oneshot_full(AioContext *ctx, QEMUBHFunc *cb,
+  void *opaque, const char *name)
 {
 QEMUBH *bh;
 bh = g_new(QEMUBH, 1);
@@ -115,11 +117,13 @@ void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc 
*cb, void *opaque)
 .ctx = ctx,
 .cb = cb,
 .opaque = opaque,
+.name = name,
 };
 aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT);
 }
 
-QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
+QEMUBH *aio_bh_new_full(AioCont

[PULL 2/5] util/async: print leaked BH name when AioContext finalizes

2021-07-08 Thread Stefan Hajnoczi
BHs must be deleted before the AioContext is finalized. If not, it's a
bug and probably indicates that some part of the program still expects
the BH to run in the future. That can lead to memory leaks, inconsistent
state, or just hangs.

Unfortunately the assert(flags & BH_DELETED) call in aio_ctx_finalize()
is difficult to debug because the assertion failure contains no
information about the BH!

Use the QEMUBH name field added in the previous patch to show a useful
error when a leaked BH is detected.

Suggested-by: Eric Ernst 
Signed-off-by: Stefan Hajnoczi 
Message-Id: <20210414200247.917496-3-stefa...@redhat.com>
---
 util/async.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/util/async.c b/util/async.c
index 9a668996b8..9a41591319 100644
--- a/util/async.c
+++ b/util/async.c
@@ -344,8 +344,20 @@ aio_ctx_finalize(GSource *source)
 assert(QSIMPLEQ_EMPTY(&ctx->bh_slice_list));
 
 while ((bh = aio_bh_dequeue(&ctx->bh_list, &flags))) {
-/* qemu_bh_delete() must have been called on BHs in this AioContext */
-assert(flags & BH_DELETED);
+/*
+ * qemu_bh_delete() must have been called on BHs in this AioContext. In
+ * many cases memory leaks, hangs, or inconsistent state occur when a
+ * BH is leaked because something still expects it to run.
+ *
+ * If you hit this, fix the lifecycle of the BH so that
+ * qemu_bh_delete() and any associated cleanup is called before the
+ * AioContext is finalized.
+ */
+if (unlikely(!(flags & BH_DELETED))) {
+fprintf(stderr, "%s: BH '%s' leaked, aborting...\n",
+__func__, bh->name);
+abort();
+}
 
 g_free(bh);
 }
-- 
2.31.1



[PULL 0/5] Block patches

2021-07-08 Thread Stefan Hajnoczi
The following changes since commit 711c0418c8c1ce3a24346f058b001c4c5a2f0f81:

  Merge remote-tracking branch 'remotes/philmd/tags/mips-20210702' into staging 
(2021-07-04 14:04:12 +0100)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 9f460c64e13897117f35ffb61f6f5e0102cabc70:

  block/io: Merge discard request alignments (2021-07-06 14:28:55 +0100)


Pull request



Akihiko Odaki (3):
  block/file-posix: Optimize for macOS
  block: Add backend_defaults property
  block/io: Merge discard request alignments

Stefan Hajnoczi (2):
  util/async: add a human-readable name to BHs for debugging
  util/async: print leaked BH name when AioContext finalizes

 include/block/aio.h| 31 ++---
 include/hw/block/block.h   |  3 +++
 include/qemu/main-loop.h   |  4 +++-
 block/file-posix.c | 27 --
 block/io.c |  2 ++
 hw/block/block.c   | 42 ++
 tests/unit/ptimer-test-stubs.c |  2 +-
 util/async.c   | 25 
 util/main-loop.c   |  4 ++--
 tests/qemu-iotests/172.out | 38 ++
 10 files changed, 161 insertions(+), 17 deletions(-)

-- 
2.31.1



Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex

2021-07-08 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 06:58:07PM +0200, Emanuele Giuseppe Esposito wrote:
> This is a continuation on the work to reduce (and possibly get rid of) the 
> usage of AioContext lock, by introducing smaller granularity locks to keep 
> the thread safety.
> 
> This series aims to:
> 1) remove the aiocontext lock and substitute it with the already existing
>global job_mutex
> 2) fix what it looks like to be an oversight when moving the blockjob.c logic
>into the more generic job.c: job_mutex was introduced especially to
>protect job->busy flag, but it seems that it was not used in successive
>patches, because there are multiple code sections that directly
>access the field without any locking.
> 3) use job_mutex instead of the aiocontext_lock
> 4) extend the reach of the job_mutex to protect all shared fields
>that the job structure has.

Can you explain the big picture:

1. What are the rules for JobDrivers? Imagine you are implementing a new
   JobDriver. What do you need to know in order to write correct code?

2. What are the rules for monitor? The main pattern is looking up a job,
   invoking a job API on it, and then calling job_unlock().

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 0/2] Remove deprecated qemu-img backing file without format

2021-07-08 Thread Kevin Wolf
Am 07.07.2021 um 23:17 hat Eric Blake geschrieben:
> On Mon, May 03, 2021 at 04:35:58PM -0500, Eric Blake wrote:
> > We've gone enough release cycles without noticeable pushback on our
> > intentions, so time to make it harder to create images that can form a
> > security hole due to a need for format probing rather than an explicit
> > format.
> > 
> > Eric Blake (2):
> >   qcow2: Prohibit backing file changes in 'qemu-img amend'
> >   qemu-img: Require -F with -b backing image
> 
> Ping.

Thanks, applied to the block branch.

For some reason, the CCs were missing on the cover letter. Please make
sure to CC me (and qemu-block) for the whole series if you want it to go
through my tree.

Kevin




Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex

2021-07-08 Thread Stefan Hajnoczi
On Thu, Jul 08, 2021 at 01:32:12PM +0200, Paolo Bonzini wrote:
> On 08/07/21 12:36, Stefan Hajnoczi wrote:
> > > What is very clear from this patch is that it
> > > is strictly related to the brdv_* and lower level calls, because
> > > they also internally check or even use the aiocontext lock.
> > > Therefore, in order to make it work, I temporarly added some
> > > aiocontext_acquire/release pair around the function that
> > > still assert for them or assume they are hold and temporarly
> > > unlock (unlock() - lock()).
> > 
> > Sounds like the issue is that this patch series assumes AioContext locks
> > are no longer required for calling the blk_*()/bdrv_*() APIs? That is
> > not the case yet, so you had to then add those aio_context_lock() calls
> > back in elsewhere. This approach introduces unnecessary risk. I think we
> > should wait until blk_*()/bdrv_*() no longer requires the caller to hold
> > the AioContext lock before applying this series.
> 
> In general I'm in favor of pushing the lock further down into smaller and
> smaller critical sections; it's a good approach to make further audits
> easier until it's "obvious" that the lock is unnecessary.  I haven't yet
> reviewed Emanuele's patches to see if this is what he's doing where he's
> adding the acquire/release calls, but that's my understanding of both his
> cover letter and your reply.

The problem is the unnecessary risk. We know what the goal is for
blk_*()/bdrv_*() but it's not quite there yet. Does making changes in
block jobs help solve the final issues with blk_*()/bdrv_*()?

If yes, then it's a risk worth taking. If no, then spending time
developing interim code, reviewing those patches, and risking breakage
doesn't seem worth it. I'd rather wait for blk_*()/bdrv_*() to be fully
complete and then see patches that delete aio_context_acquire() in most
places or add locks in the remaining places where the caller was relying
on the AioContext lock.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 2/2] qemu-img: Require -F with -b backing image

2021-07-08 Thread Kevin Wolf
Am 03.05.2021 um 23:36 hat Eric Blake geschrieben:
> @@ -17,7 +14,7 @@ qemu-io: can't open device TEST_DIR/t.qcow2: Could not open 
> backing file: Unknow
>  no file open, try 'help open'
>  read 4096/4096 bytes at offset 0
>  4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -qemu-img: warning: Deprecated use of backing file without explicit backing 
> format, use of this image requires potentially unsafe format probing
> +qemu-img: Could not change the backing file to 
> '/home/eblake/qemu/build/tests/qemu-iotests/scratch/t.qcow2.base': Invalid 
> argument
>  read 4096/4096 bytes at offset 0
>  4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)

This is not exactly an improvement for the error message. Maybe worth a
follow-up patch?

Kevin




Re: [RFC PATCH 5/6] job: use global job_mutex to protect struct Job

2021-07-08 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 06:58:12PM +0200, Emanuele Giuseppe Esposito wrote:
> This lock is going to replace most of the AioContext locks
> in the job and blockjob, so that a Job can run in an arbitrary
> AioContext.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/block/blockjob_int.h |   1 +
>  include/qemu/job.h   |   2 +
>  block/backup.c   |   4 +
>  block/mirror.c   |  11 +-
>  blockdev.c   |  62 
>  blockjob.c   |  67 +++--
>  job-qmp.c|  55 +++
>  job.c| 284 +++
>  qemu-img.c   |  15 +-
>  9 files changed, 350 insertions(+), 151 deletions(-)
> 
> diff --git a/include/block/blockjob_int.h b/include/block/blockjob_int.h
> index 6633d83da2..8b91126506 100644
> --- a/include/block/blockjob_int.h
> +++ b/include/block/blockjob_int.h
> @@ -53,6 +53,7 @@ struct BlockJobDriver {
>   */
>  void (*attached_aio_context)(BlockJob *job, AioContext *new_context);
>  
> +/* Called with job mutex *not* held. */
>  void (*set_speed)(BlockJob *job, int64_t speed);
>  };
>  
> diff --git a/include/qemu/job.h b/include/qemu/job.h
> index 4421d08d93..359f4e6b3a 100644
> --- a/include/qemu/job.h
> +++ b/include/qemu/job.h
> @@ -49,6 +49,8 @@ typedef struct Job {
>  /**
>   * The type of this job.
>   * Set it in job_create and just read.
> + * All calls to the driver function must be not locked by job_mutex,
> + * to avoid deadlocks.
>   */
>  const JobDriver *driver;
>  
> diff --git a/block/backup.c b/block/backup.c
> index bd3614ce70..80ce956299 100644
> --- a/block/backup.c
> +++ b/block/backup.c
> @@ -315,6 +315,10 @@ static void coroutine_fn backup_pause(Job *job)
>  }
>  }
>  
> +/*
> + * Called with job mutex *not* held (we don't want to call block_copy_kick
> + * with the lock held!)
> + */
>  static void coroutine_fn backup_set_speed(BlockJob *job, int64_t speed)
>  {
>  BackupBlockJob *s = container_of(job, BackupBlockJob, common);
> diff --git a/block/mirror.c b/block/mirror.c
> index 49fffa..deefaa6a39 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -1150,9 +1150,11 @@ static void mirror_complete(Job *job, Error **errp)
>  s->should_complete = true;
>  
>  /* If the job is paused, it will be re-entered when it is resumed */
> +job_lock();
>  if (!job_is_paused(job)) {
> -job_enter(job);
> +job_enter_locked(job);
>  }
> +job_unlock();
>  }
>  
>  static void coroutine_fn mirror_pause(Job *job)
> @@ -1171,10 +1173,13 @@ static bool mirror_drained_poll(BlockJob *job)
>   * from one of our own drain sections, to avoid a deadlock waiting for
>   * ourselves.
>   */
> -if (!job_is_paused(&s->common.job) && !job_is_cancelled(&s->common.job) 
> &&
> -!s->in_drain) {
> +job_lock();
> +if (!job_is_paused(&s->common.job) &&
> +!job_is_cancelled_locked(&s->common.job) && !s->in_drain) {
> +job_unlock();
>  return true;
>  }
> +job_unlock();
>  
>  return !!s->in_flight;
>  }
> diff --git a/blockdev.c b/blockdev.c
> index 8e2c15370e..9255aea6a2 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -150,9 +150,11 @@ void blockdev_mark_auto_del(BlockBackend *blk)
>  AioContext *aio_context = job_get_aiocontext(&job->job);
>  aio_context_acquire(aio_context);
>  
> +job_lock();
>  job_cancel(&job->job, false);
>  
>  aio_context_release(aio_context);
> +job_unlock();

This looks strange. The way it's written suggests there is a reason why
job_unlock() has to be called after aio_context_release(). Can
job_unlock() be called immediately after job_cancel()?

>  }
>  }
>  
> @@ -3309,48 +3311,44 @@ out:
>  aio_context_release(aio_context);
>  }
>  
> -/* Get a block job using its ID and acquire its AioContext */
> -static BlockJob *find_block_job(const char *id, AioContext **aio_context,
> -Error **errp)
> +/* Get a block job using its ID and acquire its job_lock */

"its" suggests job_lock is per-Job. I suggest saying something like
"Returns with job_lock held on success" instead.


signature.asc
Description: PGP signature


Re: [PATCH] blockdev: fix drive-backup transaction endless drained section

2021-07-08 Thread Kevin Wolf
Am 07.07.2021 um 15:35 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Forgotten thing :(
> 
> Kevin, could you please queue it in your block branch? For me not to
> bother Peter with one-patch pull request.

No problem, I've queued it now.

Kevin




Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-08 Thread Kevin Wolf
Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben:
> Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
> > Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:
> >> I will have a decent look after my vacation.
> > 
> > Sounds good, thanks. Enjoy your vacation!
> 
> As I had to fire up my laptop to look into another issue anyway, I
> have sent two patches for updating MAINTAINERS and to fix the int vs.
> bool mix for task->complete.

I think you need to reevaluate your definition of vacation. ;-)

But thanks anyway.

> As Paolos fix (5f50be9b5) is relatively new and there are maybe other
> non obvious problems when removing the BH indirection and we are close
> to soft freeze I would leave the BH removal change for 6.2.

Sure, code cleanups aren't urgent.

Kevin




Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex

2021-07-08 Thread Kevin Wolf
Am 08.07.2021 um 13:32 hat Paolo Bonzini geschrieben:
> On 08/07/21 12:36, Stefan Hajnoczi wrote:
> > > What is very clear from this patch is that it
> > > is strictly related to the brdv_* and lower level calls, because
> > > they also internally check or even use the aiocontext lock.
> > > Therefore, in order to make it work, I temporarly added some
> > > aiocontext_acquire/release pair around the function that
> > > still assert for them or assume they are hold and temporarly
> > > unlock (unlock() - lock()).
> > 
> > Sounds like the issue is that this patch series assumes AioContext locks
> > are no longer required for calling the blk_*()/bdrv_*() APIs? That is
> > not the case yet, so you had to then add those aio_context_lock() calls
> > back in elsewhere. This approach introduces unnecessary risk. I think we
> > should wait until blk_*()/bdrv_*() no longer requires the caller to hold
> > the AioContext lock before applying this series.
> 
> In general I'm in favor of pushing the lock further down into smaller and
> smaller critical sections; it's a good approach to make further audits
> easier until it's "obvious" that the lock is unnecessary.  I haven't yet
> reviewed Emanuele's patches to see if this is what he's doing where he's
> adding the acquire/release calls, but that's my understanding of both his
> cover letter and your reply.
> 
> The I/O blk_*()/bdrv_*() *should* not require the caller to hold the
> AioContext lock; all drivers use their own CoMutex or QemuMutex when needed,
> and generic code should also be ready (caveat emptor).  Others, such as
> reopen, are a mess that requires a separate audit.  Restricting
> acquire/release to be only around those seems like a good starting point.

Reopen isn't just a mess, but in fact buggy. After the following patch
goes in, the rule is simple: Don't hold any AioContext locks when
calling bdrv_reopen_multiple().

'block: Acquire AioContexts during bdrv_reopen_multiple()'
https://lists.gnu.org/archive/html/qemu-block/2021-07/msg00238.html

It still takes AioContext locks when it calls into other functions that
currently expect it, but that should be the same as usual then.

And once callers don't even hold the lock in the first place, we'll also
get rid of the ugly temporary lock release across reopen.

Kevin




Re: [PATCH] MAINTAINERS: update block/rbd.c maintainer

2021-07-08 Thread Kevin Wolf
Am 08.07.2021 um 12:44 hat Ilya Dryomov geschrieben:
> On Wed, Jul 7, 2021 at 8:05 PM Peter Lieven  wrote:
> >
> > adding myself as a designated reviewer.
> >
> > Signed-off-by: Peter Lieven 
> > ---
> >  MAINTAINERS | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 516db737d1..cfda57e825 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -3058,6 +3058,7 @@ F: block/vmdk.c
> >
> >  RBD
> >  M: Ilya Dryomov 
> > +R: Peter Lieven 
> >  L: qemu-block@nongnu.org
> >  S: Supported
> >  F: block/rbd.c
> 
> Nit: I would change the title to "MAINTAINERS: add block/rbd.c reviewer"
> or something like that.

Yes, this is better. I've changed it in my queue.

Kevin




[PATCH v6 5/6] iotests: Test reopening multiple devices at the same time

2021-07-08 Thread Kevin Wolf
From: Alberto Garcia 

This test swaps the images used by two active block devices.

This is now possible thanks to the new ability to run
x-blockdev-reopen on multiple devices at the same time.

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/245 | 47 ++
 tests/qemu-iotests/245.out |  4 ++--
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245
index ca08955207..8bc8101e6d 100755
--- a/tests/qemu-iotests/245
+++ b/tests/qemu-iotests/245
@@ -649,6 +649,53 @@ class TestBlockdevReopen(iotests.QMPTestCase):
  '-c', 'read -P 0x40 0x40008 1',
  '-c', 'read -P 0x80 0x40010 1', 
hd_path[0])
 
+# Swap the disk images of two active block devices
+def test_swap_files(self):
+# Add hd0 and hd2 (none of them with backing files)
+opts0 = hd_opts(0)
+result = self.vm.qmp('blockdev-add', conv_keys = False, **opts0)
+self.assert_qmp(result, 'return', {})
+
+opts2 = hd_opts(2)
+result = self.vm.qmp('blockdev-add', conv_keys = False, **opts2)
+self.assert_qmp(result, 'return', {})
+
+# Write different data to both block devices
+self.run_qemu_io("hd0", "write -P 0xa0 0 1k")
+self.run_qemu_io("hd2", "write -P 0xa2 0 1k")
+
+# Check that the data reads correctly
+self.run_qemu_io("hd0", "read  -P 0xa0 0 1k")
+self.run_qemu_io("hd2", "read  -P 0xa2 0 1k")
+
+# It's not possible to make a block device use an image that
+# is already being used by the other device.
+self.reopen(opts0, {'file': 'hd2-file'},
+"Permission conflict on node 'hd2-file': permissions "
+"'write, resize' are both required by node 'hd2' (uses "
+"node 'hd2-file' as 'file' child) and unshared by node "
+"'hd0' (uses node 'hd2-file' as 'file' child).")
+self.reopen(opts2, {'file': 'hd0-file'},
+"Permission conflict on node 'hd0-file': permissions "
+"'write, resize' are both required by node 'hd0' (uses "
+"node 'hd0-file' as 'file' child) and unshared by node "
+"'hd2' (uses node 'hd0-file' as 'file' child).")
+
+# But we can swap the images if we reopen both devices at the
+# same time
+opts0['file'] = 'hd2-file'
+opts2['file'] = 'hd0-file'
+self.reopenMultiple([opts0, opts2])
+self.run_qemu_io("hd0", "read  -P 0xa2 0 1k")
+self.run_qemu_io("hd2", "read  -P 0xa0 0 1k")
+
+# And we can of course come back to the original state
+opts0['file'] = 'hd0-file'
+opts2['file'] = 'hd2-file'
+self.reopenMultiple([opts0, opts2])
+self.run_qemu_io("hd0", "read  -P 0xa0 0 1k")
+self.run_qemu_io("hd2", "read  -P 0xa2 0 1k")
+
 # Misc reopen tests with different block drivers
 @iotests.skip_if_unsupported(['quorum', 'throttle'])
 def test_misc_drivers(self):
diff --git a/tests/qemu-iotests/245.out b/tests/qemu-iotests/245.out
index daf1e51922..4eced19294 100644
--- a/tests/qemu-iotests/245.out
+++ b/tests/qemu-iotests/245.out
@@ -17,8 +17,8 @@ read 1/1 bytes at offset 262152
 read 1/1 bytes at offset 262160
 1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
-..
+...
 --
-Ran 24 tests
+Ran 25 tests
 
 OK
-- 
2.31.1




[PATCH v6 6/6] block: Make blockdev-reopen stable API

2021-07-08 Thread Kevin Wolf
From: Alberto Garcia 

This patch drops the 'x-' prefix from x-blockdev-reopen.

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block-core.json|  6 +++---
 blockdev.c  |  2 +-
 tests/qemu-iotests/155  |  2 +-
 tests/qemu-iotests/165  |  2 +-
 tests/qemu-iotests/245  | 10 +-
 tests/qemu-iotests/248  |  2 +-
 tests/qemu-iotests/248.out  |  2 +-
 tests/qemu-iotests/296  |  2 +-
 tests/qemu-iotests/298  |  2 +-
 tests/qemu-iotests/tests/remove-bitmap-from-backing |  4 ++--
 10 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 052520331e..2eb399f0d4 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4219,7 +4219,7 @@
 { 'command': 'blockdev-add', 'data': 'BlockdevOptions', 'boxed': true }
 
 ##
-# @x-blockdev-reopen:
+# @blockdev-reopen:
 #
 # Reopens one or more block devices using the given set of options.
 # Any option not specified will be reset to its default value regardless
@@ -4257,9 +4257,9 @@
 # image does not have a default backing file name as part of its
 # metadata.
 #
-# Since: 4.0
+# Since: 6.0
 ##
-{ 'command': 'x-blockdev-reopen',
+{ 'command': 'blockdev-reopen',
   'data': { 'options': ['BlockdevOptions'] } }
 
 ##
diff --git a/blockdev.c b/blockdev.c
index fddcccbdbd..6466a106f4 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3560,7 +3560,7 @@ fail:
 visit_free(v);
 }
 
-void qmp_x_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
+void qmp_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
 {
 BlockReopenQueue *queue = NULL;
 GSList *drained = NULL;
diff --git a/tests/qemu-iotests/155 b/tests/qemu-iotests/155
index 2947bfb81a..eadda52615 100755
--- a/tests/qemu-iotests/155
+++ b/tests/qemu-iotests/155
@@ -261,7 +261,7 @@ class TestBlockdevMirrorReopen(MirrorBaseClass):
 result = self.vm.qmp('blockdev-add', node_name="backing",
  driver="null-co")
 self.assert_qmp(result, 'return', {})
-result = self.vm.qmp('x-blockdev-reopen', options=[{
+result = self.vm.qmp('blockdev-reopen', options=[{
  'node-name': "target",
  'driver': iotests.imgfmt,
  'file': "target-file",
diff --git a/tests/qemu-iotests/165 b/tests/qemu-iotests/165
index ef4cf14516..ce499946b8 100755
--- a/tests/qemu-iotests/165
+++ b/tests/qemu-iotests/165
@@ -137,7 +137,7 @@ class TestPersistentDirtyBitmap(iotests.QMPTestCase):
 assert sha256_1 == self.getSha256()
 
 # Reopen to RW
-result = self.vm.qmp('x-blockdev-reopen', options=[{
+result = self.vm.qmp('blockdev-reopen', options=[{
 'node-name': 'node0',
 'driver': iotests.imgfmt,
 'file': {
diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245
index 8bc8101e6d..bf8261eec0 100755
--- a/tests/qemu-iotests/245
+++ b/tests/qemu-iotests/245
@@ -1,7 +1,7 @@
 #!/usr/bin/env python3
 # group: rw
 #
-# Test cases for the QMP 'x-blockdev-reopen' command
+# Test cases for the QMP 'blockdev-reopen' command
 #
 # Copyright (C) 2018-2019 Igalia, S.L.
 # Author: Alberto Garcia 
@@ -85,16 +85,16 @@ class TestBlockdevReopen(iotests.QMPTestCase):
  "Expected output of %d qemu-io commands, found %d" %
  (found, self.total_io_cmds))
 
-# Run x-blockdev-reopen on a list of block devices
+# Run blockdev-reopen on a list of block devices
 def reopenMultiple(self, opts, errmsg = None):
-result = self.vm.qmp('x-blockdev-reopen', conv_keys=False, 
options=opts)
+result = self.vm.qmp('blockdev-reopen', conv_keys=False, options=opts)
 if errmsg:
 self.assert_qmp(result, 'error/class', 'GenericError')
 self.assert_qmp(result, 'error/desc', errmsg)
 else:
 self.assert_qmp(result, 'return', {})
 
-# Run x-blockdev-reopen on a single block device (specified by
+# Run blockdev-reopen on a single block device (specified by
 # 'opts') but applying 'newopts' on top of it. The original 'opts'
 # dict is unmodified
 def reopen(self, opts, newopts = {}, errmsg = None):
@@ -161,7 +161,7 @@ class TestBlockdevReopen(iotests.QMPTestCase):
 self.reopen(opts, {'file.locking': 'off'}, "Cannot change the option 
'locking'")
 self.reopen(opts, {'file.filename': None}, "Invalid parameter type for 
'options[0].file.filename', expected: string")
 
-# node-name is optional in BlockdevOptions, but x-blockdev-reopen 
needs it
+# node-name is optional i

[PATCH v6 1/6] qcow2: Fix dangling pointer after reopen for 'file'

2021-07-08 Thread Kevin Wolf
Without an external data file, s->data_file is a second pointer with the
same value as bs->file. When changing bs->file to a different BdrvChild
and freeing the old BdrvChild, s->data_file must also be updated,
otherwise it points to freed memory and causes crashes.

This problem was caught by iotests case 245.

Fixes: df2b7086f169239ebad5d150efa29c9bb6d4f820
Signed-off-by: Kevin Wolf 
---
 block/qcow2.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index ee4530cdbd..9126127633 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1926,6 +1926,7 @@ static void qcow2_refresh_limits(BlockDriverState *bs, 
Error **errp)
 static int qcow2_reopen_prepare(BDRVReopenState *state,
 BlockReopenQueue *queue, Error **errp)
 {
+BDRVQcow2State *s = state->bs->opaque;
 Qcow2ReopenState *r;
 int ret;
 
@@ -1956,6 +1957,16 @@ static int qcow2_reopen_prepare(BDRVReopenState *state,
 }
 }
 
+/*
+ * Without an external data file, s->data_file points to the same BdrvChild
+ * as bs->file. It needs to be resynced after reopen because bs->file may
+ * be changed. We can't use it in the meantime.
+ */
+if (!has_data_file(state->bs)) {
+assert(s->data_file == state->bs->file);
+s->data_file = NULL;
+}
+
 return 0;
 
 fail:
@@ -1966,7 +1977,16 @@ fail:
 
 static void qcow2_reopen_commit(BDRVReopenState *state)
 {
+BDRVQcow2State *s = state->bs->opaque;
+
 qcow2_update_options_commit(state->bs, state->opaque);
+if (!s->data_file) {
+/*
+ * If we don't have an external data file, s->data_file was cleared by
+ * qcow2_reopen_prepare() and needs to be updated.
+ */
+s->data_file = state->bs->file;
+}
 g_free(state->opaque);
 }
 
@@ -1990,6 +2010,15 @@ static void qcow2_reopen_commit_post(BDRVReopenState 
*state)
 
 static void qcow2_reopen_abort(BDRVReopenState *state)
 {
+BDRVQcow2State *s = state->bs->opaque;
+
+if (!s->data_file) {
+/*
+ * If we don't have an external data file, s->data_file was cleared by
+ * qcow2_reopen_prepare() and needs to be restored.
+ */
+s->data_file = state->bs->file;
+}
 qcow2_update_options_abort(state->bs, state->opaque);
 g_free(state->opaque);
 }
-- 
2.31.1




[PATCH v6 3/6] block: Acquire AioContexts during bdrv_reopen_multiple()

2021-07-08 Thread Kevin Wolf
As the BlockReopenQueue can contain nodes in multiple AioContexts, only
one of which may be locked when AIO_WAIT_WHILE() can be called, we can't
let the caller lock the right contexts. Instead, individually lock the
AioContext of a single node when iterating the queue.

Reintroduce bdrv_reopen() as a wrapper for reopening a single node that
drains the node and temporarily drops the AioContext lock for
bdrv_reopen_multiple().

Signed-off-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block.h |  2 ++
 block.c   | 49 ---
 block/replication.c   |  7 +++
 blockdev.c|  5 +
 qemu-io-cmds.c|  7 +--
 5 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 6d42992985..3477290f9a 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -388,6 +388,8 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue 
*bs_queue,
 bool keep_old_opts);
 void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue);
 int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp);
+int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
+Error **errp);
 int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
   Error **errp);
 int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
diff --git a/block.c b/block.c
index ee9b46e95e..06fb69df5c 100644
--- a/block.c
+++ b/block.c
@@ -4124,19 +4124,26 @@ void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
  *
  * All affected nodes must be drained between bdrv_reopen_queue() and
  * bdrv_reopen_multiple().
+ *
+ * To be called from the main thread, with all other AioContexts unlocked.
  */
 int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp)
 {
 int ret = -1;
 BlockReopenQueueEntry *bs_entry, *next;
+AioContext *ctx;
 Transaction *tran = tran_new();
 g_autoptr(GHashTable) found = NULL;
 g_autoptr(GSList) refresh_list = NULL;
 
+assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 assert(bs_queue != NULL);
 
 QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 ret = bdrv_flush(bs_entry->state.bs);
+aio_context_release(ctx);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Error flushing drive");
 goto abort;
@@ -4145,7 +4152,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
Error **errp)
 
 QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
 assert(bs_entry->state.bs->quiesce_counter > 0);
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 ret = bdrv_reopen_prepare(&bs_entry->state, bs_queue, tran, errp);
+aio_context_release(ctx);
 if (ret < 0) {
 goto abort;
 }
@@ -4188,7 +4198,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
Error **errp)
  * to first element.
  */
 QTAILQ_FOREACH_REVERSE(bs_entry, bs_queue, entry) {
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 bdrv_reopen_commit(&bs_entry->state);
+aio_context_release(ctx);
 }
 
 tran_commit(tran);
@@ -4197,7 +4210,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
Error **errp)
 BlockDriverState *bs = bs_entry->state.bs;
 
 if (bs->drv->bdrv_reopen_commit_post) {
+ctx = bdrv_get_aio_context(bs);
+aio_context_acquire(ctx);
 bs->drv->bdrv_reopen_commit_post(&bs_entry->state);
+aio_context_release(ctx);
 }
 }
 
@@ -4208,7 +4224,10 @@ abort:
 tran_abort(tran);
 QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
 if (bs_entry->prepared) {
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 bdrv_reopen_abort(&bs_entry->state);
+aio_context_release(ctx);
 }
 }
 
@@ -4218,23 +4237,39 @@ cleanup:
 return ret;
 }
 
-int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
-  Error **errp)
+int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
+Error **errp)
 {
-int ret;
+AioContext *ctx = bdrv_get_aio_context(bs);
 BlockReopenQueue *queue;
-QDict *opts = qdict_new();
-
-qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
+int ret;
 
 bdrv_subtree_drained_begin(bs);
-queue = bdrv_reopen_queue(NULL, bs, opts, true);
+if (ctx != qemu_get_aio_context()) {
+aio_context_release(ctx);
+}
+
+queue = bdrv_reopen_queue(NULL, bs, opts, keep_old_opts);
 ret = bdrv_reopen_multiple(queue, errp);
+
+if (ctx != qemu_get_aio_context()) {
+aio_context_acquire(c

[PATCH v6 4/6] block: Support multiple reopening with x-blockdev-reopen

2021-07-08 Thread Kevin Wolf
From: Alberto Garcia 

[ kwolf: Fixed AioContext locking ]

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block-core.json  | 18 +++--
 blockdev.c| 81 ++-
 tests/qemu-iotests/155|  9 ++-
 tests/qemu-iotests/165|  4 +-
 tests/qemu-iotests/245| 27 ---
 tests/qemu-iotests/248|  2 +-
 tests/qemu-iotests/248.out|  2 +-
 tests/qemu-iotests/296|  9 ++-
 tests/qemu-iotests/298|  4 +-
 .../tests/remove-bitmap-from-backing  | 18 +++--
 10 files changed, 99 insertions(+), 75 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 4a46552816..052520331e 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4221,13 +4221,15 @@
 ##
 # @x-blockdev-reopen:
 #
-# Reopens a block device using the given set of options. Any option
-# not specified will be reset to its default value regardless of its
-# previous status. If an option cannot be changed or a particular
+# Reopens one or more block devices using the given set of options.
+# Any option not specified will be reset to its default value regardless
+# of its previous status. If an option cannot be changed or a particular
 # driver does not support reopening then the command will return an
-# error.
+# error. All devices in the list are reopened in one transaction, so
+# if one of them fails then the whole transaction is cancelled.
 #
-# The top-level @node-name option (from BlockdevOptions) must be
+# The command receives a list of block devices to reopen. For each one
+# of them, the top-level @node-name option (from BlockdevOptions) must be
 # specified and is used to select the block device to be reopened.
 # Other @node-name options must be either omitted or set to the
 # current name of the appropriate node. This command won't change any
@@ -4247,8 +4249,8 @@
 #
 #  4) NULL: the current child (if any) is detached.
 #
-# Options (1) and (2) are supported in all cases, but at the moment
-# only @backing allows replacing or detaching an existing child.
+# Options (1) and (2) are supported in all cases. Option (3) is
+# supported for @file and @backing, and option (4) for @backing only.
 #
 # Unlike with blockdev-add, the @backing option must always be present
 # unless the node being reopened does not have a backing file and its
@@ -4258,7 +4260,7 @@
 # Since: 4.0
 ##
 { 'command': 'x-blockdev-reopen',
-  'data': 'BlockdevOptions', 'boxed': true }
+  'data': { 'options': ['BlockdevOptions'] } }
 
 ##
 # @blockdev-del:
diff --git a/blockdev.c b/blockdev.c
index f657d090d3..fddcccbdbd 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3560,51 +3560,60 @@ fail:
 visit_free(v);
 }
 
-void qmp_x_blockdev_reopen(BlockdevOptions *options, Error **errp)
-{
-BlockDriverState *bs;
-AioContext *ctx;
-QObject *obj;
-Visitor *v = qobject_output_visitor_new(&obj);
-BlockReopenQueue *queue;
-QDict *qdict;
+void qmp_x_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
+{
+BlockReopenQueue *queue = NULL;
+GSList *drained = NULL;
+
+/* Add each one of the BDS that we want to reopen to the queue */
+for (; reopen_list != NULL; reopen_list = reopen_list->next) {
+BlockdevOptions *options = reopen_list->value;
+BlockDriverState *bs;
+AioContext *ctx;
+QObject *obj;
+Visitor *v;
+QDict *qdict;
+
+/* Check for the selected node name */
+if (!options->has_node_name) {
+error_setg(errp, "node-name not specified");
+goto fail;
+}
 
-/* Check for the selected node name */
-if (!options->has_node_name) {
-error_setg(errp, "node-name not specified");
-goto fail;
-}
+bs = bdrv_find_node(options->node_name);
+if (!bs) {
+error_setg(errp, "Failed to find node with node-name='%s'",
+   options->node_name);
+goto fail;
+}
 
-bs = bdrv_find_node(options->node_name);
-if (!bs) {
-error_setg(errp, "Failed to find node with node-name='%s'",
-   options->node_name);
-goto fail;
-}
+/* Put all options in a QDict and flatten it */
+v = qobject_output_visitor_new(&obj);
+visit_type_BlockdevOptions(v, NULL, &options, &error_abort);
+visit_complete(v, &obj);
+visit_free(v);
 
-/* Put all options in a QDict and flatten it */
-visit_type_BlockdevOptions(v, NULL, &options, &error_abort);
-visit_complete(v, &obj);
-qdict = qobject_to(QDict, obj);
+qdict = qobject_to(QDict, obj);
 
-qdict_flatten(qdict);
+qdict_flatten(qdict);
 
-/* Perform the reopen operation */
-ctx = bdrv_get_aio_context(bs);
-aio_cont

[PATCH v6 2/6] block: Add bdrv_reopen_queue_free()

2021-07-08 Thread Kevin Wolf
From: Alberto Garcia 

Move the code to free a BlockReopenQueue to a separate function.
It will be used in a subsequent patch.

[ kwolf: Also free explicit_options and options, and explicitly
  qobject_ref() the value when it continues to be used. This makes
  future memory leaks less likely. ]

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block.h |  1 +
 block.c   | 22 --
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 7ec77ecb1a..6d42992985 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -386,6 +386,7 @@ BlockDriverState *bdrv_new_open_driver(BlockDriver *drv, 
const char *node_name,
 BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
 BlockDriverState *bs, QDict *options,
 bool keep_old_opts);
+void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue);
 int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp);
 int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
   Error **errp);
diff --git a/block.c b/block.c
index acd35cb0cb..ee9b46e95e 100644
--- a/block.c
+++ b/block.c
@@ -4095,6 +4095,19 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue 
*bs_queue,
NULL, 0, keep_old_opts);
 }
 
+void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
+{
+if (bs_queue) {
+BlockReopenQueueEntry *bs_entry, *next;
+QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
+qobject_unref(bs_entry->state.explicit_options);
+qobject_unref(bs_entry->state.options);
+g_free(bs_entry);
+}
+g_free(bs_queue);
+}
+}
+
 /*
  * Reopen multiple BlockDriverStates atomically & transactionally.
  *
@@ -4197,15 +4210,10 @@ abort:
 if (bs_entry->prepared) {
 bdrv_reopen_abort(&bs_entry->state);
 }
-qobject_unref(bs_entry->state.explicit_options);
-qobject_unref(bs_entry->state.options);
 }
 
 cleanup:
-QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
-g_free(bs_entry);
-}
-g_free(bs_queue);
+bdrv_reopen_queue_free(bs_queue);
 
 return ret;
 }
@@ -4573,6 +4581,8 @@ static void bdrv_reopen_commit(BDRVReopenState 
*reopen_state)
 /* set BDS specific flags now */
 qobject_unref(bs->explicit_options);
 qobject_unref(bs->options);
+qobject_ref(reopen_state->explicit_options);
+qobject_ref(reopen_state->options);
 
 bs->explicit_options   = reopen_state->explicit_options;
 bs->options= reopen_state->options;
-- 
2.31.1




[PATCH v6 0/6] Make blockdev-reopen stable

2021-07-08 Thread Kevin Wolf
This series picks up the remaining patches from Berto's series "[PATCH
v4 0/6] Allow changing bs->file on reopen", which are not merged into
master yet.

Apart from renaming 'x-blockdev-reopen' into 'blockdev-reopen', the
remaining functional change in this series is taking a list of nodes to
reopen as an argument so that multiple changes to the graph can be made
atomically that would be invalid separately (e.g. due to permission
checks on the intermediate state).

It also contains a qcow2 fix for a bug introduced by the part of the
series that was already picked up in Vladimir's "[PATCH v6 0/9] Allow
changing bs->file on reopen".

v6:
- Changed qcow2 fix to set s->data_file = NULL between .prepare and
  .commit instead of using a separate bool [Vladimir]
- Coding style fixes [Vladimir]

Alberto Garcia (4):
  block: Add bdrv_reopen_queue_free()
  block: Support multiple reopening with x-blockdev-reopen
  iotests: Test reopening multiple devices at the same time
  block: Make blockdev-reopen stable API

Kevin Wolf (2):
  qcow2: Fix dangling pointer after reopen for 'file'
  block: Acquire AioContexts during bdrv_reopen_multiple()

 qapi/block-core.json  | 24 +++---
 include/block/block.h |  3 +
 block.c   | 71 +
 block/qcow2.c | 29 +++
 block/replication.c   |  7 ++
 blockdev.c| 76 ++
 qemu-io-cmds.c|  7 +-
 tests/qemu-iotests/155|  9 ++-
 tests/qemu-iotests/165|  4 +-
 tests/qemu-iotests/245| 78 +++
 tests/qemu-iotests/245.out|  4 +-
 tests/qemu-iotests/248|  4 +-
 tests/qemu-iotests/248.out|  2 +-
 tests/qemu-iotests/296| 11 ++-
 tests/qemu-iotests/298|  4 +-
 .../tests/remove-bitmap-from-backing  | 22 +++---
 16 files changed, 255 insertions(+), 100 deletions(-)

-- 
2.31.1




Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex

2021-07-08 Thread Paolo Bonzini

On 08/07/21 12:36, Stefan Hajnoczi wrote:

What is very clear from this patch is that it
is strictly related to the brdv_* and lower level calls, because
they also internally check or even use the aiocontext lock.
Therefore, in order to make it work, I temporarly added some
aiocontext_acquire/release pair around the function that
still assert for them or assume they are hold and temporarly
unlock (unlock() - lock()).


Sounds like the issue is that this patch series assumes AioContext locks
are no longer required for calling the blk_*()/bdrv_*() APIs? That is
not the case yet, so you had to then add those aio_context_lock() calls
back in elsewhere. This approach introduces unnecessary risk. I think we
should wait until blk_*()/bdrv_*() no longer requires the caller to hold
the AioContext lock before applying this series.


In general I'm in favor of pushing the lock further down into smaller 
and smaller critical sections; it's a good approach to make further 
audits easier until it's "obvious" that the lock is unnecessary.  I 
haven't yet reviewed Emanuele's patches to see if this is what he's 
doing where he's adding the acquire/release calls, but that's my 
understanding of both his cover letter and your reply.


The I/O blk_*()/bdrv_*() *should* not require the caller to hold the 
AioContext lock; all drivers use their own CoMutex or QemuMutex when 
needed, and generic code should also be ready (caveat emptor).  Others, 
such as reopen, are a mess that requires a separate audit.  Restricting 
acquire/release to be only around those seems like a good starting point.


Paolo




Re: [RFC PATCH 4/6] job.h: categorize job fields

2021-07-08 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 06:58:11PM +0200, Emanuele Giuseppe Esposito wrote:
> -/** AioContext to run the job coroutine in */
> +/**
> + * AioContext to run the job coroutine in.
> + * Atomic.
> + */
>  AioContext *aio_context;

This isn't accessed using atomic operations, so I'm not sure why it's
documented as atomic?


signature.asc
Description: PGP signature


Re: [RFC PATCH 3/6] job: minor changes to simplify locking

2021-07-08 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 06:58:10PM +0200, Emanuele Giuseppe Esposito wrote:
> @@ -406,15 +410,18 @@ void *job_create(const char *job_id, const JobDriver 
> *driver, JobTxn *txn,
>  error_setg(errp, "Invalid job ID '%s'", job_id);
>  return NULL;
>  }
> -if (job_get(job_id)) {
> -error_setg(errp, "Job ID '%s' already in use", job_id);
> -return NULL;
> -}
>  } else if (!(flags & JOB_INTERNAL)) {
>  error_setg(errp, "An explicit job ID is required");
>  return NULL;
>  }
>  
> +job_lock();
> +if (job_get(job_id)) {
> +error_setg(errp, "Job ID '%s' already in use", job_id);
> +job_unlock();
> +return NULL;
> +}
> +

Where is the matching job_unlock() in the success case? Please consider
lock guard macros like QEMU_LOCK_GUARD()/WITH_QEMU_LOCK_GUARD() to
prevent common errors.


signature.asc
Description: PGP signature


Re: [RFC PATCH 2/6] job: _locked functions and public job_lock/unlock for next patch

2021-07-08 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 06:58:09PM +0200, Emanuele Giuseppe Esposito wrote:
> diff --git a/job.c b/job.c
> index 872bbebb01..96fb8e9730 100644
> --- a/job.c
> +++ b/job.c
> @@ -32,6 +32,10 @@
>  #include "trace/trace-root.h"
>  #include "qapi/qapi-events-job.h"
>  
> +/* job_mutex protexts the jobs list, but also the job operations. */
> +static QemuMutex job_mutex;

It's unclear what protecting "job operations" means. I would prefer a
fine-grained per-job lock that protects the job's fields instead of a
global lock with an unclear scope.

> +
> +/* Protected by job_mutex */
>  static QLIST_HEAD(, Job) jobs = QLIST_HEAD_INITIALIZER(jobs);
>  
>  /* Job State Transition Table */
> @@ -64,27 +68,22 @@ bool JobVerbTable[JOB_VERB__MAX][JOB_STATUS__MAX] = {
>  /* Transactional group of jobs */
>  struct JobTxn {
>  
> -/* Is this txn being cancelled? */
> +/* Is this txn being cancelled? Atomic.*/
>  bool aborting;

The comment says atomic but this field is not accessed using atomic
operations (at least at this point in the patch series)?

>  
> -/* List of jobs */
> +/* List of jobs. Protected by job_mutex. */
>  QLIST_HEAD(, Job) jobs;
>  
> -/* Reference count */
> +/* Reference count. Atomic. */
>  int refcnt;

Same.


signature.asc
Description: PGP signature


Re: [PATCH] MAINTAINERS: update block/rbd.c maintainer

2021-07-08 Thread Ilya Dryomov
On Wed, Jul 7, 2021 at 8:05 PM Peter Lieven  wrote:
>
> adding myself as a designated reviewer.
>
> Signed-off-by: Peter Lieven 
> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 516db737d1..cfda57e825 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3058,6 +3058,7 @@ F: block/vmdk.c
>
>  RBD
>  M: Ilya Dryomov 
> +R: Peter Lieven 
>  L: qemu-block@nongnu.org
>  S: Supported
>  F: block/rbd.c
> --
> 2.17.1
>
>

Nit: I would change the title to "MAINTAINERS: add block/rbd.c reviewer"
or something like that.

Acked-by: Ilya Dryomov 

Thanks again for volunteering!

Ilya



Re: [PATCH] block/rbd: fix type of task->complete

2021-07-08 Thread Ilya Dryomov
On Wed, Jul 7, 2021 at 8:05 PM Peter Lieven  wrote:
>
> task->complete is a bool not an integer.
>
> Signed-off-by: Peter Lieven 
> ---
>  block/rbd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 01a7b94d62..dcf82b15b8 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1066,7 +1066,7 @@ static int qemu_rbd_resize(BlockDriverState *bs, 
> uint64_t size)
>  static void qemu_rbd_finish_bh(void *opaque)
>  {
>  RBDTask *task = opaque;
> -task->complete = 1;
> +task->complete = true;
>  aio_co_wake(task->co);
>  }
>
> --
> 2.17.1
>
>

Reviewed-by: Ilya Dryomov 

Thanks,

Ilya



Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex

2021-07-08 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 06:58:07PM +0200, Emanuele Giuseppe Esposito wrote:
> This is a continuation on the work to reduce (and possibly get rid of) the 
> usage of AioContext lock, by introducing smaller granularity locks to keep 
> the thread safety.
> 
> This series aims to:
> 1) remove the aiocontext lock and substitute it with the already existing
>global job_mutex
> 2) fix what it looks like to be an oversight when moving the blockjob.c logic
>into the more generic job.c: job_mutex was introduced especially to
>protect job->busy flag, but it seems that it was not used in successive
>patches, because there are multiple code sections that directly
>access the field without any locking.
> 3) use job_mutex instead of the aiocontext_lock
> 4) extend the reach of the job_mutex to protect all shared fields
>that the job structure has.
> 
> The reason why we propose to use the existing job_mutex and not make one for
> each job is to keep things as simple as possible for now, and because
> the jobs are not in the execution critical path, so we can affort
> some delays.
> Having a lock per job would increase overall complexity and
> increase the chances of deadlocks (one good example could be the job
> transactions, where multiple jobs are grouped together).
> Anyways, the per-job mutex can always be added in the future.
> 
> Patch 1-4 are in preparation for patch 5. They try to simplify and clarify
> the job_mutex usage. Patch 5 tries to add proper syncronization to the job
> structure, replacing the AioContext lock when necessary.
> Patch 6 just removes unnecessary AioContext locks that are now unneeded.
> 
> 
> RFC: I am not sure the way I layed out the locks is ideal.
> But their usage should not make deadlocks. I also made sure
> the series passess all qemu_iotests.
> 
> What is very clear from this patch is that it
> is strictly related to the brdv_* and lower level calls, because
> they also internally check or even use the aiocontext lock.
> Therefore, in order to make it work, I temporarly added some
> aiocontext_acquire/release pair around the function that
> still assert for them or assume they are hold and temporarly
> unlock (unlock() - lock()).

Sounds like the issue is that this patch series assumes AioContext locks
are no longer required for calling the blk_*()/bdrv_*() APIs? That is
not the case yet, so you had to then add those aio_context_lock() calls
back in elsewhere. This approach introduces unnecessary risk. I think we
should wait until blk_*()/bdrv_*() no longer requires the caller to hold
the AioContext lock before applying this series.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] block/rbd: fix type of task->complete

2021-07-08 Thread Kevin Wolf
Am 07.07.2021 um 23:51 hat Connor Kuehl geschrieben:
> On 7/7/21 11:04 AM, Peter Lieven wrote:
> > task->complete is a bool not an integer.
> > 
> > Signed-off-by: Peter Lieven 
> > ---
> >  block/rbd.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/block/rbd.c b/block/rbd.c
> > index 01a7b94d62..dcf82b15b8 100644
> > --- a/block/rbd.c
> > +++ b/block/rbd.c
> > @@ -1066,7 +1066,7 @@ static int qemu_rbd_resize(BlockDriverState *bs, 
> > uint64_t size)
> >  static void qemu_rbd_finish_bh(void *opaque)
> >  {
> >  RBDTask *task = opaque;
> > -task->complete = 1;
> > +task->complete = true;
> >  aio_co_wake(task->co);
> >  }
> >  
> > 
> 
> Hi Peter,
> 
> What tree/QEMU git sha does this apply to? I am having trouble
> finding definitions for RBDTask and qemu_rbd_finish_bh; and the
> patch won't apply to my few-minutes-old clone of upstream.

It is on top of:

[PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support
Message-Id: <20210702172356.11574-1-idryo...@gmail.com>

Also, thanks, applied to the block branch.

Kevin




Re: [PATCH v3 3/2] qemu-img: Reword 'qemu-img map --output=json' docs

2021-07-08 Thread Vladimir Sementsov-Ogievskiy

07.07.2021 21:41, Eric Blake wrote:

Reword the paragraphs to list the JSON key first, rather than in the
middle of prose.

Suggested-by: Vladimir Sementsov-Ogievskiy
Signed-off-by: Eric Blake


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir