Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging
On Jul 9 08:55, Klaus Jensen wrote: On Jul 9 08:16, Hannes Reinecke wrote: On 7/9/21 8:05 AM, Klaus Jensen wrote: On Jul 7 17:49, Klaus Jensen wrote: From: Klaus Jensen Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We discussed a bit back and fourth and I mentioned that the core issue was an artifact of the parent/child relationship stemming from the qdev setup we have with namespaces attaching to controller through a qdev bus. The gist of this series is the fourth patch "hw/nvme: fix controller hot unplugging" which basically causes namespaces to be reassigned to a bus owned by the subsystem if the parent controller is linked to one. This fixes `device_del/add nvme` in such settings. Note, that in the case that there is no subsystem involved, nvme devices can be removed from the system with `device_del`, but this *will* cause the namespaces to be removed as well since there is no place (i.e. no subsystem) for them to "linger". And since this series does not add support for hotplugging nvme-ns devices, while an nvme device can be readded, no namespaces can. Support for hotplugging nvme-ns devices is present in [1], but I'd rather not add that since I think '-device nvme-ns' is already a bad design choice. Now, I do realize that it is not "pretty" to explicitly change the parent bus, so I do have a an RFC patch in queue that replaces the subsystem and namespace devices with objects, but keeps -device shims available for backwards compatibility. This approach will solve the problems properly and should be a better model. However, I don't believe it will make it for 6.1 and I'd really like to at least fix the unplugging for 6.1 and this gets the job done. [1]: 20210511073511.32511-1-h...@suse.de v2: - added R-b's by Hannes for patches 1 through 3 - simplified "hw/nvme: fix controller hot unplugging" Klaus Jensen (4): hw/nvme: remove NvmeCtrl parameter from ns setup/check functions hw/nvme: mark nvme-subsys non-hotpluggable hw/nvme: unregister controller with subsystem at exit hw/nvme: fix controller hot unplugging hw/nvme/nvme.h | 18 +--- hw/nvme/ctrl.c | 14 ++-- hw/nvme/ns.c | 55 +++- hw/nvme/subsys.c | 9 4 files changed, 63 insertions(+), 33 deletions(-) -- 2.32.0 Applied patches 1 through 3 to nvme-next. So, how do we go about with patch 4? Without it this whole exercise is a bit pointless, seeing that it doesn't fix anything. Patch 1-3 are fixes we need anyway, so I thought I might as well apply them :) Shall we go with that patch as an interim solution? Will you replace it with your 'object' patch? What is the plan? Yes, if acceptable, I would like to use patch 4 as an interim solution. We have a bug we need to fix for 6.1, and I belive this does the job. I considered changing the existing nvme-bus to be on the main system bus, but then we break the existing behavior that the namespaces attach to the most recently defined controller in the absence of the shared parameter or an explicit bus parameter. Wrt. "the plan", right now, I see two solutions going forward: 1. Introduce new -object's for nvme-nvm-subsystem and nvme-ns This is the approach that I am taking right now and it works well. Itallows many-to-many relationships and separates the life times ofsubsystems, namespaces and controllers like you mentioned. Conceptually, I also really like that the subsystem and namespace arenot "devices". One could argue that the namespace is comparable to aSCSI LUN (-device scsi-hd, right?), but where the SCSI LUN actually"shows up" in the host, the nvme namespace does not. My series handles backwards compatibility by keeping -device "shims"around that just wraps the new objects but behaves like it used to.The plan would be to deprecate these devices. The downside to this approach is that it moves the subsystem and namespaces out of the "qdev tree (info qtree)" and into the pure QOM "/objects" tree. Instead of qtree, we can have QMP and HMP commands for introspection. 2. Make the subsystem a "system bus device" This way we add an "nvme-nvm-subsystems" bus as a direct child of themain system bus, and we can possibly get rid of the explicit -devicenvme-subsys as well. We change the namespace device to plug into thatinstead. The nvme controller device still needs to plug into the PCIbus, so it cannot be a child of the subsystems bus, but can keepusing a link parameter to hook into the subsystem and attach to anynamespaces it would like. I'm unsure if we can do this without deprecating the existing namespace device, just like option 1. I have not implemented this, so I need to look more into it. It seemslike the main thing that this gives us compared to 1) is `info qtree`support and we still end up just "wiring" namespace attachment withbacklinks a
Re: [PATCH] tests/qtest/nvme-test: add persistent memory region test
On Jun 18 16:04, Gollu Appalanaidu wrote: This will test the PMR functionality. Signed-off-by: Gollu Appalanaidu --- tests/qtest/nvme-test.c | 78 - 1 file changed, 77 insertions(+), 1 deletion(-) diff --git a/tests/qtest/nvme-test.c b/tests/qtest/nvme-test.c index d32c953a38..6d557be6ca 100644 --- a/tests/qtest/nvme-test.c +++ b/tests/qtest/nvme-test.c @@ -13,6 +13,7 @@ #include "libqos/libqtest.h" #include "libqos/qgraph.h" #include "libqos/pci.h" +#include "include/block/nvme.h" typedef struct QNvme QNvme; @@ -21,6 +22,9 @@ struct QNvme { QPCIDevice dev; }; +static char *t_path; +#define TEST_IMAGE_SIZE (2 * 1024 * 1024) + static void *nvme_get_driver(void *obj, const char *interface) { QNvme *nvme = obj; @@ -66,12 +70,77 @@ static void nvmetest_oob_cmb_test(void *obj, void *data, QGuestAllocator *alloc) g_assert_cmpint(qpci_io_readl(pdev, bar, cmb_bar_size - 1), !=, 0x44332211); } +static void nvmetest_pmr_reg_test(void *obj, void *data, QGuestAllocator *alloc) +{ +QNvme *nvme = obj; +QPCIDevice *pdev = &nvme->dev; +QPCIBar pmr_bar, nvme_bar; +uint32_t pmrcap, pmrsts; + +qpci_device_enable(pdev); +pmr_bar = qpci_iomap(pdev, 4, NULL); + +/* Without Enabling PMRCTL check bar enablemet */ +qpci_io_writel(pdev, pmr_bar, 0, 0xccbbaa99); +g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), !=, 0x99); +g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), !=, 0xaa99); + +/* Map NVMe Bar Register to Enable the Mem Region */ +nvme_bar = qpci_iomap(pdev, 0, NULL); + +pmrcap = qpci_io_readl(pdev, nvme_bar, 0xe00); +g_assert_cmpint(NVME_PMRCAP_RDS(pmrcap), ==, 0x1); +g_assert_cmpint(NVME_PMRCAP_WDS(pmrcap), ==, 0x1); +g_assert_cmpint(NVME_PMRCAP_BIR(pmrcap), ==, 0x4); +g_assert_cmpint(NVME_PMRCAP_PMRWBM(pmrcap), ==, 0x2); +g_assert_cmpint(NVME_PMRCAP_CMSS(pmrcap), ==, 0x1); + +/* Enable PMRCTRL */ +qpci_io_writel(pdev, nvme_bar, 0xe04, 0x1); + +qpci_io_writel(pdev, pmr_bar, 0, 0x44332211); +g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), ==, 0x11); +g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), ==, 0x2211); +g_assert_cmpint(qpci_io_readl(pdev, pmr_bar, 0), ==, 0x44332211); + +pmrsts = qpci_io_readl(pdev, nvme_bar, 0xe08); +g_assert_cmpint(NVME_PMRSTS_NRDY(pmrsts), ==, 0x0); + +/* Disable PMRCTRL */ +qpci_io_writel(pdev, nvme_bar, 0xe04, 0x0); + +qpci_io_writel(pdev, pmr_bar, 0, 0x88776655); +g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), !=, 0x55); +g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), !=, 0x6655); +g_assert_cmpint(qpci_io_readl(pdev, pmr_bar, 0), !=, 0x88776655); + +pmrsts = qpci_io_readl(pdev, nvme_bar, 0xe08); +g_assert_cmpint(NVME_PMRSTS_NRDY(pmrsts), ==, 0x1); + +qpci_iounmap(pdev, nvme_bar); +qpci_iounmap(pdev, pmr_bar); +} + static void nvme_register_nodes(void) { +int fd, ret; +t_path = g_strdup("/tmp/qtest.XX"); + +/* Create a temporary raw image*/ +fd = mkstemp(t_path); +g_assert(fd >= 0); +ret = ftruncate(fd, TEST_IMAGE_SIZE); +g_assert(ret == 0); +close(fd); + +char *pmr_cmd_line = g_strdup_printf("-object memory-backend-file,id=pmr0," + "share=on,mem-path=%s,size=8", t_path); + QOSGraphEdgeOptions opts = { .extra_device_opts = "addr=04.0,drive=drv0,serial=foo", .before_cmd_line = "-drive id=drv0,if=none,file=null-co://," - "file.read-zeroes=on,format=raw", + "file.read-zeroes=on,format=raw ", + pmr_cmd_line, }; add_qpci_address(&opts, &(QPCIAddress) { .devfn = QPCI_DEVFN(4, 0) }); @@ -83,6 +152,13 @@ static void nvme_register_nodes(void) qos_add_test("oob-cmb-access", "nvme", nvmetest_oob_cmb_test, &(QOSGraphTestOptions) { .edge.extra_device_opts = "cmb_size_mb=2" }); + +qos_add_test("pmr-test-access", "nvme", nvmetest_pmr_reg_test, + &(QOSGraphTestOptions) { +.edge.extra_device_opts = "pmrdev=pmr0" +}); + +unlink(t_path); } libqos_init(nvme_register_nodes); -- 2.17.1 Applied to nvme-next. I swapped the memory-backend-file with a memory-backend-ram so we don't need to setup an actual file. signature.asc Description: PGP signature
Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging
On Jul 9 08:16, Hannes Reinecke wrote: On 7/9/21 8:05 AM, Klaus Jensen wrote: On Jul 7 17:49, Klaus Jensen wrote: From: Klaus Jensen Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We discussed a bit back and fourth and I mentioned that the core issue was an artifact of the parent/child relationship stemming from the qdev setup we have with namespaces attaching to controller through a qdev bus. The gist of this series is the fourth patch "hw/nvme: fix controller hot unplugging" which basically causes namespaces to be reassigned to a bus owned by the subsystem if the parent controller is linked to one. This fixes `device_del/add nvme` in such settings. Note, that in the case that there is no subsystem involved, nvme devices can be removed from the system with `device_del`, but this *will* cause the namespaces to be removed as well since there is no place (i.e. no subsystem) for them to "linger". And since this series does not add support for hotplugging nvme-ns devices, while an nvme device can be readded, no namespaces can. Support for hotplugging nvme-ns devices is present in [1], but I'd rather not add that since I think '-device nvme-ns' is already a bad design choice. Now, I do realize that it is not "pretty" to explicitly change the parent bus, so I do have a an RFC patch in queue that replaces the subsystem and namespace devices with objects, but keeps -device shims available for backwards compatibility. This approach will solve the problems properly and should be a better model. However, I don't believe it will make it for 6.1 and I'd really like to at least fix the unplugging for 6.1 and this gets the job done. [1]: 20210511073511.32511-1-h...@suse.de v2: - added R-b's by Hannes for patches 1 through 3 - simplified "hw/nvme: fix controller hot unplugging" Klaus Jensen (4): hw/nvme: remove NvmeCtrl parameter from ns setup/check functions hw/nvme: mark nvme-subsys non-hotpluggable hw/nvme: unregister controller with subsystem at exit hw/nvme: fix controller hot unplugging hw/nvme/nvme.h | 18 +--- hw/nvme/ctrl.c | 14 ++-- hw/nvme/ns.c | 55 +++- hw/nvme/subsys.c | 9 4 files changed, 63 insertions(+), 33 deletions(-) -- 2.32.0 Applied patches 1 through 3 to nvme-next. So, how do we go about with patch 4? Without it this whole exercise is a bit pointless, seeing that it doesn't fix anything. Patch 1-3 are fixes we need anyway, so I thought I might as well apply them :) Shall we go with that patch as an interim solution? Will you replace it with your 'object' patch? What is the plan? Yes, if acceptable, I would like to use patch 4 as an interim solution. We have a bug we need to fix for 6.1, and I belive this does the job. I considered changing the existing nvme-bus to be on the main system bus, but then we break the existing behavior that the namespaces attach to the most recently defined controller in the absence of the shared parameter or an explicit bus parameter. Wrt. "the plan", right now, I see two solutions going forward: 1. Introduce new -object's for nvme-nvm-subsystem and nvme-ns This is the approach that I am taking right now and it works well. It allows many-to-many relationships and separates the life times of subsystems, namespaces and controllers like you mentioned. Conceptually, I also really like that the subsystem and namespace are not "devices". One could argue that the namespace is comparable to a SCSI LUN (-device scsi-hd, right?), but where the SCSI LUN actually "shows up" in the host, the nvme namespace does not. My series handles backwards compatibility by keeping -device "shims" around that just wraps the new objects but behaves like it used to. The plan would be to deprecate these devices. The downside to this approach is that it moves the subsystem and namespaces out of the "qdev tree (info qtree)" and into the pure QOM "/objects" tree. Instead of qtree, we can have QMP and HMP commands for introspection. 2. Make the subsystem a "system bus device" This way we add an "nvme-nvm-subsystems" bus as a direct child of the main system bus, and we can possibly get rid of the explicit -device nvme-subsys as well. We change the namespace device to plug into that instead. The nvme controller device still needs to plug into the PCI bus, so it cannot be a child of the subsystems bus, but can keep using a link parameter to hook into the subsystem and attach to any namespaces it would like. I'm unsure if we can do this without deprecating the existing namespace device, just like option 1. I have not implemented this, so I need to look more into it. It seems like the main thing that this gives us compared to 1) is `info qtree` support and we still end up just "wiring" namespace attachment with backlinks anyway. I'm not sure what I wou
Re: [PATCH 2/2] qemu-img: Add --skip-broken for 'convert --bitmaps'
08.07.2021 04:30, Eric Blake wrote: The point of 'qemu-img convert --bitmaps' is to be a convenience for actions that are already possible through a string of smaller 'qemu-img bitmap' sub-commands. One situation not accounted for already is that if a source image contains an inconsistent bitmap (for example, because a qemu process died abruptly before flushing bitmap state), the user MUST delete those inconsistent bitmaps before anything else useful can be done with the image. We don't want to delete inconsistent bitmaps by default: although a corrupt bitmap is only a loss of optimization rather than a corruption of user-visible data, it is still nice to require the user to opt in to the fact that they are aware of the loss of the bitmap. Still, requiring the user to check 'qemu-img info' to see whether bitmaps are consistent, then use 'qemu-img bitmap --remove' to remove offenders, all before using 'qemu-img convert', is a lot more work than just adding a knob 'qemu-img convert --bitmaps --skip-broken' which opts in to skipping the broken bitmaps. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1946084 Signed-off-by: Eric Blake --- docs/tools/qemu-img.rst | 8 +++- qemu-img.c| 20 +-- tests/qemu-iotests/tests/qemu-img-bitmaps | 4 tests/qemu-iotests/tests/qemu-img-bitmaps.out | 14 + 4 files changed, 43 insertions(+), 3 deletions(-) diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst index 1d8470eada0e..5cf1c764597b 100644 --- a/docs/tools/qemu-img.rst +++ b/docs/tools/qemu-img.rst @@ -414,7 +414,7 @@ Command description: 4 Error on reading data -.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] [--target-is-zero] [--bitmaps] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME +.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] [--target-is-zero] [--bitmaps [--skip-broken]] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME Of course, [--bitmaps [--skip-broken]] looks like --skip-broken is a suboption.. But actually it's not so. So, shouldn't it be named more explicit, like --skip-broken-bitmaps ? To be sure that we will not interfere in future with some other broken things we want to skip? And to avoid strange but correct command lines like qemu-img convert --skip-broken --bitmaps src dst Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM* to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can @@ -456,6 +456,12 @@ Command description: *NUM_COROUTINES* specifies how many coroutines work in parallel during the convert process (defaults to 8). + Use of ``--bitmaps`` requests that any persistent bitmaps present in + the original are also copied to the destination. If any bitmap is + inconsistent in the source, the conversion will fail unless + ``--skip-broken`` is also specified to copy only the consistent + bitmaps. + .. option:: create [--object OBJECTDEF] [-q] [-f FMT] [-b BACKING_FILE] [-F BACKING_FMT] [-u] [-o OPTIONS] FILENAME [SIZE] Create the new disk image *FILENAME* of size *SIZE* and format diff --git a/qemu-img.c b/qemu-img.c index 68a4d298098f..e8b012f39c0c 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -82,6 +82,7 @@ enum { OPTION_MERGE = 274, OPTION_BITMAPS = 275, OPTION_FORCE = 276, +OPTION_SKIP_BROKEN = 277, }; typedef enum OutputFormat { @@ -2101,7 +2102,8 @@ static int convert_do_copy(ImgConvertState *s) return s->ret; } -static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst) +static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst, +bool skip_broken) { BdrvDirtyBitmap *bm; Error *err = NULL; @@ -2113,6 +2115,10 @@ static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst) continue; } name = bdrv_dirty_bitmap_name(bm); +if (skip_broken && bdrv_dirty_bitmap_inconsistent(bm)) { +warn_report("Skipping inconsistent bitmap %s", name); +continue; +} qmp_block_dirty_bitmap_add(dst->node_name, name, true, bdrv_dirty_bitmap_granularity(bm), true, true, @@ -2167,6 +2173,7 @@ static int img_convert(int argc, char **argv) bool force_share = false; bool explict_min_sparse = false; bool bitmaps = false; +bool skip_broken = false; int64_t rate_limit = 0;
Re: [PATCH 1/2] iotests: Improve and rename test 291 to qemu-img-bitmap
08.07.2021 04:30, Eric Blake wrote: Enhance the test to demonstrate behavior of qemu-img with a qcow2 image containing an inconsistent bitmap, and rename it now that we support useful iotest names. While at it, fix a missing newline in the error message thus exposed. Signed-off-by: Eric Blake --- block/dirty-bitmap.c | 2 +- .../{291 => tests/qemu-img-bitmaps} | 13 +++- .../{291.out => tests/qemu-img-bitmaps.out} | 32 ++- 3 files changed, 44 insertions(+), 3 deletions(-) rename tests/qemu-iotests/{291 => tests/qemu-img-bitmaps} (92%) rename tests/qemu-iotests/{291.out => tests/qemu-img-bitmaps.out} (82%) diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index 68d295d6e3ed..0ef46163e3ea 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -193,7 +193,7 @@ int bdrv_dirty_bitmap_check(const BdrvDirtyBitmap *bitmap, uint32_t flags, error_setg(errp, "Bitmap '%s' is inconsistent and cannot be used", bitmap->name); error_append_hint(errp, "Try block-dirty-bitmap-remove to delete" - " this bitmap from disk"); + " this bitmap from disk\n"); return -1; } diff --git a/tests/qemu-iotests/291 b/tests/qemu-iotests/tests/qemu-img-bitmaps similarity index 92% rename from tests/qemu-iotests/291 rename to tests/qemu-iotests/tests/qemu-img-bitmaps index 20efb080a6c0..76cd9e31e850 100755 --- a/tests/qemu-iotests/291 +++ b/tests/qemu-iotests/tests/qemu-img-bitmaps @@ -3,7 +3,7 @@ # # Test qemu-img bitmap handling # -# Copyright (C) 2018-2020 Red Hat, Inc. +# Copyright (C) 2018-2021 Red Hat, Inc. # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by @@ -32,6 +32,7 @@ _cleanup() trap "_cleanup; exit \$status" 0 1 2 3 15 # get standard environment, filters and checks +cd .. . ./common.rc . ./common.filter . ./common.nbd @@ -129,6 +130,16 @@ $QEMU_IMG map --output=json --image-opts \ nbd_server_stop +echo +echo "=== Check handling of inconsistent bitmap ===" +echo + +$QEMU_IO -c abort "$TEST_IMG" 2>/dev/null +$QEMU_IMG bitmap --add "$TEST_IMG" b4 +$QEMU_IMG bitmap --remove "$TEST_IMG" b1 +_img_info --format-specific | _filter_irrelevant_img_info +$QEMU_IMG convert --bitmaps -O qcow2 "$TEST_IMG" "$TEST_IMG.copy" Worth then removing remaining inconsistent bitmaps and try again? I think you should now remove $TEST_IMG.copy in _cleanup with squashed in --- a/tests/qemu-iotests/tests/qemu-img-bitmaps +++ b/tests/qemu-iotests/tests/qemu-img-bitmaps @@ -27,6 +27,7 @@ status=1 # failure is the default! _cleanup() { _cleanup_test_img +_rm_test_img "$TEST_IMG.copy" nbd_server_stop } trap "_cl Tested-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Vladimir Sementsov-Ogievskiy + # success, all done echo '*** done' rm -f $seq.full diff --git a/tests/qemu-iotests/291.out b/tests/qemu-iotests/tests/qemu-img-bitmaps.out similarity index 82% rename from tests/qemu-iotests/291.out rename to tests/qemu-iotests/tests/qemu-img-bitmaps.out index 018d6b103f87..17b34eaed30f 100644 --- a/tests/qemu-iotests/291.out +++ b/tests/qemu-iotests/tests/qemu-img-bitmaps.out @@ -1,4 +1,4 @@ -QA output created by 291 +QA output created by qemu-img-bitmaps === Initial image setup === @@ -115,4 +115,34 @@ Format specific information: [{ "start": 0, "length": 2097152, "depth": 0, "present": true, "zero": false, "data": true, "offset": OFFSET}, { "start": 2097152, "length": 1048576, "depth": 0, "present": false, "zero": false, "data": false}, { "start": 3145728, "length": 7340032, "depth": 0, "present": true, "zero": false, "data": true, "offset": OFFSET}] + +=== Check handling of inconsistent bitmap === + +image: TEST_DIR/t.IMGFMT +file format: IMGFMT +virtual size: 10 MiB (10485760 bytes) +cluster_size: 65536 +backing file: TEST_DIR/t.IMGFMT.base +backing file format: IMGFMT +Format specific information: +bitmaps: +[0]: +flags: +[0]: in-use +[1]: auto +name: b2 +granularity: 65536 +[1]: +flags: +[0]: in-use +name: b0 +granularity: 65536 +[2]: +flags: +[0]: auto +name: b4 +granularity: 65536 +corrupt: false +qemu-img: Failed to populate bitmap b0: Bitmap 'b0' is inconsistent and cannot be used +Try block-dirty-bitmap-remove to delete this bitmap from disk *** done -- Best regards, Vladimir
Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging
On 7/9/21 8:05 AM, Klaus Jensen wrote: On Jul 7 17:49, Klaus Jensen wrote: From: Klaus Jensen Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We discussed a bit back and fourth and I mentioned that the core issue was an artifact of the parent/child relationship stemming from the qdev setup we have with namespaces attaching to controller through a qdev bus. The gist of this series is the fourth patch "hw/nvme: fix controller hot unplugging" which basically causes namespaces to be reassigned to a bus owned by the subsystem if the parent controller is linked to one. This fixes `device_del/add nvme` in such settings. Note, that in the case that there is no subsystem involved, nvme devices can be removed from the system with `device_del`, but this *will* cause the namespaces to be removed as well since there is no place (i.e. no subsystem) for them to "linger". And since this series does not add support for hotplugging nvme-ns devices, while an nvme device can be readded, no namespaces can. Support for hotplugging nvme-ns devices is present in [1], but I'd rather not add that since I think '-device nvme-ns' is already a bad design choice. Now, I do realize that it is not "pretty" to explicitly change the parent bus, so I do have a an RFC patch in queue that replaces the subsystem and namespace devices with objects, but keeps -device shims available for backwards compatibility. This approach will solve the problems properly and should be a better model. However, I don't believe it will make it for 6.1 and I'd really like to at least fix the unplugging for 6.1 and this gets the job done. [1]: 20210511073511.32511-1-h...@suse.de v2: - added R-b's by Hannes for patches 1 through 3 - simplified "hw/nvme: fix controller hot unplugging" Klaus Jensen (4): hw/nvme: remove NvmeCtrl parameter from ns setup/check functions hw/nvme: mark nvme-subsys non-hotpluggable hw/nvme: unregister controller with subsystem at exit hw/nvme: fix controller hot unplugging hw/nvme/nvme.h | 18 +--- hw/nvme/ctrl.c | 14 ++-- hw/nvme/ns.c | 55 +++- hw/nvme/subsys.c | 9 4 files changed, 63 insertions(+), 33 deletions(-) -- 2.32.0 Applied patches 1 through 3 to nvme-next. So, how do we go about with patch 4? Without it this whole exercise is a bit pointless, seeing that it doesn't fix anything. Shall we go with that patch as an interim solution? Will you replace it with your 'object' patch? What is the plan? Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Re: [PATCH]: /hw/nvme/ctrl error handling if descriptors are greater than 1024
On Jul 6 16:13, Padmakar Kalghatgi wrote: From: padmakar if the number of descriptors or pages is more than 1024, dma writes or reads will result in failure. Hence, we check if the number of descriptors or pages is more than 1024 in the nvme module and return Internal Device error. Signed-off-by: Padmakar Kalghatgi --- hw/nvme/ctrl.c | 14 ++ hw/nvme/trace-events | 1 + 2 files changed, 15 insertions(+) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 40a7efc..082592f 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -602,6 +602,20 @@ static uint16_t nvme_map_addr(NvmeCtrl *n, NvmeSg *sg, hwaddr addr, size_t len) return NVME_SUCCESS; } +/* + * The QEMU has an inherent issue where in if the no. + * of descriptors is more than 1024, it will result in + * failure during the dma write or reads. Hence, we need + * to return the error. + */ + +if (((sg->flags & NVME_SG_DMA) && ((sg->qsg.nsg + 1) > IOV_MAX)) || +((sg->iov.niov + 1) > IOV_MAX)) { +NVME_GUEST_ERR(pci_nvme_ub_sg_desc_toohigh, + "number of descriptors is greater than 1024"); +return NVME_INTERNAL_DEV_ERROR; +} + trace_pci_nvme_map_addr(addr, len); if (nvme_addr_is_cmb(n, addr)) { diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events index ea33d0c..bfe1a3b 100644 --- a/hw/nvme/trace-events +++ b/hw/nvme/trace-events @@ -202,3 +202,4 @@ pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion qu pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring" pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring" pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field" +pci_nvme_ub_sg_desc_toohigh(void) "the number of sg descriptors is too high" -- 2.7.0.windows.1 Applied to nvme-next, but made the error message more generic (this also applied to PRPs). signature.asc Description: PGP signature
Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging
On Jul 7 17:49, Klaus Jensen wrote: From: Klaus Jensen Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We discussed a bit back and fourth and I mentioned that the core issue was an artifact of the parent/child relationship stemming from the qdev setup we have with namespaces attaching to controller through a qdev bus. The gist of this series is the fourth patch "hw/nvme: fix controller hot unplugging" which basically causes namespaces to be reassigned to a bus owned by the subsystem if the parent controller is linked to one. This fixes `device_del/add nvme` in such settings. Note, that in the case that there is no subsystem involved, nvme devices can be removed from the system with `device_del`, but this *will* cause the namespaces to be removed as well since there is no place (i.e. no subsystem) for them to "linger". And since this series does not add support for hotplugging nvme-ns devices, while an nvme device can be readded, no namespaces can. Support for hotplugging nvme-ns devices is present in [1], but I'd rather not add that since I think '-device nvme-ns' is already a bad design choice. Now, I do realize that it is not "pretty" to explicitly change the parent bus, so I do have a an RFC patch in queue that replaces the subsystem and namespace devices with objects, but keeps -device shims available for backwards compatibility. This approach will solve the problems properly and should be a better model. However, I don't believe it will make it for 6.1 and I'd really like to at least fix the unplugging for 6.1 and this gets the job done. [1]: 20210511073511.32511-1-h...@suse.de v2: - added R-b's by Hannes for patches 1 through 3 - simplified "hw/nvme: fix controller hot unplugging" Klaus Jensen (4): hw/nvme: remove NvmeCtrl parameter from ns setup/check functions hw/nvme: mark nvme-subsys non-hotpluggable hw/nvme: unregister controller with subsystem at exit hw/nvme: fix controller hot unplugging hw/nvme/nvme.h | 18 +--- hw/nvme/ctrl.c | 14 ++-- hw/nvme/ns.c | 55 +++- hw/nvme/subsys.c | 9 4 files changed, 63 insertions(+), 33 deletions(-) -- 2.32.0 Applied patches 1 through 3 to nvme-next. signature.asc Description: PGP signature
Re: [PATCH] vhost-user: Fix backends without multiqueue support
On Mon, Jul 05, 2021 at 07:14:29PM +0200, Kevin Wolf wrote: > dev->max_queues was never initialised for backends that don't support > VHOST_USER_PROTOCOL_F_MQ, so it would use 0 as the maximum number of > queues to check against and consequently fail for any such backend. > > Set it to 1 if the backend doesn't have multiqueue support. > > Fixes: c90bd505a3e8210c23d69fecab9ee6f56ec4a161 > Signed-off-by: Kevin Wolf Reviewed-by: Raphael Norwitz > --- > hw/virtio/vhost-user.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > index 1ac4a2ebec..29ea2b4fce 100644 > --- a/hw/virtio/vhost-user.c > +++ b/hw/virtio/vhost-user.c > @@ -1913,7 +1913,10 @@ static int vhost_user_backend_init(struct vhost_dev > *dev, void *opaque, > if (err < 0) { > return -EPROTO; > } > +} else { > +dev->max_queues = 1; > } > + > if (dev->num_queues && dev->max_queues < dev->num_queues) { > error_setg(errp, "The maximum number of queues supported by the " > "backend is %" PRIu64, dev->max_queues); > -- > 2.31.1 >
Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support
> Am 08.07.2021 um 14:18 schrieb Kevin Wolf : > > Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben: >>> Am 06.07.2021 um 17:25 schrieb Kevin Wolf : >>> Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben: I will have a decent look after my vacation. >>> >>> Sounds good, thanks. Enjoy your vacation! >> >> As I had to fire up my laptop to look into another issue anyway, I >> have sent two patches for updating MAINTAINERS and to fix the int vs. >> bool mix for task->complete. > > I think you need to reevaluate your definition of vacation. ;-) Lets talk about this when the kids are grown up. Sometimes sending patches can be quite relaxing :-) > > But thanks anyway. > >> As Paolos fix (5f50be9b5) is relatively new and there are maybe other >> non obvious problems when removing the BH indirection and we are close >> to soft freeze I would leave the BH removal change for 6.2. > > Sure, code cleanups aren't urgent. Isn’t the indirection also a slight performance drop? Peter
Re: [PATCH 3/2] qemu-img: Improve error for rebase without backing format
Am 08.07.2021 um 17:52 hat Eric Blake geschrieben: > When removeing support for qemu-img being able to create backing > chains without embedded backing formats, we caused a poor error > message as caught by iotest 114. Improve the situation to inform the > user what went wrong. > > Suggested-by: Kevin Wolf > Signed-off-by: Eric Blake Thanks, applied to the block branch. Kevin
[PATCH 3/2] qemu-img: Improve error for rebase without backing format
When removeing support for qemu-img being able to create backing chains without embedded backing formats, we caused a poor error message as caught by iotest 114. Improve the situation to inform the user what went wrong. Suggested-by: Kevin Wolf Signed-off-by: Eric Blake --- qemu-img.c | 3 +++ tests/qemu-iotests/114.out | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/qemu-img.c b/qemu-img.c index 7c0e73882dd4..b017734c255a 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -3786,6 +3786,9 @@ static int img_rebase(int argc, char **argv) if (ret == -ENOSPC) { error_report("Could not change the backing file to '%s': No " "space left in the file header", out_baseimg); +} else if (ret == -EINVAL && out_baseimg && !out_basefmt) { +error_report("Could not change the backing file to '%s': backing " + "format must be specified", out_baseimg); } else if (ret < 0) { error_report("Could not change the backing file to '%s': %s", out_baseimg, strerror(-ret)); diff --git a/tests/qemu-iotests/114.out b/tests/qemu-iotests/114.out index 172454401257..016e9ce3ecfb 100644 --- a/tests/qemu-iotests/114.out +++ b/tests/qemu-iotests/114.out @@ -14,7 +14,7 @@ qemu-io: can't open device TEST_DIR/t.qcow2: Could not open backing file: Unknow no file open, try 'help open' read 4096/4096 bytes at offset 0 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -qemu-img: Could not change the backing file to '/home/eblake/qemu/build/tests/qemu-iotests/scratch/t.qcow2.base': Invalid argument +qemu-img: Could not change the backing file to '/home/eblake/qemu/build/tests/qemu-iotests/scratch/t.qcow2.base': backing format must be specified read 4096/4096 bytes at offset 0 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) *** done -- 2.31.1
Re: [PATCH v6 6/6] block: Make blockdev-reopen stable API
On Thu, Jul 08, 2021 at 01:47:09PM +0200, Kevin Wolf wrote: > From: Alberto Garcia > > This patch drops the 'x-' prefix from x-blockdev-reopen. > > Signed-off-by: Alberto Garcia > Signed-off-by: Kevin Wolf > Reviewed-by: Vladimir Sementsov-Ogievskiy > --- > +++ b/qapi/block-core.json > @@ -4219,7 +4219,7 @@ > { 'command': 'blockdev-add', 'data': 'BlockdevOptions', 'boxed': true } > > ## > -# @x-blockdev-reopen: > +# @blockdev-reopen: > # > # Reopens one or more block devices using the given set of options. > # Any option not specified will be reset to its default value regardless > @@ -4257,9 +4257,9 @@ > # image does not have a default backing file name as part of its > # metadata. > # > -# Since: 4.0 > +# Since: 6.0 6.1 now And yay! We are finally getting here! -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH 2/2] qemu-img: Require -F with -b backing image
On Thu, Jul 08, 2021 at 03:00:36PM +0200, Kevin Wolf wrote: > Am 03.05.2021 um 23:36 hat Eric Blake geschrieben: > > @@ -17,7 +14,7 @@ qemu-io: can't open device TEST_DIR/t.qcow2: Could not > > open backing file: Unknow > > no file open, try 'help open' > > read 4096/4096 bytes at offset 0 > > 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > > -qemu-img: warning: Deprecated use of backing file without explicit backing > > format, use of this image requires potentially unsafe format probing > > +qemu-img: Could not change the backing file to > > '/home/eblake/qemu/build/tests/qemu-iotests/scratch/t.qcow2.base': Invalid > > argument > > read 4096/4096 bytes at offset 0 > > 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > > This is not exactly an improvement for the error message. Maybe worth a > follow-up patch? Sure, will do. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH v6 1/6] qcow2: Fix dangling pointer after reopen for 'file'
08.07.2021 14:47, Kevin Wolf wrote: Without an external data file, s->data_file is a second pointer with the same value as bs->file. When changing bs->file to a different BdrvChild and freeing the old BdrvChild, s->data_file must also be updated, otherwise it points to freed memory and causes crashes. This problem was caught by iotests case 245. Fixes: df2b7086f169239ebad5d150efa29c9bb6d4f820 Signed-off-by: Kevin Wolf Reviewed-by: Vladimir Sementsov-Ogievskiy -- Best regards, Vladimir
Re: [PATCH v6 1/6] qcow2: Fix dangling pointer after reopen for 'file'
On Thu, Jul 08, 2021 at 01:47:04PM +0200, Kevin Wolf wrote: > Without an external data file, s->data_file is a second pointer with the > same value as bs->file. When changing bs->file to a different BdrvChild > and freeing the old BdrvChild, s->data_file must also be updated, > otherwise it points to freed memory and causes crashes. > > This problem was caught by iotests case 245. > > Fixes: df2b7086f169239ebad5d150efa29c9bb6d4f820 > Signed-off-by: Kevin Wolf > --- > block/qcow2.c | 29 + > 1 file changed, 29 insertions(+) Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
[PULL 5/5] block/io: Merge discard request alignments
From: Akihiko Odaki Signed-off-by: Akihiko Odaki Message-id: 20210705130458.97642-3-akihiko.od...@gmail.com Signed-off-by: Stefan Hajnoczi --- block/io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/block/io.c b/block/io.c index cf177a9d2d..e0a689c584 100644 --- a/block/io.c +++ b/block/io.c @@ -125,6 +125,8 @@ void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll) static void bdrv_merge_limits(BlockLimits *dst, const BlockLimits *src) { +dst->pdiscard_alignment = MAX(dst->pdiscard_alignment, + src->pdiscard_alignment); dst->opt_transfer = MAX(dst->opt_transfer, src->opt_transfer); dst->max_transfer = MIN_NON_ZERO(dst->max_transfer, src->max_transfer); dst->max_hw_transfer = MIN_NON_ZERO(dst->max_hw_transfer, -- 2.31.1
[PULL 3/5] block/file-posix: Optimize for macOS
From: Akihiko Odaki This commit introduces "punch hole" operation and optimizes transfer block size for macOS. Thanks to Konstantin Nazarov for detailed analysis of a flaw in an old version of this change: https://gist.github.com/akihikodaki/87df4149e7ca87f18dc56807ec5a1bc5#gistcomment-3654667 Signed-off-by: Akihiko Odaki Message-id: 20210705130458.97642-1-akihiko.od...@gmail.com Signed-off-by: Stefan Hajnoczi --- block/file-posix.c | 27 +-- 1 file changed, 25 insertions(+), 2 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index a26eab0ac3..cb9bffe047 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -46,6 +46,7 @@ #if defined(HAVE_HOST_BLOCK_DEVICE) #include #include +#include #include #include #include @@ -1254,6 +1255,15 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) return; } +#if defined(__APPLE__) && (__MACH__) +struct statfs buf; + +if (!fstatfs(s->fd, &buf)) { +bs->bl.opt_transfer = buf.f_iosize; +bs->bl.pdiscard_alignment = buf.f_bsize; +} +#endif + if (bs->sg || S_ISBLK(st.st_mode)) { int ret = hdev_get_max_hw_transfer(s->fd, &st); @@ -1591,6 +1601,7 @@ out: } } +#if defined(CONFIG_FALLOCATE) || defined(BLKZEROOUT) || defined(BLKDISCARD) static int translate_err(int err) { if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP || @@ -1599,6 +1610,7 @@ static int translate_err(int err) } return err; } +#endif #ifdef CONFIG_FALLOCATE static int do_fallocate(int fd, int mode, off_t offset, off_t len) @@ -1811,16 +1823,27 @@ static int handle_aiocb_discard(void *opaque) } } while (errno == EINTR); -ret = -errno; +ret = translate_err(-errno); #endif } else { #ifdef CONFIG_FALLOCATE_PUNCH_HOLE ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, aiocb->aio_offset, aiocb->aio_nbytes); +ret = translate_err(-errno); +#elif defined(__APPLE__) && (__MACH__) +fpunchhole_t fpunchhole; +fpunchhole.fp_flags = 0; +fpunchhole.reserved = 0; +fpunchhole.fp_offset = aiocb->aio_offset; +fpunchhole.fp_length = aiocb->aio_nbytes; +if (fcntl(s->fd, F_PUNCHHOLE, &fpunchhole) == -1) { +ret = errno == ENODEV ? -ENOTSUP : -errno; +} else { +ret = 0; +} #endif } -ret = translate_err(ret); if (ret == -ENOTSUP) { s->has_discard = false; } -- 2.31.1
[PULL 4/5] block: Add backend_defaults property
From: Akihiko Odaki backend_defaults property allow users to control if default block properties should be decided with backend information. If it is off, any backend information will be discarded, which is suitable if you plan to perform live migration to a different disk backend. If it is on, a block device may utilize backend information more aggressively. By default, it is auto, which uses backend information for block sizes and ignores the others, which is consistent with the older versions. Signed-off-by: Akihiko Odaki Message-id: 20210705130458.97642-2-akihiko.od...@gmail.com Signed-off-by: Stefan Hajnoczi --- include/hw/block/block.h | 3 +++ hw/block/block.c | 42 ++ tests/qemu-iotests/172.out | 38 ++ 3 files changed, 79 insertions(+), 4 deletions(-) diff --git a/include/hw/block/block.h b/include/hw/block/block.h index c172cbe65f..5902c0440a 100644 --- a/include/hw/block/block.h +++ b/include/hw/block/block.h @@ -19,6 +19,7 @@ typedef struct BlockConf { BlockBackend *blk; +OnOffAuto backend_defaults; uint32_t physical_block_size; uint32_t logical_block_size; uint32_t min_io_size; @@ -48,6 +49,8 @@ static inline unsigned int get_physical_block_exp(BlockConf *conf) } #define DEFINE_BLOCK_PROPERTIES_BASE(_state, _conf) \ +DEFINE_PROP_ON_OFF_AUTO("backend_defaults", _state, \ +_conf.backend_defaults, ON_OFF_AUTO_AUTO), \ DEFINE_PROP_BLOCKSIZE("logical_block_size", _state, \ _conf.logical_block_size),\ DEFINE_PROP_BLOCKSIZE("physical_block_size", _state,\ diff --git a/hw/block/block.c b/hw/block/block.c index 1e34573da7..d47ebf005a 100644 --- a/hw/block/block.c +++ b/hw/block/block.c @@ -65,24 +65,58 @@ bool blkconf_blocksizes(BlockConf *conf, Error **errp) { BlockBackend *blk = conf->blk; BlockSizes blocksizes; -int backend_ret; +BlockDriverState *bs; +bool use_blocksizes; +bool use_bs; + +switch (conf->backend_defaults) { +case ON_OFF_AUTO_AUTO: +use_blocksizes = !blk_probe_blocksizes(blk, &blocksizes); +use_bs = false; +break; + +case ON_OFF_AUTO_ON: +use_blocksizes = !blk_probe_blocksizes(blk, &blocksizes); +bs = blk_bs(blk); +use_bs = bs; +break; + +case ON_OFF_AUTO_OFF: +use_blocksizes = false; +use_bs = false; +break; + +default: +abort(); +} -backend_ret = blk_probe_blocksizes(blk, &blocksizes); /* fill in detected values if they are not defined via qemu command line */ if (!conf->physical_block_size) { -if (!backend_ret) { +if (use_blocksizes) { conf->physical_block_size = blocksizes.phys; } else { conf->physical_block_size = BDRV_SECTOR_SIZE; } } if (!conf->logical_block_size) { -if (!backend_ret) { +if (use_blocksizes) { conf->logical_block_size = blocksizes.log; } else { conf->logical_block_size = BDRV_SECTOR_SIZE; } } +if (use_bs) { +if (!conf->opt_io_size) { +conf->opt_io_size = bs->bl.opt_transfer; +} +if (conf->discard_granularity == -1) { +if (bs->bl.pdiscard_alignment) { +conf->discard_granularity = bs->bl.pdiscard_alignment; +} else if (bs->bl.request_alignment != 1) { +conf->discard_granularity = bs->bl.request_alignment; +} +} +} if (conf->logical_block_size > conf->physical_block_size) { error_setg(errp, diff --git a/tests/qemu-iotests/172.out b/tests/qemu-iotests/172.out index d53f61d0de..4cf4d536b4 100644 --- a/tests/qemu-iotests/172.out +++ b/tests/qemu-iotests/172.out @@ -21,6 +21,7 @@ Testing: dev: floppy, id "" unit = 0 (0x0) drive = "floppy0" +backend_defaults = "auto" logical_block_size = 512 (512 B) physical_block_size = 512 (512 B) min_io_size = 0 (0 B) @@ -48,6 +49,7 @@ Testing: -fda TEST_DIR/t.qcow2 dev: floppy, id "" unit = 0 (0x0) drive = "floppy0" +backend_defaults = "auto" logical_block_size = 512 (512 B) physical_block_size = 512 (512 B) min_io_size = 0 (0 B) @@ -85,6 +87,7 @@ Testing: -fdb TEST_DIR/t.qcow2 dev: floppy, id "" unit = 1 (0x1) drive = "floppy1" +backend_defaults = "auto" logical_block_size = 512 (512 B) physical_block_size = 512 (512 B) min_io_size = 0 (0 B) @@ -96,6 +99,7 @@ Testing: -fdb TEST_DIR/t.qcow2
[PULL 1/5] util/async: add a human-readable name to BHs for debugging
It can be difficult to debug issues with BHs in production environments. Although BHs can usually be identified by looking up their ->cb() function pointer, this requires debug information for the program. It is also not possible to print human-readable diagnostics about BHs because they have no identifier. This patch adds a name to each BH. The name is not unique per instance but differentiates between cb() functions, which is usually enough. It's done by changing aio_bh_new() and friends to macros that stringify cb. The next patch will use the name field when reporting leaked BHs. Signed-off-by: Stefan Hajnoczi Reviewed-by: Philippe Mathieu-Daudé Message-Id: <20210414200247.917496-2-stefa...@redhat.com> --- include/block/aio.h| 31 --- include/qemu/main-loop.h | 4 +++- tests/unit/ptimer-test-stubs.c | 2 +- util/async.c | 9 +++-- util/main-loop.c | 4 ++-- 5 files changed, 41 insertions(+), 9 deletions(-) diff --git a/include/block/aio.h b/include/block/aio.h index 10fcae1515..807edce9b5 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -291,20 +291,45 @@ void aio_context_acquire(AioContext *ctx); /* Relinquish ownership of the AioContext. */ void aio_context_release(AioContext *ctx); +/** + * aio_bh_schedule_oneshot_full: Allocate a new bottom half structure that will + * run only once and as soon as possible. + * + * @name: A human-readable identifier for debugging purposes. + */ +void aio_bh_schedule_oneshot_full(AioContext *ctx, QEMUBHFunc *cb, void *opaque, + const char *name); + /** * aio_bh_schedule_oneshot: Allocate a new bottom half structure that will run * only once and as soon as possible. + * + * A convenience wrapper for aio_bh_schedule_oneshot_full() that uses cb as the + * name string. */ -void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque); +#define aio_bh_schedule_oneshot(ctx, cb, opaque) \ +aio_bh_schedule_oneshot_full((ctx), (cb), (opaque), (stringify(cb))) /** - * aio_bh_new: Allocate a new bottom half structure. + * aio_bh_new_full: Allocate a new bottom half structure. * * Bottom halves are lightweight callbacks whose invocation is guaranteed * to be wait-free, thread-safe and signal-safe. The #QEMUBH structure * is opaque and must be allocated prior to its use. + * + * @name: A human-readable identifier for debugging purposes. */ -QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque); +QEMUBH *aio_bh_new_full(AioContext *ctx, QEMUBHFunc *cb, void *opaque, +const char *name); + +/** + * aio_bh_new: Allocate a new bottom half structure + * + * A convenience wrapper for aio_bh_new_full() that uses the cb as the name + * string. + */ +#define aio_bh_new(ctx, cb, opaque) \ +aio_bh_new_full((ctx), (cb), (opaque), (stringify(cb))) /** * aio_notify: Force processing of pending events. diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h index 98aef5647c..8dbc6fcb89 100644 --- a/include/qemu/main-loop.h +++ b/include/qemu/main-loop.h @@ -294,7 +294,9 @@ void qemu_cond_timedwait_iothread(QemuCond *cond, int ms); void qemu_fd_register(int fd); -QEMUBH *qemu_bh_new(QEMUBHFunc *cb, void *opaque); +#define qemu_bh_new(cb, opaque) \ +qemu_bh_new_full((cb), (opaque), (stringify(cb))) +QEMUBH *qemu_bh_new_full(QEMUBHFunc *cb, void *opaque, const char *name); void qemu_bh_schedule_idle(QEMUBH *bh); enum { diff --git a/tests/unit/ptimer-test-stubs.c b/tests/unit/ptimer-test-stubs.c index 7f801a4d09..2a3ef58799 100644 --- a/tests/unit/ptimer-test-stubs.c +++ b/tests/unit/ptimer-test-stubs.c @@ -108,7 +108,7 @@ int64_t qemu_clock_deadline_ns_all(QEMUClockType type, int attr_mask) return deadline; } -QEMUBH *qemu_bh_new(QEMUBHFunc *cb, void *opaque) +QEMUBH *qemu_bh_new_full(QEMUBHFunc *cb, void *opaque, const char *name) { QEMUBH *bh = g_new(QEMUBH, 1); diff --git a/util/async.c b/util/async.c index 5d9b7cc1eb..9a668996b8 100644 --- a/util/async.c +++ b/util/async.c @@ -57,6 +57,7 @@ enum { struct QEMUBH { AioContext *ctx; +const char *name; QEMUBHFunc *cb; void *opaque; QSLIST_ENTRY(QEMUBH) next; @@ -107,7 +108,8 @@ static QEMUBH *aio_bh_dequeue(BHList *head, unsigned *flags) return bh; } -void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque) +void aio_bh_schedule_oneshot_full(AioContext *ctx, QEMUBHFunc *cb, + void *opaque, const char *name) { QEMUBH *bh; bh = g_new(QEMUBH, 1); @@ -115,11 +117,13 @@ void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque) .ctx = ctx, .cb = cb, .opaque = opaque, +.name = name, }; aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT); } -QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque) +QEMUBH *aio_bh_new_full(AioCont
[PULL 2/5] util/async: print leaked BH name when AioContext finalizes
BHs must be deleted before the AioContext is finalized. If not, it's a bug and probably indicates that some part of the program still expects the BH to run in the future. That can lead to memory leaks, inconsistent state, or just hangs. Unfortunately the assert(flags & BH_DELETED) call in aio_ctx_finalize() is difficult to debug because the assertion failure contains no information about the BH! Use the QEMUBH name field added in the previous patch to show a useful error when a leaked BH is detected. Suggested-by: Eric Ernst Signed-off-by: Stefan Hajnoczi Message-Id: <20210414200247.917496-3-stefa...@redhat.com> --- util/async.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/util/async.c b/util/async.c index 9a668996b8..9a41591319 100644 --- a/util/async.c +++ b/util/async.c @@ -344,8 +344,20 @@ aio_ctx_finalize(GSource *source) assert(QSIMPLEQ_EMPTY(&ctx->bh_slice_list)); while ((bh = aio_bh_dequeue(&ctx->bh_list, &flags))) { -/* qemu_bh_delete() must have been called on BHs in this AioContext */ -assert(flags & BH_DELETED); +/* + * qemu_bh_delete() must have been called on BHs in this AioContext. In + * many cases memory leaks, hangs, or inconsistent state occur when a + * BH is leaked because something still expects it to run. + * + * If you hit this, fix the lifecycle of the BH so that + * qemu_bh_delete() and any associated cleanup is called before the + * AioContext is finalized. + */ +if (unlikely(!(flags & BH_DELETED))) { +fprintf(stderr, "%s: BH '%s' leaked, aborting...\n", +__func__, bh->name); +abort(); +} g_free(bh); } -- 2.31.1
[PULL 0/5] Block patches
The following changes since commit 711c0418c8c1ce3a24346f058b001c4c5a2f0f81: Merge remote-tracking branch 'remotes/philmd/tags/mips-20210702' into staging (2021-07-04 14:04:12 +0100) are available in the Git repository at: https://gitlab.com/stefanha/qemu.git tags/block-pull-request for you to fetch changes up to 9f460c64e13897117f35ffb61f6f5e0102cabc70: block/io: Merge discard request alignments (2021-07-06 14:28:55 +0100) Pull request Akihiko Odaki (3): block/file-posix: Optimize for macOS block: Add backend_defaults property block/io: Merge discard request alignments Stefan Hajnoczi (2): util/async: add a human-readable name to BHs for debugging util/async: print leaked BH name when AioContext finalizes include/block/aio.h| 31 ++--- include/hw/block/block.h | 3 +++ include/qemu/main-loop.h | 4 +++- block/file-posix.c | 27 -- block/io.c | 2 ++ hw/block/block.c | 42 ++ tests/unit/ptimer-test-stubs.c | 2 +- util/async.c | 25 util/main-loop.c | 4 ++-- tests/qemu-iotests/172.out | 38 ++ 10 files changed, 161 insertions(+), 17 deletions(-) -- 2.31.1
Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex
On Wed, Jul 07, 2021 at 06:58:07PM +0200, Emanuele Giuseppe Esposito wrote: > This is a continuation on the work to reduce (and possibly get rid of) the > usage of AioContext lock, by introducing smaller granularity locks to keep > the thread safety. > > This series aims to: > 1) remove the aiocontext lock and substitute it with the already existing >global job_mutex > 2) fix what it looks like to be an oversight when moving the blockjob.c logic >into the more generic job.c: job_mutex was introduced especially to >protect job->busy flag, but it seems that it was not used in successive >patches, because there are multiple code sections that directly >access the field without any locking. > 3) use job_mutex instead of the aiocontext_lock > 4) extend the reach of the job_mutex to protect all shared fields >that the job structure has. Can you explain the big picture: 1. What are the rules for JobDrivers? Imagine you are implementing a new JobDriver. What do you need to know in order to write correct code? 2. What are the rules for monitor? The main pattern is looking up a job, invoking a job API on it, and then calling job_unlock(). Stefan signature.asc Description: PGP signature
Re: [PATCH 0/2] Remove deprecated qemu-img backing file without format
Am 07.07.2021 um 23:17 hat Eric Blake geschrieben: > On Mon, May 03, 2021 at 04:35:58PM -0500, Eric Blake wrote: > > We've gone enough release cycles without noticeable pushback on our > > intentions, so time to make it harder to create images that can form a > > security hole due to a need for format probing rather than an explicit > > format. > > > > Eric Blake (2): > > qcow2: Prohibit backing file changes in 'qemu-img amend' > > qemu-img: Require -F with -b backing image > > Ping. Thanks, applied to the block branch. For some reason, the CCs were missing on the cover letter. Please make sure to CC me (and qemu-block) for the whole series if you want it to go through my tree. Kevin
Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex
On Thu, Jul 08, 2021 at 01:32:12PM +0200, Paolo Bonzini wrote: > On 08/07/21 12:36, Stefan Hajnoczi wrote: > > > What is very clear from this patch is that it > > > is strictly related to the brdv_* and lower level calls, because > > > they also internally check or even use the aiocontext lock. > > > Therefore, in order to make it work, I temporarly added some > > > aiocontext_acquire/release pair around the function that > > > still assert for them or assume they are hold and temporarly > > > unlock (unlock() - lock()). > > > > Sounds like the issue is that this patch series assumes AioContext locks > > are no longer required for calling the blk_*()/bdrv_*() APIs? That is > > not the case yet, so you had to then add those aio_context_lock() calls > > back in elsewhere. This approach introduces unnecessary risk. I think we > > should wait until blk_*()/bdrv_*() no longer requires the caller to hold > > the AioContext lock before applying this series. > > In general I'm in favor of pushing the lock further down into smaller and > smaller critical sections; it's a good approach to make further audits > easier until it's "obvious" that the lock is unnecessary. I haven't yet > reviewed Emanuele's patches to see if this is what he's doing where he's > adding the acquire/release calls, but that's my understanding of both his > cover letter and your reply. The problem is the unnecessary risk. We know what the goal is for blk_*()/bdrv_*() but it's not quite there yet. Does making changes in block jobs help solve the final issues with blk_*()/bdrv_*()? If yes, then it's a risk worth taking. If no, then spending time developing interim code, reviewing those patches, and risking breakage doesn't seem worth it. I'd rather wait for blk_*()/bdrv_*() to be fully complete and then see patches that delete aio_context_acquire() in most places or add locks in the remaining places where the caller was relying on the AioContext lock. Stefan signature.asc Description: PGP signature
Re: [PATCH 2/2] qemu-img: Require -F with -b backing image
Am 03.05.2021 um 23:36 hat Eric Blake geschrieben: > @@ -17,7 +14,7 @@ qemu-io: can't open device TEST_DIR/t.qcow2: Could not open > backing file: Unknow > no file open, try 'help open' > read 4096/4096 bytes at offset 0 > 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > -qemu-img: warning: Deprecated use of backing file without explicit backing > format, use of this image requires potentially unsafe format probing > +qemu-img: Could not change the backing file to > '/home/eblake/qemu/build/tests/qemu-iotests/scratch/t.qcow2.base': Invalid > argument > read 4096/4096 bytes at offset 0 > 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) This is not exactly an improvement for the error message. Maybe worth a follow-up patch? Kevin
Re: [RFC PATCH 5/6] job: use global job_mutex to protect struct Job
On Wed, Jul 07, 2021 at 06:58:12PM +0200, Emanuele Giuseppe Esposito wrote: > This lock is going to replace most of the AioContext locks > in the job and blockjob, so that a Job can run in an arbitrary > AioContext. > > Signed-off-by: Emanuele Giuseppe Esposito > --- > include/block/blockjob_int.h | 1 + > include/qemu/job.h | 2 + > block/backup.c | 4 + > block/mirror.c | 11 +- > blockdev.c | 62 > blockjob.c | 67 +++-- > job-qmp.c| 55 +++ > job.c| 284 +++ > qemu-img.c | 15 +- > 9 files changed, 350 insertions(+), 151 deletions(-) > > diff --git a/include/block/blockjob_int.h b/include/block/blockjob_int.h > index 6633d83da2..8b91126506 100644 > --- a/include/block/blockjob_int.h > +++ b/include/block/blockjob_int.h > @@ -53,6 +53,7 @@ struct BlockJobDriver { > */ > void (*attached_aio_context)(BlockJob *job, AioContext *new_context); > > +/* Called with job mutex *not* held. */ > void (*set_speed)(BlockJob *job, int64_t speed); > }; > > diff --git a/include/qemu/job.h b/include/qemu/job.h > index 4421d08d93..359f4e6b3a 100644 > --- a/include/qemu/job.h > +++ b/include/qemu/job.h > @@ -49,6 +49,8 @@ typedef struct Job { > /** > * The type of this job. > * Set it in job_create and just read. > + * All calls to the driver function must be not locked by job_mutex, > + * to avoid deadlocks. > */ > const JobDriver *driver; > > diff --git a/block/backup.c b/block/backup.c > index bd3614ce70..80ce956299 100644 > --- a/block/backup.c > +++ b/block/backup.c > @@ -315,6 +315,10 @@ static void coroutine_fn backup_pause(Job *job) > } > } > > +/* > + * Called with job mutex *not* held (we don't want to call block_copy_kick > + * with the lock held!) > + */ > static void coroutine_fn backup_set_speed(BlockJob *job, int64_t speed) > { > BackupBlockJob *s = container_of(job, BackupBlockJob, common); > diff --git a/block/mirror.c b/block/mirror.c > index 49fffa..deefaa6a39 100644 > --- a/block/mirror.c > +++ b/block/mirror.c > @@ -1150,9 +1150,11 @@ static void mirror_complete(Job *job, Error **errp) > s->should_complete = true; > > /* If the job is paused, it will be re-entered when it is resumed */ > +job_lock(); > if (!job_is_paused(job)) { > -job_enter(job); > +job_enter_locked(job); > } > +job_unlock(); > } > > static void coroutine_fn mirror_pause(Job *job) > @@ -1171,10 +1173,13 @@ static bool mirror_drained_poll(BlockJob *job) > * from one of our own drain sections, to avoid a deadlock waiting for > * ourselves. > */ > -if (!job_is_paused(&s->common.job) && !job_is_cancelled(&s->common.job) > && > -!s->in_drain) { > +job_lock(); > +if (!job_is_paused(&s->common.job) && > +!job_is_cancelled_locked(&s->common.job) && !s->in_drain) { > +job_unlock(); > return true; > } > +job_unlock(); > > return !!s->in_flight; > } > diff --git a/blockdev.c b/blockdev.c > index 8e2c15370e..9255aea6a2 100644 > --- a/blockdev.c > +++ b/blockdev.c > @@ -150,9 +150,11 @@ void blockdev_mark_auto_del(BlockBackend *blk) > AioContext *aio_context = job_get_aiocontext(&job->job); > aio_context_acquire(aio_context); > > +job_lock(); > job_cancel(&job->job, false); > > aio_context_release(aio_context); > +job_unlock(); This looks strange. The way it's written suggests there is a reason why job_unlock() has to be called after aio_context_release(). Can job_unlock() be called immediately after job_cancel()? > } > } > > @@ -3309,48 +3311,44 @@ out: > aio_context_release(aio_context); > } > > -/* Get a block job using its ID and acquire its AioContext */ > -static BlockJob *find_block_job(const char *id, AioContext **aio_context, > -Error **errp) > +/* Get a block job using its ID and acquire its job_lock */ "its" suggests job_lock is per-Job. I suggest saying something like "Returns with job_lock held on success" instead. signature.asc Description: PGP signature
Re: [PATCH] blockdev: fix drive-backup transaction endless drained section
Am 07.07.2021 um 15:35 hat Vladimir Sementsov-Ogievskiy geschrieben: > Forgotten thing :( > > Kevin, could you please queue it in your block branch? For me not to > bother Peter with one-patch pull request. No problem, I've queued it now. Kevin
Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support
Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben: > Am 06.07.2021 um 17:25 schrieb Kevin Wolf : > > Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben: > >> I will have a decent look after my vacation. > > > > Sounds good, thanks. Enjoy your vacation! > > As I had to fire up my laptop to look into another issue anyway, I > have sent two patches for updating MAINTAINERS and to fix the int vs. > bool mix for task->complete. I think you need to reevaluate your definition of vacation. ;-) But thanks anyway. > As Paolos fix (5f50be9b5) is relatively new and there are maybe other > non obvious problems when removing the BH indirection and we are close > to soft freeze I would leave the BH removal change for 6.2. Sure, code cleanups aren't urgent. Kevin
Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex
Am 08.07.2021 um 13:32 hat Paolo Bonzini geschrieben: > On 08/07/21 12:36, Stefan Hajnoczi wrote: > > > What is very clear from this patch is that it > > > is strictly related to the brdv_* and lower level calls, because > > > they also internally check or even use the aiocontext lock. > > > Therefore, in order to make it work, I temporarly added some > > > aiocontext_acquire/release pair around the function that > > > still assert for them or assume they are hold and temporarly > > > unlock (unlock() - lock()). > > > > Sounds like the issue is that this patch series assumes AioContext locks > > are no longer required for calling the blk_*()/bdrv_*() APIs? That is > > not the case yet, so you had to then add those aio_context_lock() calls > > back in elsewhere. This approach introduces unnecessary risk. I think we > > should wait until blk_*()/bdrv_*() no longer requires the caller to hold > > the AioContext lock before applying this series. > > In general I'm in favor of pushing the lock further down into smaller and > smaller critical sections; it's a good approach to make further audits > easier until it's "obvious" that the lock is unnecessary. I haven't yet > reviewed Emanuele's patches to see if this is what he's doing where he's > adding the acquire/release calls, but that's my understanding of both his > cover letter and your reply. > > The I/O blk_*()/bdrv_*() *should* not require the caller to hold the > AioContext lock; all drivers use their own CoMutex or QemuMutex when needed, > and generic code should also be ready (caveat emptor). Others, such as > reopen, are a mess that requires a separate audit. Restricting > acquire/release to be only around those seems like a good starting point. Reopen isn't just a mess, but in fact buggy. After the following patch goes in, the rule is simple: Don't hold any AioContext locks when calling bdrv_reopen_multiple(). 'block: Acquire AioContexts during bdrv_reopen_multiple()' https://lists.gnu.org/archive/html/qemu-block/2021-07/msg00238.html It still takes AioContext locks when it calls into other functions that currently expect it, but that should be the same as usual then. And once callers don't even hold the lock in the first place, we'll also get rid of the ugly temporary lock release across reopen. Kevin
Re: [PATCH] MAINTAINERS: update block/rbd.c maintainer
Am 08.07.2021 um 12:44 hat Ilya Dryomov geschrieben: > On Wed, Jul 7, 2021 at 8:05 PM Peter Lieven wrote: > > > > adding myself as a designated reviewer. > > > > Signed-off-by: Peter Lieven > > --- > > MAINTAINERS | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/MAINTAINERS b/MAINTAINERS > > index 516db737d1..cfda57e825 100644 > > --- a/MAINTAINERS > > +++ b/MAINTAINERS > > @@ -3058,6 +3058,7 @@ F: block/vmdk.c > > > > RBD > > M: Ilya Dryomov > > +R: Peter Lieven > > L: qemu-block@nongnu.org > > S: Supported > > F: block/rbd.c > > Nit: I would change the title to "MAINTAINERS: add block/rbd.c reviewer" > or something like that. Yes, this is better. I've changed it in my queue. Kevin
[PATCH v6 5/6] iotests: Test reopening multiple devices at the same time
From: Alberto Garcia This test swaps the images used by two active block devices. This is now possible thanks to the new ability to run x-blockdev-reopen on multiple devices at the same time. Signed-off-by: Alberto Garcia Signed-off-by: Kevin Wolf Reviewed-by: Vladimir Sementsov-Ogievskiy --- tests/qemu-iotests/245 | 47 ++ tests/qemu-iotests/245.out | 4 ++-- 2 files changed, 49 insertions(+), 2 deletions(-) diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245 index ca08955207..8bc8101e6d 100755 --- a/tests/qemu-iotests/245 +++ b/tests/qemu-iotests/245 @@ -649,6 +649,53 @@ class TestBlockdevReopen(iotests.QMPTestCase): '-c', 'read -P 0x40 0x40008 1', '-c', 'read -P 0x80 0x40010 1', hd_path[0]) +# Swap the disk images of two active block devices +def test_swap_files(self): +# Add hd0 and hd2 (none of them with backing files) +opts0 = hd_opts(0) +result = self.vm.qmp('blockdev-add', conv_keys = False, **opts0) +self.assert_qmp(result, 'return', {}) + +opts2 = hd_opts(2) +result = self.vm.qmp('blockdev-add', conv_keys = False, **opts2) +self.assert_qmp(result, 'return', {}) + +# Write different data to both block devices +self.run_qemu_io("hd0", "write -P 0xa0 0 1k") +self.run_qemu_io("hd2", "write -P 0xa2 0 1k") + +# Check that the data reads correctly +self.run_qemu_io("hd0", "read -P 0xa0 0 1k") +self.run_qemu_io("hd2", "read -P 0xa2 0 1k") + +# It's not possible to make a block device use an image that +# is already being used by the other device. +self.reopen(opts0, {'file': 'hd2-file'}, +"Permission conflict on node 'hd2-file': permissions " +"'write, resize' are both required by node 'hd2' (uses " +"node 'hd2-file' as 'file' child) and unshared by node " +"'hd0' (uses node 'hd2-file' as 'file' child).") +self.reopen(opts2, {'file': 'hd0-file'}, +"Permission conflict on node 'hd0-file': permissions " +"'write, resize' are both required by node 'hd0' (uses " +"node 'hd0-file' as 'file' child) and unshared by node " +"'hd2' (uses node 'hd0-file' as 'file' child).") + +# But we can swap the images if we reopen both devices at the +# same time +opts0['file'] = 'hd2-file' +opts2['file'] = 'hd0-file' +self.reopenMultiple([opts0, opts2]) +self.run_qemu_io("hd0", "read -P 0xa2 0 1k") +self.run_qemu_io("hd2", "read -P 0xa0 0 1k") + +# And we can of course come back to the original state +opts0['file'] = 'hd0-file' +opts2['file'] = 'hd2-file' +self.reopenMultiple([opts0, opts2]) +self.run_qemu_io("hd0", "read -P 0xa0 0 1k") +self.run_qemu_io("hd2", "read -P 0xa2 0 1k") + # Misc reopen tests with different block drivers @iotests.skip_if_unsupported(['quorum', 'throttle']) def test_misc_drivers(self): diff --git a/tests/qemu-iotests/245.out b/tests/qemu-iotests/245.out index daf1e51922..4eced19294 100644 --- a/tests/qemu-iotests/245.out +++ b/tests/qemu-iotests/245.out @@ -17,8 +17,8 @@ read 1/1 bytes at offset 262152 read 1/1 bytes at offset 262160 1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -.. +... -- -Ran 24 tests +Ran 25 tests OK -- 2.31.1
[PATCH v6 6/6] block: Make blockdev-reopen stable API
From: Alberto Garcia This patch drops the 'x-' prefix from x-blockdev-reopen. Signed-off-by: Alberto Garcia Signed-off-by: Kevin Wolf Reviewed-by: Vladimir Sementsov-Ogievskiy --- qapi/block-core.json| 6 +++--- blockdev.c | 2 +- tests/qemu-iotests/155 | 2 +- tests/qemu-iotests/165 | 2 +- tests/qemu-iotests/245 | 10 +- tests/qemu-iotests/248 | 2 +- tests/qemu-iotests/248.out | 2 +- tests/qemu-iotests/296 | 2 +- tests/qemu-iotests/298 | 2 +- tests/qemu-iotests/tests/remove-bitmap-from-backing | 4 ++-- 10 files changed, 17 insertions(+), 17 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 052520331e..2eb399f0d4 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -4219,7 +4219,7 @@ { 'command': 'blockdev-add', 'data': 'BlockdevOptions', 'boxed': true } ## -# @x-blockdev-reopen: +# @blockdev-reopen: # # Reopens one or more block devices using the given set of options. # Any option not specified will be reset to its default value regardless @@ -4257,9 +4257,9 @@ # image does not have a default backing file name as part of its # metadata. # -# Since: 4.0 +# Since: 6.0 ## -{ 'command': 'x-blockdev-reopen', +{ 'command': 'blockdev-reopen', 'data': { 'options': ['BlockdevOptions'] } } ## diff --git a/blockdev.c b/blockdev.c index fddcccbdbd..6466a106f4 100644 --- a/blockdev.c +++ b/blockdev.c @@ -3560,7 +3560,7 @@ fail: visit_free(v); } -void qmp_x_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp) +void qmp_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp) { BlockReopenQueue *queue = NULL; GSList *drained = NULL; diff --git a/tests/qemu-iotests/155 b/tests/qemu-iotests/155 index 2947bfb81a..eadda52615 100755 --- a/tests/qemu-iotests/155 +++ b/tests/qemu-iotests/155 @@ -261,7 +261,7 @@ class TestBlockdevMirrorReopen(MirrorBaseClass): result = self.vm.qmp('blockdev-add', node_name="backing", driver="null-co") self.assert_qmp(result, 'return', {}) -result = self.vm.qmp('x-blockdev-reopen', options=[{ +result = self.vm.qmp('blockdev-reopen', options=[{ 'node-name': "target", 'driver': iotests.imgfmt, 'file': "target-file", diff --git a/tests/qemu-iotests/165 b/tests/qemu-iotests/165 index ef4cf14516..ce499946b8 100755 --- a/tests/qemu-iotests/165 +++ b/tests/qemu-iotests/165 @@ -137,7 +137,7 @@ class TestPersistentDirtyBitmap(iotests.QMPTestCase): assert sha256_1 == self.getSha256() # Reopen to RW -result = self.vm.qmp('x-blockdev-reopen', options=[{ +result = self.vm.qmp('blockdev-reopen', options=[{ 'node-name': 'node0', 'driver': iotests.imgfmt, 'file': { diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245 index 8bc8101e6d..bf8261eec0 100755 --- a/tests/qemu-iotests/245 +++ b/tests/qemu-iotests/245 @@ -1,7 +1,7 @@ #!/usr/bin/env python3 # group: rw # -# Test cases for the QMP 'x-blockdev-reopen' command +# Test cases for the QMP 'blockdev-reopen' command # # Copyright (C) 2018-2019 Igalia, S.L. # Author: Alberto Garcia @@ -85,16 +85,16 @@ class TestBlockdevReopen(iotests.QMPTestCase): "Expected output of %d qemu-io commands, found %d" % (found, self.total_io_cmds)) -# Run x-blockdev-reopen on a list of block devices +# Run blockdev-reopen on a list of block devices def reopenMultiple(self, opts, errmsg = None): -result = self.vm.qmp('x-blockdev-reopen', conv_keys=False, options=opts) +result = self.vm.qmp('blockdev-reopen', conv_keys=False, options=opts) if errmsg: self.assert_qmp(result, 'error/class', 'GenericError') self.assert_qmp(result, 'error/desc', errmsg) else: self.assert_qmp(result, 'return', {}) -# Run x-blockdev-reopen on a single block device (specified by +# Run blockdev-reopen on a single block device (specified by # 'opts') but applying 'newopts' on top of it. The original 'opts' # dict is unmodified def reopen(self, opts, newopts = {}, errmsg = None): @@ -161,7 +161,7 @@ class TestBlockdevReopen(iotests.QMPTestCase): self.reopen(opts, {'file.locking': 'off'}, "Cannot change the option 'locking'") self.reopen(opts, {'file.filename': None}, "Invalid parameter type for 'options[0].file.filename', expected: string") -# node-name is optional in BlockdevOptions, but x-blockdev-reopen needs it +# node-name is optional i
[PATCH v6 1/6] qcow2: Fix dangling pointer after reopen for 'file'
Without an external data file, s->data_file is a second pointer with the same value as bs->file. When changing bs->file to a different BdrvChild and freeing the old BdrvChild, s->data_file must also be updated, otherwise it points to freed memory and causes crashes. This problem was caught by iotests case 245. Fixes: df2b7086f169239ebad5d150efa29c9bb6d4f820 Signed-off-by: Kevin Wolf --- block/qcow2.c | 29 + 1 file changed, 29 insertions(+) diff --git a/block/qcow2.c b/block/qcow2.c index ee4530cdbd..9126127633 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -1926,6 +1926,7 @@ static void qcow2_refresh_limits(BlockDriverState *bs, Error **errp) static int qcow2_reopen_prepare(BDRVReopenState *state, BlockReopenQueue *queue, Error **errp) { +BDRVQcow2State *s = state->bs->opaque; Qcow2ReopenState *r; int ret; @@ -1956,6 +1957,16 @@ static int qcow2_reopen_prepare(BDRVReopenState *state, } } +/* + * Without an external data file, s->data_file points to the same BdrvChild + * as bs->file. It needs to be resynced after reopen because bs->file may + * be changed. We can't use it in the meantime. + */ +if (!has_data_file(state->bs)) { +assert(s->data_file == state->bs->file); +s->data_file = NULL; +} + return 0; fail: @@ -1966,7 +1977,16 @@ fail: static void qcow2_reopen_commit(BDRVReopenState *state) { +BDRVQcow2State *s = state->bs->opaque; + qcow2_update_options_commit(state->bs, state->opaque); +if (!s->data_file) { +/* + * If we don't have an external data file, s->data_file was cleared by + * qcow2_reopen_prepare() and needs to be updated. + */ +s->data_file = state->bs->file; +} g_free(state->opaque); } @@ -1990,6 +2010,15 @@ static void qcow2_reopen_commit_post(BDRVReopenState *state) static void qcow2_reopen_abort(BDRVReopenState *state) { +BDRVQcow2State *s = state->bs->opaque; + +if (!s->data_file) { +/* + * If we don't have an external data file, s->data_file was cleared by + * qcow2_reopen_prepare() and needs to be restored. + */ +s->data_file = state->bs->file; +} qcow2_update_options_abort(state->bs, state->opaque); g_free(state->opaque); } -- 2.31.1
[PATCH v6 3/6] block: Acquire AioContexts during bdrv_reopen_multiple()
As the BlockReopenQueue can contain nodes in multiple AioContexts, only one of which may be locked when AIO_WAIT_WHILE() can be called, we can't let the caller lock the right contexts. Instead, individually lock the AioContext of a single node when iterating the queue. Reintroduce bdrv_reopen() as a wrapper for reopening a single node that drains the node and temporarily drops the AioContext lock for bdrv_reopen_multiple(). Signed-off-by: Kevin Wolf Reviewed-by: Vladimir Sementsov-Ogievskiy --- include/block/block.h | 2 ++ block.c | 49 --- block/replication.c | 7 +++ blockdev.c| 5 + qemu-io-cmds.c| 7 +-- 5 files changed, 57 insertions(+), 13 deletions(-) diff --git a/include/block/block.h b/include/block/block.h index 6d42992985..3477290f9a 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -388,6 +388,8 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue, bool keep_old_opts); void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue); int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp); +int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts, +Error **errp); int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only, Error **errp); int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset, diff --git a/block.c b/block.c index ee9b46e95e..06fb69df5c 100644 --- a/block.c +++ b/block.c @@ -4124,19 +4124,26 @@ void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue) * * All affected nodes must be drained between bdrv_reopen_queue() and * bdrv_reopen_multiple(). + * + * To be called from the main thread, with all other AioContexts unlocked. */ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp) { int ret = -1; BlockReopenQueueEntry *bs_entry, *next; +AioContext *ctx; Transaction *tran = tran_new(); g_autoptr(GHashTable) found = NULL; g_autoptr(GSList) refresh_list = NULL; +assert(qemu_get_current_aio_context() == qemu_get_aio_context()); assert(bs_queue != NULL); QTAILQ_FOREACH(bs_entry, bs_queue, entry) { +ctx = bdrv_get_aio_context(bs_entry->state.bs); +aio_context_acquire(ctx); ret = bdrv_flush(bs_entry->state.bs); +aio_context_release(ctx); if (ret < 0) { error_setg_errno(errp, -ret, "Error flushing drive"); goto abort; @@ -4145,7 +4152,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp) QTAILQ_FOREACH(bs_entry, bs_queue, entry) { assert(bs_entry->state.bs->quiesce_counter > 0); +ctx = bdrv_get_aio_context(bs_entry->state.bs); +aio_context_acquire(ctx); ret = bdrv_reopen_prepare(&bs_entry->state, bs_queue, tran, errp); +aio_context_release(ctx); if (ret < 0) { goto abort; } @@ -4188,7 +4198,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp) * to first element. */ QTAILQ_FOREACH_REVERSE(bs_entry, bs_queue, entry) { +ctx = bdrv_get_aio_context(bs_entry->state.bs); +aio_context_acquire(ctx); bdrv_reopen_commit(&bs_entry->state); +aio_context_release(ctx); } tran_commit(tran); @@ -4197,7 +4210,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp) BlockDriverState *bs = bs_entry->state.bs; if (bs->drv->bdrv_reopen_commit_post) { +ctx = bdrv_get_aio_context(bs); +aio_context_acquire(ctx); bs->drv->bdrv_reopen_commit_post(&bs_entry->state); +aio_context_release(ctx); } } @@ -4208,7 +4224,10 @@ abort: tran_abort(tran); QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) { if (bs_entry->prepared) { +ctx = bdrv_get_aio_context(bs_entry->state.bs); +aio_context_acquire(ctx); bdrv_reopen_abort(&bs_entry->state); +aio_context_release(ctx); } } @@ -4218,23 +4237,39 @@ cleanup: return ret; } -int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only, - Error **errp) +int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts, +Error **errp) { -int ret; +AioContext *ctx = bdrv_get_aio_context(bs); BlockReopenQueue *queue; -QDict *opts = qdict_new(); - -qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only); +int ret; bdrv_subtree_drained_begin(bs); -queue = bdrv_reopen_queue(NULL, bs, opts, true); +if (ctx != qemu_get_aio_context()) { +aio_context_release(ctx); +} + +queue = bdrv_reopen_queue(NULL, bs, opts, keep_old_opts); ret = bdrv_reopen_multiple(queue, errp); + +if (ctx != qemu_get_aio_context()) { +aio_context_acquire(c
[PATCH v6 4/6] block: Support multiple reopening with x-blockdev-reopen
From: Alberto Garcia [ kwolf: Fixed AioContext locking ] Signed-off-by: Alberto Garcia Signed-off-by: Kevin Wolf Reviewed-by: Vladimir Sementsov-Ogievskiy --- qapi/block-core.json | 18 +++-- blockdev.c| 81 ++- tests/qemu-iotests/155| 9 ++- tests/qemu-iotests/165| 4 +- tests/qemu-iotests/245| 27 --- tests/qemu-iotests/248| 2 +- tests/qemu-iotests/248.out| 2 +- tests/qemu-iotests/296| 9 ++- tests/qemu-iotests/298| 4 +- .../tests/remove-bitmap-from-backing | 18 +++-- 10 files changed, 99 insertions(+), 75 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 4a46552816..052520331e 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -4221,13 +4221,15 @@ ## # @x-blockdev-reopen: # -# Reopens a block device using the given set of options. Any option -# not specified will be reset to its default value regardless of its -# previous status. If an option cannot be changed or a particular +# Reopens one or more block devices using the given set of options. +# Any option not specified will be reset to its default value regardless +# of its previous status. If an option cannot be changed or a particular # driver does not support reopening then the command will return an -# error. +# error. All devices in the list are reopened in one transaction, so +# if one of them fails then the whole transaction is cancelled. # -# The top-level @node-name option (from BlockdevOptions) must be +# The command receives a list of block devices to reopen. For each one +# of them, the top-level @node-name option (from BlockdevOptions) must be # specified and is used to select the block device to be reopened. # Other @node-name options must be either omitted or set to the # current name of the appropriate node. This command won't change any @@ -4247,8 +4249,8 @@ # # 4) NULL: the current child (if any) is detached. # -# Options (1) and (2) are supported in all cases, but at the moment -# only @backing allows replacing or detaching an existing child. +# Options (1) and (2) are supported in all cases. Option (3) is +# supported for @file and @backing, and option (4) for @backing only. # # Unlike with blockdev-add, the @backing option must always be present # unless the node being reopened does not have a backing file and its @@ -4258,7 +4260,7 @@ # Since: 4.0 ## { 'command': 'x-blockdev-reopen', - 'data': 'BlockdevOptions', 'boxed': true } + 'data': { 'options': ['BlockdevOptions'] } } ## # @blockdev-del: diff --git a/blockdev.c b/blockdev.c index f657d090d3..fddcccbdbd 100644 --- a/blockdev.c +++ b/blockdev.c @@ -3560,51 +3560,60 @@ fail: visit_free(v); } -void qmp_x_blockdev_reopen(BlockdevOptions *options, Error **errp) -{ -BlockDriverState *bs; -AioContext *ctx; -QObject *obj; -Visitor *v = qobject_output_visitor_new(&obj); -BlockReopenQueue *queue; -QDict *qdict; +void qmp_x_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp) +{ +BlockReopenQueue *queue = NULL; +GSList *drained = NULL; + +/* Add each one of the BDS that we want to reopen to the queue */ +for (; reopen_list != NULL; reopen_list = reopen_list->next) { +BlockdevOptions *options = reopen_list->value; +BlockDriverState *bs; +AioContext *ctx; +QObject *obj; +Visitor *v; +QDict *qdict; + +/* Check for the selected node name */ +if (!options->has_node_name) { +error_setg(errp, "node-name not specified"); +goto fail; +} -/* Check for the selected node name */ -if (!options->has_node_name) { -error_setg(errp, "node-name not specified"); -goto fail; -} +bs = bdrv_find_node(options->node_name); +if (!bs) { +error_setg(errp, "Failed to find node with node-name='%s'", + options->node_name); +goto fail; +} -bs = bdrv_find_node(options->node_name); -if (!bs) { -error_setg(errp, "Failed to find node with node-name='%s'", - options->node_name); -goto fail; -} +/* Put all options in a QDict and flatten it */ +v = qobject_output_visitor_new(&obj); +visit_type_BlockdevOptions(v, NULL, &options, &error_abort); +visit_complete(v, &obj); +visit_free(v); -/* Put all options in a QDict and flatten it */ -visit_type_BlockdevOptions(v, NULL, &options, &error_abort); -visit_complete(v, &obj); -qdict = qobject_to(QDict, obj); +qdict = qobject_to(QDict, obj); -qdict_flatten(qdict); +qdict_flatten(qdict); -/* Perform the reopen operation */ -ctx = bdrv_get_aio_context(bs); -aio_cont
[PATCH v6 2/6] block: Add bdrv_reopen_queue_free()
From: Alberto Garcia Move the code to free a BlockReopenQueue to a separate function. It will be used in a subsequent patch. [ kwolf: Also free explicit_options and options, and explicitly qobject_ref() the value when it continues to be used. This makes future memory leaks less likely. ] Signed-off-by: Alberto Garcia Signed-off-by: Kevin Wolf Reviewed-by: Vladimir Sementsov-Ogievskiy --- include/block/block.h | 1 + block.c | 22 -- 2 files changed, 17 insertions(+), 6 deletions(-) diff --git a/include/block/block.h b/include/block/block.h index 7ec77ecb1a..6d42992985 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -386,6 +386,7 @@ BlockDriverState *bdrv_new_open_driver(BlockDriver *drv, const char *node_name, BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue, BlockDriverState *bs, QDict *options, bool keep_old_opts); +void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue); int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp); int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only, Error **errp); diff --git a/block.c b/block.c index acd35cb0cb..ee9b46e95e 100644 --- a/block.c +++ b/block.c @@ -4095,6 +4095,19 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue, NULL, 0, keep_old_opts); } +void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue) +{ +if (bs_queue) { +BlockReopenQueueEntry *bs_entry, *next; +QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) { +qobject_unref(bs_entry->state.explicit_options); +qobject_unref(bs_entry->state.options); +g_free(bs_entry); +} +g_free(bs_queue); +} +} + /* * Reopen multiple BlockDriverStates atomically & transactionally. * @@ -4197,15 +4210,10 @@ abort: if (bs_entry->prepared) { bdrv_reopen_abort(&bs_entry->state); } -qobject_unref(bs_entry->state.explicit_options); -qobject_unref(bs_entry->state.options); } cleanup: -QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) { -g_free(bs_entry); -} -g_free(bs_queue); +bdrv_reopen_queue_free(bs_queue); return ret; } @@ -4573,6 +4581,8 @@ static void bdrv_reopen_commit(BDRVReopenState *reopen_state) /* set BDS specific flags now */ qobject_unref(bs->explicit_options); qobject_unref(bs->options); +qobject_ref(reopen_state->explicit_options); +qobject_ref(reopen_state->options); bs->explicit_options = reopen_state->explicit_options; bs->options= reopen_state->options; -- 2.31.1
[PATCH v6 0/6] Make blockdev-reopen stable
This series picks up the remaining patches from Berto's series "[PATCH v4 0/6] Allow changing bs->file on reopen", which are not merged into master yet. Apart from renaming 'x-blockdev-reopen' into 'blockdev-reopen', the remaining functional change in this series is taking a list of nodes to reopen as an argument so that multiple changes to the graph can be made atomically that would be invalid separately (e.g. due to permission checks on the intermediate state). It also contains a qcow2 fix for a bug introduced by the part of the series that was already picked up in Vladimir's "[PATCH v6 0/9] Allow changing bs->file on reopen". v6: - Changed qcow2 fix to set s->data_file = NULL between .prepare and .commit instead of using a separate bool [Vladimir] - Coding style fixes [Vladimir] Alberto Garcia (4): block: Add bdrv_reopen_queue_free() block: Support multiple reopening with x-blockdev-reopen iotests: Test reopening multiple devices at the same time block: Make blockdev-reopen stable API Kevin Wolf (2): qcow2: Fix dangling pointer after reopen for 'file' block: Acquire AioContexts during bdrv_reopen_multiple() qapi/block-core.json | 24 +++--- include/block/block.h | 3 + block.c | 71 + block/qcow2.c | 29 +++ block/replication.c | 7 ++ blockdev.c| 76 ++ qemu-io-cmds.c| 7 +- tests/qemu-iotests/155| 9 ++- tests/qemu-iotests/165| 4 +- tests/qemu-iotests/245| 78 +++ tests/qemu-iotests/245.out| 4 +- tests/qemu-iotests/248| 4 +- tests/qemu-iotests/248.out| 2 +- tests/qemu-iotests/296| 11 ++- tests/qemu-iotests/298| 4 +- .../tests/remove-bitmap-from-backing | 22 +++--- 16 files changed, 255 insertions(+), 100 deletions(-) -- 2.31.1
Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex
On 08/07/21 12:36, Stefan Hajnoczi wrote: What is very clear from this patch is that it is strictly related to the brdv_* and lower level calls, because they also internally check or even use the aiocontext lock. Therefore, in order to make it work, I temporarly added some aiocontext_acquire/release pair around the function that still assert for them or assume they are hold and temporarly unlock (unlock() - lock()). Sounds like the issue is that this patch series assumes AioContext locks are no longer required for calling the blk_*()/bdrv_*() APIs? That is not the case yet, so you had to then add those aio_context_lock() calls back in elsewhere. This approach introduces unnecessary risk. I think we should wait until blk_*()/bdrv_*() no longer requires the caller to hold the AioContext lock before applying this series. In general I'm in favor of pushing the lock further down into smaller and smaller critical sections; it's a good approach to make further audits easier until it's "obvious" that the lock is unnecessary. I haven't yet reviewed Emanuele's patches to see if this is what he's doing where he's adding the acquire/release calls, but that's my understanding of both his cover letter and your reply. The I/O blk_*()/bdrv_*() *should* not require the caller to hold the AioContext lock; all drivers use their own CoMutex or QemuMutex when needed, and generic code should also be ready (caveat emptor). Others, such as reopen, are a mess that requires a separate audit. Restricting acquire/release to be only around those seems like a good starting point. Paolo
Re: [RFC PATCH 4/6] job.h: categorize job fields
On Wed, Jul 07, 2021 at 06:58:11PM +0200, Emanuele Giuseppe Esposito wrote: > -/** AioContext to run the job coroutine in */ > +/** > + * AioContext to run the job coroutine in. > + * Atomic. > + */ > AioContext *aio_context; This isn't accessed using atomic operations, so I'm not sure why it's documented as atomic? signature.asc Description: PGP signature
Re: [RFC PATCH 3/6] job: minor changes to simplify locking
On Wed, Jul 07, 2021 at 06:58:10PM +0200, Emanuele Giuseppe Esposito wrote: > @@ -406,15 +410,18 @@ void *job_create(const char *job_id, const JobDriver > *driver, JobTxn *txn, > error_setg(errp, "Invalid job ID '%s'", job_id); > return NULL; > } > -if (job_get(job_id)) { > -error_setg(errp, "Job ID '%s' already in use", job_id); > -return NULL; > -} > } else if (!(flags & JOB_INTERNAL)) { > error_setg(errp, "An explicit job ID is required"); > return NULL; > } > > +job_lock(); > +if (job_get(job_id)) { > +error_setg(errp, "Job ID '%s' already in use", job_id); > +job_unlock(); > +return NULL; > +} > + Where is the matching job_unlock() in the success case? Please consider lock guard macros like QEMU_LOCK_GUARD()/WITH_QEMU_LOCK_GUARD() to prevent common errors. signature.asc Description: PGP signature
Re: [RFC PATCH 2/6] job: _locked functions and public job_lock/unlock for next patch
On Wed, Jul 07, 2021 at 06:58:09PM +0200, Emanuele Giuseppe Esposito wrote: > diff --git a/job.c b/job.c > index 872bbebb01..96fb8e9730 100644 > --- a/job.c > +++ b/job.c > @@ -32,6 +32,10 @@ > #include "trace/trace-root.h" > #include "qapi/qapi-events-job.h" > > +/* job_mutex protexts the jobs list, but also the job operations. */ > +static QemuMutex job_mutex; It's unclear what protecting "job operations" means. I would prefer a fine-grained per-job lock that protects the job's fields instead of a global lock with an unclear scope. > + > +/* Protected by job_mutex */ > static QLIST_HEAD(, Job) jobs = QLIST_HEAD_INITIALIZER(jobs); > > /* Job State Transition Table */ > @@ -64,27 +68,22 @@ bool JobVerbTable[JOB_VERB__MAX][JOB_STATUS__MAX] = { > /* Transactional group of jobs */ > struct JobTxn { > > -/* Is this txn being cancelled? */ > +/* Is this txn being cancelled? Atomic.*/ > bool aborting; The comment says atomic but this field is not accessed using atomic operations (at least at this point in the patch series)? > > -/* List of jobs */ > +/* List of jobs. Protected by job_mutex. */ > QLIST_HEAD(, Job) jobs; > > -/* Reference count */ > +/* Reference count. Atomic. */ > int refcnt; Same. signature.asc Description: PGP signature
Re: [PATCH] MAINTAINERS: update block/rbd.c maintainer
On Wed, Jul 7, 2021 at 8:05 PM Peter Lieven wrote: > > adding myself as a designated reviewer. > > Signed-off-by: Peter Lieven > --- > MAINTAINERS | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 516db737d1..cfda57e825 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -3058,6 +3058,7 @@ F: block/vmdk.c > > RBD > M: Ilya Dryomov > +R: Peter Lieven > L: qemu-block@nongnu.org > S: Supported > F: block/rbd.c > -- > 2.17.1 > > Nit: I would change the title to "MAINTAINERS: add block/rbd.c reviewer" or something like that. Acked-by: Ilya Dryomov Thanks again for volunteering! Ilya
Re: [PATCH] block/rbd: fix type of task->complete
On Wed, Jul 7, 2021 at 8:05 PM Peter Lieven wrote: > > task->complete is a bool not an integer. > > Signed-off-by: Peter Lieven > --- > block/rbd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/rbd.c b/block/rbd.c > index 01a7b94d62..dcf82b15b8 100644 > --- a/block/rbd.c > +++ b/block/rbd.c > @@ -1066,7 +1066,7 @@ static int qemu_rbd_resize(BlockDriverState *bs, > uint64_t size) > static void qemu_rbd_finish_bh(void *opaque) > { > RBDTask *task = opaque; > -task->complete = 1; > +task->complete = true; > aio_co_wake(task->co); > } > > -- > 2.17.1 > > Reviewed-by: Ilya Dryomov Thanks, Ilya
Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex
On Wed, Jul 07, 2021 at 06:58:07PM +0200, Emanuele Giuseppe Esposito wrote: > This is a continuation on the work to reduce (and possibly get rid of) the > usage of AioContext lock, by introducing smaller granularity locks to keep > the thread safety. > > This series aims to: > 1) remove the aiocontext lock and substitute it with the already existing >global job_mutex > 2) fix what it looks like to be an oversight when moving the blockjob.c logic >into the more generic job.c: job_mutex was introduced especially to >protect job->busy flag, but it seems that it was not used in successive >patches, because there are multiple code sections that directly >access the field without any locking. > 3) use job_mutex instead of the aiocontext_lock > 4) extend the reach of the job_mutex to protect all shared fields >that the job structure has. > > The reason why we propose to use the existing job_mutex and not make one for > each job is to keep things as simple as possible for now, and because > the jobs are not in the execution critical path, so we can affort > some delays. > Having a lock per job would increase overall complexity and > increase the chances of deadlocks (one good example could be the job > transactions, where multiple jobs are grouped together). > Anyways, the per-job mutex can always be added in the future. > > Patch 1-4 are in preparation for patch 5. They try to simplify and clarify > the job_mutex usage. Patch 5 tries to add proper syncronization to the job > structure, replacing the AioContext lock when necessary. > Patch 6 just removes unnecessary AioContext locks that are now unneeded. > > > RFC: I am not sure the way I layed out the locks is ideal. > But their usage should not make deadlocks. I also made sure > the series passess all qemu_iotests. > > What is very clear from this patch is that it > is strictly related to the brdv_* and lower level calls, because > they also internally check or even use the aiocontext lock. > Therefore, in order to make it work, I temporarly added some > aiocontext_acquire/release pair around the function that > still assert for them or assume they are hold and temporarly > unlock (unlock() - lock()). Sounds like the issue is that this patch series assumes AioContext locks are no longer required for calling the blk_*()/bdrv_*() APIs? That is not the case yet, so you had to then add those aio_context_lock() calls back in elsewhere. This approach introduces unnecessary risk. I think we should wait until blk_*()/bdrv_*() no longer requires the caller to hold the AioContext lock before applying this series. Stefan signature.asc Description: PGP signature
Re: [PATCH] block/rbd: fix type of task->complete
Am 07.07.2021 um 23:51 hat Connor Kuehl geschrieben: > On 7/7/21 11:04 AM, Peter Lieven wrote: > > task->complete is a bool not an integer. > > > > Signed-off-by: Peter Lieven > > --- > > block/rbd.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/block/rbd.c b/block/rbd.c > > index 01a7b94d62..dcf82b15b8 100644 > > --- a/block/rbd.c > > +++ b/block/rbd.c > > @@ -1066,7 +1066,7 @@ static int qemu_rbd_resize(BlockDriverState *bs, > > uint64_t size) > > static void qemu_rbd_finish_bh(void *opaque) > > { > > RBDTask *task = opaque; > > -task->complete = 1; > > +task->complete = true; > > aio_co_wake(task->co); > > } > > > > > > Hi Peter, > > What tree/QEMU git sha does this apply to? I am having trouble > finding definitions for RBDTask and qemu_rbd_finish_bh; and the > patch won't apply to my few-minutes-old clone of upstream. It is on top of: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support Message-Id: <20210702172356.11574-1-idryo...@gmail.com> Also, thanks, applied to the block branch. Kevin
Re: [PATCH v3 3/2] qemu-img: Reword 'qemu-img map --output=json' docs
07.07.2021 21:41, Eric Blake wrote: Reword the paragraphs to list the JSON key first, rather than in the middle of prose. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Eric Blake Reviewed-by: Vladimir Sementsov-Ogievskiy -- Best regards, Vladimir