Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-07 Thread Jason Wang
On Wed, Sep 8, 2021 at 11:24 AM Peter Xu  wrote:
>
> On Wed, Sep 08, 2021 at 10:59:57AM +0800, Jason Wang wrote:
> > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu  wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos 
> > > wrote:
> > > > > I don't think it is valid to unconditionally enable this feature due 
> > > > > to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >if the socket option was not set, the socket exceeds its optmem
> > > > >limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of 
> > > > locked
> > > > memory for the user before using it, or adding a capability to qemu 
> > > > before.
> > >
> > > So I'm a bit confused on why MSG_ZEROCOPY requires checking 
> > > RLIMIT_MEMLOCK.
> > >
> > > The thing is IIUC that's accounting for pinned pages only with either 
> > > mlock()
> > > (FOLL_MLOCK) or vfio (FOLL_PIN).
> > >
> > > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> >
> > It happens probably here:
> >
> > E.g
> >
> > __ip_append_data()
> > msg_zerocopy_realloc()
> > mm_account_pinned_pages()
>
> Right. :)
>
> But my previous question is more about the reason behind it - I thought that's
> a common GUP and it shouldn't rely on locked_vm because it should still be
> temporary GUP rather than a long time GUP, IMHO that's how we use locked_vm 
> (we
> don't account for temp GUPs but only longterm ones). IOW, I'm wondering 
> whether
> all the rest of iov_iter_get_pages() callers should check locked_vm too, and
> AFAIU they're not doing that right now..

You are probably right, but I'm not sure if it's too late to fix that.
(E.g it will break existing userspace)

Thanks

>
> Thanks,
>
> --
> Peter Xu
>




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-07 Thread Peter Xu
On Wed, Sep 08, 2021 at 10:59:57AM +0800, Jason Wang wrote:
> On Wed, Sep 8, 2021 at 2:32 AM Peter Xu  wrote:
> >
> > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > I don't think it is valid to unconditionally enable this feature due to 
> > > > the
> > > > resource usage implications
> > > >
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > >
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > >if the socket option was not set, the socket exceeds its optmem
> > > >limit or the user exceeds its ulimit on locked pages."
> > >
> > > You are correct, I unfortunately missed this part in the docs :(
> > >
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > >
> > > Do you mean the limit an user has on locking memory?
> > >
> > > If so, that makes sense. I remember I needed to set the upper limit of 
> > > locked
> > > memory for the user before using it, or adding a capability to qemu 
> > > before.
> >
> > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> >
> > The thing is IIUC that's accounting for pinned pages only with either 
> > mlock()
> > (FOLL_MLOCK) or vfio (FOLL_PIN).
> >
> > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> 
> It happens probably here:
> 
> E.g
> 
> __ip_append_data()
> msg_zerocopy_realloc()
> mm_account_pinned_pages()

Right. :)

But my previous question is more about the reason behind it - I thought that's
a common GUP and it shouldn't rely on locked_vm because it should still be
temporary GUP rather than a long time GUP, IMHO that's how we use locked_vm (we
don't account for temp GUPs but only longterm ones). IOW, I'm wondering whether
all the rest of iov_iter_get_pages() callers should check locked_vm too, and
AFAIU they're not doing that right now..

Thanks,

-- 
Peter Xu




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-07 Thread Jason Wang
On Wed, Sep 8, 2021 at 2:32 AM Peter Xu  wrote:
>
> On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > I don't think it is valid to unconditionally enable this feature due to 
> > > the
> > > resource usage implications
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > >if the socket option was not set, the socket exceeds its optmem
> > >limit or the user exceeds its ulimit on locked pages."
> >
> > You are correct, I unfortunately missed this part in the docs :(
> >
> > > The limit on locked pages is something that looks very likely to be
> > > exceeded unless you happen to be running a QEMU config that already
> > > implies locked memory (eg PCI assignment)
> >
> > Do you mean the limit an user has on locking memory?
> >
> > If so, that makes sense. I remember I needed to set the upper limit of 
> > locked
> > memory for the user before using it, or adding a capability to qemu before.
>
> So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
>
> The thing is IIUC that's accounting for pinned pages only with either mlock()
> (FOLL_MLOCK) or vfio (FOLL_PIN).
>
> I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> __zerocopy_sg_from_iter() -> iov_iter_get_pages().

It happens probably here:

E.g

__ip_append_data()
msg_zerocopy_realloc()
mm_account_pinned_pages()

Thanks

>
> Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, 
> I
> think most of them do no need such accountings.
>
> --
> Peter Xu
>




[PATCH 4/4] tests/qtest: add qtests for npcm7xx sdhci

2021-09-07 Thread Hao Wu
From: Shengtan Mao 

Signed-off-by: Shengtan Mao 
Reviewed-by: Hao Wu 
Reviewed-by: Chris Rauer 
Reviewed-by: Tyrone Ting 
---
 tests/qtest/meson.build  |   1 +
 tests/qtest/npcm7xx_sdhci-test.c | 201 +++
 2 files changed, 202 insertions(+)
 create mode 100644 tests/qtest/npcm7xx_sdhci-test.c

diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index 757bb8499a..ef9c904779 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -157,6 +157,7 @@ qtests_npcm7xx = \
'npcm7xx_gpio-test',
'npcm7xx_pwm-test',
'npcm7xx_rng-test',
+   'npcm7xx_sdhci-test',
'npcm7xx_smbus-test',
'npcm7xx_timer-test',
'npcm7xx_watchdog_timer-test'] + \
diff --git a/tests/qtest/npcm7xx_sdhci-test.c b/tests/qtest/npcm7xx_sdhci-test.c
new file mode 100644
index 00..5c4e78fda4
--- /dev/null
+++ b/tests/qtest/npcm7xx_sdhci-test.c
@@ -0,0 +1,201 @@
+/*
+ * QTests for NPCM7xx SD-3.0 / MMC-4.51 Host Controller
+ *
+ * Copyright (c) 2021 Google LLC
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+ * for more details.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/sd/npcm7xx_sdhci.h"
+
+#include "libqos/libqtest.h"
+#include "libqtest-single.h"
+#include "libqos/sdhci-cmd.h"
+
+#define NPCM7XX_MMC_BA 0xF0842000
+#define NPCM7XX_BLK_SIZE 512
+#define NPCM7XX_TEST_IMAGE_SIZE (1 << 30)
+
+char *sd_path;
+
+static QTestState *setup_sd_card(void)
+{
+QTestState *qts = qtest_initf(
+"-machine quanta-gbs-bmc "
+"-device sd-card,drive=drive0 "
+"-drive id=drive0,if=none,file=%s,format=raw,auto-read-only=off",
+sd_path);
+
+qtest_writew(qts, NPCM7XX_MMC_BA + SDHC_SWRST, SDHC_RESET_ALL);
+qtest_writew(qts, NPCM7XX_MMC_BA + SDHC_CLKCON,
+ SDHC_CLOCK_SDCLK_EN | SDHC_CLOCK_INT_STABLE |
+ SDHC_CLOCK_INT_EN);
+sdhci_cmd_regs(qts, NPCM7XX_MMC_BA, 0, 0, 0, 0, SDHC_APP_CMD);
+sdhci_cmd_regs(qts, NPCM7XX_MMC_BA, 0, 0, 0x4120, 0, (41 << 8));
+sdhci_cmd_regs(qts, NPCM7XX_MMC_BA, 0, 0, 0, 0, SDHC_ALL_SEND_CID);
+sdhci_cmd_regs(qts, NPCM7XX_MMC_BA, 0, 0, 0, 0, SDHC_SEND_RELATIVE_ADDR);
+sdhci_cmd_regs(qts, NPCM7XX_MMC_BA, 0, 0, 0x4567, 0,
+   SDHC_SELECT_DESELECT_CARD);
+
+return qts;
+}
+
+static void write_sdread(QTestState *qts, const char *msg)
+{
+size_t len = strlen(msg);
+char *rmsg = g_malloc(len);
+
+/* write message to sd */
+int fd = open(sd_path, O_WRONLY);
+int ret = write(fd, msg, len);
+close(fd);
+g_assert(ret == len);
+
+/* read message using sdhci */
+ret = sdhci_read_cmd(qts, NPCM7XX_MMC_BA, rmsg, len);
+g_assert(ret == len);
+g_assert(!strcmp(rmsg, msg));
+
+free(rmsg);
+}
+
+/* Check MMC can read values from sd */
+static void test_read_sd(void)
+{
+QTestState *qts = setup_sd_card();
+
+write_sdread(qts, "hello world");
+write_sdread(qts, "goodbye");
+
+qtest_quit(qts);
+}
+
+static void sdwrite_read(QTestState *qts, const char *msg)
+{
+size_t len = strlen(msg);
+char *rmsg = g_malloc(len);
+
+/* write message using sdhci */
+sdhci_write_cmd(qts, NPCM7XX_MMC_BA, msg, len, NPCM7XX_BLK_SIZE);
+
+/* read message from sd */
+int fd = open(sd_path, O_RDONLY);
+int ret = read(fd, rmsg, len);
+close(fd);
+g_assert(ret == len);
+
+g_assert(!strcmp(rmsg, msg));
+
+free(rmsg);
+}
+
+/* Check MMC can write values to sd */
+static void test_write_sd(void)
+{
+QTestState *qts = setup_sd_card();
+
+sdwrite_read(qts, "hello world");
+sdwrite_read(qts, "goodbye");
+
+qtest_quit(qts);
+}
+
+/* Check SDHCI has correct default values. */
+static void test_reset(void)
+{
+QTestState *qts = qtest_init("-machine quanta-gbs-bmc");
+
+uint64_t addr = NPCM7XX_MMC_BA;
+uint64_t end_addr = addr + NPCM7XX_REG_SIZE;
+uint16_t prstvals_resets[] = {NPCM7XX_PRSTVALS_0_RESET,
+  NPCM7XX_PRSTVALS_1_RESET,
+  0,
+  NPCM7XX_PRSTVALS_3_RESET,
+  0,
+  0};
+int i;
+uint32_t mask;
+while (addr < end_addr) {
+switch (addr - NPCM7XX_MMC_BA) {
+case SDHC_PRNSTS:
+/* ignores bits 20 to 24: they are changed when reading registers 
*/
+mask = 0x1f0;
+g_assert_cmphex(qtest_readl(qts, addr) | mask, ==,
+NPCM7XX_PRSNTS_RESET | mask);
+addr 

[PATCH 2/4] hw/sd: add nuvoton MMC

2021-09-07 Thread Hao Wu
From: Shengtan Mao 

Signed-off-by: Shengtan Mao 
Reviewed-by: Hao Wu 
Reviewed-by: Chris Rauer 
Reviewed-by: Tyrone Ting 
---
 hw/arm/npcm7xx.c  |  12 +++-
 hw/sd/meson.build |   1 +
 hw/sd/npcm7xx_sdhci.c | 131 ++
 include/hw/arm/npcm7xx.h  |   2 +
 include/hw/sd/npcm7xx_sdhci.h |  65 +
 5 files changed, 210 insertions(+), 1 deletion(-)
 create mode 100644 hw/sd/npcm7xx_sdhci.c
 create mode 100644 include/hw/sd/npcm7xx_sdhci.h

diff --git a/hw/arm/npcm7xx.c b/hw/arm/npcm7xx.c
index 2ab0080e0b..878c2208e0 100644
--- a/hw/arm/npcm7xx.c
+++ b/hw/arm/npcm7xx.c
@@ -63,6 +63,8 @@
 #define NPCM7XX_ROM_BA  (0x)
 #define NPCM7XX_ROM_SZ  (64 * KiB)
 
+/* SDHCI Modules */
+#define NPCM7XX_MMC_BA  (0xf0842000)
 
 /* Clock configuration values to be fixed up when bypassing bootloader */
 
@@ -83,6 +85,7 @@ enum NPCM7xxInterrupt {
 NPCM7XX_UART3_IRQ,
 NPCM7XX_EMC1RX_IRQ  = 15,
 NPCM7XX_EMC1TX_IRQ,
+NPCM7XX_MMC_IRQ = 26,
 NPCM7XX_TIMER0_IRQ  = 32,   /* Timer Module 0 */
 NPCM7XX_TIMER1_IRQ,
 NPCM7XX_TIMER2_IRQ,
@@ -443,6 +446,8 @@ static void npcm7xx_init(Object *obj)
 for (i = 0; i < ARRAY_SIZE(s->emc); i++) {
 object_initialize_child(obj, "emc[*]", >emc[i], TYPE_NPCM7XX_EMC);
 }
+
+object_initialize_child(obj, "mmc", >mmc, TYPE_NPCM7XX_SDHCI);
 }
 
 static void npcm7xx_realize(DeviceState *dev, Error **errp)
@@ -707,6 +712,12 @@ static void npcm7xx_realize(DeviceState *dev, Error **errp)
_abort);
 memory_region_add_subregion(get_system_memory(), NPCM7XX_ROM_BA, >irom);
 
+/* SDHCI */
+sysbus_realize(SYS_BUS_DEVICE(>mmc), _abort);
+sysbus_mmio_map(SYS_BUS_DEVICE(>mmc), 0, NPCM7XX_MMC_BA);
+sysbus_connect_irq(SYS_BUS_DEVICE(>mmc), 0,
+npcm7xx_irq(s, NPCM7XX_MMC_IRQ));
+
 create_unimplemented_device("npcm7xx.shm",  0xc0001000,   4 * KiB);
 create_unimplemented_device("npcm7xx.vdmx", 0xe080,   4 * KiB);
 create_unimplemented_device("npcm7xx.pcierc",   0xe100,  64 * KiB);
@@ -736,7 +747,6 @@ static void npcm7xx_realize(DeviceState *dev, Error **errp)
 create_unimplemented_device("npcm7xx.usbd[8]",  0xf0838000,   4 * KiB);
 create_unimplemented_device("npcm7xx.usbd[9]",  0xf0839000,   4 * KiB);
 create_unimplemented_device("npcm7xx.sd",   0xf084,   8 * KiB);
-create_unimplemented_device("npcm7xx.mmc",  0xf0842000,   8 * KiB);
 create_unimplemented_device("npcm7xx.pcimbx",   0xf0848000, 512 * KiB);
 create_unimplemented_device("npcm7xx.aes",  0xf0858000,   4 * KiB);
 create_unimplemented_device("npcm7xx.des",  0xf0859000,   4 * KiB);
diff --git a/hw/sd/meson.build b/hw/sd/meson.build
index f1ce357a3b..807ca07b7c 100644
--- a/hw/sd/meson.build
+++ b/hw/sd/meson.build
@@ -9,4 +9,5 @@ softmmu_ss.add(when: 'CONFIG_PXA2XX', if_true: 
files('pxa2xx_mmci.c'))
 softmmu_ss.add(when: 'CONFIG_RASPI', if_true: files('bcm2835_sdhost.c'))
 softmmu_ss.add(when: 'CONFIG_ASPEED_SOC', if_true: files('aspeed_sdhci.c'))
 softmmu_ss.add(when: 'CONFIG_ALLWINNER_H3', if_true: 
files('allwinner-sdhost.c'))
+softmmu_ss.add(when: 'CONFIG_NPCM7XX', if_true: files('npcm7xx_sdhci.c'))
 softmmu_ss.add(when: 'CONFIG_CADENCE_SDHCI', if_true: files('cadence_sdhci.c'))
diff --git a/hw/sd/npcm7xx_sdhci.c b/hw/sd/npcm7xx_sdhci.c
new file mode 100644
index 00..85cccdc485
--- /dev/null
+++ b/hw/sd/npcm7xx_sdhci.c
@@ -0,0 +1,131 @@
+/*
+ * NPCM7xx SD-3.0 / eMMC-4.51 Host Controller
+ *
+ * Copyright (c) 2021 Google LLC
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+ * for more details.
+ */
+
+#include "qemu/osdep.h"
+
+#include "hw/sd/npcm7xx_sdhci.h"
+#include "sdhci-internal.h"
+
+static uint64_t npcm7xx_sdhci_read(void *opaque, hwaddr addr, unsigned int 
size)
+{
+NPCM7xxSDHCIState *s = opaque;
+uint64_t val = 0;
+
+switch (addr) {
+case NPCM7XX_PRSTVALS_0:
+case NPCM7XX_PRSTVALS_1:
+case NPCM7XX_PRSTVALS_2:
+case NPCM7XX_PRSTVALS_3:
+case NPCM7XX_PRSTVALS_4:
+case NPCM7XX_PRSTVALS_5:
+val = (uint64_t)s->regs.prstvals[(addr - NPCM7XX_PRSTVALS_0) / 2];
+break;
+case NPCM7XX_BOOTTOCTRL:
+val = (uint64_t)s->regs.boottoctrl;
+break;
+default:
+val = (uint64_t)s->sdhci.io_ops->read(>sdhci, addr, size);
+}
+
+return val;
+}
+
+static void npcm7xx_sdhci_write(void 

[PATCH 3/4] hw/arm: Attach MMC to quanta-gbs-bmc

2021-09-07 Thread Hao Wu
From: Shengtan Mao 

Signed-off-by: Shengtan Mao 
Reviewed-by: Hao Wu 
Reviewed-by: Tyrone Ting 
---
 hw/arm/npcm7xx_boards.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/hw/arm/npcm7xx_boards.c b/hw/arm/npcm7xx_boards.c
index e5a3243995..7205483280 100644
--- a/hw/arm/npcm7xx_boards.c
+++ b/hw/arm/npcm7xx_boards.c
@@ -27,6 +27,10 @@
 #include "qemu-common.h"
 #include "qemu/datadir.h"
 #include "qemu/units.h"
+#include "sysemu/blockdev.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/block-backend.h"
+
 
 #define NPCM750_EVB_POWER_ON_STRAPS 0x1ff7
 #define QUANTA_GSJ_POWER_ON_STRAPS 0x1fff
@@ -80,6 +84,22 @@ static void npcm7xx_connect_dram(NPCM7xxState *soc, 
MemoryRegion *dram)
  _abort);
 }
 
+static void sdhci_attach_drive(SDHCIState *sdhci)
+{
+DriveInfo *di = drive_get_next(IF_SD);
+BlockBackend *blk = di ? blk_by_legacy_dinfo(di) : NULL;
+
+BusState *bus = qdev_get_child_bus(DEVICE(sdhci), "sd-bus");
+if (bus == NULL) {
+error_report("No SD bus found in SOC object");
+exit(1);
+}
+
+DeviceState *carddev = qdev_new(TYPE_SD_CARD);
+qdev_prop_set_drive_err(carddev, "drive", blk, _fatal);
+qdev_realize_and_unref(carddev, bus, _fatal);
+}
+
 static NPCM7xxState *npcm7xx_create_soc(MachineState *machine,
 uint32_t hw_straps)
 {
@@ -354,6 +374,7 @@ static void quanta_gbs_init(MachineState *machine)
   drive_get(IF_MTD, 0, 0));
 
 quanta_gbs_i2c_init(soc);
+sdhci_attach_drive(>mmc.sdhci);
 npcm7xx_load_kernel(machine, soc);
 }
 
-- 
2.33.0.153.gba50c8fa24-goog




[PATCH 1/4] tests/qtest/libqos: add SDHCI commands

2021-09-07 Thread Hao Wu
From: Shengtan Mao 

Signed-off-by: Shengtan Mao 
Reviewed-by: Hao Wu 
Reviewed-by: Chris Rauer 
Reviewed-by: Tyrone Ting 
---
 tests/qtest/libqos/meson.build |   1 +
 tests/qtest/libqos/sdhci-cmd.c | 116 +
 tests/qtest/libqos/sdhci-cmd.h |  70 
 3 files changed, 187 insertions(+)
 create mode 100644 tests/qtest/libqos/sdhci-cmd.c
 create mode 100644 tests/qtest/libqos/sdhci-cmd.h

diff --git a/tests/qtest/libqos/meson.build b/tests/qtest/libqos/meson.build
index 1f5c8f1053..4af1f04787 100644
--- a/tests/qtest/libqos/meson.build
+++ b/tests/qtest/libqos/meson.build
@@ -5,6 +5,7 @@ libqos_srcs = files('../libqtest.c',
 'fw_cfg.c',
 'malloc.c',
 'libqos.c',
+'sdhci-cmd.c',
 
 # spapr
 'malloc-spapr.c',
diff --git a/tests/qtest/libqos/sdhci-cmd.c b/tests/qtest/libqos/sdhci-cmd.c
new file mode 100644
index 00..2d9e518341
--- /dev/null
+++ b/tests/qtest/libqos/sdhci-cmd.c
@@ -0,0 +1,116 @@
+/*
+ * MMC Host Controller Commands
+ *
+ * Copyright (c) 2021 Google LLC
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+ * for more details.
+ */
+
+#include "qemu/osdep.h"
+#include "sdhci-cmd.h"
+#include "libqtest.h"
+
+static ssize_t read_fifo(QTestState *qts, uint64_t reg, char *msg, size_t 
count)
+{
+uint32_t mask = 0xff;
+size_t index = 0;
+uint32_t msg_frag;
+int size;
+while (index < count) {
+size = count - index;
+if (size > 4) {
+size = 4;
+}
+msg_frag = qtest_readl(qts, reg);
+while (size > 0) {
+msg[index] = msg_frag & mask;
+if (msg[index++] == 0) {
+return index;
+}
+msg_frag >>= 8;
+--size;
+}
+}
+return index;
+}
+
+static void write_fifo(QTestState *qts, uint64_t reg, const char *msg,
+   size_t count)
+{
+size_t index = 0;
+uint32_t msg_frag;
+int size;
+int frag_i;
+while (index < count) {
+size = count - index;
+if (size > 4) {
+size = 4;
+}
+msg_frag = 0;
+frag_i = 0;
+while (frag_i < size) {
+msg_frag |= ((uint32_t)msg[index++]) << (frag_i * 8);
+++frag_i;
+}
+qtest_writel(qts, reg, msg_frag);
+}
+}
+
+static void fill_block(QTestState *qts, uint64_t reg, int count)
+{
+while (--count >= 0) {
+qtest_writel(qts, reg, 0);
+}
+}
+
+void sdhci_cmd_regs(QTestState *qts, uint64_t base_addr, uint16_t blksize,
+uint16_t blkcnt, uint32_t argument, uint16_t trnmod,
+uint16_t cmdreg)
+{
+qtest_writew(qts, base_addr + SDHC_BLKSIZE, blksize);
+qtest_writew(qts, base_addr + SDHC_BLKCNT, blkcnt);
+qtest_writel(qts, base_addr + SDHC_ARGUMENT, argument);
+qtest_writew(qts, base_addr + SDHC_TRNMOD, trnmod);
+qtest_writew(qts, base_addr + SDHC_CMDREG, cmdreg);
+}
+
+ssize_t sdhci_read_cmd(QTestState *qts, uint64_t base_addr, char *msg,
+   size_t count)
+{
+sdhci_cmd_regs(qts, base_addr, count, 1, 0,
+   SDHC_TRNS_MULTI | SDHC_TRNS_READ | SDHC_TRNS_BLK_CNT_EN,
+   SDHC_READ_MULTIPLE_BLOCK | SDHC_CMD_DATA_PRESENT);
+
+/* read sd fifo_buffer */
+ssize_t bytes_read = read_fifo(qts, base_addr + SDHC_BDATA, msg, count);
+
+sdhci_cmd_regs(qts, base_addr, 0, 0, 0,
+   SDHC_TRNS_MULTI | SDHC_TRNS_READ | SDHC_TRNS_BLK_CNT_EN,
+   SDHC_STOP_TRANSMISSION);
+
+return bytes_read;
+}
+
+void sdhci_write_cmd(QTestState *qts, uint64_t base_addr, const char *msg,
+ size_t count, size_t blksize)
+{
+sdhci_cmd_regs(qts, base_addr, blksize, 1, 0,
+   SDHC_TRNS_MULTI | SDHC_TRNS_WRITE | SDHC_TRNS_BLK_CNT_EN,
+   SDHC_WRITE_MULTIPLE_BLOCK | SDHC_CMD_DATA_PRESENT);
+
+/* write to sd fifo_buffer */
+write_fifo(qts, base_addr + SDHC_BDATA, msg, count);
+fill_block(qts, base_addr + SDHC_BDATA, (blksize - count) / 4);
+
+sdhci_cmd_regs(qts, base_addr, 0, 0, 0,
+   SDHC_TRNS_MULTI | SDHC_TRNS_WRITE | SDHC_TRNS_BLK_CNT_EN,
+   SDHC_STOP_TRANSMISSION);
+}
diff --git a/tests/qtest/libqos/sdhci-cmd.h b/tests/qtest/libqos/sdhci-cmd.h
new file mode 100644
index 00..64763c5a2a
--- /dev/null
+++ b/tests/qtest/libqos/sdhci-cmd.h
@@ -0,0 +1,70 @@
+/*
+ * MMC Host Controller Commands
+ *
+ * Copyright (c) 2021 Google LLC
+ *

[PATCH 0/4] hw/arm: Add MMC device for NPCM7XX boards

2021-09-07 Thread Hao Wu
This patch set implements the Nuvoton MMC device
for NPCM7XX boards.

The MMC device is compatible with the SDHCI interface
in QEMU. It allows the user to attach an SD card image
to it.

Shengtan Mao (4):
  tests/qtest/libqos: add SDHCI commands
  hw/sd: add nuvoton MMC
  hw/arm: Attach MMC to quanta-gbs-bmc
  tests/qtest: add qtests for npcm7xx sdhci

 hw/arm/npcm7xx.c |  12 +-
 hw/arm/npcm7xx_boards.c  |  21 
 hw/sd/meson.build|   1 +
 hw/sd/npcm7xx_sdhci.c| 131 
 include/hw/arm/npcm7xx.h |   2 +
 include/hw/sd/npcm7xx_sdhci.h|  65 ++
 tests/qtest/libqos/meson.build   |   1 +
 tests/qtest/libqos/sdhci-cmd.c   | 116 ++
 tests/qtest/libqos/sdhci-cmd.h   |  70 +++
 tests/qtest/meson.build  |   1 +
 tests/qtest/npcm7xx_sdhci-test.c | 201 +++
 11 files changed, 620 insertions(+), 1 deletion(-)
 create mode 100644 hw/sd/npcm7xx_sdhci.c
 create mode 100644 include/hw/sd/npcm7xx_sdhci.h
 create mode 100644 tests/qtest/libqos/sdhci-cmd.c
 create mode 100644 tests/qtest/libqos/sdhci-cmd.h
 create mode 100644 tests/qtest/npcm7xx_sdhci-test.c

-- 
2.33.0.153.gba50c8fa24-goog




Re: [PATCH v3 9/9] iotests: add nbd-reconnect-on-open test

2021-09-07 Thread Eric Blake
On Mon, Sep 06, 2021 at 10:06:54PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  .../qemu-iotests/tests/nbd-reconnect-on-open  | 71 +++
>  .../tests/nbd-reconnect-on-open.out   | 11 +++
>  2 files changed, 82 insertions(+)
>  create mode 100755 tests/qemu-iotests/tests/nbd-reconnect-on-open
>  create mode 100644 tests/qemu-iotests/tests/nbd-reconnect-on-open.out

I'm less confident in my review of the python code, but...

> 
> diff --git a/tests/qemu-iotests/tests/nbd-reconnect-on-open 
> b/tests/qemu-iotests/tests/nbd-reconnect-on-open
> new file mode 100755
> index 00..7ee9bce947
> --- /dev/null
> +++ b/tests/qemu-iotests/tests/nbd-reconnect-on-open
> @@ -0,0 +1,71 @@

> +
> +def create_args(open_timeout):
> +return ['--image-opts', '-c', 'read 0 1M',
> +f'driver=nbd,open-timeout={open_timeout},'
> +f'server.type=unix,server.path={nbd_sock}']
> +
> +
> +def check_fail_to_connect(open_timeout):
> +log(f'Check fail to connect with {open_timeout} seconds of timeout')
> +
> +start_t = time.time()
> +qemu_io_log(*create_args(open_timeout))
> +delta_t = time.time() - start_t
> +
> +max_delta = open_timeout + 0.2

Is this fractional delay going to bite us on heavily-loaded CI machines?

> +if open_timeout <= delta_t <= max_delta:
> +log(f'qemu_io finished in {open_timeout}..{max_delta} seconds, OK')
> +else:
> +note = 'too early' if delta_t < open_timeout else 'too long'
> +log(f'qemu_io finished in {delta_t:.1f} seconds, {note}')
> +
> +
> +qemu_img_create('-f', iotests.imgfmt, disk, '1M')
> +
> +# Start NBD client when NBD server is not yet running. It should not fail, 
> but
> +# wait for 5 seconds for the server to be available.
> +client = qemu_io_popen(*create_args(5))
> +
> +time.sleep(1)
> +qemu_nbd('-k', nbd_sock, '-f', iotests.imgfmt, disk)
> +
> +# client should succeed
> +log(client.communicate()[0], filters=[iotests.filter_qemu_io])
> +
> +# Server was started without --persistent flag, so it should be off now. 
> Let's
> +# check it and it the same time check that with open-timeout=0 client fails

check it and at

> +# immediately.
> +check_fail_to_connect(0)
> +
> +# Check that we will fail after non-zero timeout if server is still 
> unavailable
> +check_fail_to_connect(1)
> diff --git a/tests/qemu-iotests/tests/nbd-reconnect-on-open.out 
> b/tests/qemu-iotests/tests/nbd-reconnect-on-open.out
> new file mode 100644
> index 00..a35ae30ea4
> --- /dev/null
> +++ b/tests/qemu-iotests/tests/nbd-reconnect-on-open.out
> @@ -0,0 +1,11 @@
> +read 1048576/1048576 bytes at offset 0
> +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +
> +Check fail to connect with 0 seconds of timeout
> +qemu-io: can't open: Failed to connect to 'TEST_DIR/PID-nbd-sock': No such 
> file or directory
> +
> +qemu_io finished in 0..0.2 seconds, OK
> +Check fail to connect with 1 seconds of timeout
> +qemu-io: can't open: Failed to connect to 'TEST_DIR/PID-nbd-sock': No such 
> file or directory
> +
> +qemu_io finished in 1..1.2 seconds, OK

Overall, the test looks like a nice demonstration of the feature: you
are showing that the client can now start before the server, and that
the retry for N seconds is handled gracefully both by the creation of
the server and by the expiration of the retry timeout.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v3 5/9] nbd/client-connection: improve error message of cancelled attempt

2021-09-07 Thread Eric Blake
On Mon, Sep 06, 2021 at 10:06:50PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  nbd/client-connection.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)

Reviewed-by: Eric Blake 

> 
> diff --git a/nbd/client-connection.c b/nbd/client-connection.c
> index 722998c985..2bda42641d 100644
> --- a/nbd/client-connection.c
> +++ b/nbd/client-connection.c
> @@ -351,8 +351,15 @@ nbd_co_establish_connection(NBDClientConnection *conn, 
> NBDExportInfo *info,
>  if (conn->err) {
>  error_propagate(errp, error_copy(conn->err));
>  } else {
> -error_setg(errp,
> -   "Connection attempt cancelled by other 
> operation");
> +/*
> + * The only possible case here is cancelling by open_timer
> + * during nbd_open(). So, the error message is for that case.
> + * If we have more use cases, we can refactor
> + * nbd_co_establish_connection_cancel() to take an additional
> + * parameter cancel_reason, that would be passed than to the
> + * caller of cancelled nbd_co_establish_connection().
> + */
> +error_setg(errp, "Connection attempt cancelled by timeout");
>  }
>  
>  return NULL;
> -- 
> 2.29.2
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v3 4/9] nbd/client-connection: nbd_co_establish_connection(): return real error

2021-09-07 Thread Eric Blake
On Mon, Sep 06, 2021 at 10:06:49PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> The only user of errp is call to nbd_do_establish_connection() in
> nbd_open(). The only way to cancel this call is through open_timer

The only caller of nbd_do_establish_connection() that uses errp is
nbd_open().

> timeout. And for this case, user will be more interested in description
> of last failed connect rather than in
> "Connection attempt cancelled by other operation".
> 
> So, let's change behavior on cancel to return previous failure error if
> available.
> 
> Do the same for non-blocking failure case. In this case we still don't
> have a caller that is interested in errp. But let's be consistent.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  nbd/client-connection.c | 50 -
>  1 file changed, 34 insertions(+), 16 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v3 3/9] nbd: allow reconnect on open, with corresponding new options

2021-09-07 Thread Eric Blake
On Mon, Sep 06, 2021 at 10:06:48PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> It is useful when start of vm and start of nbd server are not
> simple to sync.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  qapi/block-core.json |  9 -
>  block/nbd.c  | 45 +++-
>  2 files changed, 52 insertions(+), 2 deletions(-)
>

Reviewed-by: Eric Blake 

This is when qemu is acting as NBD client (and does not affect qemu
acting as NBD server).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-07 Thread Peter Xu
On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > I don't think it is valid to unconditionally enable this feature due to the
> > resource usage implications
> >
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >
> >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> >if the socket option was not set, the socket exceeds its optmem
> >limit or the user exceeds its ulimit on locked pages."
> 
> You are correct, I unfortunately missed this part in the docs :(
> 
> > The limit on locked pages is something that looks very likely to be
> > exceeded unless you happen to be running a QEMU config that already
> > implies locked memory (eg PCI assignment)
> 
> Do you mean the limit an user has on locking memory?
> 
> If so, that makes sense. I remember I needed to set the upper limit of locked
> memory for the user before using it, or adding a capability to qemu before.

So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.

The thing is IIUC that's accounting for pinned pages only with either mlock()
(FOLL_MLOCK) or vfio (FOLL_PIN).

I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
__zerocopy_sg_from_iter() -> iov_iter_get_pages().

Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, I
think most of them do no need such accountings.

-- 
Peter Xu




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-07 Thread Peter Xu
On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > What if we do the 'flush()' before we start post-copy, instead of after 
> > > each
> > > iteration? would that be enough?
> > 
> > Possibly, yes. This really need David G's input since he understands
> > the code in way more detail than me.
> 
> Hmm I'm not entirely sure why we have the sync after each iteration;

We don't have that yet, do we?

> the case I can think of is if we're doing async sending, we could have
> two versions of the same page in flight (one from each iteration) -
> you'd want those to get there in the right order.

>From MSG_ZEROCOPY document:

For protocols that acknowledge data in-order, like TCP, each
notification can be squashed into the previous one, so that no more
than one notification is outstanding at any one point.

Ordered delivery is the common case, but not guaranteed. Notifications
may arrive out of order on retransmission and socket teardown.

So no matter whether we have a flush() per iteration before, it seems we may
want one when zero copy enabled?

Thanks,

-- 
Peter Xu




Re: [PATCH v3 1/9] nbd/client-connection: nbd_co_establish_connection(): fix non set errp

2021-09-07 Thread Eric Blake
On Mon, Sep 06, 2021 at 10:06:46PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> When we don't have a connection and blocking is false, we return NULL
> but don't set errp. That's wrong.

Oops...

> 
> We have two paths for calling nbd_co_establish_connection():
> 
> 1. nbd_open() -> nbd_do_establish_connection() -> ...
>   but that will never set blocking=false
> 
> 2. nbd_reconnect_attempt() -> nbd_co_do_establish_connection() -> ...
>   but that uses errp=NULL
> 
> So, we are safe with our wrong errp policy in
> nbd_co_establish_connection(). Still let's fix it.

...phew!  Thus, it's not critical to backport.

Reviewed-by: Eric Blake 

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  nbd/client-connection.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/nbd/client-connection.c b/nbd/client-connection.c
> index 7123b1e189..695f855754 100644
> --- a/nbd/client-connection.c
> +++ b/nbd/client-connection.c
> @@ -318,6 +318,7 @@ nbd_co_establish_connection(NBDClientConnection *conn, 
> NBDExportInfo *info,
>  }
>  
>  if (!blocking) {
> +error_setg(errp, "No connection at the moment");
>  return NULL;
>  }
>  
> -- 
> 2.29.2
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




[PATCH] nbd/server: Allow LIST_META_CONTEXT without STRUCTURED_REPLY

2021-09-07 Thread Eric Blake
The NBD protocol just relaxed the requirements on
NBD_OPT_LIST_META_CONTEXT:

https://github.com/NetworkBlockDevice/nbd/commit/13a4e33a87

Since listing is not stateful (unlike SET_META_CONTEXT), we don't care
if a client asks for meta contexts without first requesting structured
replies.  Well-behaved clients will still ask for structured reply
first (if for no other reason than for back-compat to older servers),
but that's no reason to avoid this change.

Signed-off-by: Eric Blake 
---
 nbd/server.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/nbd/server.c b/nbd/server.c
index 1646796a4798..bd635cf8ffd1 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -980,7 +980,7 @@ static int nbd_negotiate_meta_queries(NBDClient *client,
 size_t i;
 size_t count = 0;

-if (!client->structured_reply) {
+if (client->opt == NBD_OPT_SET_META_CONTEXT && !client->structured_reply) {
 return nbd_opt_invalid(client, errp,
"request option '%s' when structured reply "
"is not negotiated",
-- 
2.31.1




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-07 Thread Peter Xu
On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote:
> I also suggested something like that, but I thought it could be good if we 
> could
> fall back to io_writev() if we didn't have the zerocopy feature (or
> the async feature).
> 
> What do you think?

That fallback looks safe and ok, I'm just not sure whether it'll be of great
help.  E.g. if we provide an QIO api that allows both sync write and zero-copy
write (then we do the fallback when necessary), it means the buffer implication
applies too to this api, so it's easier to me to just detect the zero copy
capability and use one alternative.  Thanks,

-- 
Peter Xu




Re: [PULL 00/11] Block patches

2021-09-07 Thread Peter Maydell
On Tue, 7 Sept 2021 at 09:14, Stefan Hajnoczi  wrote:
>
> The following changes since commit 88afdc92b644120e0182c8567e1b1d236e471b12:
>
>   Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging 
> (2021-09-05 15:48:42 +0100)
>
> are available in the Git repository at:
>
>   https://gitlab.com/stefanha/qemu.git tags/block-pull-request
>
> for you to fetch changes up to 9bd2788f49c331b02372cc257b11e4c984d39708:
>
>   block/nvme: Only report VFIO error on failed retry (2021-09-07 09:08:24 
> +0100)
>
> 
> Pull request
>
> Userspace NVMe driver patches.


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/6.2
for any user-visible changes.

-- PMM



Re: [PATCH 0/4] hw/block/pflash_cfi01: Remove pflash_cfi01_get_memory()

2021-09-07 Thread Peter Maydell
On Tue, 7 Sept 2021 at 15:45, Philippe Mathieu-Daudé  wrote:
> The problems I see:
>
> - pflash_cfi01_get_memory() doesn't really document what it returns,
>   simply an internal MemoryRegion* in pflash device. Neither we
>   document this is a ROMD device providing a RAM buffer initialized
>   by qemu_ram_alloc().
>
> - to update the flash content, we get the internal buffer via
>   memory_region_get_ram_ptr(). If the pflash implementation is
>   changed (.i.e. reworked to expose a MR container) we break
>   everything.
>
> - memory_region_get_ram_ptr() doesn't do any check on the MR type,
>   it simply calls qemu_map_ram_ptr(mr->ram_block, offset).

Using memory_region_get_ram_ptr() is tricky to get right, too --
if you're writing directly to the underlying ram while the system is
running you need to use memory_region_flush_rom_device() to make
sure it's marked dirty. I think the current users of this on the
pflash devices get away with it because they do it during initial
machine init.

-- PMM



Re: [PATCH 0/4] hw/block/pflash_cfi01: Remove pflash_cfi01_get_memory()

2021-09-07 Thread Philippe Mathieu-Daudé
On 3/15/21 1:08 PM, Peter Maydell wrote:
> On Mon, 15 Mar 2021 at 11:34, Paolo Bonzini  wrote:
>> On 07/03/21 23:26, Philippe Mathieu-Daudé wrote:
>>> TYPE_PFLASH_CFI01 is a TYPE_SYS_BUS_DEVICE which registers its romd
>>> MemoryRegion with sysbus_init_mmio(), so we can use the generic
>>> sysbus_mmio_get_region() to get the region, no need for a specific
>>> pflash_cfi01_get_memory() helper.
>>>
>>> First replace the few pflash_cfi01_get_memory() uses by
>>> sysbus_mmio_get_region(), then remove the now unused helper.
>>
>> Why is this an improvement?  You're replacing nice and readable code
>> with an implementation-dependent function whose second argument is a
>> magic number.  The right patch would _add_ more of these helpers, not
>> remove them.
> 
> I agree that sysbus_mmio_get_region()'s use of arbitrary
> integers is unfortunate (we should look at improving that
> to use usefully named regions I guess), but I don't see
> why pflash_cfi01 should expose its MemoryRegion to users
> in a different way to every other sysbus device.

It is used that way (x86/pc):

if (i == 0) {
flash_mem = pflash_cfi01_get_memory(system_flash);
pc_isa_bios_init(rom_memory, flash_mem, size);

/* Encrypt the pflash boot ROM */
if (sev_enabled()) {
flash_ptr = memory_region_get_ram_ptr(flash_mem);
flash_size = memory_region_size(flash_mem);
/*
 * OVMF places a GUIDed structures in the flash, so
 * search for them
 */
pc_system_parse_ovmf_flash(flash_ptr, flash_size);

ret = sev_es_save_reset_vector(flash_ptr, flash_size);

The problems I see:

- pflash_cfi01_get_memory() doesn't really document what it returns,
  simply an internal MemoryRegion* in pflash device. Neither we
  document this is a ROMD device providing a RAM buffer initialized
  by qemu_ram_alloc().

- to update the flash content, we get the internal buffer via
  memory_region_get_ram_ptr(). If the pflash implementation is
  changed (.i.e. reworked to expose a MR container) we break
  everything.

- memory_region_get_ram_ptr() doesn't do any check on the MR type,
  it simply calls qemu_map_ram_ptr(mr->ram_block, offset).

I agree with Peter pflash_cfi01_get_memory() has nothing special.

Now what if we want a safer function to access pflash internal
buffer, I'd prefer we use an explicit function such:

  /**
   * pflash_cfi01_get_ram_ptr_size: Return information on eventual RAMBlock
   *associated with the device
   *
   * @pfl: the flash device being queried.
   * @ptr: optional pointer to hold the ram address associated with the
RAMBlock
   * @size: optional pointer to hold length of the RAMBlock
   * Return %true on success, %false on failure.
   */
  bool pflash_cfi01_get_ram_ptr_size(PFlashCFI01 *pfl,
 void **ptr, uint64_t *size);

Thoughts?



Re: [PATCH] qemu-img: Allow target be aligned to sector size

2021-09-07 Thread Vladimir Sementsov-Ogievskiy

07.09.2021 17:00, Hanna Reitz wrote:

On 07.09.21 15:44, Vladimir Sementsov-Ogievskiy wrote:

07.09.2021 15:48, Hanna Reitz wrote:

On 07.09.21 13:29, Vladimir Sementsov-Ogievskiy wrote:

19.08.2021 13:12, Hanna Reitz wrote:

We cannot write to images opened with O_DIRECT unless we allow them to
be resized so they are aligned to the sector size: Since 9c60a5d1978,
bdrv_node_refresh_perm() ensures that for nodes whose length is not
aligned to the request alignment and where someone has taken a WRITE
permission, the RESIZE permission is taken, too).

Let qemu-img convert pass the BDRV_O_RESIZE flag (which causes
blk_new_open() to take the RESIZE permission) when using cache=none for
the target, so that when writing to it, it can be aligned to the target
sector size.

Without this patch, an error is returned:

$ qemu-img convert -f raw -O raw -t none foo.img /mnt/tmp/foo.img
qemu-img: Could not open '/mnt/tmp/foo.img': Cannot get 'write'
permission without 'resize': Image size is not a multiple of request
alignment

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1994266
Signed-off-by: Hanna Reitz 
---
As I have written in the BZ linked above, I am not sure what behavior we
want.  It can be argued that the current behavior is perfectly OK
because we want the target to have the same size as the source, so if
this cannot be done, we should just print the above error (which I think
explains the problem well enough that users can figure out they need to
resize the source image).

OTOH, it is difficult for me to imagine a case where the user would
prefer the above error to just having qemu-img align the target image's
length.
---
  qemu-img.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/qemu-img.c b/qemu-img.c
index 908fd0cce5..d4b29bf73e 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2628,6 +2628,14 @@ static int img_convert(int argc, char **argv)
  goto out;
  }
  +    if (flags & BDRV_O_NOCACHE) {
+    /*
+ * If we open the target with O_DIRECT, it may be necessary to
+ * extend its size to align to the physical sector size.
+ */
+    flags |= BDRV_O_RESIZE;
+    }
+
  if (skip_create) {
  s.target = img_open(tgt_image_opts, out_filename, out_fmt,
  flags, writethrough, s.quiet, false);



Hmm. Strange. I am experimenting now with master branch without that patch and 
can't reproduce:

[root@kvm master]# dd if=/dev/zero of=a bs=1M count=2
2+0 records in
2+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.00153639 s, 1.4 GB/s
[root@kvm master]# dd if=/dev/zero of=b bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.000939314 s, 1.1 GB/s
[root@kvm master]# ls -lthr a b
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 a
-rw-r--r--. 1 root root 1.0M Sep  7 14:28 b
[root@kvm master]# ./build/qemu-img convert -f raw -O raw -t none a b
[root@kvm master]# ls -lthr a b
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 a
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 b
[root@kvm master]# dd if=/dev/zero of=b bs=1 count=2097147
2097147+0 records in
2097147+0 records out
2097147 bytes (2.1 MB, 2.0 MiB) copied, 2.76548 s, 758 kB/s
[root@kvm master]# ls -ltr a b
-rw-r--r--. 1 root root 2097152 Sep  7 14:28 a
-rw-r--r--. 1 root root 2097147 Sep  7 14:29 b
[root@kvm master]# ./build/qemu-img convert -f raw -O raw -t none a b
[root@kvm master]# ls -ltr a b
-rw-r--r--. 1 root root 2097152 Sep  7 14:28 a
-rw-r--r--. 1 root root 2097152 Sep  7 14:29 b

It just works, and do resize target without any additional BDRV_O_RESIZE...


bdrv_getlength() invokes bdrv_nb_sectors(), so we always align sizes to 512 
anyway.  On volumes with a logical sector size of 512, this error cannot be 
reproduced.

You can use loop devices to get volumes with other sector sizes, like so:

$ cd /tmp
$ truncate fs.img -s 128M
$ sudo losetup -f --show --sector-size 4096 fs.img
/dev/loop0
$ sudo mkfs.ext4 /dev/loop0
mke2fs 1.46.4 (18-Aug-2021)
Discarding device blocks: done
Creating filesystem with 32768 4k blocks and 32768 inodes

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

$ sudo mount /dev/loop0 /mnt/tmp
$ qemu-img create -f raw source.img $((2 * 1048576 + 512))
Formatting 'source.img', fmt=raw size=2097664
$ sudo qemu-img convert -f raw -O raw -t none source.img /mnt/tmp/target.img
qemu-img: Could not open '/mnt/tmp/target.img': Cannot get 'write' permission 
without 'resize': Image size is not a multiple of request alignment




Does it mean, that when logical sector size is 512, qemu-img do resize target 
without any 'resize' permission?


Well, it creates the target with the size of the source, and the size of the 
source will always be reported to be aligned to 512.  If we didn’t have 
bdrv_nb_sectors() but used a byte granularity for a BDS’s length internally, 
then you could see this error with 

Re: [PATCH] qemu-img: Allow target be aligned to sector size

2021-09-07 Thread Hanna Reitz

On 07.09.21 15:44, Vladimir Sementsov-Ogievskiy wrote:

07.09.2021 15:48, Hanna Reitz wrote:

On 07.09.21 13:29, Vladimir Sementsov-Ogievskiy wrote:

19.08.2021 13:12, Hanna Reitz wrote:

We cannot write to images opened with O_DIRECT unless we allow them to
be resized so they are aligned to the sector size: Since 9c60a5d1978,
bdrv_node_refresh_perm() ensures that for nodes whose length is not
aligned to the request alignment and where someone has taken a WRITE
permission, the RESIZE permission is taken, too).

Let qemu-img convert pass the BDRV_O_RESIZE flag (which causes
blk_new_open() to take the RESIZE permission) when using cache=none 
for
the target, so that when writing to it, it can be aligned to the 
target

sector size.

Without this patch, an error is returned:

$ qemu-img convert -f raw -O raw -t none foo.img /mnt/tmp/foo.img
qemu-img: Could not open '/mnt/tmp/foo.img': Cannot get 'write'
permission without 'resize': Image size is not a multiple of request
alignment

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1994266
Signed-off-by: Hanna Reitz 
---
As I have written in the BZ linked above, I am not sure what 
behavior we

want.  It can be argued that the current behavior is perfectly OK
because we want the target to have the same size as the source, so if
this cannot be done, we should just print the above error (which I 
think
explains the problem well enough that users can figure out they 
need to

resize the source image).

OTOH, it is difficult for me to imagine a case where the user would
prefer the above error to just having qemu-img align the target 
image's

length.
---
  qemu-img.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/qemu-img.c b/qemu-img.c
index 908fd0cce5..d4b29bf73e 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2628,6 +2628,14 @@ static int img_convert(int argc, char **argv)
  goto out;
  }
  +    if (flags & BDRV_O_NOCACHE) {
+    /*
+ * If we open the target with O_DIRECT, it may be 
necessary to

+ * extend its size to align to the physical sector size.
+ */
+    flags |= BDRV_O_RESIZE;
+    }
+
  if (skip_create) {
  s.target = img_open(tgt_image_opts, out_filename, out_fmt,
  flags, writethrough, s.quiet, false);



Hmm. Strange. I am experimenting now with master branch without that 
patch and can't reproduce:


[root@kvm master]# dd if=/dev/zero of=a bs=1M count=2
2+0 records in
2+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.00153639 s, 1.4 GB/s
[root@kvm master]# dd if=/dev/zero of=b bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.000939314 s, 1.1 GB/s
[root@kvm master]# ls -lthr a b
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 a
-rw-r--r--. 1 root root 1.0M Sep  7 14:28 b
[root@kvm master]# ./build/qemu-img convert -f raw -O raw -t none a b
[root@kvm master]# ls -lthr a b
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 a
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 b
[root@kvm master]# dd if=/dev/zero of=b bs=1 count=2097147
2097147+0 records in
2097147+0 records out
2097147 bytes (2.1 MB, 2.0 MiB) copied, 2.76548 s, 758 kB/s
[root@kvm master]# ls -ltr a b
-rw-r--r--. 1 root root 2097152 Sep  7 14:28 a
-rw-r--r--. 1 root root 2097147 Sep  7 14:29 b
[root@kvm master]# ./build/qemu-img convert -f raw -O raw -t none a b
[root@kvm master]# ls -ltr a b
-rw-r--r--. 1 root root 2097152 Sep  7 14:28 a
-rw-r--r--. 1 root root 2097152 Sep  7 14:29 b

It just works, and do resize target without any additional 
BDRV_O_RESIZE...


bdrv_getlength() invokes bdrv_nb_sectors(), so we always align sizes 
to 512 anyway.  On volumes with a logical sector size of 512, this 
error cannot be reproduced.


You can use loop devices to get volumes with other sector sizes, like 
so:


$ cd /tmp
$ truncate fs.img -s 128M
$ sudo losetup -f --show --sector-size 4096 fs.img
/dev/loop0
$ sudo mkfs.ext4 /dev/loop0
mke2fs 1.46.4 (18-Aug-2021)
Discarding device blocks: done
Creating filesystem with 32768 4k blocks and 32768 inodes

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

$ sudo mount /dev/loop0 /mnt/tmp
$ qemu-img create -f raw source.img $((2 * 1048576 + 512))
Formatting 'source.img', fmt=raw size=2097664
$ sudo qemu-img convert -f raw -O raw -t none source.img 
/mnt/tmp/target.img
qemu-img: Could not open '/mnt/tmp/target.img': Cannot get 'write' 
permission without 'resize': Image size is not a multiple of request 
alignment





Does it mean, that when logical sector size is 512, qemu-img do resize 
target without any 'resize' permission?


Well, it creates the target with the size of the source, and the size of 
the source will always be reported to be aligned to 512.  If we didn’t 
have bdrv_nb_sectors() but used a byte granularity for a BDS’s length 
internally, then you could see this error with 512-byte-sectored 
volumes, too.


Re: [PATCH] qemu-img: Allow target be aligned to sector size

2021-09-07 Thread Vladimir Sementsov-Ogievskiy

07.09.2021 15:48, Hanna Reitz wrote:

On 07.09.21 13:29, Vladimir Sementsov-Ogievskiy wrote:

19.08.2021 13:12, Hanna Reitz wrote:

We cannot write to images opened with O_DIRECT unless we allow them to
be resized so they are aligned to the sector size: Since 9c60a5d1978,
bdrv_node_refresh_perm() ensures that for nodes whose length is not
aligned to the request alignment and where someone has taken a WRITE
permission, the RESIZE permission is taken, too).

Let qemu-img convert pass the BDRV_O_RESIZE flag (which causes
blk_new_open() to take the RESIZE permission) when using cache=none for
the target, so that when writing to it, it can be aligned to the target
sector size.

Without this patch, an error is returned:

$ qemu-img convert -f raw -O raw -t none foo.img /mnt/tmp/foo.img
qemu-img: Could not open '/mnt/tmp/foo.img': Cannot get 'write'
permission without 'resize': Image size is not a multiple of request
alignment

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1994266
Signed-off-by: Hanna Reitz 
---
As I have written in the BZ linked above, I am not sure what behavior we
want.  It can be argued that the current behavior is perfectly OK
because we want the target to have the same size as the source, so if
this cannot be done, we should just print the above error (which I think
explains the problem well enough that users can figure out they need to
resize the source image).

OTOH, it is difficult for me to imagine a case where the user would
prefer the above error to just having qemu-img align the target image's
length.
---
  qemu-img.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/qemu-img.c b/qemu-img.c
index 908fd0cce5..d4b29bf73e 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2628,6 +2628,14 @@ static int img_convert(int argc, char **argv)
  goto out;
  }
  +    if (flags & BDRV_O_NOCACHE) {
+    /*
+ * If we open the target with O_DIRECT, it may be necessary to
+ * extend its size to align to the physical sector size.
+ */
+    flags |= BDRV_O_RESIZE;
+    }
+
  if (skip_create) {
  s.target = img_open(tgt_image_opts, out_filename, out_fmt,
  flags, writethrough, s.quiet, false);



Hmm. Strange. I am experimenting now with master branch without that patch and 
can't reproduce:

[root@kvm master]# dd if=/dev/zero of=a bs=1M count=2
2+0 records in
2+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.00153639 s, 1.4 GB/s
[root@kvm master]# dd if=/dev/zero of=b bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.000939314 s, 1.1 GB/s
[root@kvm master]# ls -lthr a b
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 a
-rw-r--r--. 1 root root 1.0M Sep  7 14:28 b
[root@kvm master]# ./build/qemu-img convert -f raw -O raw -t none a b
[root@kvm master]# ls -lthr a b
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 a
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 b
[root@kvm master]# dd if=/dev/zero of=b bs=1 count=2097147
2097147+0 records in
2097147+0 records out
2097147 bytes (2.1 MB, 2.0 MiB) copied, 2.76548 s, 758 kB/s
[root@kvm master]# ls -ltr a b
-rw-r--r--. 1 root root 2097152 Sep  7 14:28 a
-rw-r--r--. 1 root root 2097147 Sep  7 14:29 b
[root@kvm master]# ./build/qemu-img convert -f raw -O raw -t none a b
[root@kvm master]# ls -ltr a b
-rw-r--r--. 1 root root 2097152 Sep  7 14:28 a
-rw-r--r--. 1 root root 2097152 Sep  7 14:29 b

It just works, and do resize target without any additional BDRV_O_RESIZE...


bdrv_getlength() invokes bdrv_nb_sectors(), so we always align sizes to 512 
anyway.  On volumes with a logical sector size of 512, this error cannot be 
reproduced.

You can use loop devices to get volumes with other sector sizes, like so:

$ cd /tmp
$ truncate fs.img -s 128M
$ sudo losetup -f --show --sector-size 4096 fs.img
/dev/loop0
$ sudo mkfs.ext4 /dev/loop0
mke2fs 1.46.4 (18-Aug-2021)
Discarding device blocks: done
Creating filesystem with 32768 4k blocks and 32768 inodes

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

$ sudo mount /dev/loop0 /mnt/tmp
$ qemu-img create -f raw source.img $((2 * 1048576 + 512))
Formatting 'source.img', fmt=raw size=2097664
$ sudo qemu-img convert -f raw -O raw -t none source.img /mnt/tmp/target.img
qemu-img: Could not open '/mnt/tmp/target.img': Cannot get 'write' permission 
without 'resize': Image size is not a multiple of request alignment




Does it mean, that when logical sector size is 512, qemu-img do resize target 
without any 'resize' permission?

It looks very strange: in both cases we need to resize target. With 512-bytes 
sectors it just works (look it resizes 1M->2M which is a lot larger than 512b). 
And with 4096-bytes sectors it needs additional BDRV_O_RESIZE..

--
Best regards,
Vladimir



[PATCH v4 10/12] mirror: Stop active mirroring after force-cancel

2021-09-07 Thread Hanna Reitz
Once the mirror job is force-cancelled (job_is_cancelled() is true), we
should not generate new I/O requests.  This applies to active mirroring,
too, so stop it once the job is cancelled.

(We must still forward all I/O requests to the source, though, of
course, but those are not really I/O requests generated by the job, so
this is fine.)

Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/mirror.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/mirror.c b/block/mirror.c
index bf1d50ff1c..af89c1716a 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1418,6 +1418,7 @@ static int coroutine_fn 
bdrv_mirror_top_do_write(BlockDriverState *bs,
 bool copy_to_target;
 
 copy_to_target = s->job->ret >= 0 &&
+ !job_is_cancelled(>job->common.job) &&
  s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
 
 if (copy_to_target) {
@@ -1466,6 +1467,7 @@ static int coroutine_fn 
bdrv_mirror_top_pwritev(BlockDriverState *bs,
 bool copy_to_target;
 
 copy_to_target = s->job->ret >= 0 &&
+ !job_is_cancelled(>job->common.job) &&
  s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
 
 if (copy_to_target) {
-- 
2.31.1




[PATCH v4 00/12] mirror: Handle errors after READY cancel

2021-09-07 Thread Hanna Reitz
Hi,

v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-07/msg00705.html

v2 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-07/msg00747.html

v3 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-08/msg00127.html


Changes in v4:
- Patch 1: Swap the order of aio_context_acquire() and job_unref() to
  save ourselves from using a local variable here (i.e. do it the same
  way as job_txn_apply())

- Patch 5:
  - Do not add a @force parameter to job_cancel_sync_all(): All callers
want to force-cancel all jobs when they use this function, because
what else would you want to do when you want to “cancel all jobs”.
So we don’t need a @force parameter here, and can unconditionally
invoke job_cancel_sync() with force=true.

  - Let the replication block driver force-cancel its backup job
(because it doesn’t make a difference, but it’s cleaner to
force-cancel jobs that don’t support any other cancellation
method).


git-backport-diff against v3:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/12:[0003] [FC] 'job: Context changes in job_completed_txn_abort()'
002/12:[] [--] 'mirror: Keep s->synced on error'
003/12:[] [--] 'mirror: Drop s->synced'
004/12:[] [--] 'job: Force-cancel jobs in a failed transaction'
005/12:[0022] [FC] 'job: @force parameter for job_cancel_sync()'
006/12:[] [--] 'jobs: Give Job.force_cancel more meaning'
007/12:[] [--] 'job: Add job_cancel_requested()'
008/12:[] [--] 'mirror: Use job_is_cancelled()'
009/12:[] [--] 'mirror: Check job_is_cancelled() earlier'
010/12:[] [--] 'mirror: Stop active mirroring after force-cancel'
011/12:[] [--] 'mirror: Do not clear .cancelled'
012/12:[] [--] 'iotests: Add mirror-ready-cancel-error test'


Hanna Reitz (12):
  job: Context changes in job_completed_txn_abort()
  mirror: Keep s->synced on error
  mirror: Drop s->synced
  job: Force-cancel jobs in a failed transaction
  job: @force parameter for job_cancel_sync()
  jobs: Give Job.force_cancel more meaning
  job: Add job_cancel_requested()
  mirror: Use job_is_cancelled()
  mirror: Check job_is_cancelled() earlier
  mirror: Stop active mirroring after force-cancel
  mirror: Do not clear .cancelled
  iotests: Add mirror-ready-cancel-error test

 include/qemu/job.h|  29 +++-
 block/backup.c|   3 +-
 block/mirror.c|  56 ---
 block/replication.c   |   4 +-
 blockdev.c|   4 +-
 job.c |  64 ++--
 tests/unit/test-blockjob.c|   2 +-
 tests/qemu-iotests/109.out|  60 +++-
 .../tests/mirror-ready-cancel-error   | 143 ++
 .../tests/mirror-ready-cancel-error.out   |   5 +
 tests/qemu-iotests/tests/qsd-jobs.out |   2 +-
 11 files changed, 286 insertions(+), 86 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

-- 
2.31.1




Re: [PATCH] qemu-img: Allow target be aligned to sector size

2021-09-07 Thread Hanna Reitz

On 07.09.21 13:29, Vladimir Sementsov-Ogievskiy wrote:

19.08.2021 13:12, Hanna Reitz wrote:

We cannot write to images opened with O_DIRECT unless we allow them to
be resized so they are aligned to the sector size: Since 9c60a5d1978,
bdrv_node_refresh_perm() ensures that for nodes whose length is not
aligned to the request alignment and where someone has taken a WRITE
permission, the RESIZE permission is taken, too).

Let qemu-img convert pass the BDRV_O_RESIZE flag (which causes
blk_new_open() to take the RESIZE permission) when using cache=none for
the target, so that when writing to it, it can be aligned to the target
sector size.

Without this patch, an error is returned:

$ qemu-img convert -f raw -O raw -t none foo.img /mnt/tmp/foo.img
qemu-img: Could not open '/mnt/tmp/foo.img': Cannot get 'write'
permission without 'resize': Image size is not a multiple of request
alignment

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1994266
Signed-off-by: Hanna Reitz 
---
As I have written in the BZ linked above, I am not sure what behavior we
want.  It can be argued that the current behavior is perfectly OK
because we want the target to have the same size as the source, so if
this cannot be done, we should just print the above error (which I think
explains the problem well enough that users can figure out they need to
resize the source image).

OTOH, it is difficult for me to imagine a case where the user would
prefer the above error to just having qemu-img align the target image's
length.
---
  qemu-img.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/qemu-img.c b/qemu-img.c
index 908fd0cce5..d4b29bf73e 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2628,6 +2628,14 @@ static int img_convert(int argc, char **argv)
  goto out;
  }
  +    if (flags & BDRV_O_NOCACHE) {
+    /*
+ * If we open the target with O_DIRECT, it may be necessary to
+ * extend its size to align to the physical sector size.
+ */
+    flags |= BDRV_O_RESIZE;
+    }
+
  if (skip_create) {
  s.target = img_open(tgt_image_opts, out_filename, out_fmt,
  flags, writethrough, s.quiet, false);



Hmm. Strange. I am experimenting now with master branch without that 
patch and can't reproduce:


[root@kvm master]# dd if=/dev/zero of=a bs=1M count=2
2+0 records in
2+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.00153639 s, 1.4 GB/s
[root@kvm master]# dd if=/dev/zero of=b bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.000939314 s, 1.1 GB/s
[root@kvm master]# ls -lthr a b
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 a
-rw-r--r--. 1 root root 1.0M Sep  7 14:28 b
[root@kvm master]# ./build/qemu-img convert -f raw -O raw -t none a b
[root@kvm master]# ls -lthr a b
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 a
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 b
[root@kvm master]# dd if=/dev/zero of=b bs=1 count=2097147
2097147+0 records in
2097147+0 records out
2097147 bytes (2.1 MB, 2.0 MiB) copied, 2.76548 s, 758 kB/s
[root@kvm master]# ls -ltr a b
-rw-r--r--. 1 root root 2097152 Sep  7 14:28 a
-rw-r--r--. 1 root root 2097147 Sep  7 14:29 b
[root@kvm master]# ./build/qemu-img convert -f raw -O raw -t none a b
[root@kvm master]# ls -ltr a b
-rw-r--r--. 1 root root 2097152 Sep  7 14:28 a
-rw-r--r--. 1 root root 2097152 Sep  7 14:29 b

It just works, and do resize target without any additional 
BDRV_O_RESIZE...


bdrv_getlength() invokes bdrv_nb_sectors(), so we always align sizes to 
512 anyway.  On volumes with a logical sector size of 512, this error 
cannot be reproduced.


You can use loop devices to get volumes with other sector sizes, like so:

$ cd /tmp
$ truncate fs.img -s 128M
$ sudo losetup -f --show --sector-size 4096 fs.img
/dev/loop0
$ sudo mkfs.ext4 /dev/loop0
mke2fs 1.46.4 (18-Aug-2021)
Discarding device blocks: done
Creating filesystem with 32768 4k blocks and 32768 inodes

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

$ sudo mount /dev/loop0 /mnt/tmp
$ qemu-img create -f raw source.img $((2 * 1048576 + 512))
Formatting 'source.img', fmt=raw size=2097664
$ sudo qemu-img convert -f raw -O raw -t none source.img /mnt/tmp/target.img
qemu-img: Could not open '/mnt/tmp/target.img': Cannot get 'write' 
permission without 'resize': Image size is not a multiple of request 
alignment





[PATCH v4 11/12] mirror: Do not clear .cancelled

2021-09-07 Thread Hanna Reitz
Clearing .cancelled before leaving the main loop when the job has been
soft-cancelled is no longer necessary since job_is_cancelled() only
returns true for jobs that have been force-cancelled.

Therefore, this only makes a differences in places that call
job_cancel_requested().  In block/mirror.c, this is done only before
.cancelled was cleared.

In job.c, there are two callers:
- job_completed_txn_abort() asserts that .cancelled is true, so keeping
  it true will not affect this place.

- job_complete() refuses to let a job complete that has .cancelled set.
  It is correct to refuse to let the user invoke job-complete on mirror
  jobs that have already been soft-cancelled.

With this change, there are no places that reset .cancelled to false and
so we can be sure that .force_cancel can only be true if .cancelled is
true as well.  Assert this in job_is_cancelled().

Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/mirror.c | 2 --
 job.c  | 4 +++-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index af89c1716a..f94aa52fae 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -939,7 +939,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 while (!job_cancel_requested(>common.job) && !s->should_complete) {
 job_yield(>common.job);
 }
-s->common.job.cancelled = false;
 goto immediate_exit;
 }
 
@@ -1078,7 +1077,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
  * completion.
  */
 assert(QLIST_EMPTY(>tracked_requests));
-s->common.job.cancelled = false;
 need_drain = false;
 break;
 }
diff --git a/job.c b/job.c
index be878ca5fc..85c0216734 100644
--- a/job.c
+++ b/job.c
@@ -217,7 +217,9 @@ const char *job_type_str(const Job *job)
 
 bool job_is_cancelled(Job *job)
 {
-return job->cancelled && job->force_cancel;
+/* force_cancel may be true only if cancelled is true, too */
+assert(job->cancelled || !job->force_cancel);
+return job->force_cancel;
 }
 
 bool job_cancel_requested(Job *job)
-- 
2.31.1




[PATCH v4 12/12] iotests: Add mirror-ready-cancel-error test

2021-09-07 Thread Hanna Reitz
Test what happens when there is an I/O error after a mirror job in the
READY phase has been cancelled.

Signed-off-by: Hanna Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Tested-by: Vladimir Sementsov-Ogievskiy 
---
 .../tests/mirror-ready-cancel-error   | 143 ++
 .../tests/mirror-ready-cancel-error.out   |   5 +
 2 files changed, 148 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

diff --git a/tests/qemu-iotests/tests/mirror-ready-cancel-error 
b/tests/qemu-iotests/tests/mirror-ready-cancel-error
new file mode 100755
index 00..f2dc1f
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-ready-cancel-error
@@ -0,0 +1,143 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when errors occur to a mirror job after it has
+# been cancelled in the READY phase
+#
+# Copyright (C) 2021 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import iotests
+
+
+image_size = 1 * 1024 * 1024
+source = os.path.join(iotests.test_dir, 'source.img')
+target = os.path.join(iotests.test_dir, 'target.img')
+
+
+class TestMirrorReadyCancelError(iotests.QMPTestCase):
+def setUp(self) -> None:
+assert iotests.qemu_img_create('-f', iotests.imgfmt, source,
+   str(image_size)) == 0
+assert iotests.qemu_img_create('-f', iotests.imgfmt, target,
+   str(image_size)) == 0
+
+self.vm = iotests.VM()
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(source)
+os.remove(target)
+
+def add_blockdevs(self, once: bool) -> None:
+res = self.vm.qmp('blockdev-add',
+  **{'node-name': 'source',
+ 'driver': iotests.imgfmt,
+ 'file': {
+ 'driver': 'file',
+ 'filename': source
+ }})
+self.assert_qmp(res, 'return', {})
+
+# blkdebug notes:
+# Enter state 2 on the first flush, which happens before the
+# job enters the READY state.  The second flush will happen
+# when the job is about to complete, and we want that one to
+# fail.
+res = self.vm.qmp('blockdev-add',
+  **{'node-name': 'target',
+ 'driver': iotests.imgfmt,
+ 'file': {
+ 'driver': 'blkdebug',
+ 'image': {
+ 'driver': 'file',
+ 'filename': target
+ },
+ 'set-state': [{
+ 'event': 'flush_to_disk',
+ 'state': 1,
+ 'new_state': 2
+ }],
+ 'inject-error': [{
+ 'event': 'flush_to_disk',
+ 'once': once,
+ 'immediately': True,
+ 'state': 2
+ }]}})
+self.assert_qmp(res, 'return', {})
+
+def start_mirror(self) -> None:
+res = self.vm.qmp('blockdev-mirror',
+  job_id='mirror',
+  device='source',
+  target='target',
+  filter_node_name='mirror-top',
+  sync='full',
+  on_target_error='stop')
+self.assert_qmp(res, 'return', {})
+
+def cancel_mirror_with_error(self) -> None:
+self.vm.event_wait('BLOCK_JOB_READY')
+
+# Write something so will not leave the job immediately, but
+# flush first (which will fail, thanks to blkdebug)
+res = self.vm.qmp('human-monitor-command',
+  command_line='qemu-io mirror-top "write 0 64k"')
+self.assert_qmp(res, 'return', '')
+
+# Drain status change events
+while self.vm.event_wait('JOB_STATUS_CHANGE', timeout=0.0) is 

[PATCH v4 09/12] mirror: Check job_is_cancelled() earlier

2021-09-07 Thread Hanna Reitz
We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

Jobs can be cancelled while they yield, and once they are
(force-cancelled), they should not generate new I/O requests.
Therefore, we should put the check after the last yield before
mirror_iteration() is invoked.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/mirror.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 024fa2dcea..bf1d50ff1c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1000,6 +1000,11 @@ static int coroutine_fn mirror_run(Job *job, Error 
**errp)
 
 job_pause_point(>common.job);
 
+if (job_is_cancelled(>common.job)) {
+ret = 0;
+goto immediate_exit;
+}
+
 cnt = bdrv_get_dirty_count(s->dirty_bitmap);
 /* cnt is the number of dirty bytes remaining and s->bytes_in_flight is
  * the number of bytes currently being processed; together those are
@@ -1078,8 +1083,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 break;
 }
 
-ret = 0;
-
 if (job_is_ready(>common.job) && !should_complete) {
 delay_ns = (s->in_flight == 0 &&
 cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
@@ -1087,9 +1090,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
   delay_ns);
 job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job)) {
-break;
-}
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
 }
 
-- 
2.31.1




[PATCH v4 08/12] mirror: Use job_is_cancelled()

2021-09-07 Thread Hanna Reitz
mirror_drained_poll() returns true whenever the job is cancelled,
because "we [can] be sure that it won't issue more requests".  However,
this is only true for force-cancelled jobs, so use job_is_cancelled().

Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/mirror.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 72e02fa34e..024fa2dcea 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1177,7 +1177,7 @@ static bool mirror_drained_poll(BlockJob *job)
  * from one of our own drain sections, to avoid a deadlock waiting for
  * ourselves.
  */
-if (!s->common.job.paused && !s->common.job.cancelled && !s->in_drain) {
+if (!s->common.job.paused && !job_is_cancelled(>job) && !s->in_drain) 
{
 return true;
 }
 
-- 
2.31.1




[PATCH v4 05/12] job: @force parameter for job_cancel_sync()

2021-09-07 Thread Hanna Reitz
Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception,
specifically the active commit job it runs.

As for job_cancel_sync_all(), all callers want it to force-cancel all
jobs, because that is the point of it: To cancel all remaining jobs as
quickly as possible (generally on process termination).  So make it
invoke job_cancel_sync() with force=true.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz 
---
 include/qemu/job.h| 10 ++---
 block/replication.c   |  4 +-
 blockdev.c|  4 +-
 job.c | 18 ++--
 tests/unit/test-blockjob.c|  2 +-
 tests/qemu-iotests/109.out| 60 +++
 tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
 7 files changed, 50 insertions(+), 50 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 41162ed494..2eddf3b536 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -506,18 +506,18 @@ void job_user_cancel(Job *job, bool force, Error **errp);
 
 /**
  * Synchronously cancel the @job.  The completion callback is called
- * before the function returns.  The job may actually complete
- * instead of canceling itself; the circumstances under which this
- * happens depend on the kind of job that is active.
+ * before the function returns.  If @force is false, the job may
+ * actually complete instead of canceling itself; the circumstances
+ * under which this happens depend on the kind of job that is active.
  *
  * Returns the return value from the job if the job actually completed
  * during the call, or -ECANCELED if it was canceled.
  *
  * Callers must hold the AioContext lock of job->aio_context.
  */
-int job_cancel_sync(Job *job);
+int job_cancel_sync(Job *job, bool force);
 
-/** Synchronously cancels all jobs using job_cancel_sync(). */
+/** Synchronously force-cancels all jobs using job_cancel_sync(). */
 void job_cancel_sync_all(void);
 
 /**
diff --git a/block/replication.c b/block/replication.c
index 32444b9a8f..55c8f894aa 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -149,7 +149,7 @@ static void replication_close(BlockDriverState *bs)
 if (s->stage == BLOCK_REPLICATION_FAILOVER) {
 commit_job = >commit_job->job;
 assert(commit_job->aio_context == qemu_get_current_aio_context());
-job_cancel_sync(commit_job);
+job_cancel_sync(commit_job, false);
 }
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
@@ -726,7 +726,7 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
  * disk, secondary disk in backup_job_completed().
  */
 if (s->backup_job) {
-job_cancel_sync(>backup_job->job);
+job_cancel_sync(>backup_job->job, true);
 }
 
 if (!failover) {
diff --git a/blockdev.c b/blockdev.c
index e79c5f3b5e..b35072644e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1847,7 +1847,7 @@ static void drive_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(>job->job);
+job_cancel_sync(>job->job, true);
 
 aio_context_release(aio_context);
 }
@@ -1948,7 +1948,7 @@ static void blockdev_backup_abort(BlkActionState *common)
 aio_context = bdrv_get_aio_context(state->bs);
 aio_context_acquire(aio_context);
 
-job_cancel_sync(>job->job);
+job_cancel_sync(>job->job, true);
 
 aio_context_release(aio_context);
 }
diff --git a/job.c b/job.c
index e74d81928d..dfac35d553 100644
--- a/job.c
+++ b/job.c
@@ -981,9 +981,21 @@ static void job_cancel_err(Job *job, Error **errp)
 job_cancel(job, false);
 }
 
-int job_cancel_sync(Job *job)
+/**
+ * Same as job_cancel_err(), but force-cancel.
+ */
+static void job_force_cancel_err(Job *job, Error **errp)
 {
-return job_finish_sync(job, _cancel_err, NULL);
+job_cancel(job, true);
+}
+
+int job_cancel_sync(Job *job, bool force)
+{
+if (force) {
+return job_finish_sync(job, _force_cancel_err, NULL);
+} else {
+return job_finish_sync(job, _cancel_err, NULL);
+}
 }
 
 void job_cancel_sync_all(void)
@@ -994,7 +1006,7 @@ void job_cancel_sync_all(void)
 while ((job = 

[PATCH v4 06/12] jobs: Give Job.force_cancel more meaning

2021-09-07 Thread Hanna Reitz
We largely have two cancel modes for jobs:

First, there is actual cancelling.  The job is terminated as soon as
possible, without trying to reach a consistent result.

Second, we have mirror in the READY state.  Technically, the job is not
really cancelled, but it just is a different completion mode.  The job
can still run for an indefinite amount of time while it tries to reach a
consistent result.

We want to be able to clearly distinguish which cancel mode a job is in
(when it has been cancelled).  We can use Job.force_cancel for this, but
right now it only reflects cancel requests from the user with
force=true, but clearly, jobs that do not even distinguish between
force=false and force=true are effectively always force-cancelled.

So this patch has Job.force_cancel signify whether the job will
terminate as soon as possible (force_cancel=true) or whether it will
effectively remain running despite being "cancelled"
(force_cancel=false).

To this end, we let jobs that provide JobDriver.cancel() tell the
generic job code whether they will terminate as soon as possible or not,
and for jobs that do not provide that method we assume they will.

Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
---
 include/qemu/job.h | 11 ++-
 block/backup.c |  3 ++-
 block/mirror.c | 24 ++--
 job.c  |  6 +-
 4 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 2eddf3b536..90f6abbd6a 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -253,8 +253,17 @@ struct JobDriver {
 
 /**
  * If the callback is not NULL, it will be invoked in job_cancel_async
+ *
+ * This function must return true if the job will be cancelled
+ * immediately without any further I/O (mandatory if @force is
+ * true), and false otherwise.  This lets the generic job layer
+ * know whether a job has been truly (force-)cancelled, or whether
+ * it is just in a special completion mode (like mirror after
+ * READY).
+ * (If the callback is NULL, the job is assumed to terminate
+ * without I/O.)
  */
-void (*cancel)(Job *job, bool force);
+bool (*cancel)(Job *job, bool force);
 
 
 /** Called when the job is freed */
diff --git a/block/backup.c b/block/backup.c
index 687d2882bc..e8a13f9178 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -327,11 +327,12 @@ static void coroutine_fn backup_set_speed(BlockJob *job, 
int64_t speed)
 }
 }
 
-static void backup_cancel(Job *job, bool force)
+static bool backup_cancel(Job *job, bool force)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
 
 bdrv_cancel_in_flight(s->target_bs);
+return true;
 }
 
 static const BlockJobDriver backup_job_driver = {
diff --git a/block/mirror.c b/block/mirror.c
index fcb7b65f93..e93631a9f6 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1087,9 +1087,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
   delay_ns);
 job_sleep_ns(>common.job, delay_ns);
-if (job_is_cancelled(>common.job) &&
-(!job_is_ready(>common.job) || s->common.job.force_cancel))
-{
+if (job_is_cancelled(>common.job) && s->common.job.force_cancel) {
 break;
 }
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1102,7 +1100,7 @@ immediate_exit:
  * the target is a copy of the source.
  */
 assert(ret < 0 ||
-   ((s->common.job.force_cancel || !job_is_ready(>common.job)) 
&&
+   (s->common.job.force_cancel &&
 job_is_cancelled(>common.job)));
 assert(need_drain);
 mirror_wait_for_all_io(s);
@@ -1188,14 +1186,27 @@ static bool mirror_drained_poll(BlockJob *job)
 return !!s->in_flight;
 }
 
-static void mirror_cancel(Job *job, bool force)
+static bool mirror_cancel(Job *job, bool force)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 BlockDriverState *target = blk_bs(s->target);
 
-if (force || !job_is_ready(job)) {
+/*
+ * Before the job is READY, we treat any cancellation like a
+ * force-cancellation.
+ */
+force = force || !job_is_ready(job);
+
+if (force) {
 bdrv_cancel_in_flight(target);
 }
+return force;
+}
+
+static bool commit_active_cancel(Job *job, bool force)
+{
+/* Same as above in mirror_cancel() */
+return force || !job_is_ready(job);
 }
 
 static const BlockJobDriver mirror_job_driver = {
@@ -1225,6 +1236,7 @@ static const BlockJobDriver commit_active_job_driver = {
 .abort  = mirror_abort,
 .pause  = mirror_pause,
 .complete   = mirror_complete,
+.cancel = 

[PATCH v4 02/12] mirror: Keep s->synced on error

2021-09-07 Thread Hanna Reitz
An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Hanna Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
---
 block/mirror.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 98fc66eabf..d73b704473 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -121,7 +121,6 @@ typedef enum MirrorMethod {
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
 int error)
 {
-s->synced = false;
 s->actively_synced = false;
 if (read) {
 return block_job_error_action(>common, s->on_source_error,
-- 
2.31.1




[PATCH v4 07/12] job: Add job_cancel_requested()

2021-09-07 Thread Hanna Reitz
Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_requested() as the general variant, which returns true for
any jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Finally, here is a justification for how different job_is_cancelled()
invocations are treated by this patch:

- block/mirror.c (mirror_run()):
  - The first invocation is a while loop that should loop until the job
has been cancelled or scheduled for completion.  What kind of cancel
does not matter, only the fact that the job is supposed to end.

  - The second invocation wants to know whether the job has been
soft-cancelled.  Calling job_cancel_requested() is a bit too broad,
but if the job were force-cancelled, we should leave the main loop
as soon as possible anyway, so this should not matter here.

  - The last two invocations already check force_cancel, so they should
continue to use job_is_cancelled().

- block/backup.c, block/commit.c, block/stream.c, anything in tests/:
  These jobs know only force-cancel, so there is no difference between
  job_is_cancelled() and job_cancel_requested().  We can continue using
  job_is_cancelled().

- job.c:
  - job_pause_point(), job_yield(), job_sleep_ns(): Only force-cancelled
jobs should be prevented from being paused.  Continue using 
job_is_cancelled().

  - job_update_rc(), job_finalize_single(), job_finish_sync(): These
functions are all called after the job has left its main loop.  The
mirror job (the only job that can be soft-cancelled) will clear
.cancelled before leaving the main loop if it has been
soft-cancelled.  Therefore, these functions will observe .cancelled
to be true only if the job has been force-cancelled.  We can
continue to use job_is_cancelled().
(Furthermore, conceptually, a soft-cancelled mirror job should not
report to have been cancelled.  It should report completion (see
also the block-job-cancel QAPI documentation).  Therefore, it makes
sense for these functions not to distinguish between a
soft-cancelled mirror job and a job that has completed as normal.)

  - job_completed_txn_abort(): All jobs other than @job have been
force-cancelled.  job_is_cancelled() must be true for them.
Regarding @job itself: job_completed_txn_abort() is mostly called
when the job's return value is not 0.  A soft-cancelled mirror has a
return value of 0, and so will not end up here then.
However, job_cancel() invokes job_completed_txn_abort() if the job
has been deferred to the main loop, which is mostly the case for
completed jobs (which skip the assertion), but not for sure.
To be safe, use job_cancel_requested() in this assertion.

  - job_complete(): This is function eventually invoked by the user
(through qmp_block_job_complete() or qmp_job_complete(), or
job_complete_sync(), which comes from qemu-img).  The intention here
is to prevent a user from invoking job-complete after the job has
been cancelled.  This should also apply to soft cancelling: After a
mirror job has been soft-cancelled, the user should not be able to
decide otherwise and have it complete as normal (i.e. pivoting to
the target).

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/qemu/job.h |  8 +++-
 block/mirror.c | 10 --
 job.c  |  9 +++--
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 90f6abbd6a..6e67b6977f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
 /** Returns true if the job should not be visible to the management layer. */
 bool job_is_internal(Job *job);
 
-/** Returns whether the job is scheduled for cancellation. */
+/** Returns whether the job is being cancelled. */
 bool job_is_cancelled(Job *job);
 
+/**
+ * Returns whether the job is scheduled for cancellation 

[PATCH v4 03/12] mirror: Drop s->synced

2021-09-07 Thread Hanna Reitz
As of HEAD^, there is no meaning to s->synced other than whether the job
is READY or not.  job_is_ready() gives us that information, too.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
---
 block/mirror.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index d73b704473..fcb7b65f93 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -56,7 +56,6 @@ typedef struct MirrorBlockJob {
 bool zero_target;
 MirrorCopyMode copy_mode;
 BlockdevOnError on_source_error, on_target_error;
-bool synced;
 /* Set when the target is synced (dirty bitmap is clean, nothing
  * in flight) and the job is running in active mode */
 bool actively_synced;
@@ -936,7 +935,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 if (s->bdev_length == 0) {
 /* Transition to the READY state and wait for complete. */
 job_transition_to_ready(>common.job);
-s->synced = true;
 s->actively_synced = true;
 while (!job_is_cancelled(>common.job) && !s->should_complete) {
 job_yield(>common.job);
@@ -1028,7 +1026,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 should_complete = false;
 if (s->in_flight == 0 && cnt == 0) {
 trace_mirror_before_flush(s);
-if (!s->synced) {
+if (!job_is_ready(>common.job)) {
 if (mirror_flush(s) < 0) {
 /* Go check s->ret.  */
 continue;
@@ -1039,7 +1037,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
  * the target in a consistent state.
  */
 job_transition_to_ready(>common.job);
-s->synced = true;
 if (s->copy_mode != MIRROR_COPY_MODE_BACKGROUND) {
 s->actively_synced = true;
 }
@@ -1083,14 +1080,15 @@ static int coroutine_fn mirror_run(Job *job, Error 
**errp)
 
 ret = 0;
 
-if (s->synced && !should_complete) {
+if (job_is_ready(>common.job) && !should_complete) {
 delay_ns = (s->in_flight == 0 &&
 cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
 }
-trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
+trace_mirror_before_sleep(s, cnt, job_is_ready(>common.job),
+  delay_ns);
 job_sleep_ns(>common.job, delay_ns);
 if (job_is_cancelled(>common.job) &&
-(!s->synced || s->common.job.force_cancel))
+(!job_is_ready(>common.job) || s->common.job.force_cancel))
 {
 break;
 }
@@ -1103,8 +1101,9 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 || ((s->common.job.force_cancel || !s->synced) &&
-   job_is_cancelled(>common.job)));
+assert(ret < 0 ||
+   ((s->common.job.force_cancel || !job_is_ready(>common.job)) 
&&
+job_is_cancelled(>common.job)));
 assert(need_drain);
 mirror_wait_for_all_io(s);
 }
@@ -1127,7 +1126,7 @@ static void mirror_complete(Job *job, Error **errp)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 
-if (!s->synced) {
+if (!job_is_ready(job)) {
 error_setg(errp, "The active block job '%s' cannot be completed",
job->id);
 return;
-- 
2.31.1




[PATCH v4 04/12] job: Force-cancel jobs in a failed transaction

2021-09-07 Thread Hanna Reitz
When a transaction is aborted, no result matters, and so all jobs within
should be force-cancelled.

Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 job.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/job.c b/job.c
index 810e6a2065..e74d81928d 100644
--- a/job.c
+++ b/job.c
@@ -766,7 +766,12 @@ static void job_completed_txn_abort(Job *job)
 if (other_job != job) {
 ctx = other_job->aio_context;
 aio_context_acquire(ctx);
-job_cancel_async(other_job, false);
+/*
+ * This is a transaction: If one job failed, no result will matter.
+ * Therefore, pass force=true to terminate all other jobs as 
quickly
+ * as possible.
+ */
+job_cancel_async(other_job, true);
 aio_context_release(ctx);
 }
 }
-- 
2.31.1




[PATCH v4 01/12] job: Context changes in job_completed_txn_abort()

2021-09-07 Thread Hanna Reitz
Finalizing the job may cause its AioContext to change.  This is noted by
job_exit(), which points at job_txn_apply() to take this fact into
account.

However, job_completed() does not necessarily invoke job_txn_apply()
(through job_completed_txn_success()), but potentially also
job_completed_txn_abort().  The latter stores the context in a local
variable, and so always acquires the same context at its end that it has
released in the beginning -- which may be a different context from the
one that job_exit() releases at its end.  If it is different, qemu
aborts ("qemu_mutex_unlock_impl: Operation not permitted").

Drop the local @outer_ctx variable from job_completed_txn_abort(), and
instead re-acquire the actual job's context at the end of the function,
so job_exit() will release the same.

Signed-off-by: Hanna Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 job.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/job.c b/job.c
index e7a5d28854..810e6a2065 100644
--- a/job.c
+++ b/job.c
@@ -737,7 +737,6 @@ static void job_cancel_async(Job *job, bool force)
 
 static void job_completed_txn_abort(Job *job)
 {
-AioContext *outer_ctx = job->aio_context;
 AioContext *ctx;
 JobTxn *txn = job->txn;
 Job *other_job;
@@ -751,10 +750,14 @@ static void job_completed_txn_abort(Job *job)
 txn->aborting = true;
 job_txn_ref(txn);
 
-/* We can only hold the single job's AioContext lock while calling
+/*
+ * We can only hold the single job's AioContext lock while calling
  * job_finalize_single() because the finalization callbacks can involve
- * calls of AIO_WAIT_WHILE(), which could deadlock otherwise. */
-aio_context_release(outer_ctx);
+ * calls of AIO_WAIT_WHILE(), which could deadlock otherwise.
+ * Note that the job's AioContext may change when it is finalized.
+ */
+job_ref(job);
+aio_context_release(job->aio_context);
 
 /* Other jobs are effectively cancelled by us, set the status for
  * them; this job, however, may or may not be cancelled, depending
@@ -769,6 +772,10 @@ static void job_completed_txn_abort(Job *job)
 }
 while (!QLIST_EMPTY(>jobs)) {
 other_job = QLIST_FIRST(>jobs);
+/*
+ * The job's AioContext may change, so store it in @ctx so we
+ * release the same context that we have acquired before.
+ */
 ctx = other_job->aio_context;
 aio_context_acquire(ctx);
 if (!job_is_completed(other_job)) {
@@ -779,7 +786,12 @@ static void job_completed_txn_abort(Job *job)
 aio_context_release(ctx);
 }
 
-aio_context_acquire(outer_ctx);
+/*
+ * Use job_ref()/job_unref() so we can read the AioContext here
+ * even if the job went away during job_finalize_single().
+ */
+aio_context_acquire(job->aio_context);
+job_unref(job);
 
 job_txn_unref(txn);
 }
-- 
2.31.1




Re: [PATCH] qemu-img: Allow target be aligned to sector size

2021-09-07 Thread Vladimir Sementsov-Ogievskiy

19.08.2021 13:12, Hanna Reitz wrote:

We cannot write to images opened with O_DIRECT unless we allow them to
be resized so they are aligned to the sector size: Since 9c60a5d1978,
bdrv_node_refresh_perm() ensures that for nodes whose length is not
aligned to the request alignment and where someone has taken a WRITE
permission, the RESIZE permission is taken, too).

Let qemu-img convert pass the BDRV_O_RESIZE flag (which causes
blk_new_open() to take the RESIZE permission) when using cache=none for
the target, so that when writing to it, it can be aligned to the target
sector size.

Without this patch, an error is returned:

$ qemu-img convert -f raw -O raw -t none foo.img /mnt/tmp/foo.img
qemu-img: Could not open '/mnt/tmp/foo.img': Cannot get 'write'
permission without 'resize': Image size is not a multiple of request
alignment

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1994266
Signed-off-by: Hanna Reitz 
---
As I have written in the BZ linked above, I am not sure what behavior we
want.  It can be argued that the current behavior is perfectly OK
because we want the target to have the same size as the source, so if
this cannot be done, we should just print the above error (which I think
explains the problem well enough that users can figure out they need to
resize the source image).

OTOH, it is difficult for me to imagine a case where the user would
prefer the above error to just having qemu-img align the target image's
length.
---
  qemu-img.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/qemu-img.c b/qemu-img.c
index 908fd0cce5..d4b29bf73e 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2628,6 +2628,14 @@ static int img_convert(int argc, char **argv)
  goto out;
  }
  
+if (flags & BDRV_O_NOCACHE) {

+/*
+ * If we open the target with O_DIRECT, it may be necessary to
+ * extend its size to align to the physical sector size.
+ */
+flags |= BDRV_O_RESIZE;
+}
+
  if (skip_create) {
  s.target = img_open(tgt_image_opts, out_filename, out_fmt,
  flags, writethrough, s.quiet, false);



Hmm. Strange. I am experimenting now with master branch without that patch and 
can't reproduce:

[root@kvm master]# dd if=/dev/zero of=a bs=1M count=2
2+0 records in
2+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.00153639 s, 1.4 GB/s
[root@kvm master]# dd if=/dev/zero of=b bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.000939314 s, 1.1 GB/s
[root@kvm master]# ls -lthr a b
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 a
-rw-r--r--. 1 root root 1.0M Sep  7 14:28 b
[root@kvm master]# ./build/qemu-img convert -f raw -O raw -t none a b
[root@kvm master]# ls -lthr a b
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 a
-rw-r--r--. 1 root root 2.0M Sep  7 14:28 b
[root@kvm master]# dd if=/dev/zero of=b bs=1 count=2097147
2097147+0 records in
2097147+0 records out
2097147 bytes (2.1 MB, 2.0 MiB) copied, 2.76548 s, 758 kB/s
[root@kvm master]# ls -ltr a b
-rw-r--r--. 1 root root 2097152 Sep  7 14:28 a
-rw-r--r--. 1 root root 2097147 Sep  7 14:29 b
[root@kvm master]# ./build/qemu-img convert -f raw -O raw -t none a b
[root@kvm master]# ls -ltr a b
-rw-r--r--. 1 root root 2097152 Sep  7 14:28 a
-rw-r--r--. 1 root root 2097152 Sep  7 14:29 b

It just works, and do resize target without any additional BDRV_O_RESIZE...

--
Best regards,
Vladimir



Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-07 Thread Dr. David Alan Gilbert
* Daniel P. Berrangé (berra...@redhat.com) wrote:
> On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos 
> > > wrote:
> > > > Hello Daniel, thanks for the feedback !
> > > >
> > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé 
> > > >  wrote:
> > > > >
> > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd 
> > > > > > thread.
> > > > > >
> > > > > > Change the send_write() interface of multifd, allowing it to pass 
> > > > > > down
> > > > > > flags for qio_channel_write*().
> > > > > >
> > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping 
> > > > > > the
> > > > > > other data being sent at the default copying approach.
> > > > > >
> > > > > > Signed-off-by: Leonardo Bras 
> > > > > > ---
> > > > > >  migration/multifd-zlib.c | 7 ---
> > > > > >  migration/multifd-zstd.c | 7 ---
> > > > > >  migration/multifd.c  | 9 ++---
> > > > > >  migration/multifd.h  | 3 ++-
> > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > >
> > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > >  }
> > > > > >
> > > > > >  if (used) {
> > > > > > -ret = multifd_send_state->ops->send_write(p, used, 
> > > > > > _err);
> > > > > > +ret = multifd_send_state->ops->send_write(p, used, 
> > > > > > MSG_ZEROCOPY,
> > > > > > +  
> > > > > > _err);
> > > > >
> > > > > I don't think it is valid to unconditionally enable this feature due 
> > > > > to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >if the socket option was not set, the socket exceeds its optmem
> > > > >limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > >
> > > Yes, by default limit QEMU sees will be something very small.
> > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of 
> > > > locked
> > > > memory for the user before using it, or adding a capability to qemu 
> > > > before.
> > > >
> > > > Maybe an option would be trying to mlock all guest memory before setting
> > > > zerocopy=on in qemu code. If it fails, we can print an error message 
> > > > and fall
> > > > back to not using zerocopy (following the idea of a new 
> > > > io_async_writev()
> > > > I told you in the previous mail).
> > >
> > > Currently ability to lock memory is something that has to be configured
> > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > to QEMU. Memory locking is generally undesirable because it prevents
> > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > allowing memory locking is a way to kill your entire host.
> > 
> > You mean it's gonna consume too much memory, or something else?
> 
> Essentially yes. 
> 
> > > I don't think we can unconditionally grant ability to lock arbitrary
> > > guest RAM at startup, just to satisfy a possible desire to use zerocopy
> > > migration later. Granting it at runtime feels questionable as you now
> > > need to track and predict how much locked memory you've allowed, and
> > > also have possible problems with revokation.
> > 
> > (I am really new to this, so please forgive me if I am asking dumb or
> > overly basic questions)
> > 
> > What does revokation means in this context?
> > You give the process hability to lock n MB of memory, and then you take it?
> > Why would that happen? Is Locked memory a limited resource?
> 
> Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> total.
> 
> So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> which exceeds physical RAM. Most of the time this may well be fine
> as the VMs don't concurrently need their full RAM allocation, and
> worst case they'll get pushed to swap as the kernel re-shares
> memory in respose to load. So perhaps each VM only needs 20 GB
> resident at any time, but over time one VM can burst upto 32 GB
> and then 12 GB of it get swapped out later when inactive.
> 
> But now consider if we allowed 2 of the VMs to lock memory for
> purposes of migration. Those 2 VMs 

Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-07 Thread Dr. David Alan Gilbert
* Daniel P. Berrangé (berra...@redhat.com) wrote:
> On Wed, Sep 01, 2021 at 11:35:33AM -0400, Peter Xu wrote:
> > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> > > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd 
> > > > > > thread.
> > > > > > 
> > > > > > Change the send_write() interface of multifd, allowing it to pass 
> > > > > > down
> > > > > > flags for qio_channel_write*().
> > > > > > 
> > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping 
> > > > > > the
> > > > > > other data being sent at the default copying approach.
> > > > > > 
> > > > > > Signed-off-by: Leonardo Bras 
> > > > > > ---
> > > > > >  migration/multifd-zlib.c | 7 ---
> > > > > >  migration/multifd-zstd.c | 7 ---
> > > > > >  migration/multifd.c  | 9 ++---
> > > > > >  migration/multifd.h  | 3 ++-
> > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > 
> > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > >  }
> > > > > >  
> > > > > >  if (used) {
> > > > > > -ret = multifd_send_state->ops->send_write(p, used, 
> > > > > > _err);
> > > > > > +ret = multifd_send_state->ops->send_write(p, used, 
> > > > > > MSG_ZEROCOPY,
> > > > > > +  
> > > > > > _err);
> > > > > 
> > > > > I don't think it is valid to unconditionally enable this feature due 
> > > > > to the
> > > > > resource usage implications
> > > > > 
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > 
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> > > > >if the socket option was not set, the socket exceeds its optmem 
> > > > >limit or the user exceeds its ulimit on locked pages."
> > > > > 
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > > 
> > > > Yes it would be great to be a migration capability in parallel to 
> > > > multifd. At
> > > > initial phase if it's easy to be implemented on multi-fd only, we can 
> > > > add a
> > > > dependency between the caps.  In the future we can remove that 
> > > > dependency when
> > > > the code is ready to go without multifd.  Thanks,
> > > 
> > > Also, I'm wondering how zerocopy support interacts with kernel support
> > > for kTLS and multipath-TCP, both of which we want to be able to use
> > > with migration.
> > 
> > Copying Jason Wang for net implications between these features on kernel 
> > side
> > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).
> > 
> > From the safe side we may want to only enable one of them until we prove
> > they'll work together I guess..
> 
> MPTCP is good when we're network limited for migration
> 
> KTLS will be good when we're CPU limited on AES for migration,
> which is essentially always when TLS is used.
> 
> ZEROCOPY will be good when we're CPU limited for data copy
> on migration, or to reduce the impact on other concurrent
> VMs on the same CPUs.
> 
> Ultimately we woudld benefit from all of them at the same
> time, if it were technically possible todo.

I think last time I spoke to Paolo Abeni there were some interactions
between them; I can't remember what though (I think mptcp and ktls
didn't play at the time).

Dave

> > Not a immediate concern as I don't really think any of them is really
> > explicitly supported in qemu.
> 
> QEMU has mptcp support already:
> 
>   commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7
>   Author: Dr. David Alan Gilbert 
>   Date:   Wed Apr 21 12:28:34 2021 +0100
> 
> sockets: Support multipath TCP
> 
> Multipath TCP allows combining multiple interfaces/routes into a single
> socket, with very little work for the user/admin.
> 
> It's enabled by 'mptcp' on most socket addresses:
> 
>./qemu-system-x86_64 -nographic -incoming tcp:0:,mptcp
> 
> > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> > gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> > least we need some knob to detect whether kTLS is enabled in gnutls.
> 
> It isn't possible for gnutls to transparently enable KTLS, because
> GNUTLS doesn't get to see the actual socket directly - it'll need
> some work in QEMU to enable it.  We know MPTCP and KTLS are currently
> mutually exclusive as they both use the same kernel network hooks
> framework.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com 

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-07 Thread Dr. David Alan Gilbert
* Daniel P. Berrangé (berra...@redhat.com) wrote:
> On Thu, Sep 02, 2021 at 07:19:58AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos 
> > > wrote:
> > > > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé  
> > > > wrote:
> > > > >
> > > > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos 
> > > > > wrote:
> > > > >
> > > > > > > I would suggest checkig in close(), but as mentioned
> > > > > > > earlier, I think the design is flawed because the caller
> > > > > > > fundamentally needs to know about completion for every
> > > > > > > single write they make in order to know when the buffer
> > > > > > > can be released / reused.
> > > > > >
> > > > > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > > > > activated with a
> > > > > > parameter flag, or on a different method if callback is preferred):
> > > > > > In the MSG_ZEROCOPY docs, we can see that the example includes 
> > > > > > using a poll()
> > > > > > syscall after each packet sent, and this means the fd gets a signal 
> > > > > > after each
> > > > > > sendmsg() happens, with error or not.
> > > > > >
> > > > > > We could harness this with a poll() and a relatively high timeout:
> > > > > > - We stop sending packets, and then call poll().
> > > > > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > > > > took too long
> > > > > > without sendmsg() happening, meaning all the packets are sent.
> > > > > > - If it returns anything else, we go back to fixing the errors 
> > > > > > found (re-send)
> > > > > >
> > > > > > The problem may be defining the value of this timeout, but it could 
> > > > > > be
> > > > > > called only
> > > > > > when zerocopy is active.
> > > > >
> > > > > Maybe we need to check completions at the end of each iteration of the
> > > > > migration dirtypage loop ?
> > > >
> > > > Sorry, I am really new to this, and I still couldn't understand why 
> > > > would we
> > > > need to check at the end of each iteration, instead of doing a full 
> > > > check at the
> > > > end.
> > >
> > > The end of each iteration is an implicit synchronization point in the
> > > current migration code.
> > >
> > > For example, we might do 2 iterations of migration pre-copy, and then
> > > switch to post-copy mode. If the data from those 2 iterations hasn't
> > > been sent at the point we switch to post-copy, that is a semantic
> > > change from current behaviour. I don't know if that will have an
> > > problematic effect on the migration process, or not. Checking the
> > > async completions at the end of each iteration though, would ensure
> > > the semantics similar to current semantics, reducing risk of unexpected
> > > problems.
> > >
> > 
> > What if we do the 'flush()' before we start post-copy, instead of after each
> > iteration? would that be enough?
> 
> Possibly, yes. This really need David G's input since he understands
> the code in way more detail than me.

Hmm I'm not entirely sure why we have the sync after each iteration;
the case I can think of is if we're doing async sending, we could have
two versions of the same page in flight (one from each iteration) -
you'd want those to get there in the right order.

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH v4 4/5] mirror-top-perms: Fix AbnormalShutdown path

2021-09-07 Thread Philippe Mathieu-Daudé
On 9/7/21 11:57 AM, Hanna Reitz wrote:
> On 02.09.21 12:15, Philippe Mathieu-Daudé wrote:
>> On 9/2/21 11:58 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> 02.09.2021 12:40, Hanna Reitz wrote:
 The AbnormalShutdown exception class is not in qemu.machine, but in
 qemu.machine.machine.  (qemu.machine.AbnormalShutdown was enough for
 Python to find it in order to run this test, but pylint complains about
 it.)

 Signed-off-by: Hanna Reitz 
 ---
    tests/qemu-iotests/tests/mirror-top-perms | 2 +-
    1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/tests/qemu-iotests/tests/mirror-top-perms
 b/tests/qemu-iotests/tests/mirror-top-perms
 index 451a0666f8..2fc8dd66e0 100755
 --- a/tests/qemu-iotests/tests/mirror-top-perms
 +++ b/tests/qemu-iotests/tests/mirror-top-perms
 @@ -47,7 +47,7 @@ class TestMirrorTopPerms(iotests.QMPTestCase):
    def tearDown(self):
    try:
    self.vm.shutdown()
 -    except qemu.machine.AbnormalShutdown:
 +    except qemu.machine.machine.AbnormalShutdown:
    pass
      if self.vm_b is not None:

>>> Hmm, interesting.. May be that bad that module has same name as
>>> subpackage?
>> Confusing indeed. Could this be improved?
> 
> I think if we want to improve something, it would be that we make the
> exception public in the qemu.machine namespace, like so:
> 
> diff --git a/python/qemu/machine/__init__.py
> b/python/qemu/machine/__init__.py
> index 9ccd58ef14..48bbb0530b 100644
> --- a/python/qemu/machine/__init__.py
> +++ b/python/qemu/machine/__init__.py
> @@ -25,7 +25,7 @@
>  # pylint: disable=import-error
>  # see: https://github.com/PyCQA/pylint/issues/3624
>  # see: https://github.com/PyCQA/pylint/issues/3651
> -from .machine import QEMUMachine
> +from .machine import AbnormalShutdown, QEMUMachine
>  from .qtest import QEMUQtestMachine, QEMUQtestProtocol
> 
> But, well.  I don’t mind a qemu.machine.machine too much, personally.

I'm not worried about you, but the newcomer willing to use this
interface ;) Note I'm not asking you to clean that neither, I
was just wondering why we have machine.machine.




Re: [PATCH] qemu-img: Allow target be aligned to sector size

2021-09-07 Thread Hanna Reitz

Ping – any thoughts on this?

Hanna

On 19.08.21 12:12, Hanna Reitz wrote:

We cannot write to images opened with O_DIRECT unless we allow them to
be resized so they are aligned to the sector size: Since 9c60a5d1978,
bdrv_node_refresh_perm() ensures that for nodes whose length is not
aligned to the request alignment and where someone has taken a WRITE
permission, the RESIZE permission is taken, too).

Let qemu-img convert pass the BDRV_O_RESIZE flag (which causes
blk_new_open() to take the RESIZE permission) when using cache=none for
the target, so that when writing to it, it can be aligned to the target
sector size.

Without this patch, an error is returned:

$ qemu-img convert -f raw -O raw -t none foo.img /mnt/tmp/foo.img
qemu-img: Could not open '/mnt/tmp/foo.img': Cannot get 'write'
permission without 'resize': Image size is not a multiple of request
alignment

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1994266
Signed-off-by: Hanna Reitz 
---
As I have written in the BZ linked above, I am not sure what behavior we
want.  It can be argued that the current behavior is perfectly OK
because we want the target to have the same size as the source, so if
this cannot be done, we should just print the above error (which I think
explains the problem well enough that users can figure out they need to
resize the source image).

OTOH, it is difficult for me to imagine a case where the user would
prefer the above error to just having qemu-img align the target image's
length.
---
  qemu-img.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/qemu-img.c b/qemu-img.c
index 908fd0cce5..d4b29bf73e 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2628,6 +2628,14 @@ static int img_convert(int argc, char **argv)
  goto out;
  }
  
+if (flags & BDRV_O_NOCACHE) {

+/*
+ * If we open the target with O_DIRECT, it may be necessary to
+ * extend its size to align to the physical sector size.
+ */
+flags |= BDRV_O_RESIZE;
+}
+
  if (skip_create) {
  s.target = img_open(tgt_image_opts, out_filename, out_fmt,
  flags, writethrough, s.quiet, false);





Re: [PATCH v4 4/5] mirror-top-perms: Fix AbnormalShutdown path

2021-09-07 Thread Hanna Reitz

On 02.09.21 12:15, Philippe Mathieu-Daudé wrote:

On 9/2/21 11:58 AM, Vladimir Sementsov-Ogievskiy wrote:

02.09.2021 12:40, Hanna Reitz wrote:

The AbnormalShutdown exception class is not in qemu.machine, but in
qemu.machine.machine.  (qemu.machine.AbnormalShutdown was enough for
Python to find it in order to run this test, but pylint complains about
it.)

Signed-off-by: Hanna Reitz 
---
   tests/qemu-iotests/tests/mirror-top-perms | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/tests/mirror-top-perms
b/tests/qemu-iotests/tests/mirror-top-perms
index 451a0666f8..2fc8dd66e0 100755
--- a/tests/qemu-iotests/tests/mirror-top-perms
+++ b/tests/qemu-iotests/tests/mirror-top-perms
@@ -47,7 +47,7 @@ class TestMirrorTopPerms(iotests.QMPTestCase):
   def tearDown(self):
   try:
   self.vm.shutdown()
-    except qemu.machine.AbnormalShutdown:
+    except qemu.machine.machine.AbnormalShutdown:
   pass
     if self.vm_b is not None:


Hmm, interesting.. May be that bad that module has same name as subpackage?

Confusing indeed. Could this be improved?


I think if we want to improve something, it would be that we make the 
exception public in the qemu.machine namespace, like so:


diff --git a/python/qemu/machine/__init__.py 
b/python/qemu/machine/__init__.py

index 9ccd58ef14..48bbb0530b 100644
--- a/python/qemu/machine/__init__.py
+++ b/python/qemu/machine/__init__.py
@@ -25,7 +25,7 @@
 # pylint: disable=import-error
 # see: https://github.com/PyCQA/pylint/issues/3624
 # see: https://github.com/PyCQA/pylint/issues/3651
-from .machine import QEMUMachine
+from .machine import AbnormalShutdown, QEMUMachine
 from .qtest import QEMUQtestMachine, QEMUQtestProtocol

But, well.  I don’t mind a qemu.machine.machine too much, personally.

Hanna




Re: [PATCH v4 0/5] iotests/297: Cover tests/

2021-09-07 Thread Hanna Reitz

On 02.09.21 11:40, Hanna Reitz wrote:

v1: https://lists.nongnu.org/archive/html/qemu-block/2021-03/msg01471.html
v2: https://lists.nongnu.org/archive/html/qemu-block/2021-05/msg00492.html
v3: https://lists.nongnu.org/archive/html/qemu-block/2021-05/msg00569.html


Hi,

Sorry for the long delay, here is v4 to make our lint checking iotest
297 cover the tests/ subdirectory.


Thanks for the review, applied to my block branch:

https://github.com/XanClic/qemu/commits/block

Hanna




Re: [PATCH 0/2] iotests: Fix pylint warnings

2021-09-07 Thread Hanna Reitz

On 24.08.21 17:35, Hanna Reitz wrote:

Hi,

I’ve updated my pylint to 2.10.2 and was greeted with some new warnings.
Some are fixed by John’s “Run iotest linters during Python CI” series
(https://lists.nongnu.org/archive/html/qemu-block/2021-07/msg00611.html),
but some are not (namely unspecified-encoding, use-list-literal, and
use-dict-literal).  This series cleans up that rest.


Thanks for the reviews, applied to my block branch:

https://github.com/XanClic/qemu/commits/block

Hanna




Re: [PATCH v3 0/6] block-status cache for data regions

2021-09-07 Thread Hanna Reitz

On 12.08.21 10:41, Hanna Reitz wrote:

Hi,

See the cover letter from v1 for the general idea:
https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg00843.html

Cover letter from v2, introducing RCU locking:
https://lists.nongnu.org/archive/html/qemu-block/2021-06/msg01060.html


v3:
- Patch 2:
   - Add rcu_head object to BdrvBlockStatusCache, so we can use
 g_free_rcu() to free it instead of synchronize_rcu()+g_free()
   - Use qatomic_rcu_read() every time we read bs->block_status_cache
 (except in bdrv_close(), where no concurrency is possible)
   - Use RCU_READ_LOCK_GUARD() instead of WITH_RCU_READ_LOCK_GUARD() in
 functions where we lock the whole scope anyway
   - Same for QEMU_LOCK_GUARD() instead of WITH_QEMU_LOCK_GUARD() in
 bdrv_bsc_fill()
   - Drop from_cache variable in bdrv_co_block_status()
 (was an artifact from v1, which had a different control flow and
 needed this variable)
   - Assert that local_map returned from a protocol driver’s
 bdrv_co_block_status() implementation is equal to the offset we
 passed to it (see comment there for why we should do this)

- Patch 3:
   - Add note why block drivers should return larger *pnum values in
 addition to just saying that it’s allowed


Thanks for the reviews, I’ve added the `local_file == bs` assertion in 
patch 2 as suggested by Vladimir (and updated the comment to match) and 
applied the series to my block branch:


https://github.com/XanClic/qemu/commits/block

Hanna




[PULL 06/11] util/vfio-helpers: Pass Error handle to qemu_vfio_dma_map()

2021-09-07 Thread Stefan Hajnoczi
From: Philippe Mathieu-Daudé 

Currently qemu_vfio_dma_map() displays errors on stderr.
When using management interface, this information is simply
lost. Pass qemu_vfio_dma_map() an Error** handle so it can
propagate the error to callers.

Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210902070025.197072-7-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 include/qemu/vfio-helpers.h |  2 +-
 block/nvme.c| 22 +++---
 util/vfio-helpers.c | 10 ++
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
index 4491c8e1a6..bde9495b25 100644
--- a/include/qemu/vfio-helpers.h
+++ b/include/qemu/vfio-helpers.h
@@ -18,7 +18,7 @@ typedef struct QEMUVFIOState QEMUVFIOState;
 QEMUVFIOState *qemu_vfio_open_pci(const char *device, Error **errp);
 void qemu_vfio_close(QEMUVFIOState *s);
 int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
-  bool temporary, uint64_t *iova_list);
+  bool temporary, uint64_t *iova_list, Error **errp);
 int qemu_vfio_dma_reset_temporary(QEMUVFIOState *s);
 void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host);
 void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index,
diff --git a/block/nvme.c b/block/nvme.c
index 0786c501e4..80546b0bab 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -176,12 +176,11 @@ static bool nvme_init_queue(BDRVNVMeState *s, NVMeQueue 
*q,
 return false;
 }
 memset(q->queue, 0, bytes);
-r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, false, >iova);
+r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, false, >iova, errp);
 if (r) {
-error_setg(errp, "Cannot map queue");
-return false;
+error_prepend(errp, "Cannot map queue: ");
 }
-return true;
+return r == 0;
 }
 
 static void nvme_free_queue_pair(NVMeQueuePair *q)
@@ -239,9 +238,9 @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState 
*s,
 qemu_co_queue_init(>free_req_queue);
 q->completion_bh = aio_bh_new(aio_context, nvme_process_completion_bh, q);
 r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages, bytes,
-  false, _list_iova);
+  false, _list_iova, errp);
 if (r) {
-error_setg_errno(errp, -r, "Cannot map buffer for DMA");
+error_prepend(errp, "Cannot map buffer for DMA: ");
 goto fail;
 }
 q->free_req_head = -1;
@@ -534,9 +533,9 @@ static bool nvme_identify(BlockDriverState *bs, int 
namespace, Error **errp)
 error_setg(errp, "Cannot allocate buffer for identify response");
 goto out;
 }
-r = qemu_vfio_dma_map(s->vfio, id, id_size, true, );
+r = qemu_vfio_dma_map(s->vfio, id, id_size, true, , errp);
 if (r) {
-error_setg(errp, "Cannot map buffer for DMA");
+error_prepend(errp, "Cannot map buffer for DMA: ");
 goto out;
 }
 
@@ -1032,7 +1031,7 @@ static coroutine_fn int 
nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *cmd,
 try_map:
 r = qemu_vfio_dma_map(s->vfio,
   qiov->iov[i].iov_base,
-  len, true, );
+  len, true, , NULL);
 if (r == -ENOSPC) {
 /*
  * In addition to the -ENOMEM error, the VFIO_IOMMU_MAP_DMA
@@ -1524,14 +1523,15 @@ static void nvme_aio_unplug(BlockDriverState *bs)
 static void nvme_register_buf(BlockDriverState *bs, void *host, size_t size)
 {
 int ret;
+Error *local_err = NULL;
 BDRVNVMeState *s = bs->opaque;
 
-ret = qemu_vfio_dma_map(s->vfio, host, size, false, NULL);
+ret = qemu_vfio_dma_map(s->vfio, host, size, false, NULL, _err);
 if (ret) {
 /* FIXME: we may run out of IOVA addresses after repeated
  * bdrv_register_buf/bdrv_unregister_buf, because nvme_vfio_dma_unmap
  * doesn't reclaim addresses for fixed mappings. */
-error_report("nvme_register_buf failed: %s", strerror(-ret));
+error_reportf_err(local_err, "nvme_register_buf failed: ");
 }
 }
 
diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index e7909222cf..77cdec845d 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -463,13 +463,15 @@ static void qemu_vfio_ram_block_added(RAMBlockNotifier 
*n, void *host,
   size_t size, size_t max_size)
 {
 QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
+Error *local_err = NULL;
 int ret;
 
 trace_qemu_vfio_ram_block_added(s, host, max_size);
-ret = qemu_vfio_dma_map(s, host, max_size, false, NULL);
+ret = qemu_vfio_dma_map(s, host, max_size, false, NULL, _err);
 if (ret) {
-error_report("qemu_vfio_dma_map(%p, %zu) failed: %s", host, max_size,
- strerror(-ret));
+error_reportf_err(local_err,
+   

[PULL 11/11] block/nvme: Only report VFIO error on failed retry

2021-09-07 Thread Stefan Hajnoczi
From: Philippe Mathieu-Daudé 

We expect the first qemu_vfio_dma_map() to fail (indicating
DMA mappings exhaustion, see commit 15a730e7a3a). Do not
report the first failure as error, since we are going to
flush the mappings and retry.

This removes spurious error message displayed on the monitor:

  (qemu) c
  (qemu) qemu-kvm: VFIO_MAP_DMA failed: No space left on device
  (qemu) info status
  VM status: running

Reported-by: Tingting Mao 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210902070025.197072-12-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/nvme.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/nvme.c b/block/nvme.c
index 80546b0bab..abfe305baf 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -1019,6 +1019,7 @@ static coroutine_fn int 
nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *cmd,
 uint64_t *pagelist = req->prp_list_page;
 int i, j, r;
 int entries = 0;
+Error *local_err = NULL, **errp = NULL;
 
 assert(qiov->size);
 assert(QEMU_IS_ALIGNED(qiov->size, s->page_size));
@@ -1031,7 +1032,7 @@ static coroutine_fn int 
nvme_cmd_map_qiov(BlockDriverState *bs, NvmeCmd *cmd,
 try_map:
 r = qemu_vfio_dma_map(s->vfio,
   qiov->iov[i].iov_base,
-  len, true, , NULL);
+  len, true, , errp);
 if (r == -ENOSPC) {
 /*
  * In addition to the -ENOMEM error, the VFIO_IOMMU_MAP_DMA
@@ -1066,6 +1067,8 @@ try_map:
 goto fail;
 }
 }
+errp = _err;
+
 goto try_map;
 }
 if (r) {
@@ -1109,6 +1112,9 @@ fail:
  * because they are already mapped before calling this function; for
  * temporary mappings, a later nvme_cmd_(un)map_qiov will reclaim by
  * calling qemu_vfio_dma_reset_temporary when necessary. */
+if (local_err) {
+error_reportf_err(local_err, "Cannot map buffer for DMA: ");
+}
 return r;
 }
 
-- 
2.31.1




[PULL 10/11] util/vfio-helpers: Let qemu_vfio_do_mapping() propagate Error

2021-09-07 Thread Stefan Hajnoczi
From: Philippe Mathieu-Daudé 

Pass qemu_vfio_do_mapping() an Error* argument so it can propagate
any error to callers. Replace error_report() which only report
to the monitor by the more generic error_setg_errno().

Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210902070025.197072-11-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/vfio-helpers.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 66148bd381..00a80431a0 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -610,7 +610,7 @@ static IOVAMapping *qemu_vfio_add_mapping(QEMUVFIOState *s,
 
 /* Do the DMA mapping with VFIO. */
 static int qemu_vfio_do_mapping(QEMUVFIOState *s, void *host, size_t size,
-uint64_t iova)
+uint64_t iova, Error **errp)
 {
 struct vfio_iommu_type1_dma_map dma_map = {
 .argsz = sizeof(dma_map),
@@ -622,7 +622,7 @@ static int qemu_vfio_do_mapping(QEMUVFIOState *s, void 
*host, size_t size,
 trace_qemu_vfio_do_mapping(s, host, iova, size);
 
 if (ioctl(s->container, VFIO_IOMMU_MAP_DMA, _map)) {
-error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
+error_setg_errno(errp, errno, "VFIO_MAP_DMA failed");
 return -errno;
 }
 return 0;
@@ -772,7 +772,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 
 mapping = qemu_vfio_add_mapping(s, host, size, index + 1, iova0);
 assert(qemu_vfio_verify_mappings(s));
-ret = qemu_vfio_do_mapping(s, host, size, iova0);
+ret = qemu_vfio_do_mapping(s, host, size, iova0, errp);
 if (ret < 0) {
 qemu_vfio_undo_mapping(s, mapping, NULL);
 return ret;
@@ -782,7 +782,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 if (!qemu_vfio_find_temp_iova(s, size, , errp)) {
 return -ENOMEM;
 }
-ret = qemu_vfio_do_mapping(s, host, size, iova0);
+ret = qemu_vfio_do_mapping(s, host, size, iova0, errp);
 if (ret < 0) {
 return ret;
 }
-- 
2.31.1




[PULL 04/11] util/vfio-helpers: Remove unreachable code in qemu_vfio_dma_map()

2021-09-07 Thread Stefan Hajnoczi
From: Philippe Mathieu-Daudé 

qemu_vfio_add_mapping() returns a pointer to an indexed entry
in pre-allocated QEMUVFIOState::mappings[], thus can not be NULL.
Remove the pointless check.

Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210902070025.197072-5-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/vfio-helpers.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index d956866b4e..e7909222cf 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -751,10 +751,6 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 }
 
 mapping = qemu_vfio_add_mapping(s, host, size, index + 1, iova0);
-if (!mapping) {
-ret = -ENOMEM;
-goto out;
-}
 assert(qemu_vfio_verify_mappings(s));
 ret = qemu_vfio_do_mapping(s, host, size, iova0);
 if (ret) {
-- 
2.31.1




[PULL 09/11] util/vfio-helpers: Simplify qemu_vfio_dma_map() returning directly

2021-09-07 Thread Stefan Hajnoczi
From: Philippe Mathieu-Daudé 

To simplify qemu_vfio_dma_map():
- reduce 'ret' (returned value) scope by returning errno directly,
- remove the goto 'out' label.

Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210902070025.197072-10-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/vfio-helpers.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index b93a3d3578..66148bd381 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -748,7 +748,6 @@ static bool qemu_vfio_water_mark_reached(QEMUVFIOState *s, 
size_t size,
 int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
   bool temporary, uint64_t *iova, Error **errp)
 {
-int ret = 0;
 int index;
 IOVAMapping *mapping;
 uint64_t iova0;
@@ -761,32 +760,31 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, 
size_t size,
 if (mapping) {
 iova0 = mapping->iova + ((uint8_t *)host - (uint8_t *)mapping->host);
 } else {
+int ret;
+
 if (qemu_vfio_water_mark_reached(s, size, errp)) {
-ret = -ENOMEM;
-goto out;
+return -ENOMEM;
 }
 if (!temporary) {
 if (!qemu_vfio_find_fixed_iova(s, size, , errp)) {
-ret = -ENOMEM;
-goto out;
+return -ENOMEM;
 }
 
 mapping = qemu_vfio_add_mapping(s, host, size, index + 1, iova0);
 assert(qemu_vfio_verify_mappings(s));
 ret = qemu_vfio_do_mapping(s, host, size, iova0);
-if (ret) {
+if (ret < 0) {
 qemu_vfio_undo_mapping(s, mapping, NULL);
-goto out;
+return ret;
 }
 qemu_vfio_dump_mappings(s);
 } else {
 if (!qemu_vfio_find_temp_iova(s, size, , errp)) {
-ret = -ENOMEM;
-goto out;
+return -ENOMEM;
 }
 ret = qemu_vfio_do_mapping(s, host, size, iova0);
-if (ret) {
-goto out;
+if (ret < 0) {
+return ret;
 }
 }
 }
@@ -794,8 +792,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 if (iova) {
 *iova = iova0;
 }
-out:
-return ret;
+return 0;
 }
 
 /* Reset the high watermark and free all "temporary" mappings. */
-- 
2.31.1




[PULL 07/11] util/vfio-helpers: Extract qemu_vfio_water_mark_reached()

2021-09-07 Thread Stefan Hajnoczi
From: Philippe Mathieu-Daudé 

Extract qemu_vfio_water_mark_reached() for readability,
and have it provide an error hint it its Error* handle.

Suggested-by: Klaus Jensen 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210902070025.197072-8-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/vfio-helpers.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 77cdec845d..306b5a83e4 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -721,6 +721,21 @@ qemu_vfio_find_temp_iova(QEMUVFIOState *s, size_t size, 
uint64_t *iova)
 return -ENOMEM;
 }
 
+/**
+ * qemu_vfio_water_mark_reached:
+ *
+ * Returns %true if high watermark has been reached, %false otherwise.
+ */
+static bool qemu_vfio_water_mark_reached(QEMUVFIOState *s, size_t size,
+ Error **errp)
+{
+if (s->high_water_mark - s->low_water_mark + 1 < size) {
+error_setg(errp, "iova exhausted (water mark reached)");
+return true;
+}
+return false;
+}
+
 /* Map [host, host + size) area into a contiguous IOVA address space, and store
  * the result in @iova if not NULL. The caller need to make sure the area is
  * aligned to page size, and mustn't overlap with existing mapping areas (split
@@ -742,7 +757,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 if (mapping) {
 iova0 = mapping->iova + ((uint8_t *)host - (uint8_t *)mapping->host);
 } else {
-if (s->high_water_mark - s->low_water_mark + 1 < size) {
+if (qemu_vfio_water_mark_reached(s, size, errp)) {
 ret = -ENOMEM;
 goto out;
 }
-- 
2.31.1




[PULL 03/11] util/vfio-helpers: Replace qemu_mutex_lock() calls with QEMU_LOCK_GUARD

2021-09-07 Thread Stefan Hajnoczi
From: Philippe Mathieu-Daudé 

Simplify qemu_vfio_dma_[un]map() handlers by replacing a pair of
qemu_mutex_lock/qemu_mutex_unlock calls by the WITH_QEMU_LOCK_GUARD
macro.

Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210902070025.197072-4-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/vfio-helpers.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 1d14913629..d956866b4e 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -735,7 +735,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 assert(QEMU_PTR_IS_ALIGNED(host, qemu_real_host_page_size));
 assert(QEMU_IS_ALIGNED(size, qemu_real_host_page_size));
 trace_qemu_vfio_dma_map(s, host, size, temporary, iova);
-qemu_mutex_lock(>lock);
+QEMU_LOCK_GUARD(>lock);
 mapping = qemu_vfio_find_mapping(s, host, );
 if (mapping) {
 iova0 = mapping->iova + ((uint8_t *)host - (uint8_t *)mapping->host);
@@ -778,7 +778,6 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 *iova = iova0;
 }
 out:
-qemu_mutex_unlock(>lock);
 return ret;
 }
 
@@ -813,14 +812,12 @@ void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host)
 }
 
 trace_qemu_vfio_dma_unmap(s, host);
-qemu_mutex_lock(>lock);
+QEMU_LOCK_GUARD(>lock);
 m = qemu_vfio_find_mapping(s, host, );
 if (!m) {
-goto out;
+return;
 }
 qemu_vfio_undo_mapping(s, m, NULL);
-out:
-qemu_mutex_unlock(>lock);
 }
 
 static void qemu_vfio_reset(QEMUVFIOState *s)
-- 
2.31.1




[PULL 05/11] block/nvme: Have nvme_create_queue_pair() report errors consistently

2021-09-07 Thread Stefan Hajnoczi
From: Philippe Mathieu-Daudé 

nvme_create_queue_pair() does not return a boolean value (indicating
eventual error) but a pointer, and is inconsistent in how it fills the
error handler. To fulfill callers expectations, always set an error
message on failure.

Reported-by: Auger Eric 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210902070025.197072-6-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/nvme.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/nvme.c b/block/nvme.c
index e8dbbc2317..0786c501e4 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -220,6 +220,7 @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState 
*s,
 
 q = g_try_new0(NVMeQueuePair, 1);
 if (!q) {
+error_setg(errp, "Cannot allocate queue pair");
 return NULL;
 }
 trace_nvme_create_queue_pair(idx, q, size, aio_context,
@@ -228,6 +229,7 @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState 
*s,
   qemu_real_host_page_size);
 q->prp_list_pages = qemu_try_memalign(qemu_real_host_page_size, bytes);
 if (!q->prp_list_pages) {
+error_setg(errp, "Cannot allocate PRP page list");
 goto fail;
 }
 memset(q->prp_list_pages, 0, bytes);
@@ -239,6 +241,7 @@ static NVMeQueuePair *nvme_create_queue_pair(BDRVNVMeState 
*s,
 r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages, bytes,
   false, _list_iova);
 if (r) {
+error_setg_errno(errp, -r, "Cannot map buffer for DMA");
 goto fail;
 }
 q->free_req_head = -1;
-- 
2.31.1




[PULL 01/11] block/nvme: Use safer trace format string

2021-09-07 Thread Stefan Hajnoczi
From: Philippe Mathieu-Daudé 

Fix when building with -Wshorten-64-to-32:

  warning: implicit conversion loses integer precision: 'unsigned long' to 
'int' [-Wshorten-64-to-32]

Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210902070025.197072-2-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/trace-events | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/trace-events b/block/trace-events
index b3d2b1e62c..f4f1267c8c 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -156,7 +156,7 @@ nvme_dsm(void *s, uint64_t offset, uint64_t bytes) "s %p 
offset 0x%"PRIx64" byte
 nvme_dsm_done(void *s, uint64_t offset, uint64_t bytes, int ret) "s %p offset 
0x%"PRIx64" bytes %"PRId64" ret %d"
 nvme_dma_map_flush(void *s) "s %p"
 nvme_free_req_queue_wait(void *s, unsigned q_index) "s %p q #%u"
-nvme_create_queue_pair(unsigned q_index, void *q, unsigned size, void 
*aio_context, int fd) "index %u q %p size %u aioctx %p fd %d"
+nvme_create_queue_pair(unsigned q_index, void *q, size_t size, void 
*aio_context, int fd) "index %u q %p size %zu aioctx %p fd %d"
 nvme_free_queue_pair(unsigned q_index, void *q) "index %u q %p"
 nvme_cmd_map_qiov(void *s, void *cmd, void *req, void *qiov, int entries) "s 
%p cmd %p req %p qiov %p entries %d"
 nvme_cmd_map_qiov_pages(void *s, int i, uint64_t page) "s %p page[%d] 
0x%"PRIx64
-- 
2.31.1




[PULL 02/11] util/vfio-helpers: Let qemu_vfio_verify_mappings() use error_report()

2021-09-07 Thread Stefan Hajnoczi
From: Philippe Mathieu-Daudé 

Instead of displaying the error on stderr, use error_report()
which also report to the monitor.

Reviewed-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210902070025.197072-3-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/vfio-helpers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 95b86e..1d14913629 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -660,13 +660,13 @@ static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
 if (QEMU_VFIO_DEBUG) {
 for (i = 0; i < s->nr_mappings - 1; ++i) {
 if (!(s->mappings[i].host < s->mappings[i + 1].host)) {
-fprintf(stderr, "item %d not sorted!\n", i);
+error_report("item %d not sorted!", i);
 qemu_vfio_dump_mappings(s);
 return false;
 }
 if (!(s->mappings[i].host + s->mappings[i].size <=
   s->mappings[i + 1].host)) {
-fprintf(stderr, "item %d overlap with next!\n", i);
+error_report("item %d overlap with next!", i);
 qemu_vfio_dump_mappings(s);
 return false;
 }
-- 
2.31.1




[PULL 08/11] util/vfio-helpers: Use error_setg in qemu_vfio_find_[fixed/temp]_iova

2021-09-07 Thread Stefan Hajnoczi
From: Philippe Mathieu-Daudé 

Both qemu_vfio_find_fixed_iova() and qemu_vfio_find_temp_iova()
return an errno which is unused (or overwritten). Have them propagate
eventual errors to callers, returning a boolean (which is what the
Error API recommends, see commit e3fe3988d78 "error: Document Error
API usage rules" for rationale).

Suggested-by: Klaus Jensen 
Reviewed-by: Klaus Jensen 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20210902070025.197072-9-phi...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 util/vfio-helpers.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 306b5a83e4..b93a3d3578 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -677,8 +677,8 @@ static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
 return true;
 }
 
-static int
-qemu_vfio_find_fixed_iova(QEMUVFIOState *s, size_t size, uint64_t *iova)
+static bool qemu_vfio_find_fixed_iova(QEMUVFIOState *s, size_t size,
+  uint64_t *iova, Error **errp)
 {
 int i;
 
@@ -693,14 +693,16 @@ qemu_vfio_find_fixed_iova(QEMUVFIOState *s, size_t size, 
uint64_t *iova)
 s->usable_iova_ranges[i].end - s->low_water_mark + 1 == 0) {
 *iova = s->low_water_mark;
 s->low_water_mark += size;
-return 0;
+return true;
 }
 }
-return -ENOMEM;
+error_setg(errp, "fixed iova range not found");
+
+return false;
 }
 
-static int
-qemu_vfio_find_temp_iova(QEMUVFIOState *s, size_t size, uint64_t *iova)
+static bool qemu_vfio_find_temp_iova(QEMUVFIOState *s, size_t size,
+ uint64_t *iova, Error **errp)
 {
 int i;
 
@@ -715,10 +717,12 @@ qemu_vfio_find_temp_iova(QEMUVFIOState *s, size_t size, 
uint64_t *iova)
 s->high_water_mark - s->usable_iova_ranges[i].start + 1 == 0) {
 *iova = s->high_water_mark - size;
 s->high_water_mark = *iova;
-return 0;
+return true;
 }
 }
-return -ENOMEM;
+error_setg(errp, "temporary iova range not found");
+
+return false;
 }
 
 /**
@@ -762,7 +766,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 goto out;
 }
 if (!temporary) {
-if (qemu_vfio_find_fixed_iova(s, size, )) {
+if (!qemu_vfio_find_fixed_iova(s, size, , errp)) {
 ret = -ENOMEM;
 goto out;
 }
@@ -776,7 +780,7 @@ int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t 
size,
 }
 qemu_vfio_dump_mappings(s);
 } else {
-if (qemu_vfio_find_temp_iova(s, size, )) {
+if (!qemu_vfio_find_temp_iova(s, size, , errp)) {
 ret = -ENOMEM;
 goto out;
 }
-- 
2.31.1




[PULL 00/11] Block patches

2021-09-07 Thread Stefan Hajnoczi
The following changes since commit 88afdc92b644120e0182c8567e1b1d236e471b12:

  Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging 
(2021-09-05 15:48:42 +0100)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 9bd2788f49c331b02372cc257b11e4c984d39708:

  block/nvme: Only report VFIO error on failed retry (2021-09-07 09:08:24 +0100)


Pull request

Userspace NVMe driver patches.



Philippe Mathieu-Daudé (11):
  block/nvme: Use safer trace format string
  util/vfio-helpers: Let qemu_vfio_verify_mappings() use error_report()
  util/vfio-helpers: Replace qemu_mutex_lock() calls with
QEMU_LOCK_GUARD
  util/vfio-helpers: Remove unreachable code in qemu_vfio_dma_map()
  block/nvme: Have nvme_create_queue_pair() report errors consistently
  util/vfio-helpers: Pass Error handle to qemu_vfio_dma_map()
  util/vfio-helpers: Extract qemu_vfio_water_mark_reached()
  util/vfio-helpers: Use error_setg in qemu_vfio_find_[fixed/temp]_iova
  util/vfio-helpers: Simplify qemu_vfio_dma_map() returning directly
  util/vfio-helpers: Let qemu_vfio_do_mapping() propagate Error
  block/nvme: Only report VFIO error on failed retry

 include/qemu/vfio-helpers.h |  2 +-
 block/nvme.c| 29 +++
 util/vfio-helpers.c | 99 -
 block/trace-events  |  2 +-
 4 files changed, 76 insertions(+), 56 deletions(-)

-- 
2.31.1





Re: [PATCH] gluster: Align block-status tail

2021-09-07 Thread Hanna Reitz

On 05.08.21 16:36, Max Reitz wrote:

gluster's block-status implementation is basically a copy of that in
block/file-posix.c, there is only one thing missing, and that is
aligning trailing data extents to the request alignment (as added by
commit 9c3db310ff0).

Note that 9c3db310ff0 mentions that "there seems to be no other block
driver that sets request_alignment and [...]", but while block/gluster.c
does indeed not set request_alignment, block/io.c's
bdrv_refresh_limits() will still default to an alignment of 512 because
block/gluster.c does not provide a byte-aligned read function.
Therefore, unaligned tails can conceivably occur, and so we should apply
the change from 9c3db310ff0 to gluster's block-status implementation.

Reported-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Max Reitz 
---
  block/gluster.c | 16 
  1 file changed, 16 insertions(+)


Thanks for the review, applied to my block branch:

https://github.com/XanClic/qemu/commits/block

Hanna




Re: [PATCH v3 2/9] qapi: make blockdev-add a coroutine command

2021-09-07 Thread Markus Armbruster
Vladimir Sementsov-Ogievskiy  writes:

> 06.09.2021 22:28, Markus Armbruster wrote:
>> Vladimir Sementsov-Ogievskiy  writes:
>> 
>>> We are going to support nbd reconnect on open in a next commit. This
>>> means that we want to do several connection attempts during some time.
>>> And this should be done in a coroutine, otherwise we'll stuck.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>>> ---
>>>   qapi/block-core.json | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>> index 06674c25c9..6e4042530a 100644
>>> --- a/qapi/block-core.json
>>> +++ b/qapi/block-core.json
>>> @@ -4219,7 +4219,8 @@
>>>   # <- { "return": {} }
>>>   #
>>>   ##
>>> -{ 'command': 'blockdev-add', 'data': 'BlockdevOptions', 'boxed': true }
>>> +{ 'command': 'blockdev-add', 'data': 'BlockdevOptions', 'boxed': true,
>>> +  'coroutine': true }
>>>   
>>>   ##
>>>   # @blockdev-reopen:
>> 
>> Why is this safe?
>> 
>> Prior discusson:
>> Message-ID: <87lfq0yp9v@dusky.pond.sub.org>
>> https://lists.gnu.org/archive/html/qemu-devel/2020-01/msg04921.html
>> 
>
> Hmm.. I'm afraid, that I can't prove that it's safe. At least it will mean to 
> audit .bdrv_open() of all block drivers.. And nothing prevents creating new 
> incompatible drivers in future..

Breaking existing block drivers is more serious than adding new drivers
broken.

> On the other hand, looking at qmp_blockdev_add, bdrv_open() is the only thing 
> of interest.
>
> And theoretically, bdrv_open() should work in coroutine context. We do call 
> this function from coroutine_fn functions sometimes. So, maybe, if in some 
> circumstances, bdrv_open() is not compatible with coroutine context, we can 
> consider it as a bug? And fix it later, if it happen?

If it's already used in ways that require coroutine-compatibility, we'd
merely add another way for existing bugs to bite.  Kevin, is it?

In general, the less coroutine-incompatibility we have, the better.
>From the thread I quoted:

Kevin Wolf  writes:

> AM 22.01.2020 um 13:15 hat Markus Armbruster geschrieben:
[...]
>> Is coroutine-incompatible command handler code necessary or accidental?
>> 
>> By "necessary" I mean there are (and likely always will be) commands
>> that need to do stuff that cannot or should not be done on coroutine
>> context.
>> 
>> "Accidental" is the opposite: coroutine-incompatibility can be regarded
>> as a fixable flaw.
>
> I expect it's mostly accidental, but maybe not all of it.

I'm inclinded to regard accidental coroutine-incompatibility as a bug.

As long as the command doesn't have the coroutine flag set, it's a
latent bug.

Setting the coroutine flag without auditing the code risks making such
latent bugs actual bugs.  A typical outcome is deadlock.

Whether we're willing to accept such risk is not for me to decide.

We weren't when we merged the coroutine work, at least not wholesale.
The risk is perhaps less scary for a single command.  On the other hand,
code audit is less daunting, too.

Kevin, what do you think?