date:20230915

Re: [PATCH v3 2/2] qemu-img: map: report compressed data blocks

2023-09-15 Thread Kevin Wolf

Am 14.09.2023 um 23:40 hat Andrey Drobyshev geschrieben:
> On 9/15/23 00:17, Eric Blake wrote:
> > On Fri, Sep 08, 2023 at 12:02:26AM +0300, Andrey Drobyshev wrote:
> >> Right now "qemu-img map" reports compressed blocks as containing data
> >> but having no host offset.  This is not very informative.  Instead,
> >> let's add another boolean field named "compressed" in case JSON output
> >> mode is specified.  This is achieved by utilizing new allocation status
> >> flag BDRV_BLOCK_COMPRESSED for bdrv_block_status().
> >>
> >> Also update the expected qemu-iotests outputs to contain the new field.
> >>
> >> Signed-off-by: Andrey Drobyshev 
> >> ---
> > 
> >> +++ b/qapi/block-core.json
> >> @@ -409,6 +409,9 @@
> >>  #
> >>  # @zero: whether the virtual blocks read as zeroes
> >>  #
> >> +# @compressed: true indicates that data is stored compressed.  Only valid
> >> +# for the formats whith support compression (since 8.2)
> > 
> > s/whith/which/
> > 
> > "compressed":false seems universally valid for all other file formats,
> > and the field is not marked as optional.  Do we really need the
> > disclaimer?  Could we get by with the shorter 'Will be false for
> > formats that do not support compression', or by omitting it
> > altogether?
> > 
> 
> You're right, this remark should've been removed as it only makes sense
> in case of the field being optional.  Feel free to remove it altogether,
> or I can send a follow-up if you prefer.

I'm updating it in my queue to read:

# @compressed: true if the data is stored compressed (since 8.2)

Kevin

Re: [PATCH 5/7] block/vdi: Clean up local variable shadowing

2023-09-15 Thread Kevin Wolf

Am 31.08.2023 um 15:25 hat Markus Armbruster geschrieben:
> Local variables shadowing other local variables or parameters make the
> code needlessly hard to understand.  Tracked down with -Wshadow=local.
> Clean up: delete inner declarations when they are actually redundant,
> else rename variables.
> 
> Signed-off-by: Markus Armbruster 

> @@ -700,7 +699,7 @@ nonallocating_write:
>  /* One or more new blocks were allocated. */
>  VdiHeader *header;
>  uint8_t *base;
> -uint64_t offset;
> +uint64_t offs;
>  uint32_t n_sectors;
>  
>  g_free(block);
> @@ -723,11 +722,11 @@ nonallocating_write:
>  bmap_first /= (SECTOR_SIZE / sizeof(uint32_t));
>  bmap_last /= (SECTOR_SIZE / sizeof(uint32_t));
>  n_sectors = bmap_last - bmap_first + 1;
> -offset = s->bmap_sector + bmap_first;
> +offs = s->bmap_sector + bmap_first;
>  base = ((uint8_t *)&s->bmap[0]) + bmap_first * SECTOR_SIZE;
>  logout("will write %u block map sectors starting from entry %u\n",
> n_sectors, bmap_first);
> -ret = bdrv_co_pwrite(bs->file, offset * SECTOR_SIZE,
> +ret = bdrv_co_pwrite(bs->file, offs * SECTOR_SIZE,
>   n_sectors * SECTOR_SIZE, base, 0);
>  }

Having two variables 'offset' and 'offs' doesn't really help with
clarity either. Can we be more specific and use something like
'bmap_offset' here?

Kevin

Re: [PATCH 4/7] block/dirty-bitmap: Clean up local variable shadowing

2023-09-15 Thread Kevin Wolf

Am 31.08.2023 um 15:25 hat Markus Armbruster geschrieben:
> Local variables shadowing other local variables or parameters make the
> code needlessly hard to understand.  Tracked down with -Wshadow=local.
> Clean up: delete inner declarations when they are actually redundant,
> else rename variables.
> 
> Signed-off-by: Markus Armbruster 
> ---
>  block/monitor/bitmap-qmp-cmds.c | 2 +-
>  block/qcow2-bitmap.c| 3 +--
>  2 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/block/monitor/bitmap-qmp-cmds.c b/block/monitor/bitmap-qmp-cmds.c
> index 55f778f5af..4d018423d8 100644
> --- a/block/monitor/bitmap-qmp-cmds.c
> +++ b/block/monitor/bitmap-qmp-cmds.c
> @@ -276,7 +276,7 @@ BdrvDirtyBitmap *block_dirty_bitmap_merge(const char 
> *node, const char *target,
>  
>  for (lst = bms; lst; lst = lst->next) {
>  switch (lst->value->type) {
> -const char *name, *node;
> +const char *name;
>  case QTYPE_QSTRING:
>  name = lst->value->u.local;
>  src = bdrv_find_dirty_bitmap(bs, name);

The names in this function are all over the place... A more ambitious
patch could rename the parameters to dst_node/dst_bitmap and these
variables to src_node/src_bitmap to get some more consistency (both with
each other and with the existing src/dst variables).

Preexisting, so I'm not insisting that you should do this.

> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
> index 037fa2d435..ffd5cd3b23 100644
> --- a/block/qcow2-bitmap.c
> +++ b/block/qcow2-bitmap.c
> @@ -1555,7 +1555,6 @@ bool 
> qcow2_store_persistent_dirty_bitmaps(BlockDriverState *bs,
>  FOR_EACH_DIRTY_BITMAP(bs, bitmap) {
>  const char *name = bdrv_dirty_bitmap_name(bitmap);
>  uint32_t granularity = bdrv_dirty_bitmap_granularity(bitmap);
> -Qcow2Bitmap *bm;
>  
>  if (!bdrv_dirty_bitmap_get_persistence(bitmap) ||
>  bdrv_dirty_bitmap_inconsistent(bitmap)) {
> @@ -1625,7 +1624,7 @@ bool 
> qcow2_store_persistent_dirty_bitmaps(BlockDriverState *bs,
>  
>  /* allocate clusters and store bitmaps */
>  QSIMPLEQ_FOREACH(bm, bm_list, entry) {
> -BdrvDirtyBitmap *bitmap = bm->dirty_bitmap;
> +bitmap = bm->dirty_bitmap;
>  
>  if (bitmap == NULL || bdrv_dirty_bitmap_readonly(bitmap)) {
>  continue;

Reviewed-by: Kevin Wolf

Re: [PULL 4/5] hw/ufs: Support for UFS logical unit

2023-09-15 Thread Paolo Bonzini


On 9/15/23 00:19, Jeuk Kim wrote:
First, ufs-lu has a feature called "unit descriptor". This feature shows 
the status of the ufs-lu


and only works with UFS-specific "query request" commands, not SCSI 
commands.


This looks like something that can be implemented in the UFS subsystem.

UFS also has something called a well-known lu. Unlike typical SCSI 
devices, where each lu is independent,

UFS can control other lu's through the well-known lu.


This can also be implemented in UfsBus.

Finally, UFS-LU will have features that SCSI-HD does not have, such as 
the zone block command.


These should be implemented in scsi-hd as well.

In addition to this, I wanted some scsi commands to behave differently 
from scsi-hd, for example,

the Inquiry command should read "QEMU UFS" instead of "QEMU HARDDISK",
and the mode_sense_page command should have a different result.


Some of these don't have much justification, and others (such as the 
control page) could be done in scsi-hd as well.


We should look into cleaning this up and making ufs-lu share a lot more 
code with scsi-hd; possibly even supporting -device scsi-hd with UFS 
devices.  I am not going to ask you for a revert, but if this is not 
done before 8.2 is out, I will ask you to disable it by default in 
hw/ufs/Kconfig.


In the future, please Cc the SCSI maintainers for UFS patches.

Paolo

Re: [PATCH 6/7] block: Clean up local variable shadowing

2023-09-15 Thread Kevin Wolf

Am 31.08.2023 um 15:25 hat Markus Armbruster geschrieben:
> Local variables shadowing other local variables or parameters make the
> code needlessly hard to understand.  Tracked down with -Wshadow=local.
> Clean up: delete inner declarations when they are actually redundant,
> else rename variables.
> 
> Signed-off-by: Markus Armbruster 
> ---
>  block.c  |  7 ---
>  block/rbd.c  |  2 +-
>  block/stream.c   |  1 -
>  block/vvfat.c| 34 +-
>  hw/block/xen-block.c |  6 +++---
>  5 files changed, 25 insertions(+), 25 deletions(-)

I wonder why you made vdi a separate patch, but not vvfat, even though
that has more changes. (Of course, my selfish motivation for asking this
is that I could have given a R-b for it and wouldn't have to look at it
again in a v2 :-))

> diff --git a/block.c b/block.c
> index a307c151a8..7f0003d8ac 100644
> --- a/block.c
> +++ b/block.c
> @@ -3001,7 +3001,8 @@ static BdrvChild 
> *bdrv_attach_child_common(BlockDriverState *child_bs,
> BdrvChildRole child_role,
> uint64_t perm, uint64_t 
> shared_perm,
> void *opaque,
> -   Transaction *tran, Error **errp)
> +   Transaction *transaction,
> +   Error **errp)
>  {
>  BdrvChild *new_child;
>  AioContext *parent_ctx, *new_child_ctx;
> @@ -3088,7 +3089,7 @@ static BdrvChild 
> *bdrv_attach_child_common(BlockDriverState *child_bs,
>  .old_parent_ctx = parent_ctx,
>  .old_child_ctx = child_ctx,
>  };
> -tran_add(tran, &bdrv_attach_child_common_drv, s);
> +tran_add(transaction, &bdrv_attach_child_common_drv, s);
>  
>  if (new_child_ctx != child_ctx) {
>  aio_context_release(new_child_ctx);

I think I would resolve this one the other way around. 'tran' is the
typical name for the parameter and it is the transaction that this
function should add things to.

The other one that shadows it is a local transaction that is completed
within the function. I think it's better if that one has a different
name.

As usual, being more specific than just 'tran' vs. 'transaction' would
be nice. Maybe 'aio_ctx_tran' for the nested one?

The rest looks okay.

Kevin

Re: [PATCH 0/4] virtio-blk: prepare for the multi-queue block layer

2023-09-15 Thread Michael S. Tsirkin

On Thu, Sep 14, 2023 at 10:00:57AM -0400, Stefan Hajnoczi wrote:
> The virtio-blk device will soon be able to assign virtqueues to IOThreads,
> eliminating the single IOThread bottleneck. In order to do that, the I/O code
> path must support running in multiple threads.
> 
> This patch series removes the AioContext lock from the virtio-blk I/O code
> path, adds thread-safety where it is required, and ensures that Linux AIO and
> io_uring are available regardless of which thread calls into the block driver.
> With these changes virtio-blk is ready for the iothread-vq-mapping feature,
> which will be introduced in the next patch series.
> 
> Based-on: 20230913200045.1024233-1-stefa...@redhat.com ("[PATCH v3 0/4] 
> virtio-blk: use blk_io_plug_call() instead of notification BH")
> Based-on: 20230912231037.826804-1-stefa...@redhat.com ("[PATCH v3 0/5] 
> block-backend: process I/O in the current AioContext")


virtio bits:

Reviewed-by: Michael S. Tsirkin 

feel free to merge

> Stefan Hajnoczi (4):
>   block/file-posix: set up Linux AIO and io_uring in the current thread
>   virtio-blk: add lock to protect s->rq
>   virtio-blk: don't lock AioContext in the completion code path
>   virtio-blk: don't lock AioContext in the submission code path
> 
>  include/hw/virtio/virtio-blk.h |   3 +-
>  block/file-posix.c |  99 +++---
>  hw/block/virtio-blk.c  | 106 +++--
>  3 files changed, 98 insertions(+), 110 deletions(-)
> 
> -- 
> 2.41.0

[PATCH 2/3] backends: Initial support for SPDM socket support

2023-09-15 Thread Alistair Francis

From: Huai-Cheng Kuo 

SPDM enables authentication, attestation and key exchange to assist in
providing infrastructure security enablement. It's a standard published
by the DMTF [1].

SPDM currently supports PCIe DOE and MCTP transports, but it can be
extended to support others in the future. This patch adds
support to QEMU to connect to an external SPDM instance.

SPDM support can be added to any QEMU device by exposing a
TCP socket to a SPDM server. The server can then implement the SPDM
decoding/encoding support, generally using libspdm [2].

This is similar to how the current TPM implementation works and means
that the heavy lifting of setting up certificate chains, capabilities,
measurements and complex crypto can be done outside QEMU by a well
supported and tested library.

1: https://www.dmtf.org/standards/SPDM
2: https://github.com/DMTF/libspdm

Signed-off-by: Huai-Cheng Kuo 
Signed-off-by: Chris Browy 
Co-developed-by: Jonathan Cameron 
Signed-off-by: Jonathan Cameron 
[ Changes by AF:
 - Convert to be more QEMU-ified
 - Move to backends as it isn't PCIe specific
]
Signed-off-by: Alistair Francis 
Signed-off-by: Wilfred Mallawa 
---
 include/sysemu/spdm-socket.h |  44 +++
 backends/spdm-socket.c   | 215 +++
 backends/Kconfig |   4 +
 backends/meson.build |   2 +
 4 files changed, 265 insertions(+)
 create mode 100644 include/sysemu/spdm-socket.h
 create mode 100644 backends/spdm-socket.c

diff --git a/include/sysemu/spdm-socket.h b/include/sysemu/spdm-socket.h
new file mode 100644
index 00..24e6fccb83
--- /dev/null
+++ b/include/sysemu/spdm-socket.h
@@ -0,0 +1,44 @@
+/*
+ * QEMU SPDM socket support
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#ifndef SPDM_REQUESTER_H
+#define SPDM_REQUESTER_H
+
+int spdm_socket_connect(uint16_t port, Error **errp);
+uint32_t spdm_socket_rsp(const int socket, uint32_t transport_type,
+ void *req, uint32_t req_len,
+ void *rsp, uint32_t rsp_len);
+void spdm_socket_close(const int socket, uint32_t transport_type);
+
+#define SPDM_SOCKET_COMMAND_NORMAL0x0001
+#define SPDM_SOCKET_COMMAND_OOB_ENCAP_KEY_UPDATE  0x8001
+#define SPDM_SOCKET_COMMAND_CONTINUE  0xFFFD
+#define SPDM_SOCKET_COMMAND_SHUTDOWN  0xFFFE
+#define SPDM_SOCKET_COMMAND_UNKOWN0x
+#define SPDM_SOCKET_COMMAND_TEST  0xDEAD
+
+#define SPDM_SOCKET_TRANSPORT_TYPE_MCTP   0x01
+#define SPDM_SOCKET_TRANSPORT_TYPE_PCI_DOE0x02
+
+#define SPDM_SOCKET_MAX_MESSAGE_BUFFER_SIZE   0x1200
+
+#endif
diff --git a/backends/spdm-socket.c b/backends/spdm-socket.c
new file mode 100644
index 00..2f31ba80ba
--- /dev/null
+++ b/backends/spdm-socket.c
@@ -0,0 +1,215 @@
+/* SPDX-License-Identifier: BSD-3-Clause */
+/*
+ * QEMU SPDM socket support
+ *
+ * This is based on:
+ * 
https://github.com/DMTF/spdm-emu/blob/07c0a838bcc1c6207c656ac75885c0603e344b6f/spdm_emu/spdm_emu_common/command.c
+ * but has been re-written to match QEMU style
+ *
+ * Copyright (c) 2021, DMTF. All rights reserved.
+ * Copyright (c) 2023. Western Digital Corporation or its affiliates.
+ */
+
+#include "qemu/osdep.h"
+#include "sysemu/spdm-socket.h"
+#include "qapi/error.h"
+
+static bool read_bytes(const int socket, uint8_t *buffer,
+   size_t number_of_bytes)
+{
+ssize_t number_received = 0;
+ssize_t result;
+
+while (number_received < number_of_bytes) {
+result = recv(socket, buffer + number_received,
+  number_of_bytes - number_received, 0);
+if (result <= 0) {
+return false;
+}
+number_received += result;
+}
+return true;
+}
+
+static bool read_data32(const int socket, uint32_t *data)
+{
+bool result;
+
+result = read_bytes(socket, (uint8_t *)data, sizeof(uint32_t));
+if (!result) {
+

[PATCH 3/3] hw/nvme: Add SPDM over DOE support

2023-09-15 Thread Alistair Francis

From: Wilfred Mallawa 

Setup Data Object Exchance (DOE) as an extended capability for the NVME
controller and connect SPDM to it (CMA) to it.

Signed-off-by: Wilfred Mallawa 
Signed-off-by: Alistair Francis 
---
 docs/specs/index.rst|  1 +
 docs/specs/spdm.rst | 56 +
 include/hw/pci/pci_device.h |  5 
 include/hw/pci/pcie_doe.h   |  3 ++
 hw/nvme/ctrl.c  | 52 ++
 hw/nvme/trace-events|  1 +
 6 files changed, 118 insertions(+)
 create mode 100644 docs/specs/spdm.rst

diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index e58be38c41..c398541388 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -24,3 +24,4 @@ guest hardware that is specific to QEMU.
acpi_erst
sev-guest-firmware
fw_cfg
+   spdm
diff --git a/docs/specs/spdm.rst b/docs/specs/spdm.rst
new file mode 100644
index 00..0f96d618ef
--- /dev/null
+++ b/docs/specs/spdm.rst
@@ -0,0 +1,56 @@
+==
+QEMU Security Protocols and Data Models (SPDM) Support
+==
+
+SPDM enables authentication, attestation and key exchange to assist in
+providing infrastructure security enablement. It's a standard published
+by the DMTF https://www.dmtf.org/standards/SPDM.
+
+Setting up a SPDM server
+
+
+When using QEMU with SPDM devices QEMU will connect to a server which
+implements the SPDM functionality.
+
+spdm-emu
+
+
+You can use spdm-emu https://github.com/dmtf/spdm-emu to model the
+SPDM responder.
+
+.. code-block:: shell
+
+$ cd spdm-emu
+$ git submodule init; git submodule update --recursive
+$ mkdir build; cd build
+$ cmake -DARCH=x64 -DTOOLCHAIN=GCC -DTARGET=Debug -DCRYPTO=openssl ..
+$ make -j32
+$ make copy_sample_key # Build certificates, required for SPDM 
authentication.
+
+The responder can then be launched with
+
+.. code-block:: shell
+
+$ cd bin
+$ ./spdm_responder_emu --trans PCI_DOE
+
+Connecting an SPDM NVMe device
+==
+
+Once a SPDM server is running we can start QEMU and connect to the server.
+
+For an NVMe device first let's setup a block we can use
+
+.. code-block:: shell
+
+$ cd qemu-spdm/linux/image
+$ dd if=/dev/zero of=blknvme bs=1M count=2096 # 2GB NNMe Drive
+
+Then you can add this to your QEMU command line:
+
+.. code-block:: shell
+
+-drive file=blknvme,if=none,id=mynvme,format=raw \
+-device nvme,drive=mynvme,serial=deadbeef,spdm=2323
+
+At which point QEMU will connect to the SPDM server.
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..b8379c78f1 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -3,6 +3,7 @@
 
 #include "hw/pci/pci.h"
 #include "hw/pci/pcie.h"
+#include "hw/pci/pcie_doe.h"
 
 #define TYPE_PCI_DEVICE "pci-device"
 typedef struct PCIDeviceClass PCIDeviceClass;
@@ -157,6 +158,10 @@ struct PCIDevice {
 MSIVectorReleaseNotifier msix_vector_release_notifier;
 MSIVectorPollNotifier msix_vector_poll_notifier;
 
+/* DOE */
+DOECap doe_spdm;
+uint16_t spdm_port;
+
 /* ID of standby device in net_failover pair */
 char *failover_pair_id;
 uint32_t acpi_index;
diff --git a/include/hw/pci/pcie_doe.h b/include/hw/pci/pcie_doe.h
index 15d94661f9..eb8f4e393d 100644
--- a/include/hw/pci/pcie_doe.h
+++ b/include/hw/pci/pcie_doe.h
@@ -108,6 +108,9 @@ struct DOECap {
 /* Protocols and its callback response */
 DOEProtocol *protocols;
 uint16_t protocol_num;
+
+/* Used for spdm-socket */
+int socket;
 };
 
 void pcie_doe_init(PCIDevice *pdev, DOECap *doe_cap, uint16_t offset,
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90687b168a..1ff30a9ad4 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -203,6 +203,7 @@
 #include "sysemu/hostmem.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pcie_sriov.h"
+#include "sysemu/spdm-socket.h"
 #include "migration/vmstate.h"
 
 #include "nvme.h"
@@ -8077,6 +8078,28 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, 
uint8_t offset)
 return 0;
 }
 
+static bool pcie_doe_spdm_rsp(DOECap *doe_cap)
+{
+void *req = pcie_doe_get_write_mbox_ptr(doe_cap);
+uint32_t req_len = pcie_doe_get_obj_len(req) * 4;
+void *rsp = doe_cap->read_mbox;
+uint32_t rsp_len = SPDM_SOCKET_MAX_MESSAGE_BUFFER_SIZE;
+uint32_t recvd;
+
+recvd = spdm_socket_rsp(doe_cap->socket,
+ SPDM_SOCKET_TRANSPORT_TYPE_PCI_DOE,
+ req, req_len, rsp, rsp_len);
+doe_cap->read_mbox_len += DIV_ROUND_UP(recvd, 4);
+
+return (recvd == 0) ? false : true;
+}
+
+static DOEProtocol doe_spdm_prot[] = {
+{ PCI_VENDOR_ID_PCI_SIG, PCI_SIG_DOE_CMA, pcie_doe_spdm_rsp },
+{ PCI_VENDOR_ID_PCI_SIG, PCI_SIG_DOE_SECURED_CMA, pcie_doe_spdm_rsp },
+{ }
+};
+
 static bool nvme_init_pci(Nvme

[PATCH 1/3] hw/pci: Add all Data Object Types

2023-09-15 Thread Alistair Francis

Add all of the defined protocols/features from the PCIe-SIG
"Table 6-32 PCI-SIG defined Data Object Types (Vendor ID = 0001h)"
table.

Signed-off-by: Alistair Francis 
---
 include/hw/pci/pcie_doe.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/hw/pci/pcie_doe.h b/include/hw/pci/pcie_doe.h
index 87dc17dcef..15d94661f9 100644
--- a/include/hw/pci/pcie_doe.h
+++ b/include/hw/pci/pcie_doe.h
@@ -46,6 +46,8 @@ REG32(PCI_DOE_CAP_STATUS, 0)
 
 /* PCI-SIG defined Data Object Types - r6.0 Table 6-32 */
 #define PCI_SIG_DOE_DISCOVERY   0x00
+#define PCI_SIG_DOE_CMA 0x01
+#define PCI_SIG_DOE_SECURED_CMA 0x02
 
 #define PCI_DOE_DW_SIZE_MAX (1 << 18)
 #define PCI_DOE_PROTOCOL_NUM_MAX256
-- 
2.41.0

Re: [PATCH v3 0/5] block-backend: process I/O in the current AioContext

2023-09-15 Thread Kevin Wolf

Am 13.09.2023 um 01:10 hat Stefan Hajnoczi geschrieben:
> v3
> - Add Patch 2 to fix a race condition in test-bdrv-drain. This was the CI
>   failure that bumped this patch series from Kevin's pull request.
> - Add missing 051.pc.out file. I tried qemu-system-aarch64 to see of 051.out
>   also needs to be updated, but no changes were necessary. [Kevin]
> v2
> - Add patch to remove AIOCBInfo->get_aio_context() [Kevin]
> - Add patch to use qemu_get_current_aio_context() in block-coroutine-wrapper 
> so
>   that the wrappers use the current AioContext instead of
>   bdrv_get_aio_context().
> 
> Switch blk_aio_*() APIs over to multi-queue by using
> qemu_get_current_aio_context() instead of blk_get_aio_context(). This change
> will allow devices to process I/O in multiple IOThreads in the future.
> 
> The final patch requires my QIOChannel AioContext series to pass
> tests/qemu-iotests/check -qcow2 281 because the nbd block driver is now
> accessed from the main loop thread in addition to the IOThread:
> https://lore.kernel.org/qemu-devel/20230823234504.1387239-1-stefa...@redhat.com/T/#t
> 
> Based-on: 20230823234504.1387239-1-stefa...@redhat.com

Thanks, applied to the block branch.

Kevin

[PULL 14/28] block: Mark bdrv_get_cumulative_perm() and callers GRAPH_RDLOCK

2023-09-15 Thread Kevin Wolf

The function reads the parents list, so it needs to hold the graph lock.

This happens to result in BlockDriver.bdrv_set_perm() to be called with
the graph lock held. For consistency, make it the same for all of the
BlockDriver callbacks for updating permissions and annotate the function
pointers with GRAPH_RDLOCK_PTR.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-15-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/block/block_int-common.h   |  9 ---
 include/block/block_int-global-state.h |  4 +--
 block.c| 35 --
 blockdev.c |  6 +
 4 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index fda9d8b5c8..f82c14fb9c 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -413,8 +413,8 @@ struct BlockDriver {
  * If both conditions are met, 0 is returned. Otherwise, -errno is returned
  * and errp is set to an error describing the conflict.
  */
-int (*bdrv_check_perm)(BlockDriverState *bs, uint64_t perm,
-   uint64_t shared, Error **errp);
+int GRAPH_RDLOCK_PTR (*bdrv_check_perm)(BlockDriverState *bs, uint64_t 
perm,
+uint64_t shared, Error **errp);
 
 /**
  * Called to inform the driver that the set of cumulative set of used
@@ -426,7 +426,8 @@ struct BlockDriver {
  * This function is only invoked after bdrv_check_perm(), so block drivers
  * may rely on preparations made in their .bdrv_check_perm implementation.
  */
-void (*bdrv_set_perm)(BlockDriverState *bs, uint64_t perm, uint64_t 
shared);
+void GRAPH_RDLOCK_PTR (*bdrv_set_perm)(
+BlockDriverState *bs, uint64_t perm, uint64_t shared);
 
 /*
  * Called to inform the driver that after a previous bdrv_check_perm()
@@ -436,7 +437,7 @@ struct BlockDriver {
  * This function can be called even for nodes that never saw a
  * bdrv_check_perm() call. It is a no-op then.
  */
-void (*bdrv_abort_perm_update)(BlockDriverState *bs);
+void GRAPH_RDLOCK_PTR (*bdrv_abort_perm_update)(BlockDriverState *bs);
 
 /**
  * Returns in @nperm and @nshared the permissions that the driver for @bs
diff --git a/include/block/block_int-global-state.h 
b/include/block/block_int-global-state.h
index bebcc08bce..e2304db58b 100644
--- a/include/block/block_int-global-state.h
+++ b/include/block/block_int-global-state.h
@@ -204,8 +204,8 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState 
*child_bs,
   void *opaque, Error **errp);
 void bdrv_root_unref_child(BdrvChild *child);
 
-void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
-  uint64_t *shared_perm);
+void GRAPH_RDLOCK bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t 
*perm,
+   uint64_t *shared_perm);
 
 /**
  * Sets a BdrvChild's permissions.  Avoid if the parent is a BDS; use
diff --git a/block.c b/block.c
index 6720bc4f8a..186efda70f 100644
--- a/block.c
+++ b/block.c
@@ -2320,7 +2320,7 @@ static void bdrv_child_set_perm(BdrvChild *c, uint64_t 
perm,
 tran_add(tran, &bdrv_child_set_pem_drv, s);
 }
 
-static void bdrv_drv_set_perm_commit(void *opaque)
+static void GRAPH_RDLOCK bdrv_drv_set_perm_commit(void *opaque)
 {
 BlockDriverState *bs = opaque;
 uint64_t cumulative_perms, cumulative_shared_perms;
@@ -2333,7 +2333,7 @@ static void bdrv_drv_set_perm_commit(void *opaque)
 }
 }
 
-static void bdrv_drv_set_perm_abort(void *opaque)
+static void GRAPH_RDLOCK bdrv_drv_set_perm_abort(void *opaque)
 {
 BlockDriverState *bs = opaque;
 GLOBAL_STATE_CODE();
@@ -2348,9 +2348,13 @@ TransactionActionDrv bdrv_drv_set_perm_drv = {
 .commit = bdrv_drv_set_perm_commit,
 };
 
-static int bdrv_drv_set_perm(BlockDriverState *bs, uint64_t perm,
- uint64_t shared_perm, Transaction *tran,
- Error **errp)
+/*
+ * After calling this function, the transaction @tran may only be completed
+ * while holding a reader lock for the graph.
+ */
+static int GRAPH_RDLOCK
+bdrv_drv_set_perm(BlockDriverState *bs, uint64_t perm, uint64_t shared_perm,
+  Transaction *tran, Error **errp)
 {
 GLOBAL_STATE_CODE();
 if (!bs->drv) {
@@ -2457,9 +2461,13 @@ bdrv_replace_child_tran(BdrvChild *child, 
BlockDriverState *new_bs,
 /*
  * Refresh permissions in @bs subtree. The function is intended to be called
  * after some graph modification that was done without permission update.
+ *
+ * After calling this function, the transaction @tran may only be completed
+ * while holding a reader lock for the graph.
  */
-static int bdrv_node_refresh_perm(BlockDriverState *bs, BlockReopenQueue *

[PULL 02/28] preallocate: Factor out preallocate_truncate_to_real_size()

2023-09-15 Thread Kevin Wolf

It's essentially the same code in preallocate_check_perm() and
preallocate_close(), except that the latter ignores errors.

Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-3-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block/preallocate.c | 48 +
 1 file changed, 22 insertions(+), 26 deletions(-)

diff --git a/block/preallocate.c b/block/preallocate.c
index 3d0f621003..3173d80534 100644
--- a/block/preallocate.c
+++ b/block/preallocate.c
@@ -162,26 +162,39 @@ static int preallocate_open(BlockDriverState *bs, QDict 
*options, int flags,
 return 0;
 }
 
-static void preallocate_close(BlockDriverState *bs)
+static int preallocate_truncate_to_real_size(BlockDriverState *bs, Error 
**errp)
 {
-int ret;
 BDRVPreallocateState *s = bs->opaque;
-
-if (s->data_end < 0) {
-return;
-}
+int ret;
 
 if (s->file_end < 0) {
 s->file_end = bdrv_getlength(bs->file->bs);
 if (s->file_end < 0) {
-return;
+error_setg_errno(errp, -s->file_end, "Failed to get file length");
+return s->file_end;
 }
 }
 
 if (s->data_end < s->file_end) {
 ret = bdrv_truncate(bs->file, s->data_end, true, PREALLOC_MODE_OFF, 0,
 NULL);
-s->file_end = ret < 0 ? ret : s->data_end;
+if (ret < 0) {
+error_setg_errno(errp, -ret, "Failed to drop preallocation");
+s->file_end = ret;
+return ret;
+}
+s->file_end = s->data_end;
+}
+
+return 0;
+}
+
+static void preallocate_close(BlockDriverState *bs)
+{
+BDRVPreallocateState *s = bs->opaque;
+
+if (s->data_end >= 0) {
+preallocate_truncate_to_real_size(bs, NULL);
 }
 }
 
@@ -473,24 +486,7 @@ static int preallocate_check_perm(BlockDriverState *bs,
  * We should truncate in check_perm, as in set_perm bs->file->perm will
  * be already changed, and we should not violate it.
  */
-if (s->file_end < 0) {
-s->file_end = bdrv_getlength(bs->file->bs);
-if (s->file_end < 0) {
-error_setg(errp, "Failed to get file length");
-return s->file_end;
-}
-}
-
-if (s->data_end < s->file_end) {
-int ret = bdrv_truncate(bs->file, s->data_end, true,
-PREALLOC_MODE_OFF, 0, NULL);
-if (ret < 0) {
-error_setg(errp, "Failed to drop preallocation");
-s->file_end = ret;
-return ret;
-}
-s->file_end = s->data_end;
-}
+return preallocate_truncate_to_real_size(bs, errp);
 }
 
 return 0;
-- 
2.41.0

[PULL 04/28] block: Take AioContext lock for bdrv_append() more consistently

2023-09-15 Thread Kevin Wolf

The documentation for bdrv_append() says that the caller must hold the
AioContext lock for bs_top. Change all callers to actually adhere to the
contract.

Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-5-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 tests/unit/test-bdrv-drain.c | 3 +++
 tests/unit/test-bdrv-graph-mod.c | 6 ++
 tests/unit/test-block-iothread.c | 3 +++
 3 files changed, 12 insertions(+)

diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index ccc453c29e..89c8fa6780 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -1359,7 +1359,10 @@ static void test_append_to_drained(void)
 g_assert_cmpint(base_s->drain_count, ==, 1);
 g_assert_cmpint(base->in_flight, ==, 0);
 
+aio_context_acquire(qemu_get_aio_context());
 bdrv_append(overlay, base, &error_abort);
+aio_context_release(qemu_get_aio_context());
+
 g_assert_cmpint(base->in_flight, ==, 0);
 g_assert_cmpint(overlay->in_flight, ==, 0);
 
diff --git a/tests/unit/test-bdrv-graph-mod.c b/tests/unit/test-bdrv-graph-mod.c
index 36eed4b464..d8503165b3 100644
--- a/tests/unit/test-bdrv-graph-mod.c
+++ b/tests/unit/test-bdrv-graph-mod.c
@@ -140,8 +140,10 @@ static void test_update_perm_tree(void)
 bdrv_attach_child(filter, bs, "child", &child_of_bds,
   BDRV_CHILD_DATA, &error_abort);
 
+aio_context_acquire(qemu_get_aio_context());
 ret = bdrv_append(filter, bs, NULL);
 g_assert_cmpint(ret, <, 0);
+aio_context_release(qemu_get_aio_context());
 
 bdrv_unref(filter);
 blk_unref(root);
@@ -205,7 +207,9 @@ static void test_should_update_child(void)
 g_assert(target->backing->bs == bs);
 bdrv_attach_child(filter, target, "target", &child_of_bds,
   BDRV_CHILD_DATA, &error_abort);
+aio_context_acquire(qemu_get_aio_context());
 bdrv_append(filter, bs, &error_abort);
+aio_context_release(qemu_get_aio_context());
 g_assert(target->backing->bs == bs);
 
 bdrv_unref(filter);
@@ -410,7 +414,9 @@ static void test_append_greedy_filter(void)
   BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
   &error_abort);
 
+aio_context_acquire(qemu_get_aio_context());
 bdrv_append(fl, base, &error_abort);
+aio_context_release(qemu_get_aio_context());
 bdrv_unref(fl);
 bdrv_unref(top);
 }
diff --git a/tests/unit/test-block-iothread.c b/tests/unit/test-block-iothread.c
index d727a5fee8..9155547313 100644
--- a/tests/unit/test-block-iothread.c
+++ b/tests/unit/test-block-iothread.c
@@ -756,11 +756,14 @@ static void test_propagate_mirror(void)
   &error_abort);
 
 /* Start a mirror job */
+aio_context_acquire(main_ctx);
 mirror_start("job0", src, target, NULL, JOB_DEFAULT, 0, 0, 0,
  MIRROR_SYNC_MODE_NONE, MIRROR_OPEN_BACKING_CHAIN, false,
  BLOCKDEV_ON_ERROR_REPORT, BLOCKDEV_ON_ERROR_REPORT,
  false, "filter_node", MIRROR_COPY_MODE_BACKGROUND,
  &error_abort);
+aio_context_release(main_ctx);
+
 WITH_JOB_LOCK_GUARD() {
 job = job_get_locked("job0");
 }
-- 
2.41.0

[PULL 08/28] block: Mark bdrv_replace_child_noperm() GRAPH_WRLOCK

2023-09-15 Thread Kevin Wolf

Instead of taking the writer lock internally, require callers to already
hold it when calling bdrv_replace_child_noperm(). These callers will
typically already hold the graph lock once the locking work is
completed, which means that they can't call functions that take it
internally.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-9-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/block.c b/block.c
index c8ac7cfac4..61856f5c33 100644
--- a/block.c
+++ b/block.c
@@ -91,8 +91,9 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
 static bool bdrv_recurse_has_child(BlockDriverState *bs,
BlockDriverState *child);
 
-static void bdrv_replace_child_noperm(BdrvChild *child,
-  BlockDriverState *new_bs);
+static void GRAPH_WRLOCK
+bdrv_replace_child_noperm(BdrvChild *child, BlockDriverState *new_bs);
+
 static void bdrv_remove_child(BdrvChild *child, Transaction *tran);
 
 static int bdrv_reopen_prepare(BDRVReopenState *reopen_state,
@@ -2387,6 +2388,8 @@ static void bdrv_replace_child_abort(void *opaque)
 BlockDriverState *new_bs = s->child->bs;
 
 GLOBAL_STATE_CODE();
+bdrv_graph_wrlock(s->old_bs);
+
 /* old_bs reference is transparently moved from @s to @s->child */
 if (!s->child->bs) {
 /*
@@ -2403,6 +2406,8 @@ static void bdrv_replace_child_abort(void *opaque)
 }
 assert(s->child->quiesced_parent);
 bdrv_replace_child_noperm(s->child, s->old_bs);
+
+bdrv_graph_wrunlock();
 bdrv_unref(new_bs);
 }
 
@@ -2439,7 +2444,10 @@ static void bdrv_replace_child_tran(BdrvChild *child, 
BlockDriverState *new_bs,
 if (new_bs) {
 bdrv_ref(new_bs);
 }
+
+bdrv_graph_wrlock(new_bs);
 bdrv_replace_child_noperm(child, new_bs);
+bdrv_graph_wrunlock();
 /* old_bs reference is transparently moved from @child to @s */
 }
 
@@ -2858,8 +2866,8 @@ uint64_t bdrv_qapi_perm_to_blk_perm(BlockPermission 
qapi_perm)
  * If @new_bs is non-NULL, the parent of @child must already be drained through
  * @child and the caller must hold the AioContext lock for @new_bs.
  */
-static void bdrv_replace_child_noperm(BdrvChild *child,
-  BlockDriverState *new_bs)
+static void GRAPH_WRLOCK
+bdrv_replace_child_noperm(BdrvChild *child, BlockDriverState *new_bs)
 {
 BlockDriverState *old_bs = child->bs;
 int new_bs_quiesce_counter;
@@ -2894,8 +2902,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
 }
 
-/* TODO Pull this up into the callers to avoid polling here */
-bdrv_graph_wrlock(new_bs);
 if (old_bs) {
 if (child->klass->detach) {
 child->klass->detach(child);
@@ -2911,7 +2917,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 child->klass->attach(child);
 }
 }
-bdrv_graph_wrunlock();
 
 /*
  * If the parent was drained through this BdrvChild previously, but new_bs
@@ -2952,7 +2957,10 @@ static void bdrv_attach_child_common_abort(void *opaque)
 BlockDriverState *bs = s->child->bs;
 
 GLOBAL_STATE_CODE();
+
+bdrv_graph_wrlock(NULL);
 bdrv_replace_child_noperm(s->child, NULL);
+bdrv_graph_wrunlock();
 
 if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
 bdrv_try_change_aio_context(bs, s->old_child_ctx, NULL, &error_abort);
@@ -3080,8 +3088,10 @@ static BdrvChild 
*bdrv_attach_child_common(BlockDriverState *child_bs,
  * a problem, we already did this), but it will still poll until the parent
  * is fully quiesced, so it will not be negatively affected either.
  */
+bdrv_graph_wrlock(child_bs);
 bdrv_parent_drained_begin_single(new_child);
 bdrv_replace_child_noperm(new_child, child_bs);
+bdrv_graph_wrunlock();
 
 BdrvAttachChildCommonState *s = g_new(BdrvAttachChildCommonState, 1);
 *s = (BdrvAttachChildCommonState) {
@@ -3225,7 +3235,9 @@ void bdrv_root_unref_child(BdrvChild *child)
 BlockDriverState *child_bs = child->bs;
 
 GLOBAL_STATE_CODE();
+bdrv_graph_wrlock(NULL);
 bdrv_replace_child_noperm(child, NULL);
+bdrv_graph_wrunlock();
 bdrv_child_free(child);
 
 if (child_bs) {
-- 
2.41.0

[PULL 13/28] block: Mark bdrv_parent_perms_conflict() and callers GRAPH_RDLOCK

2023-09-15 Thread Kevin Wolf

The function reads the parents list, so it needs to hold the graph lock.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-14-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/block/block_int-common.h|  6 ++---
 include/block/block_int-global-state.h  |  8 +++---
 include/sysemu/block-backend-global-state.h |  4 +--
 block.c | 28 +
 block/block-backend.c   | 26 ++-
 block/crypto.c  |  6 +++--
 block/mirror.c  |  8 ++
 block/vmdk.c|  2 ++
 tests/unit/test-bdrv-graph-mod.c|  4 +++
 9 files changed, 66 insertions(+), 26 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 85be256c09..fda9d8b5c8 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -311,7 +311,7 @@ struct BlockDriver {
  */
 void (*bdrv_cancel_in_flight)(BlockDriverState *bs);
 
-int (*bdrv_inactivate)(BlockDriverState *bs);
+int GRAPH_RDLOCK_PTR (*bdrv_inactivate)(BlockDriverState *bs);
 
 int (*bdrv_snapshot_create)(BlockDriverState *bs,
 QEMUSnapshotInfo *sn_info);
@@ -944,8 +944,8 @@ struct BdrvChildClass {
  * when migration is completing) and it can start/stop requesting
  * permissions and doing I/O on it.
  */
-void (*activate)(BdrvChild *child, Error **errp);
-int (*inactivate)(BdrvChild *child);
+void GRAPH_RDLOCK_PTR (*activate)(BdrvChild *child, Error **errp);
+int GRAPH_RDLOCK_PTR (*inactivate)(BdrvChild *child);
 
 void GRAPH_WRLOCK_PTR (*attach)(BdrvChild *child);
 void GRAPH_WRLOCK_PTR (*detach)(BdrvChild *child);
diff --git a/include/block/block_int-global-state.h 
b/include/block/block_int-global-state.h
index da5fb31089..bebcc08bce 100644
--- a/include/block/block_int-global-state.h
+++ b/include/block/block_int-global-state.h
@@ -212,8 +212,9 @@ void bdrv_get_cumulative_perm(BlockDriverState *bs, 
uint64_t *perm,
  * bdrv_child_refresh_perms() instead and make the parent's
  * .bdrv_child_perm() implementation return the correct values.
  */
-int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared,
-Error **errp);
+int GRAPH_RDLOCK
+bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared,
+Error **errp);
 
 /**
  * Calls bs->drv->bdrv_child_perm() and updates the child's permission
@@ -223,7 +224,8 @@ int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, 
uint64_t shared,
  * values than before, but which will not result in the block layer
  * automatically refreshing the permissions.
  */
-int bdrv_child_refresh_perms(BlockDriverState *bs, BdrvChild *c, Error **errp);
+int GRAPH_RDLOCK
+bdrv_child_refresh_perms(BlockDriverState *bs, BdrvChild *c, Error **errp);
 
 bool GRAPH_RDLOCK bdrv_recurse_can_replace(BlockDriverState *bs,
BlockDriverState *to_replace);
diff --git a/include/sysemu/block-backend-global-state.h 
b/include/sysemu/block-backend-global-state.h
index 184e667ebd..d5f675493a 100644
--- a/include/sysemu/block-backend-global-state.h
+++ b/include/sysemu/block-backend-global-state.h
@@ -61,8 +61,8 @@ int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, 
Error **errp);
 int blk_replace_bs(BlockBackend *blk, BlockDriverState *new_bs, Error **errp);
 bool bdrv_has_blk(BlockDriverState *bs);
 bool bdrv_is_root_node(BlockDriverState *bs);
-int blk_set_perm(BlockBackend *blk, uint64_t perm, uint64_t shared_perm,
- Error **errp);
+int GRAPH_UNLOCKED blk_set_perm(BlockBackend *blk, uint64_t perm,
+uint64_t shared_perm, Error **errp);
 void blk_get_perm(BlockBackend *blk, uint64_t *perm, uint64_t *shared_perm);
 
 void blk_iostatus_enable(BlockBackend *blk);
diff --git a/block.c b/block.c
index 369023872a..6720bc4f8a 100644
--- a/block.c
+++ b/block.c
@@ -2202,7 +2202,8 @@ static bool bdrv_a_allow_b(BdrvChild *a, BdrvChild *b, 
Error **errp)
 return false;
 }
 
-static bool bdrv_parent_perms_conflict(BlockDriverState *bs, Error **errp)
+static bool GRAPH_RDLOCK
+bdrv_parent_perms_conflict(BlockDriverState *bs, Error **errp)
 {
 BdrvChild *a, *b;
 GLOBAL_STATE_CODE();
@@ -2255,8 +2256,8 @@ static void bdrv_child_perm(BlockDriverState *bs, 
BlockDriverState *child_bs,
  * simplest way to satisfy this criteria: use only result of
  * bdrv_topological_dfs() or NULL as @list parameter.
  */
-static GSList *bdrv_topological_dfs(GSList *list, GHashTable *found,
-BlockDriverState *bs)
+static GSList * GRAPH_RDLOCK
+bdrv_topological_dfs(GSList *list, GHashTable *found, BlockDriverState *bs)
 {
 BdrvChild *child;
 g_autoptr(GHashTab

[PULL 11/28] block: Call transaction callbacks with lock held

2023-09-15 Thread Kevin Wolf

In previous patches, we changed some transactionable functions to be
marked as GRAPH_WRLOCK, but required that tran_finalize() is still
called without the lock. This was because all callbacks that can be in
the same transaction need to follow the same convention.

Now that we don't have conflicting requirements any more, we can switch
all of the transaction callbacks to be declared GRAPH_WRLOCK, too, and
call tran_finalize() with the lock held.

Document for each of these transactionable functions that the lock needs
to be held when completing the transaction, and make sure that all
callers down to the place where the transaction is finalised actually
have the writer lock.

Signed-off-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-12-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block.c | 61 +
 1 file changed, 44 insertions(+), 17 deletions(-)

diff --git a/block.c b/block.c
index f6e7cf4fb9..f06de58a3b 100644
--- a/block.c
+++ b/block.c
@@ -2375,21 +2375,21 @@ typedef struct BdrvReplaceChildState {
 BlockDriverState *old_bs;
 } BdrvReplaceChildState;
 
-static void bdrv_replace_child_commit(void *opaque)
+static void GRAPH_WRLOCK bdrv_replace_child_commit(void *opaque)
 {
 BdrvReplaceChildState *s = opaque;
 GLOBAL_STATE_CODE();
 
-bdrv_unref(s->old_bs);
+bdrv_schedule_unref(s->old_bs);
 }
 
-static void bdrv_replace_child_abort(void *opaque)
+static void GRAPH_WRLOCK bdrv_replace_child_abort(void *opaque)
 {
 BdrvReplaceChildState *s = opaque;
 BlockDriverState *new_bs = s->child->bs;
 
 GLOBAL_STATE_CODE();
-bdrv_graph_wrlock(s->old_bs);
+assert_bdrv_graph_writable();
 
 /* old_bs reference is transparently moved from @s to @s->child */
 if (!s->child->bs) {
@@ -2408,7 +2408,6 @@ static void bdrv_replace_child_abort(void *opaque)
 assert(s->child->quiesced_parent);
 bdrv_replace_child_noperm(s->child, s->old_bs);
 
-bdrv_graph_wrunlock();
 bdrv_unref(new_bs);
 }
 
@@ -2426,6 +2425,9 @@ static TransactionActionDrv bdrv_replace_child_drv = {
  * Both @child->bs and @new_bs (if non-NULL) must be drained. @new_bs must be
  * kept drained until the transaction is completed.
  *
+ * After calling this function, the transaction @tran may only be completed
+ * while holding a writer lock for the graph.
+ *
  * The function doesn't update permissions, caller is responsible for this.
  */
 static void GRAPH_WRLOCK
@@ -2951,16 +2953,15 @@ typedef struct BdrvAttachChildCommonState {
 AioContext *old_child_ctx;
 } BdrvAttachChildCommonState;
 
-static void bdrv_attach_child_common_abort(void *opaque)
+static void GRAPH_WRLOCK bdrv_attach_child_common_abort(void *opaque)
 {
 BdrvAttachChildCommonState *s = opaque;
 BlockDriverState *bs = s->child->bs;
 
 GLOBAL_STATE_CODE();
+assert_bdrv_graph_writable();
 
-bdrv_graph_wrlock(NULL);
 bdrv_replace_child_noperm(s->child, NULL);
-bdrv_graph_wrunlock();
 
 if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
 bdrv_try_change_aio_context(bs, s->old_child_ctx, NULL, &error_abort);
@@ -2984,7 +2985,7 @@ static void bdrv_attach_child_common_abort(void *opaque)
 tran_commit(tran);
 }
 
-bdrv_unref(bs);
+bdrv_schedule_unref(bs);
 bdrv_child_free(s->child);
 }
 
@@ -2998,6 +2999,9 @@ static TransactionActionDrv bdrv_attach_child_common_drv 
= {
  *
  * Function doesn't update permissions, caller is responsible for this.
  *
+ * After calling this function, the transaction @tran may only be completed
+ * while holding a writer lock for the graph.
+ *
  * Returns new created child.
  *
  * The caller must hold the AioContext lock for @child_bs. Both @parent_bs and
@@ -3114,6 +3118,9 @@ bdrv_attach_child_common(BlockDriverState *child_bs,
  * The caller must hold the AioContext lock for @child_bs. Both @parent_bs and
  * @child_bs can move to a different AioContext in this function. Callers must
  * make sure that their AioContext locking is still correct after this.
+ *
+ * After calling this function, the transaction @tran may only be completed
+ * while holding a writer lock for the graph.
  */
 static BdrvChild * GRAPH_WRLOCK
 bdrv_attach_child_noperm(BlockDriverState *parent_bs,
@@ -3180,8 +3187,8 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState 
*child_bs,
 ret = bdrv_refresh_perms(child_bs, tran, errp);
 
 out:
-bdrv_graph_wrunlock();
 tran_finalize(tran, ret);
+bdrv_graph_wrunlock();
 
 bdrv_unref(child_bs);
 
@@ -3227,8 +3234,8 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 }
 
 out:
-bdrv_graph_wrunlock();
 tran_finalize(tran, ret);
+bdrv_graph_wrunlock();
 
 bdrv_unref(child_bs);
 
@@ -3393,6 +3400,9 @@ static BdrvChildRole bdrv_backing_role(BlockDriverState 
*bs)
  * The caller must hold the AioContext lock for @child_bs. Both @parent_bs and
  * @child_bs can move to a different AioContext

[PULL 24/28] block: remove AIOCBInfo->get_aio_context()

2023-09-15 Thread Kevin Wolf

From: Stefan Hajnoczi 

The synchronous bdrv_aio_cancel() function needs the acb's AioContext so
it can call aio_poll() to wait for cancellation.

It turns out that all users run under the BQL in the main AioContext, so
this callback is not needed.

Remove the callback, mark bdrv_aio_cancel() GLOBAL_STATE_CODE just like
its blk_aio_cancel() caller, and poll the main loop AioContext.

The purpose of this cleanup is to identify bdrv_aio_cancel() as an API
that does not work with the multi-queue block layer.

Signed-off-by: Stefan Hajnoczi 
Message-ID: <20230912231037.826804-2-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Klaus Jensen 
Signed-off-by: Kevin Wolf 
---
 include/block/aio.h|  1 -
 include/block/block-global-state.h |  2 ++
 include/block/block-io.h   |  1 -
 block/block-backend.c  | 17 -
 block/io.c | 23 ---
 hw/nvme/ctrl.c |  7 ---
 softmmu/dma-helpers.c  |  8 
 util/thread-pool.c |  8 
 8 files changed, 10 insertions(+), 57 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 32042e8905..bcc165c974 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -31,7 +31,6 @@ typedef void BlockCompletionFunc(void *opaque, int ret);
 
 typedef struct AIOCBInfo {
 void (*cancel_async)(BlockAIOCB *acb);
-AioContext *(*get_aio_context)(BlockAIOCB *acb);
 size_t aiocb_size;
 } AIOCBInfo;
 
diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index f31660c7b1..6061220a6c 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -185,6 +185,8 @@ void bdrv_drain_all_begin_nopoll(void);
 void bdrv_drain_all_end(void);
 void bdrv_drain_all(void);
 
+void bdrv_aio_cancel(BlockAIOCB *acb);
+
 int bdrv_has_zero_init_1(BlockDriverState *bs);
 int bdrv_has_zero_init(BlockDriverState *bs);
 BlockDriverState *bdrv_find_node(const char *node_name);
diff --git a/include/block/block-io.h b/include/block/block-io.h
index 6db48f2d35..f1c796a1ce 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -101,7 +101,6 @@ bdrv_co_delete_file_noerr(BlockDriverState *bs);
 
 
 /* async block I/O */
-void bdrv_aio_cancel(BlockAIOCB *acb);
 void bdrv_aio_cancel_async(BlockAIOCB *acb);
 
 /* sg packet commands */
diff --git a/block/block-backend.c b/block/block-backend.c
index c2636f4351..24b9449712 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -33,8 +33,6 @@
 
 #define NOT_DONE 0x7fff /* used while emulated sync operation in progress 
*/
 
-static AioContext *blk_aiocb_get_aio_context(BlockAIOCB *acb);
-
 typedef struct BlockBackendAioNotifier {
 void (*attached_aio_context)(AioContext *new_context, void *opaque);
 void (*detach_aio_context)(void *opaque);
@@ -103,7 +101,6 @@ typedef struct BlockBackendAIOCB {
 } BlockBackendAIOCB;
 
 static const AIOCBInfo block_backend_aiocb_info = {
-.get_aio_context = blk_aiocb_get_aio_context,
 .aiocb_size = sizeof(BlockBackendAIOCB),
 };
 
@@ -1562,16 +1559,8 @@ typedef struct BlkAioEmAIOCB {
 bool has_returned;
 } BlkAioEmAIOCB;
 
-static AioContext *blk_aio_em_aiocb_get_aio_context(BlockAIOCB *acb_)
-{
-BlkAioEmAIOCB *acb = container_of(acb_, BlkAioEmAIOCB, common);
-
-return blk_get_aio_context(acb->rwco.blk);
-}
-
 static const AIOCBInfo blk_aio_em_aiocb_info = {
 .aiocb_size = sizeof(BlkAioEmAIOCB),
-.get_aio_context= blk_aio_em_aiocb_get_aio_context,
 };
 
 static void blk_aio_complete(BlkAioEmAIOCB *acb)
@@ -2451,12 +2440,6 @@ AioContext *blk_get_aio_context(BlockBackend *blk)
 return blk->ctx;
 }
 
-static AioContext *blk_aiocb_get_aio_context(BlockAIOCB *acb)
-{
-BlockBackendAIOCB *blk_acb = DO_UPCAST(BlockBackendAIOCB, common, acb);
-return blk_get_aio_context(blk_acb->blk);
-}
-
 int blk_set_aio_context(BlockBackend *blk, AioContext *new_context,
 Error **errp)
 {
diff --git a/block/io.c b/block/io.c
index ba23a9bcd3..209a6da0c8 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2950,25 +2950,18 @@ int bdrv_load_vmstate(BlockDriverState *bs, uint8_t 
*buf,
 /**/
 /* async I/Os */
 
+/**
+ * Synchronously cancels an acb. Must be called with the BQL held and the acb
+ * must be processed with the BQL held too (IOThreads are not allowed).
+ *
+ * Use bdrv_aio_cancel_async() instead when possible.
+ */
 void bdrv_aio_cancel(BlockAIOCB *acb)
 {
-IO_CODE();
+GLOBAL_STATE_CODE();
 qemu_aio_ref(acb);
 bdrv_aio_cancel_async(acb);
-while (acb->refcnt > 1) {
-if (acb->aiocb_info->get_aio_context) {
-aio_poll(acb->aiocb_info->get_aio_context(acb), true);
-} else if (acb->bs) {
-/* qemu_aio_ref and qemu_aio_unref are not thread-safe, so
- * assert

[PULL 12/28] block: Mark bdrv_attach_child() GRAPH_WRLOCK

2023-09-15 Thread Kevin Wolf

Instead of taking the writer lock internally, require callers to already
hold it when calling bdrv_attach_child_common(). These callers will
typically already hold the graph lock once the locking work is
completed, which means that they can't call functions that take it
internally.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-13-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/block/block-global-state.h | 14 --
 block.c|  7 +++
 block/quorum.c |  2 ++
 block/replication.c|  6 ++
 tests/unit/test-bdrv-drain.c   | 14 ++
 tests/unit/test-bdrv-graph-mod.c   | 10 ++
 6 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index e570799f85..eb12a35439 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -226,12 +226,14 @@ void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
 void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
 void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
 void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
-BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
- BlockDriverState *child_bs,
- const char *child_name,
- const BdrvChildClass *child_class,
- BdrvChildRole child_role,
- Error **errp);
+
+BdrvChild * GRAPH_WRLOCK
+bdrv_attach_child(BlockDriverState *parent_bs,
+  BlockDriverState *child_bs,
+  const char *child_name,
+  const BdrvChildClass *child_class,
+  BdrvChildRole child_role,
+  Error **errp);
 
 bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp);
 void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason);
diff --git a/block.c b/block.c
index f06de58a3b..369023872a 100644
--- a/block.c
+++ b/block.c
@@ -3219,8 +3219,6 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 
 GLOBAL_STATE_CODE();
 
-bdrv_graph_wrlock(child_bs);
-
 child = bdrv_attach_child_noperm(parent_bs, child_bs, child_name,
  child_class, child_role, tran, errp);
 if (!child) {
@@ -3235,9 +3233,8 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 
 out:
 tran_finalize(tran, ret);
-bdrv_graph_wrunlock();
 
-bdrv_unref(child_bs);
+bdrv_schedule_unref(child_bs);
 
 return ret < 0 ? NULL : child;
 }
@@ -3758,11 +3755,13 @@ BdrvChild *bdrv_open_child(const char *filename,
 return NULL;
 }
 
+bdrv_graph_wrlock(NULL);
 ctx = bdrv_get_aio_context(bs);
 aio_context_acquire(ctx);
 child = bdrv_attach_child(parent, bs, bdref_key, child_class, child_role,
   errp);
 aio_context_release(ctx);
+bdrv_graph_wrunlock();
 
 return child;
 }
diff --git a/block/quorum.c b/block/quorum.c
index f28758cf2b..def0539fda 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1094,8 +1094,10 @@ static void quorum_add_child(BlockDriverState *bs, 
BlockDriverState *child_bs,
 /* We can safely add the child now */
 bdrv_ref(child_bs);
 
+bdrv_graph_wrlock(child_bs);
 child = bdrv_attach_child(bs, child_bs, indexstr, &child_of_bds,
   BDRV_CHILD_DATA, errp);
+bdrv_graph_wrunlock();
 if (child == NULL) {
 s->next_child_index--;
 goto out;
diff --git a/block/replication.c b/block/replication.c
index ea4bf1aa80..eec9819625 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -542,12 +542,15 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 return;
 }
 
+bdrv_graph_wrlock(bs);
+
 bdrv_ref(hidden_disk->bs);
 s->hidden_disk = bdrv_attach_child(bs, hidden_disk->bs, "hidden disk",
&child_of_bds, BDRV_CHILD_DATA,
&local_err);
 if (local_err) {
 error_propagate(errp, local_err);
+bdrv_graph_wrunlock();
 aio_context_release(aio_context);
 return;
 }
@@ -558,10 +561,13 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
   BDRV_CHILD_DATA, &local_err);
 if (local_err) {
 error_propagate(errp, local_err);
+bdrv_graph_wrunlock();
 aio_context_release(aio_context);
 return;
 }
 
+bdrv_graph_wrunlock();
+
 /* start backup job now */
 error_setg(&s->blocker,
"Block device is in use by internal backup job")

[PULL 05/28] block: Introduce bdrv_schedule_unref()

2023-09-15 Thread Kevin Wolf

bdrv_unref() is called by a lot of places that need to hold the graph
lock (it naturally happens in the context of operations that change the
graph). However, bdrv_unref() takes the graph writer lock internally, so
it can't actually be called while already holding a graph lock without
causing a deadlock.

bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
node before closing it, and draining requires that the graph is
unlocked.

The solution is to defer deleting the node until we don't hold the lock
any more and draining is possible again.

Note that keeping images open for longer than necessary can create
problems, too: You can't open an image again before it is really closed
(if image locking didn't prevent it, it would cause corruption).
Reopening an image immediately happens at least during bdrv_open() and
bdrv_co_create().

In order to solve this problem, make sure to run the deferred unref in
bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.

The output of iotest 051 is updated because the additional polling
changes the order of HMP output, resulting in a new "(qemu)" prompt in
the test output that was previously on a separate line and filtered out.

Signed-off-by: Kevin Wolf 
Message-ID: <20230911094620.45040-6-kw...@redhat.com>
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 include/block/block-global-state.h |  1 +
 block.c| 17 +
 block/graph-lock.c | 26 +++---
 tests/qemu-iotests/051.pc.out  |  6 +++---
 4 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index f347199bff..e570799f85 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -224,6 +224,7 @@ void bdrv_img_create(const char *filename, const char *fmt,
 void bdrv_ref(BlockDriverState *bs);
 void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
 void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
+void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
 void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
 BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
  BlockDriverState *child_bs,
diff --git a/block.c b/block.c
index 9029ddd9ff..c8ac7cfac4 100644
--- a/block.c
+++ b/block.c
@@ -7044,6 +7044,23 @@ void bdrv_unref(BlockDriverState *bs)
 }
 }
 
+/*
+ * Release a BlockDriverState reference while holding the graph write lock.
+ *
+ * Calling bdrv_unref() directly is forbidden while holding the graph lock
+ * because bdrv_close() both involves polling and taking the graph lock
+ * internally. bdrv_schedule_unref() instead delays decreasing the refcount and
+ * possibly closing @bs until the graph lock is released.
+ */
+void bdrv_schedule_unref(BlockDriverState *bs)
+{
+if (!bs) {
+return;
+}
+aio_bh_schedule_oneshot(qemu_get_aio_context(),
+(QEMUBHFunc *) bdrv_unref, bs);
+}
+
 struct BdrvOpBlocker {
 Error *reason;
 QLIST_ENTRY(BdrvOpBlocker) list;
diff --git a/block/graph-lock.c b/block/graph-lock.c
index f357a2c0b1..58a799065f 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -163,17 +163,29 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
 void bdrv_graph_wrunlock(void)
 {
 GLOBAL_STATE_CODE();
-QEMU_LOCK_GUARD(&aio_context_list_lock);
 assert(qatomic_read(&has_writer));
 
+WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
+/*
+ * No need for memory barriers, this works in pair with
+ * the slow path of rdlock() and both take the lock.
+ */
+qatomic_store_release(&has_writer, 0);
+
+/* Wake up all coroutines that are waiting to read the graph */
+qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
+}
+
 /*
- * No need for memory barriers, this works in pair with
- * the slow path of rdlock() and both take the lock.
+ * Run any BHs that were scheduled during the wrlock section and that
+ * callers might expect to have finished (in particular, this is important
+ * for bdrv_schedule_unref()).
+ *
+ * Do this only after restarting coroutines so that nested event loops in
+ * BHs don't deadlock if their condition relies on the coroutine making
+ * progress.
  */
-qatomic_store_release(&has_writer, 0);
-
-/* Wake up all coroutine that are waiting to read the graph */
-qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
+aio_bh_poll(qemu_get_aio_context());
 }
 
 void coroutine_fn bdrv_graph_co_rdlock(void)
diff --git a/tests/qemu-iotests/051.pc.out b/tests/qemu-iotests/051.pc.out
index 4d4af5a486..7e10c5fa1b 100644
--- a/tests/qemu-iotests/051.pc.out
+++ b/tests/qemu-iotests/051.pc.out
@@ -169,11 +169,11 @@ QEMU_PROG: -device scsi

[PULL 06/28] block-coroutine-wrapper: Add no_co_wrapper_bdrv_wrlock functions

2023-09-15 Thread Kevin Wolf

Add a new wrapper type for GRAPH_WRLOCK functions that should be called
from coroutine context.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-7-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/block/block-common.h   |  4 
 scripts/block-coroutine-wrapper.py | 11 +++
 2 files changed, 15 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index df5ffc8d09..3bbc5d9294 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -66,10 +66,14 @@
  * function. The coroutine yields after scheduling the BH and is reentered when
  * the wrapped function returns.
  *
+ * A no_co_wrapper_bdrv_wrlock function is a no_co_wrapper function that
+ * automatically takes the graph wrlock when calling the wrapped function.
+ *
  * If the first parameter of the function is a BlockDriverState, BdrvChild or
  * BlockBackend pointer, the AioContext lock for it is taken in the wrapper.
  */
 #define no_co_wrapper
+#define no_co_wrapper_bdrv_wrlock
 
 #include "block/blockjob.h"
 
diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index d4a183db61..fa01c06567 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -71,10 +71,13 @@ def __init__(self, wrapper_type: str, return_type: str, 
name: str,
 self.args = [ParamDecl(arg.strip()) for arg in args.split(',')]
 self.create_only_co = 'mixed' not in variant
 self.graph_rdlock = 'bdrv_rdlock' in variant
+self.graph_wrlock = 'bdrv_wrlock' in variant
 
 self.wrapper_type = wrapper_type
 
 if wrapper_type == 'co':
+if self.graph_wrlock:
+raise ValueError(f"co function can't be wrlock: {self.name}")
 subsystem, subname = self.name.split('_', 1)
 self.target_name = f'{subsystem}_co_{subname}'
 else:
@@ -250,6 +253,12 @@ def gen_no_co_wrapper(func: FuncDecl) -> str:
 name = func.target_name
 struct_name = func.struct_name
 
+graph_lock=''
+graph_unlock=''
+if func.graph_wrlock:
+graph_lock='bdrv_graph_wrlock(NULL);'
+graph_unlock='bdrv_graph_wrunlock();'
+
 return f"""\
 /*
  * Wrappers for {name}
@@ -266,9 +275,11 @@ def gen_no_co_wrapper(func: FuncDecl) -> str:
 {struct_name} *s = opaque;
 AioContext *ctx = {func.gen_ctx('s->')};
 
+{graph_lock}
 aio_context_acquire(ctx);
 {func.get_result}{name}({ func.gen_list('s->{name}') });
 aio_context_release(ctx);
+{graph_unlock}
 
 aio_co_wake(s->co);
 }}
-- 
2.41.0

[PULL 27/28] block-backend: process zoned requests in the current AioContext

2023-09-15 Thread Kevin Wolf

From: Stefan Hajnoczi 

Process zoned requests in the current thread's AioContext instead of in
the BlockBackend's AioContext.

There is no need to use the BlockBackend's AioContext thanks to CoMutex
bs->wps->colock, which protects zone metadata.

Signed-off-by: Stefan Hajnoczi 
Message-ID: <20230912231037.826804-5-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/block-backend.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 6f0a6084f1..9dbdbe5545 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1907,11 +1907,11 @@ BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, 
int64_t offset,
 acb->has_returned = false;
 
 co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
-aio_co_enter(blk_get_aio_context(blk), co);
+aio_co_enter(qemu_get_current_aio_context(), co);
 
 acb->has_returned = true;
 if (acb->rwco.ret != NOT_DONE) {
-replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(),
  blk_aio_complete_bh, acb);
 }
 
@@ -1948,11 +1948,11 @@ BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, 
BlockZoneOp op,
 acb->has_returned = false;
 
 co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
-aio_co_enter(blk_get_aio_context(blk), co);
+aio_co_enter(qemu_get_current_aio_context(), co);
 
 acb->has_returned = true;
 if (acb->rwco.ret != NOT_DONE) {
-replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(),
  blk_aio_complete_bh, acb);
 }
 
@@ -1988,10 +1988,10 @@ BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, 
int64_t *offset,
 acb->has_returned = false;
 
 co = qemu_coroutine_create(blk_aio_zone_append_entry, acb);
-aio_co_enter(blk_get_aio_context(blk), co);
+aio_co_enter(qemu_get_current_aio_context(), co);
 acb->has_returned = true;
 if (acb->rwco.ret != NOT_DONE) {
-replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(),
  blk_aio_complete_bh, acb);
 }
 
-- 
2.41.0

[PULL 00/28] Block layer patches

2023-09-15 Thread Kevin Wolf

The following changes since commit 005ad32358f12fe9313a4a01918a55e60d4f39e5:

  Merge tag 'pull-tpm-2023-09-12-3' of https://github.com/stefanberger/qemu-tpm 
into staging (2023-09-13 13:41:57 -0400)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 5d96864b73225ee61b0dad7e928f0cddf14270fc:

  block-coroutine-wrapper: use qemu_get_current_aio_context() (2023-09-15 
15:49:14 +0200)


Block layer patches

- Graph locking part 4 (node management)
- qemu-img map: report compressed data blocks
- block-backend: process I/O in the current AioContext


Andrey Drobyshev via (2):
  block: add BDRV_BLOCK_COMPRESSED flag for bdrv_block_status()
  qemu-img: map: report compressed data blocks

Kevin Wolf (21):
  block: Remove unused BlockReopenQueueEntry.perms_checked
  preallocate: Factor out preallocate_truncate_to_real_size()
  preallocate: Don't poll during permission updates
  block: Take AioContext lock for bdrv_append() more consistently
  block: Introduce bdrv_schedule_unref()
  block-coroutine-wrapper: Add no_co_wrapper_bdrv_wrlock functions
  block-coroutine-wrapper: Allow arbitrary parameter names
  block: Mark bdrv_replace_child_noperm() GRAPH_WRLOCK
  block: Mark bdrv_replace_child_tran() GRAPH_WRLOCK
  block: Mark bdrv_attach_child_common() GRAPH_WRLOCK
  block: Call transaction callbacks with lock held
  block: Mark bdrv_attach_child() GRAPH_WRLOCK
  block: Mark bdrv_parent_perms_conflict() and callers GRAPH_RDLOCK
  block: Mark bdrv_get_cumulative_perm() and callers GRAPH_RDLOCK
  block: Mark bdrv_child_perm() GRAPH_RDLOCK
  block: Mark bdrv_parent_cb_change_media() GRAPH_RDLOCK
  block: Take graph rdlock in bdrv_drop_intermediate()
  block: Take graph rdlock in bdrv_change_aio_context()
  block: Mark bdrv_root_unref_child() GRAPH_WRLOCK
  block: Mark bdrv_unref_child() GRAPH_WRLOCK
  block: Mark bdrv_add/del_child() and caller GRAPH_WRLOCK

Stefan Hajnoczi (5):
  block: remove AIOCBInfo->get_aio_context()
  test-bdrv-drain: avoid race with BH in IOThread drain test
  block-backend: process I/O in the current AioContext
  block-backend: process zoned requests in the current AioContext
  block-coroutine-wrapper: use qemu_get_current_aio_context()

 qapi/block-core.json |   6 +-
 include/block/aio.h  |   1 -
 include/block/block-common.h |   7 +
 include/block/block-global-state.h   |  32 +-
 include/block/block-io.h |   1 -
 include/block/block_int-common.h |  34 +-
 include/block/block_int-global-state.h   |  14 +-
 include/sysemu/block-backend-global-state.h  |   4 +-
 block.c  | 348 +++---
 block/blklogwrites.c |   4 +
 block/blkverify.c|   2 +
 block/block-backend.c|  64 +-
 block/copy-before-write.c|  10 +-
 block/crypto.c   |   6 +-
 block/graph-lock.c   |  26 +-
 block/io.c   |  23 +-
 block/mirror.c   |   8 +
 block/preallocate.c  | 133 ++--
 block/qcow.c |   5 +-
 block/qcow2.c|   7 +-
 block/quorum.c   |  23 +-
 block/replication.c  |   9 +
 block/snapshot.c |   2 +
 block/stream.c   |  20 +-
 block/vmdk.c |  15 +
 blockdev.c   |  23 +-
 blockjob.c   |   2 +
 hw/nvme/ctrl.c   |   7 -
 qemu-img.c   |   8 +-
 softmmu/dma-helpers.c|   8 -
 tests/unit/test-bdrv-drain.c |  31 +-
 tests/unit/test-bdrv-graph-mod.c |  20 +
 tests/unit/test-block-iothread.c |   3 +
 util/thread-pool.c   |   8 -
 scripts/block-coroutine-wrapper.py   |  24 +-
 tests/qemu-iotests/051.pc.out|   6 +-
 tests/qemu-iotests/122.out   |  84 +--
 tests/qemu-iotests/146.out   | 780 +++
 tests/qemu-iotests/154.out   | 194 +++---
 tests/qemu-iotests/179.out   | 178 +++---
 tests/qemu-iotests/209.out   |   4 +-
 tests/qemu-iotests/221.out

[PULL 18/28] block: Take graph rdlock in bdrv_change_aio_context()

2023-09-15 Thread Kevin Wolf

The function reads the parents list, so it needs to hold the graph lock.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Message-ID: <20230911094620.45040-19-kw...@redhat.com>
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block.c b/block.c
index e024a6ccec..8e589bb2e4 100644
--- a/block.c
+++ b/block.c
@@ -7688,17 +7688,21 @@ static bool bdrv_change_aio_context(BlockDriverState 
*bs, AioContext *ctx,
 return true;
 }
 
+bdrv_graph_rdlock_main_loop();
 QLIST_FOREACH(c, &bs->parents, next_parent) {
 if (!bdrv_parent_change_aio_context(c, ctx, visited, tran, errp)) {
+bdrv_graph_rdunlock_main_loop();
 return false;
 }
 }
 
 QLIST_FOREACH(c, &bs->children, next) {
 if (!bdrv_child_change_aio_context(c, ctx, visited, tran, errp)) {
+bdrv_graph_rdunlock_main_loop();
 return false;
 }
 }
+bdrv_graph_rdunlock_main_loop();
 
 state = g_new(BdrvStateSetAioContext, 1);
 *state = (BdrvStateSetAioContext) {
-- 
2.41.0

[PULL 19/28] block: Mark bdrv_root_unref_child() GRAPH_WRLOCK

2023-09-15 Thread Kevin Wolf

Instead of taking the writer lock internally, require callers to already
hold it when calling bdrv_root_unref_child(). These callers will
typically already hold the graph lock once the locking work is
completed, which means that they can't call functions that take it
internally.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-20-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/block/block_int-global-state.h | 2 +-
 block.c| 6 +++---
 block/block-backend.c  | 3 +++
 blockjob.c | 2 ++
 4 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/block/block_int-global-state.h 
b/include/block/block_int-global-state.h
index e2304db58b..074b677838 100644
--- a/include/block/block_int-global-state.h
+++ b/include/block/block_int-global-state.h
@@ -202,7 +202,7 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState 
*child_bs,
   BdrvChildRole child_role,
   uint64_t perm, uint64_t shared_perm,
   void *opaque, Error **errp);
-void bdrv_root_unref_child(BdrvChild *child);
+void GRAPH_WRLOCK bdrv_root_unref_child(BdrvChild *child);
 
 void GRAPH_RDLOCK bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t 
*perm,
uint64_t *shared_perm);
diff --git a/block.c b/block.c
index 8e589bb2e4..9ea8333a28 100644
--- a/block.c
+++ b/block.c
@@ -3268,7 +3268,6 @@ void bdrv_root_unref_child(BdrvChild *child)
 BlockDriverState *child_bs = child->bs;
 
 GLOBAL_STATE_CODE();
-bdrv_graph_wrlock(NULL);
 bdrv_replace_child_noperm(child, NULL);
 bdrv_child_free(child);
 
@@ -3288,8 +3287,7 @@ void bdrv_root_unref_child(BdrvChild *child)
 NULL);
 }
 
-bdrv_graph_wrunlock();
-bdrv_unref(child_bs);
+bdrv_schedule_unref(child_bs);
 }
 
 typedef struct BdrvSetInheritsFrom {
@@ -3366,8 +3364,10 @@ void bdrv_unref_child(BlockDriverState *parent, 
BdrvChild *child)
 return;
 }
 
+bdrv_graph_wrlock(NULL);
 bdrv_unset_inherits_from(parent, child, NULL);
 bdrv_root_unref_child(child);
+bdrv_graph_wrunlock();
 }
 
 
diff --git a/block/block-backend.c b/block/block-backend.c
index 8d0282a5d9..c2636f4351 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -915,7 +915,10 @@ void blk_remove_bs(BlockBackend *blk)
 blk_drain(blk);
 root = blk->root;
 blk->root = NULL;
+
+bdrv_graph_wrlock(NULL);
 bdrv_root_unref_child(root);
+bdrv_graph_wrunlock();
 }
 
 /*
diff --git a/blockjob.c b/blockjob.c
index 25fe8e625d..58c5d64539 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
  * one to make sure that such a concurrent access does not attempt
  * to process an already freed BdrvChild.
  */
+bdrv_graph_wrlock(NULL);
 while (job->nodes) {
 GSList *l = job->nodes;
 BdrvChild *c = l->data;
@@ -209,6 +210,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
 
 g_slist_free_1(l);
 }
+bdrv_graph_wrunlock();
 }
 
 bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs)
-- 
2.41.0

[PULL 03/28] preallocate: Don't poll during permission updates

2023-09-15 Thread Kevin Wolf

When the permission related BlockDriver callbacks are called, we are in
the middle of an operation traversing the block graph. Polling in such a
place is a very bad idea because the graph could change in unexpected
ways. In the future, callers will also hold the graph lock, which is
likely to turn polling into a deadlock.

So we need to get rid of calls to functions like bdrv_getlength() or
bdrv_truncate() there as these functions poll internally. They are
currently used so that when no parent has write/resize permissions on
the image any more, the preallocate filter drops the extra preallocated
area in the image file and gives up write/resize permissions itself.

In order to achieve this without polling in .bdrv_check_perm, don't
immediately truncate the image, but only schedule a BH to do so. The
filter keeps the write/resize permissions a bit longer now until the BH
has executed.

There is one case in which delaying doesn't work: Reopening the image
read-only. In this case, bs->file will likely be reopened read-only,
too, so keeping write permissions a bit longer on it doesn't work. But
we can already cover this case in preallocate_reopen_prepare() and not
rely on the permission updates for it.

Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-4-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block/preallocate.c | 89 +++--
 1 file changed, 69 insertions(+), 20 deletions(-)

diff --git a/block/preallocate.c b/block/preallocate.c
index 3173d80534..bfb638d8b1 100644
--- a/block/preallocate.c
+++ b/block/preallocate.c
@@ -75,8 +75,14 @@ typedef struct BDRVPreallocateState {
  * be invalid (< 0) when we don't have both exclusive BLK_PERM_RESIZE and
  * BLK_PERM_WRITE permissions on file child.
  */
+
+/* Gives up the resize permission on children when parents don't need it */
+QEMUBH *drop_resize_bh;
 } BDRVPreallocateState;
 
+static int preallocate_drop_resize(BlockDriverState *bs, Error **errp);
+static void preallocate_drop_resize_bh(void *opaque);
+
 #define PREALLOCATE_OPT_PREALLOC_ALIGN "prealloc-align"
 #define PREALLOCATE_OPT_PREALLOC_SIZE "prealloc-size"
 static QemuOptsList runtime_opts = {
@@ -142,6 +148,7 @@ static int preallocate_open(BlockDriverState *bs, QDict 
*options, int flags,
  * For this to work, mark them invalid.
  */
 s->file_end = s->zero_start = s->data_end = -EINVAL;
+s->drop_resize_bh = qemu_bh_new(preallocate_drop_resize_bh, bs);
 
 ret = bdrv_open_file_child(NULL, options, "file", bs, errp);
 if (ret < 0) {
@@ -193,6 +200,9 @@ static void preallocate_close(BlockDriverState *bs)
 {
 BDRVPreallocateState *s = bs->opaque;
 
+qemu_bh_cancel(s->drop_resize_bh);
+qemu_bh_delete(s->drop_resize_bh);
+
 if (s->data_end >= 0) {
 preallocate_truncate_to_real_size(bs, NULL);
 }
@@ -211,6 +221,7 @@ static int preallocate_reopen_prepare(BDRVReopenState 
*reopen_state,
   BlockReopenQueue *queue, Error **errp)
 {
 PreallocateOpts *opts = g_new0(PreallocateOpts, 1);
+int ret;
 
 if (!preallocate_absorb_opts(opts, reopen_state->options,
  reopen_state->bs->file->bs, errp)) {
@@ -218,6 +229,19 @@ static int preallocate_reopen_prepare(BDRVReopenState 
*reopen_state,
 return -EINVAL;
 }
 
+/*
+ * Drop the preallocation already here if reopening read-only. The child
+ * might also be reopened read-only and then scheduling a BH during the
+ * permission update is too late.
+ */
+if ((reopen_state->flags & BDRV_O_RDWR) == 0) {
+ret = preallocate_drop_resize(reopen_state->bs, errp);
+if (ret < 0) {
+g_free(opts);
+return ret;
+}
+}
+
 reopen_state->opaque = opts;
 
 return 0;
@@ -475,41 +499,61 @@ preallocate_co_getlength(BlockDriverState *bs)
 return ret;
 }
 
-static int preallocate_check_perm(BlockDriverState *bs,
-  uint64_t perm, uint64_t shared, Error **errp)
+static int preallocate_drop_resize(BlockDriverState *bs, Error **errp)
 {
 BDRVPreallocateState *s = bs->opaque;
+int ret;
 
-if (s->data_end >= 0 && !can_write_resize(perm)) {
-/*
- * Lose permissions.
- * We should truncate in check_perm, as in set_perm bs->file->perm will
- * be already changed, and we should not violate it.
- */
-return preallocate_truncate_to_real_size(bs, errp);
+if (s->data_end < 0) {
+return 0;
+}
+
+/*
+ * Before switching children to be read-only, truncate them to remove
+ * the preallocation and let them have the real size.
+ */
+ret = preallocate_truncate_to_real_size(bs, errp);
+if (ret < 0) {
+return ret;
 }
 
+/*
+ * We'll drop our permissions and will allow other users to take write and
+ * resize p

[PULL 28/28] block-coroutine-wrapper: use qemu_get_current_aio_context()

2023-09-15 Thread Kevin Wolf

From: Stefan Hajnoczi 

Use qemu_get_current_aio_context() in mixed wrappers and coroutine
wrappers so that code runs in the caller's AioContext instead of moving
to the BlockDriverState's AioContext. This change is necessary for the
multi-queue block layer where any thread can call into the block layer.

Most wrappers are IO_CODE where it's safe to use the current AioContext
nowadays. BlockDrivers and the core block layer use their own locks and
no longer depend on the AioContext lock for thread-safety.

The bdrv_create() wrapper invokes GLOBAL_STATE code. Using the current
AioContext is safe because this code is only called with the BQL held
from the main loop thread.

The output of qemu-iotests 051 is sensitive to event loop activity.
Update the output because the monitor BH runs at a different time,
causing prompts to be printed differently in the output.

Signed-off-by: Stefan Hajnoczi 
Message-ID: <20230912231037.826804-6-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 scripts/block-coroutine-wrapper.py | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index 685d0b4ed4..66cda6b8db 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -91,8 +91,6 @@ def __init__(self, wrapper_type: str, return_type: str, name: 
str,
 raise ValueError(f"no_co function can't be rdlock: 
{self.name}")
 self.target_name = f'{subsystem}_{subname}'
 
-self.ctx = self.gen_ctx()
-
 self.get_result = 's->ret = '
 self.ret = 'return s.ret;'
 self.co_ret = 'return '
@@ -166,7 +164,7 @@ def create_mixed_wrapper(func: FuncDecl) -> str:
 {func.co_ret}{name}({ func.gen_list('{name}') });
 }} else {{
 {struct_name} s = {{
-.poll_state.ctx = {func.ctx},
+.poll_state.ctx = qemu_get_current_aio_context(),
 .poll_state.in_progress = true,
 
 { func.gen_block('.{name} = {name},') }
@@ -190,7 +188,7 @@ def create_co_wrapper(func: FuncDecl) -> str:
 {func.return_type} {func.name}({ func.gen_list('{decl}') })
 {{
 {struct_name} s = {{
-.poll_state.ctx = {func.ctx},
+.poll_state.ctx = qemu_get_current_aio_context(),
 .poll_state.in_progress = true,
 
 { func.gen_block('.{name} = {name},') }
-- 
2.41.0

[PULL 17/28] block: Take graph rdlock in bdrv_drop_intermediate()

2023-09-15 Thread Kevin Wolf

The function reads the parents list, so it needs to hold the graph lock.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Message-ID: <20230911094620.45040-18-kw...@redhat.com>
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block.c b/block.c
index 06d2a1b256..e024a6ccec 100644
--- a/block.c
+++ b/block.c
@@ -5938,9 +5938,11 @@ int bdrv_drop_intermediate(BlockDriverState *top, 
BlockDriverState *base,
 backing_file_str = base->filename;
 }
 
+bdrv_graph_rdlock_main_loop();
 QLIST_FOREACH(c, &top->parents, next_parent) {
 updated_children = g_slist_prepend(updated_children, c);
 }
+bdrv_graph_rdunlock_main_loop();
 
 /*
  * It seems correct to pass detach_subchain=true here, but it triggers
-- 
2.41.0

[PULL 01/28] block: Remove unused BlockReopenQueueEntry.perms_checked

2023-09-15 Thread Kevin Wolf

This field has been unused since commit 72373e40fbc ('block:
bdrv_reopen_multiple: refresh permissions on updated graph').
Remove it.

Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-2-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block.c b/block.c
index 8da89aaa62..9029ddd9ff 100644
--- a/block.c
+++ b/block.c
@@ -2115,7 +2115,6 @@ static int bdrv_fill_options(QDict **options, const char 
*filename,
 
 typedef struct BlockReopenQueueEntry {
  bool prepared;
- bool perms_checked;
  BDRVReopenState state;
  QTAILQ_ENTRY(BlockReopenQueueEntry) entry;
 } BlockReopenQueueEntry;
-- 
2.41.0

Re: [PATCH 1/3] hw/pci: Add all Data Object Types

2023-09-15 Thread Jonathan Cameron via

On Fri, 15 Sep 2023 21:27:21 +1000
Alistair Francis  wrote:

> Add all of the defined protocols/features from the PCIe-SIG
> "Table 6-32 PCI-SIG defined Data Object Types (Vendor ID = 0001h)"

Which version of the specification?  These references can rot.
Obviously it's below, but who knows if anyone will look there ;)
It's already changed in 6.1 and the table has more entries.

I'd just change this to say, Add all Data Object Types defined in PCIe r6.0


> table.
> 
> Signed-off-by: Alistair Francis 
> ---
>  include/hw/pci/pcie_doe.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/hw/pci/pcie_doe.h b/include/hw/pci/pcie_doe.h
> index 87dc17dcef..15d94661f9 100644
> --- a/include/hw/pci/pcie_doe.h
> +++ b/include/hw/pci/pcie_doe.h
> @@ -46,6 +46,8 @@ REG32(PCI_DOE_CAP_STATUS, 0)
>  
>  /* PCI-SIG defined Data Object Types - r6.0 Table 6-32 */
>  #define PCI_SIG_DOE_DISCOVERY   0x00
> +#define PCI_SIG_DOE_CMA 0x01
> +#define PCI_SIG_DOE_SECURED_CMA 0x02
>  
>  #define PCI_DOE_DW_SIZE_MAX (1 << 18)
>  #define PCI_DOE_PROTOCOL_NUM_MAX256

[PULL 15/28] block: Mark bdrv_child_perm() GRAPH_RDLOCK

2023-09-15 Thread Kevin Wolf

This adds GRAPH_RDLOCK annotations to declare that callers of
bdrv_child_perm() need to hold a reader lock for the graph because
some implementations access the children list of a node.

The callers of bdrv_child_perm() conveniently already hold the lock.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-16-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/block/block_int-common.h | 10 +-
 block.c  | 11 ++-
 block/copy-before-write.c| 10 +-
 3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index f82c14fb9c..3feb67ec4a 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -451,11 +451,11 @@ struct BlockDriver {
  * permissions, but those that will be needed after applying the
  * @reopen_queue.
  */
- void (*bdrv_child_perm)(BlockDriverState *bs, BdrvChild *c,
- BdrvChildRole role,
- BlockReopenQueue *reopen_queue,
- uint64_t parent_perm, uint64_t parent_shared,
- uint64_t *nperm, uint64_t *nshared);
+ void GRAPH_RDLOCK_PTR (*bdrv_child_perm)(
+BlockDriverState *bs, BdrvChild *c, BdrvChildRole role,
+BlockReopenQueue *reopen_queue,
+uint64_t parent_perm, uint64_t parent_shared,
+uint64_t *nperm, uint64_t *nshared);
 
 /**
  * Register/unregister a buffer for I/O. For example, when the driver is
diff --git a/block.c b/block.c
index 186efda70f..0f7f78f8de 100644
--- a/block.c
+++ b/block.c
@@ -2228,11 +2228,12 @@ bdrv_parent_perms_conflict(BlockDriverState *bs, Error 
**errp)
 return false;
 }
 
-static void bdrv_child_perm(BlockDriverState *bs, BlockDriverState *child_bs,
-BdrvChild *c, BdrvChildRole role,
-BlockReopenQueue *reopen_queue,
-uint64_t parent_perm, uint64_t parent_shared,
-uint64_t *nperm, uint64_t *nshared)
+static void GRAPH_RDLOCK
+bdrv_child_perm(BlockDriverState *bs, BlockDriverState *child_bs,
+BdrvChild *c, BdrvChildRole role,
+BlockReopenQueue *reopen_queue,
+uint64_t parent_perm, uint64_t parent_shared,
+uint64_t *nperm, uint64_t *nshared)
 {
 assert(bs->drv && bs->drv->bdrv_child_perm);
 GLOBAL_STATE_CODE();
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index 9a0e2b69d9..aeaff3bb82 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -341,11 +341,11 @@ static void cbw_refresh_filename(BlockDriverState *bs)
 bs->file->bs->filename);
 }
 
-static void cbw_child_perm(BlockDriverState *bs, BdrvChild *c,
-   BdrvChildRole role,
-   BlockReopenQueue *reopen_queue,
-   uint64_t perm, uint64_t shared,
-   uint64_t *nperm, uint64_t *nshared)
+static void GRAPH_RDLOCK
+cbw_child_perm(BlockDriverState *bs, BdrvChild *c, BdrvChildRole role,
+   BlockReopenQueue *reopen_queue,
+   uint64_t perm, uint64_t shared,
+   uint64_t *nperm, uint64_t *nshared)
 {
 if (!(role & BDRV_CHILD_FILTERED)) {
 /*
-- 
2.41.0

[PULL 07/28] block-coroutine-wrapper: Allow arbitrary parameter names

2023-09-15 Thread Kevin Wolf

Don't assume specific parameter names like 'bs' or 'blk' in the
generated code, but use the actual name.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-8-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 scripts/block-coroutine-wrapper.py | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index fa01c06567..685d0b4ed4 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -105,12 +105,13 @@ def __init__(self, wrapper_type: str, return_type: str, 
name: str,
 
 def gen_ctx(self, prefix: str = '') -> str:
 t = self.args[0].type
+name = self.args[0].name
 if t == 'BlockDriverState *':
-return f'bdrv_get_aio_context({prefix}bs)'
+return f'bdrv_get_aio_context({prefix}{name})'
 elif t == 'BdrvChild *':
-return f'bdrv_get_aio_context({prefix}child->bs)'
+return f'bdrv_get_aio_context({prefix}{name}->bs)'
 elif t == 'BlockBackend *':
-return f'blk_get_aio_context({prefix}blk)'
+return f'blk_get_aio_context({prefix}{name})'
 else:
 return 'qemu_get_aio_context()'
 
-- 
2.41.0

[PULL 16/28] block: Mark bdrv_parent_cb_change_media() GRAPH_RDLOCK

2023-09-15 Thread Kevin Wolf

The function reads the parents list, so it needs to hold the graph lock.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-17-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 0f7f78f8de..06d2a1b256 100644
--- a/block.c
+++ b/block.c
@@ -3371,7 +3371,8 @@ void bdrv_unref_child(BlockDriverState *parent, BdrvChild 
*child)
 }
 
 
-static void bdrv_parent_cb_change_media(BlockDriverState *bs, bool load)
+static void GRAPH_RDLOCK
+bdrv_parent_cb_change_media(BlockDriverState *bs, bool load)
 {
 BdrvChild *c;
 GLOBAL_STATE_CODE();
@@ -3969,6 +3970,9 @@ bdrv_open_inherit(const char *filename, const char 
*reference, QDict *options,
 GLOBAL_STATE_CODE();
 assert(!qemu_in_coroutine());
 
+/* TODO We'll eventually have to take a writer lock in this function */
+GRAPH_RDLOCK_GUARD_MAINLOOP();
+
 if (reference) {
 bool options_non_empty = options ? qdict_size(options) : false;
 qobject_unref(options);
-- 
2.41.0

[PULL 21/28] block: Mark bdrv_add/del_child() and caller GRAPH_WRLOCK

2023-09-15 Thread Kevin Wolf

The functions read the parents list in the generic block layer, so we
need to hold the graph lock already there. The BlockDriver
implementations actually modify the graph, so it has to be a writer
lock.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-22-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/block/block-global-state.h |  8 +---
 include/block/block_int-common.h   |  9 +
 block/quorum.c | 23 ++-
 blockdev.c | 17 +++--
 4 files changed, 27 insertions(+), 30 deletions(-)

diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index 0f6df8f1a2..f31660c7b1 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -276,9 +276,11 @@ int bdrv_try_change_aio_context(BlockDriverState *bs, 
AioContext *ctx,
 int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz);
 int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo);
 
-void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child,
-Error **errp);
-void bdrv_del_child(BlockDriverState *parent, BdrvChild *child, Error **errp);
+void GRAPH_WRLOCK
+bdrv_add_child(BlockDriverState *parent, BlockDriverState *child, Error 
**errp);
+
+void GRAPH_WRLOCK
+bdrv_del_child(BlockDriverState *parent, BdrvChild *child, Error **errp);
 
 /**
  *
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 3feb67ec4a..2ca3758cb8 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -393,10 +393,11 @@ struct BlockDriver {
  */
 int (*bdrv_probe_geometry)(BlockDriverState *bs, HDGeometry *geo);
 
-void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
-   Error **errp);
-void (*bdrv_del_child)(BlockDriverState *parent, BdrvChild *child,
-   Error **errp);
+void GRAPH_WRLOCK_PTR (*bdrv_add_child)(
+BlockDriverState *parent, BlockDriverState *child, Error **errp);
+
+void GRAPH_WRLOCK_PTR (*bdrv_del_child)(
+BlockDriverState *parent, BdrvChild *child, Error **errp);
 
 /**
  * Informs the block driver that a permission change is intended. The
diff --git a/block/quorum.c b/block/quorum.c
index 620a50ba2c..05220cab7f 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1066,8 +1066,8 @@ static void quorum_close(BlockDriverState *bs)
 g_free(s->children);
 }
 
-static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs,
- Error **errp)
+static void GRAPH_WRLOCK
+quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs, Error 
**errp)
 {
 BDRVQuorumState *s = bs->opaque;
 BdrvChild *child;
@@ -1093,29 +1093,22 @@ static void quorum_add_child(BlockDriverState *bs, 
BlockDriverState *child_bs,
 }
 s->next_child_index++;
 
-bdrv_drained_begin(bs);
-
 /* We can safely add the child now */
 bdrv_ref(child_bs);
 
-bdrv_graph_wrlock(child_bs);
 child = bdrv_attach_child(bs, child_bs, indexstr, &child_of_bds,
   BDRV_CHILD_DATA, errp);
-bdrv_graph_wrunlock();
 if (child == NULL) {
 s->next_child_index--;
-goto out;
+return;
 }
 s->children = g_renew(BdrvChild *, s->children, s->num_children + 1);
 s->children[s->num_children++] = child;
 quorum_refresh_flags(bs);
-
-out:
-bdrv_drained_end(bs);
 }
 
-static void quorum_del_child(BlockDriverState *bs, BdrvChild *child,
- Error **errp)
+static void GRAPH_WRLOCK
+quorum_del_child(BlockDriverState *bs, BdrvChild *child, Error **errp)
 {
 BDRVQuorumState *s = bs->opaque;
 char indexstr[INDEXSTR_LEN];
@@ -1145,18 +1138,14 @@ static void quorum_del_child(BlockDriverState *bs, 
BdrvChild *child,
 s->next_child_index--;
 }
 
-bdrv_drained_begin(bs);
-
 /* We can safely remove this child now */
 memmove(&s->children[i], &s->children[i + 1],
 (s->num_children - i - 1) * sizeof(BdrvChild *));
 s->children = g_renew(BdrvChild *, s->children, --s->num_children);
-bdrv_graph_wrlock(NULL);
+
 bdrv_unref_child(bs, child);
-bdrv_graph_wrunlock();
 
 quorum_refresh_flags(bs);
-bdrv_drained_end(bs);
 }
 
 static void quorum_gather_child_options(BlockDriverState *bs, QDict *target,
diff --git a/blockdev.c b/blockdev.c
index 372eaf198c..325b7a3bef 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3545,8 +3545,8 @@ out:
 aio_context_release(aio_context);
 }
 
-static BdrvChild *bdrv_find_child(BlockDriverState *parent_bs,
-  const char *child_name)
+static BdrvChild * GRAPH_RDLOCK
+bdrv_find_child(BlockDriverState *parent_bs, const char *child_name)
 {
 BdrvChild *child;
 
@@ -3565,9 +3565,11 @@ void

[PULL 26/28] block-backend: process I/O in the current AioContext

2023-09-15 Thread Kevin Wolf

From: Stefan Hajnoczi 

Switch blk_aio_*() APIs over to multi-queue by using
qemu_get_current_aio_context() instead of blk_get_aio_context(). This
change will allow devices to process I/O in multiple IOThreads in the
future.

I audited existing blk_aio_*() callers:
- migration/block.c: blk_mig_lock() protects the data accessed by the
  completion callback.
- The remaining emulated devices and exports run with
  qemu_get_aio_context() == blk_get_aio_context().

Signed-off-by: Stefan Hajnoczi 
Message-ID: <20230912231037.826804-4-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/block-backend.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 24b9449712..6f0a6084f1 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1547,7 +1547,7 @@ BlockAIOCB *blk_abort_aio_request(BlockBackend *blk,
 acb->blk = blk;
 acb->ret = ret;
 
-replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(),
  error_callback_bh, acb);
 return &acb->common;
 }
@@ -1601,11 +1601,11 @@ static BlockAIOCB *blk_aio_prwv(BlockBackend *blk, 
int64_t offset,
 acb->has_returned = false;
 
 co = qemu_coroutine_create(co_entry, acb);
-aio_co_enter(blk_get_aio_context(blk), co);
+aio_co_enter(qemu_get_current_aio_context(), co);
 
 acb->has_returned = true;
 if (acb->rwco.ret != NOT_DONE) {
-replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(),
  blk_aio_complete_bh, acb);
 }
 
-- 
2.41.0

[PULL 22/28] block: add BDRV_BLOCK_COMPRESSED flag for bdrv_block_status()

2023-09-15 Thread Kevin Wolf

From: Andrey Drobyshev via 

Functions qcow2_get_host_offset(), get_cluster_offset(),
vmdk_co_block_status() explicitly report compressed cluster types when data
is compressed.  However, this information is never passed further.  Let's
make use of it by adding new BDRV_BLOCK_COMPRESSED flag for
bdrv_block_status(), so that caller may know that the data range is
compressed.  In particular, we're going to use this flag to tweak
"qemu-img map" output.

This new flag is only being utilized by qcow, qcow2 and vmdk formats, as only
those support compression.

Reviewed-by: Denis V. Lunev 
Reviewed-by: Hanna Czenczek 
Signed-off-by: Andrey Drobyshev 
Message-ID: <20230907210226.953821-2-andrey.drobys...@virtuozzo.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 include/block/block-common.h | 3 +++
 block/qcow.c | 5 -
 block/qcow2.c| 3 +++
 block/vmdk.c | 2 ++
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index 3bbc5d9294..2d2af7230d 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -291,6 +291,8 @@ typedef enum {
  *   layer rather than any backing, set by block layer
  * BDRV_BLOCK_EOF: the returned pnum covers through end of file for this
  * layer, set by block layer
+ * BDRV_BLOCK_COMPRESSED: the underlying data is compressed; only valid for
+ *the formats supporting compression: qcow, qcow2
  *
  * Internal flags:
  * BDRV_BLOCK_RAW: for use by passthrough drivers, such as raw, to request
@@ -326,6 +328,7 @@ typedef enum {
 #define BDRV_BLOCK_ALLOCATED0x10
 #define BDRV_BLOCK_EOF  0x20
 #define BDRV_BLOCK_RECURSE  0x40
+#define BDRV_BLOCK_COMPRESSED   0x80
 
 typedef QTAILQ_HEAD(BlockReopenQueue, BlockReopenQueueEntry) BlockReopenQueue;
 
diff --git a/block/qcow.c b/block/qcow.c
index 577bd70324..d56d24ab6d 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -549,7 +549,10 @@ qcow_co_block_status(BlockDriverState *bs, bool want_zero,
 if (!cluster_offset) {
 return 0;
 }
-if ((cluster_offset & QCOW_OFLAG_COMPRESSED) || s->crypto) {
+if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
+return BDRV_BLOCK_DATA | BDRV_BLOCK_COMPRESSED;
+}
+if (s->crypto) {
 return BDRV_BLOCK_DATA;
 }
 *map = cluster_offset | index_in_cluster;
diff --git a/block/qcow2.c b/block/qcow2.c
index 071004b302..af43d59d76 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2162,6 +2162,9 @@ qcow2_co_block_status(BlockDriverState *bs, bool 
want_zero, int64_t offset,
 {
 status |= BDRV_BLOCK_RECURSE;
 }
+if (type == QCOW2_SUBCLUSTER_COMPRESSED) {
+status |= BDRV_BLOCK_COMPRESSED;
+}
 return status;
 }
 
diff --git a/block/vmdk.c b/block/vmdk.c
index 78baa04c0c..e90649c8bf 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -1783,6 +1783,8 @@ vmdk_co_block_status(BlockDriverState *bs, bool want_zero,
 if (extent->flat) {
 ret |= BDRV_BLOCK_RECURSE;
 }
+} else {
+ret |= BDRV_BLOCK_COMPRESSED;
 }
 *file = extent->file->bs;
 break;
-- 
2.41.0

[PULL 20/28] block: Mark bdrv_unref_child() GRAPH_WRLOCK

2023-09-15 Thread Kevin Wolf

Instead of taking the writer lock internally, require callers to already
hold it when calling bdrv_unref_child(). These callers will typically
already hold the graph lock once the locking work is completed, which
means that they can't call functions that take it internally.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-21-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/block/block-global-state.h |  7 ++-
 block.c| 11 +++
 block/blklogwrites.c   |  4 
 block/blkverify.c  |  2 ++
 block/qcow2.c  |  4 +++-
 block/quorum.c |  6 ++
 block/replication.c|  3 +++
 block/snapshot.c   |  2 ++
 block/vmdk.c   | 11 +++
 tests/unit/test-bdrv-drain.c   |  8 ++--
 10 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index eb12a35439..0f6df8f1a2 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -225,7 +225,12 @@ void bdrv_ref(BlockDriverState *bs);
 void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
 void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
 void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
-void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
+
+void GRAPH_WRLOCK
+bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
+
+void coroutine_fn no_co_wrapper_bdrv_wrlock
+bdrv_co_unref_child(BlockDriverState *parent, BdrvChild *child);
 
 BdrvChild * GRAPH_WRLOCK
 bdrv_attach_child(BlockDriverState *parent_bs,
diff --git a/block.c b/block.c
index 9ea8333a28..e7f349b25c 100644
--- a/block.c
+++ b/block.c
@@ -1701,7 +1701,9 @@ bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, 
const char *node_name,
 open_failed:
 bs->drv = NULL;
 if (bs->file != NULL) {
+bdrv_graph_wrlock(NULL);
 bdrv_unref_child(bs, bs->file);
+bdrv_graph_wrunlock();
 assert(!bs->file);
 }
 g_free(bs->opaque);
@@ -3331,8 +,9 @@ static void bdrv_set_inherits_from(BlockDriverState *bs,
  * @root that point to @root, where necessary.
  * @tran is allowed to be NULL. In this case no rollback is possible
  */
-static void bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child,
- Transaction *tran)
+static void GRAPH_WRLOCK
+bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child,
+ Transaction *tran)
 {
 BdrvChild *c;
 
@@ -3364,10 +3367,8 @@ void bdrv_unref_child(BlockDriverState *parent, 
BdrvChild *child)
 return;
 }
 
-bdrv_graph_wrlock(NULL);
 bdrv_unset_inherits_from(parent, child, NULL);
 bdrv_root_unref_child(child);
-bdrv_graph_wrunlock();
 }
 
 
@@ -5164,9 +5165,11 @@ static void bdrv_close(BlockDriverState *bs)
 bs->drv = NULL;
 }
 
+bdrv_graph_wrlock(NULL);
 QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
 bdrv_unref_child(bs, child);
 }
+bdrv_graph_wrunlock();
 
 assert(!bs->backing);
 assert(!bs->file);
diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index 3ea7141cb5..a0d70729bb 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -251,7 +251,9 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict 
*options, int flags,
 ret = 0;
 fail_log:
 if (ret < 0) {
+bdrv_graph_wrlock(NULL);
 bdrv_unref_child(bs, s->log_file);
+bdrv_graph_wrunlock();
 s->log_file = NULL;
 }
 fail:
@@ -263,8 +265,10 @@ static void blk_log_writes_close(BlockDriverState *bs)
 {
 BDRVBlkLogWritesState *s = bs->opaque;
 
+bdrv_graph_wrlock(NULL);
 bdrv_unref_child(bs, s->log_file);
 s->log_file = NULL;
+bdrv_graph_wrunlock();
 }
 
 static int64_t coroutine_fn GRAPH_RDLOCK
diff --git a/block/blkverify.c b/block/blkverify.c
index 7326461f30..dae9716a26 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -151,8 +151,10 @@ static void blkverify_close(BlockDriverState *bs)
 {
 BDRVBlkverifyState *s = bs->opaque;
 
+bdrv_graph_wrlock(NULL);
 bdrv_unref_child(bs, s->test_file);
 s->test_file = NULL;
+bdrv_graph_wrunlock();
 }
 
 static int64_t coroutine_fn GRAPH_RDLOCK
diff --git a/block/qcow2.c b/block/qcow2.c
index b48cd9ce63..071004b302 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1880,7 +1880,7 @@ qcow2_do_open(BlockDriverState *bs, QDict *options, int 
flags,
 g_free(s->image_data_file);
 if (open_data_file && has_data_file(bs)) {
 bdrv_graph_co_rdunlock();
-bdrv_unref_child(bs, s->data_file);
+bdrv_co_unref_child(bs, s->data_file);
 bdrv_graph_co_rdlock();
 s->data_file = NULL;
 }
@@ -2790,7 +2790,9 @@ static void qcow2_do_close(Bl

[PULL 25/28] test-bdrv-drain: avoid race with BH in IOThread drain test

2023-09-15 Thread Kevin Wolf

From: Stefan Hajnoczi 

This patch fixes a race condition in test-bdrv-drain that is difficult
to reproduce. test-bdrv-drain sometimes fails without an error message
on the block pull request sent by Kevin Wolf on Sep 4, 2023. I was able
to reproduce it locally and found that "block-backend: process I/O in
the current AioContext" (in this patch series) is the first commit where
it reproduces.

I do not know why "block-backend: process I/O in the current AioContext"
exposes this bug. It might be related to the fact that the test's preadv
request runs in the main thread instead of IOThread a after my commit.
That might simply change the timing of the test.

Now on to the race condition in test-bdrv-drain. The main thread
schedules a BH in IOThread a and then drains the BDS:

  aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, &data);

  /* The request is running on the IOThread a. Draining its block device
   * will make sure that it has completed as far as the BDS is concerned,
   * but the drain in this thread can continue immediately after
   * bdrv_dec_in_flight() and aio_ret might be assigned only slightly
   * later. */
  do_drain_begin(drain_type, bs);

If the BH completes before do_drain_begin() then there is nothing to
worry about.

If the BH invokes bdrv_flush() before do_drain_begin(), then
do_drain_begin() waits for it to complete.

The problematic case is when do_drain_begin() runs before the BH enters
bdrv_flush(). Then do_drain_begin() misses the BH and the drain
mechanism has failed in quiescing I/O.

Fix this by incrementing the in_flight counter so that do_drain_begin()
waits for test_iothread_main_thread_bh().

Signed-off-by: Stefan Hajnoczi 
Message-ID: <20230912231037.826804-3-stefa...@redhat.com>
Reviewed-by: Eric Blake 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/unit/test-bdrv-drain.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index b040a73bb9..0b603e7c57 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -512,6 +512,7 @@ static void test_iothread_main_thread_bh(void *opaque)
  * executed during drain, otherwise this would deadlock. */
 aio_context_acquire(bdrv_get_aio_context(data->bs));
 bdrv_flush(data->bs);
+bdrv_dec_in_flight(data->bs); /* incremented by test_iothread_common() */
 aio_context_release(bdrv_get_aio_context(data->bs));
 }
 
@@ -583,6 +584,13 @@ static void test_iothread_common(enum drain_type 
drain_type, int drain_thread)
 aio_context_acquire(ctx_a);
 }
 
+/*
+ * Increment in_flight so that do_drain_begin() waits for
+ * test_iothread_main_thread_bh(). This prevents the race between
+ * test_iothread_main_thread_bh() in IOThread a and do_drain_begin() in
+ * this thread. test_iothread_main_thread_bh() decrements in_flight.
+ */
+bdrv_inc_in_flight(bs);
 aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, &data);
 
 /* The request is running on the IOThread a. Draining its block device
-- 
2.41.0

[PULL 09/28] block: Mark bdrv_replace_child_tran() GRAPH_WRLOCK

2023-09-15 Thread Kevin Wolf

Instead of taking the writer lock internally, require callers to already
hold it when calling bdrv_replace_child_tran(). These callers will
typically already hold the graph lock once the locking work is
completed, which means that they can't call functions that take it
internally.

While a graph lock is held, polling is not allowed. Therefore draining
the necessary nodes can no longer be done in bdrv_remove_child() and
bdrv_replace_node_noperm(), but the callers must already make sure that
they are drained.

Note that the transaction callbacks still take the lock internally, so
tran_finalize() must be called without the lock held. This is because
bdrv_append() also calls bdrv_attach_child_noperm(), which currently
requires to be called unlocked. Once it changes, the transaction
callbacks can be changed, too.

Signed-off-by: Kevin Wolf 
Reviewed-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-10-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block.c | 78 -
 1 file changed, 50 insertions(+), 28 deletions(-)

diff --git a/block.c b/block.c
index 61856f5c33..0973b91d98 100644
--- a/block.c
+++ b/block.c
@@ -94,7 +94,8 @@ static bool bdrv_recurse_has_child(BlockDriverState *bs,
 static void GRAPH_WRLOCK
 bdrv_replace_child_noperm(BdrvChild *child, BlockDriverState *new_bs);
 
-static void bdrv_remove_child(BdrvChild *child, Transaction *tran);
+static void GRAPH_WRLOCK
+bdrv_remove_child(BdrvChild *child, Transaction *tran);
 
 static int bdrv_reopen_prepare(BDRVReopenState *reopen_state,
BlockReopenQueue *queue,
@@ -2427,8 +2428,9 @@ static TransactionActionDrv bdrv_replace_child_drv = {
  *
  * The function doesn't update permissions, caller is responsible for this.
  */
-static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs,
-Transaction *tran)
+static void GRAPH_WRLOCK
+bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs,
+Transaction *tran)
 {
 BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
 
@@ -2445,9 +2447,7 @@ static void bdrv_replace_child_tran(BdrvChild *child, 
BlockDriverState *new_bs,
 bdrv_ref(new_bs);
 }
 
-bdrv_graph_wrlock(new_bs);
 bdrv_replace_child_noperm(child, new_bs);
-bdrv_graph_wrunlock();
 /* old_bs reference is transparently moved from @child to @s */
 }
 
@@ -3439,8 +3439,14 @@ static int 
bdrv_set_file_or_backing_noperm(BlockDriverState *parent_bs,
 }
 
 if (child) {
+bdrv_drained_begin(child->bs);
+bdrv_graph_wrlock(NULL);
+
 bdrv_unset_inherits_from(parent_bs, child, tran);
 bdrv_remove_child(child, tran);
+
+bdrv_graph_wrunlock();
+bdrv_drained_end(child->bs);
 }
 
 if (!child_bs) {
@@ -5133,7 +5139,7 @@ void bdrv_close_all(void)
 assert(QTAILQ_EMPTY(&all_bdrv_states));
 }
 
-static bool should_update_child(BdrvChild *c, BlockDriverState *to)
+static bool GRAPH_RDLOCK should_update_child(BdrvChild *c, BlockDriverState 
*to)
 {
 GQueue *queue;
 GHashTable *found;
@@ -5222,45 +5228,41 @@ static TransactionActionDrv bdrv_remove_child_drv = {
 .commit = bdrv_remove_child_commit,
 };
 
-/* Function doesn't update permissions, caller is responsible for this. */
-static void bdrv_remove_child(BdrvChild *child, Transaction *tran)
+/*
+ * Function doesn't update permissions, caller is responsible for this.
+ *
+ * @child->bs (if non-NULL) must be drained.
+ */
+static void GRAPH_WRLOCK bdrv_remove_child(BdrvChild *child, Transaction *tran)
 {
 if (!child) {
 return;
 }
 
 if (child->bs) {
-BlockDriverState *bs = child->bs;
-bdrv_drained_begin(bs);
+assert(child->quiesced_parent);
 bdrv_replace_child_tran(child, NULL, tran);
-bdrv_drained_end(bs);
 }
 
 tran_add(tran, &bdrv_remove_child_drv, child);
 }
 
-static void undrain_on_clean_cb(void *opaque)
-{
-bdrv_drained_end(opaque);
-}
-
-static TransactionActionDrv undrain_on_clean = {
-.clean = undrain_on_clean_cb,
-};
-
-static int bdrv_replace_node_noperm(BlockDriverState *from,
-BlockDriverState *to,
-bool auto_skip, Transaction *tran,
-Error **errp)
+/*
+ * Both @from and @to (if non-NULL) must be drained. @to must be kept drained
+ * until the transaction is completed.
+ */
+static int GRAPH_WRLOCK
+bdrv_replace_node_noperm(BlockDriverState *from,
+ BlockDriverState *to,
+ bool auto_skip, Transaction *tran,
+ Error **errp)
 {
 BdrvChild *c, *next;
 
 GLOBAL_STATE_CODE();
 
-bdrv_drained_begin(from);
-bdrv_drained_begin(to);
-tran_add(tran, &undrain_on_clean, from);
-tran_add(tran, &undrain_on_clean, to)

[PULL 10/28] block: Mark bdrv_attach_child_common() GRAPH_WRLOCK

2023-09-15 Thread Kevin Wolf

Instead of taking the writer lock internally, require callers to already
hold it when calling bdrv_attach_child_common(). These callers will
typically already hold the graph lock once the locking work is
completed, which means that they can't call functions that take it
internally.

Note that the transaction callbacks still take the lock internally, so
tran_finalize() must be called without the lock held. This is because
bdrv_append() also calls bdrv_replace_node_noperm(), which currently
requires the transaction callbacks to be called unlocked. In the next
step, both of them can be switched to locked tran_finalize() calls
together.

Signed-off-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Message-ID: <20230911094620.45040-11-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block.c| 133 +++--
 block/stream.c |  20 ++--
 2 files changed, 100 insertions(+), 53 deletions(-)

diff --git a/block.c b/block.c
index 0973b91d98..f6e7cf4fb9 100644
--- a/block.c
+++ b/block.c
@@ -3004,13 +3004,14 @@ static TransactionActionDrv 
bdrv_attach_child_common_drv = {
  * @child_bs can move to a different AioContext in this function. Callers must
  * make sure that their AioContext locking is still correct after this.
  */
-static BdrvChild *bdrv_attach_child_common(BlockDriverState *child_bs,
-   const char *child_name,
-   const BdrvChildClass *child_class,
-   BdrvChildRole child_role,
-   uint64_t perm, uint64_t shared_perm,
-   void *opaque,
-   Transaction *tran, Error **errp)
+static BdrvChild * GRAPH_WRLOCK
+bdrv_attach_child_common(BlockDriverState *child_bs,
+ const char *child_name,
+ const BdrvChildClass *child_class,
+ BdrvChildRole child_role,
+ uint64_t perm, uint64_t shared_perm,
+ void *opaque,
+ Transaction *tran, Error **errp)
 {
 BdrvChild *new_child;
 AioContext *parent_ctx, *new_child_ctx;
@@ -3088,10 +3089,8 @@ static BdrvChild 
*bdrv_attach_child_common(BlockDriverState *child_bs,
  * a problem, we already did this), but it will still poll until the parent
  * is fully quiesced, so it will not be negatively affected either.
  */
-bdrv_graph_wrlock(child_bs);
 bdrv_parent_drained_begin_single(new_child);
 bdrv_replace_child_noperm(new_child, child_bs);
-bdrv_graph_wrunlock();
 
 BdrvAttachChildCommonState *s = g_new(BdrvAttachChildCommonState, 1);
 *s = (BdrvAttachChildCommonState) {
@@ -3116,13 +3115,14 @@ static BdrvChild 
*bdrv_attach_child_common(BlockDriverState *child_bs,
  * @child_bs can move to a different AioContext in this function. Callers must
  * make sure that their AioContext locking is still correct after this.
  */
-static BdrvChild *bdrv_attach_child_noperm(BlockDriverState *parent_bs,
-   BlockDriverState *child_bs,
-   const char *child_name,
-   const BdrvChildClass *child_class,
-   BdrvChildRole child_role,
-   Transaction *tran,
-   Error **errp)
+static BdrvChild * GRAPH_WRLOCK
+bdrv_attach_child_noperm(BlockDriverState *parent_bs,
+ BlockDriverState *child_bs,
+ const char *child_name,
+ const BdrvChildClass *child_class,
+ BdrvChildRole child_role,
+ Transaction *tran,
+ Error **errp)
 {
 uint64_t perm, shared_perm;
 
@@ -3167,6 +3167,8 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState 
*child_bs,
 
 GLOBAL_STATE_CODE();
 
+bdrv_graph_wrlock(child_bs);
+
 child = bdrv_attach_child_common(child_bs, child_name, child_class,
child_role, perm, shared_perm, opaque,
tran, errp);
@@ -3178,6 +3180,7 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState 
*child_bs,
 ret = bdrv_refresh_perms(child_bs, tran, errp);
 
 out:
+bdrv_graph_wrunlock();
 tran_finalize(tran, ret);
 
 bdrv_unref(child_bs);
@@ -3209,6 +3212,8 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 
 GLOBAL_STATE_CODE();
 
+bdrv_graph_wrlock(child_bs);
+
 child = bdrv_attach_child_noperm(parent_bs, child_bs, child_name,
  child_class, child_role, tran, errp);
 if (!child) {
@@ -3222,6 +3227,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 }
 
 out:
+bdrv_graph_wrunlock();

Re: [PATCH 3/3] hw/nvme: Add SPDM over DOE support

2023-09-15 Thread Jonathan Cameron via

On Fri, 15 Sep 2023 21:27:23 +1000
Alistair Francis  wrote:

> From: Wilfred Mallawa 
> 
> Setup Data Object Exchance (DOE) as an extended capability for the NVME
> controller and connect SPDM to it (CMA) to it.
> 
> Signed-off-by: Wilfred Mallawa 
> Signed-off-by: Alistair Francis 
A few comments inline.  

> ---
>  docs/specs/index.rst|  1 +
>  docs/specs/spdm.rst | 56 +
>  include/hw/pci/pci_device.h |  5 
>  include/hw/pci/pcie_doe.h   |  3 ++
>  hw/nvme/ctrl.c  | 52 ++
>  hw/nvme/trace-events|  1 +
>  6 files changed, 118 insertions(+)
>  create mode 100644 docs/specs/spdm.rst
> 
> diff --git a/docs/specs/index.rst b/docs/specs/index.rst
> index e58be38c41..c398541388 100644
> --- a/docs/specs/index.rst
> +++ b/docs/specs/index.rst
> @@ -24,3 +24,4 @@ guest hardware that is specific to QEMU.
> acpi_erst
> sev-guest-firmware
> fw_cfg
> +   spdm
> diff --git a/docs/specs/spdm.rst b/docs/specs/spdm.rst
> new file mode 100644
> index 00..0f96d618ef
> --- /dev/null
> +++ b/docs/specs/spdm.rst
> @@ -0,0 +1,56 @@
> +==
> +QEMU Security Protocols and Data Models (SPDM) Support
> +==
> +
> +SPDM enables authentication, attestation and key exchange to assist in
> +providing infrastructure security enablement. It's a standard published
> +by the DMTF https://www.dmtf.org/standards/SPDM.
> +
> +Setting up a SPDM server
> +
> +
> +When using QEMU with SPDM devices QEMU will connect to a server which
> +implements the SPDM functionality.
> +
> +spdm-emu
> +
> +
> +You can use spdm-emu https://github.com/dmtf/spdm-emu to model the
> +SPDM responder.
> +
> +.. code-block:: shell
> +
> +$ cd spdm-emu
> +$ git submodule init; git submodule update --recursive
> +$ mkdir build; cd build
> +$ cmake -DARCH=x64 -DTOOLCHAIN=GCC -DTARGET=Debug -DCRYPTO=openssl ..
> +$ make -j32
> +$ make copy_sample_key # Build certificates, required for SPDM 
> authentication.
> +
> +The responder can then be launched with
> +
> +.. code-block:: shell
> +
> +$ cd bin
> +$ ./spdm_responder_emu --trans PCI_DOE
> +
> +Connecting an SPDM NVMe device
> +==
> +
> +Once a SPDM server is running we can start QEMU and connect to the server.
> +
> +For an NVMe device first let's setup a block we can use
> +
> +.. code-block:: shell
> +
> +$ cd qemu-spdm/linux/image
> +$ dd if=/dev/zero of=blknvme bs=1M count=2096 # 2GB NNMe Drive
> +
> +Then you can add this to your QEMU command line:
> +
> +.. code-block:: shell
> +
> +-drive file=blknvme,if=none,id=mynvme,format=raw \
> +-device nvme,drive=mynvme,serial=deadbeef,spdm=2323
> +
> +At which point QEMU will connect to the SPDM server.

try to connect.

...

>  
>  void pcie_doe_init(PCIDevice *pdev, DOECap *doe_cap, uint16_t offset,
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 90687b168a..1ff30a9ad4 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -203,6 +203,7 @@
>  #include "sysemu/hostmem.h"
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pcie_sriov.h"
> +#include "sysemu/spdm-socket.h"
>  #include "migration/vmstate.h"
>  
>  #include "nvme.h"
> @@ -8077,6 +8078,28 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, 
> uint8_t offset)
>  return 0;
>  }
>  
> +static bool pcie_doe_spdm_rsp(DOECap *doe_cap)
> +{
> +void *req = pcie_doe_get_write_mbox_ptr(doe_cap);
> +uint32_t req_len = pcie_doe_get_obj_len(req) * 4;
> +void *rsp = doe_cap->read_mbox;
> +uint32_t rsp_len = SPDM_SOCKET_MAX_MESSAGE_BUFFER_SIZE;
> +uint32_t recvd;
> +
> +recvd = spdm_socket_rsp(doe_cap->socket,
> + SPDM_SOCKET_TRANSPORT_TYPE_PCI_DOE,
> + req, req_len, rsp, rsp_len);
> +doe_cap->read_mbox_len += DIV_ROUND_UP(recvd, 4);
> +
> +return (recvd == 0) ? false : true;

return recd != 0;

> +}
> +
> +static DOEProtocol doe_spdm_prot[] = {
> +{ PCI_VENDOR_ID_PCI_SIG, PCI_SIG_DOE_CMA, pcie_doe_spdm_rsp },
> +{ PCI_VENDOR_ID_PCI_SIG, PCI_SIG_DOE_SECURED_CMA, pcie_doe_spdm_rsp },
> +{ }
> +};
> +
>  static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
>  {
>  ERRP_GUARD();
> @@ -8133,6 +8156,23 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice 
> *pci_dev, Error **errp)
>  
>  nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
>  
> +pcie_cap_deverr_init(pci_dev);

Unrelated. Or I can't tell why it is related anyway.

> +
> +/* DOE Initialisation */
> +if (pci_dev->spdm_port) {
> +uint16_t doe_offset = n->params.sriov_max_vfs ?
> +  PCI_CONFIG_SPACE_SIZE + PCI_ARI_SIZEOF
> +  : PCI_CONFIG_SPACE_SIZE;
> +
> +pcie_doe_init(pci_dev, &pci_dev->doe_spdm, doe_

Re: [PATCH 2/3] backends: Initial support for SPDM socket support

2023-09-15 Thread Jonathan Cameron via

On Fri, 15 Sep 2023 21:27:22 +1000
Alistair Francis  wrote:

> From: Huai-Cheng Kuo 

Great to see you taking this forwards!

> 
> SPDM enables authentication, attestation and key exchange to assist in
> providing infrastructure security enablement. It's a standard published
> by the DMTF [1].
> 
> SPDM currently supports PCIe DOE and MCTP transports, but it can be
> extended to support others in the future. This patch adds
> support to QEMU to connect to an external SPDM instance.

It supports way more that that these days.  I'd just say 'multiple'
transports.

> 
> SPDM support can be added to any QEMU device by exposing a
> TCP socket to a SPDM server. The server can then implement the SPDM
> decoding/encoding support, generally using libspdm [2].
> 
> This is similar to how the current TPM implementation works and means
> that the heavy lifting of setting up certificate chains, capabilities,
> measurements and complex crypto can be done outside QEMU by a well
> supported and tested library.

Is this sufficient for usecases beyond initial attestation flows?
How does measurement work for example?  We need settings from the
emulated device to squirt into the SPDM agent so that it can be
encrypted and signed etc.

Measurement reports often need to include the status of various config
space registers + any device specific additional stuff - not sure
what is defined for NVME but I suspect the list will grow, particularly
when tdisp is included.  There are some things called out in the PCIe
state as must haves, like any debug features must be reported.
Also we need a way to mess with firmware revisions reported
as those are likely to be checked.

I'm not sure that model will work with the spdm-emu approach.

Anyhow, I think we need to have gotten a little further figuring that
out before we merge a solution.  I've been carrying this on the CXL
staging tree for a long time because I couldn't figure out a good solution
to the amount of information that needs to go between them.

For those not familiar with the fun of libSPDM it is a pain to work with
which is why Huai-Cheng instead connected with the demo app.

Any more luck getting a reliable build to work?

> 
> 1: https://www.dmtf.org/standards/SPDM
> 2: https://github.com/DMTF/libspdm
> 
> Signed-off-by: Huai-Cheng Kuo 
> Signed-off-by: Chris Browy 
> Co-developed-by: Jonathan Cameron 
> Signed-off-by: Jonathan Cameron 
> [ Changes by AF:
>  - Convert to be more QEMU-ified
>  - Move to backends as it isn't PCIe specific
> ]
> Signed-off-by: Alistair Francis 
Alistair, you sent this so I think your sign off should be last
+ some indication of Wilfred's involvement would be good?
Probably another Co-developed-by

> Signed-off-by: Wilfred Mallawa 
> ---

I've looked at this code too much in the past to give much
real review.  Still a few comments inline.
I'm very keen to get a solution to this upstream, though I think
we do need to discuss a few general points (no cover letter so I'll
do it here).

...

> diff --git a/backends/spdm-socket.c b/backends/spdm-socket.c
> new file mode 100644
> index 00..2f31ba80ba
> --- /dev/null
> +++ b/backends/spdm-socket.c
> @@ -0,0 +1,215 @@

> +
> +int spdm_socket_connect(uint16_t port, Error **errp)
> +{
> +int client_socket;
> +struct sockaddr_in server_addr;
> +
> +client_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
> +if (client_socket < 0) {
> +error_setg(errp, "cannot create socket: %s", strerror(errno));
> +return -1;
> +}
> +
> +memset((char *)&server_addr, 0, sizeof(server_addr));
> +server_addr.sin_family = AF_INET;
> +server_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
> +server_addr.sin_port = htons(port);
> +
> +
> +if (connect(client_socket, (struct sockaddr *)&server_addr, 
> sizeof(server_addr)) < 0) {
Wrap the line.

> +error_setg(errp, "cannot connect: %s", strerror(errno));
> +close(client_socket);
> +return -1;
> +}
> +
> +return client_socket;
> +}

[PATCH v2 1/8] qemu-img: rebase: stop when reaching EOF of old backing file

2023-09-15 Thread Andrey Drobyshev via

In case when we're rebasing within one backing chain, and when target image
is larger than old backing file, bdrv_is_allocated_above() ends up setting
*pnum = 0.  As a result, target offset isn't getting incremented, and we
get stuck in an infinite for loop.  Let's detect this case and proceed
further down the loop body, as the offsets beyond the old backing size need
to be explicitly zeroed.

Signed-off-by: Andrey Drobyshev 
Reviewed-by: Denis V. Lunev 
Reviewed-by: Hanna Czenczek 
---
 qemu-img.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/qemu-img.c b/qemu-img.c
index a48edb7101..50660ba920 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3805,6 +3805,8 @@ static int img_rebase(int argc, char **argv)
 }
 
 if (prefix_chain_bs) {
+uint64_t bytes = n;
+
 /*
  * If cluster wasn't changed since prefix_chain, we don't need
  * to take action
@@ -3817,9 +3819,18 @@ static int img_rebase(int argc, char **argv)
  strerror(-ret));
 goto out;
 }
-if (!ret) {
+if (!ret && n) {
 continue;
 }
+if (!n) {
+/*
+ * If we've reached EOF of the old backing, it means that
+ * offsets beyond the old backing size were read as zeroes.
+ * Now we will need to explicitly zero the cluster in
+ * order to preserve that state after the rebase.
+ */
+n = bytes;
+}
 }
 
 /*
-- 
2.39.3

[PATCH v2 2/8] qemu-iotests: 024: add rebasing test case for overlay_size > backing_size

2023-09-15 Thread Andrey Drobyshev via

Before previous commit, rebase was getting infitely stuck in case of
rebasing within the same backing chain and when overlay_size > backing_size.
Let's add this case to the rebasing test 024 to make sure it doesn't
break again.

Signed-off-by: Andrey Drobyshev 
Reviewed-by: Denis V. Lunev 
Reviewed-by: Hanna Czenczek 
---
 tests/qemu-iotests/024 | 57 ++
 tests/qemu-iotests/024.out | 30 
 2 files changed, 87 insertions(+)

diff --git a/tests/qemu-iotests/024 b/tests/qemu-iotests/024
index 25a564a150..98a7c8fd65 100755
--- a/tests/qemu-iotests/024
+++ b/tests/qemu-iotests/024
@@ -199,6 +199,63 @@ echo
 # $BASE_OLD and $BASE_NEW)
 $QEMU_IMG map "$OVERLAY" | _filter_qemu_img_map
 
+# Check that rebase within the chain is working when
+# overlay_size > old_backing_size
+#
+# base_new <-- base_old <-- overlay
+#
+# Backing (new): 11 11 11 11 11
+# Backing (old): 22 22 22 22
+# Overlay:   -- -- -- -- --
+#
+# As a result, overlay should contain data identical to base_old, with the
+# last cluster remaining unallocated.
+
+echo
+echo "=== Test rebase within one backing chain ==="
+echo
+
+echo "Creating backing chain"
+echo
+
+TEST_IMG=$BASE_NEW _make_test_img $(( CLUSTER_SIZE * 5 ))
+TEST_IMG=$BASE_OLD _make_test_img -b "$BASE_NEW" -F $IMGFMT \
+$(( CLUSTER_SIZE * 4 ))
+TEST_IMG=$OVERLAY _make_test_img -b "$BASE_OLD" -F $IMGFMT \
+$(( CLUSTER_SIZE * 5 ))
+
+echo
+echo "Fill backing files with data"
+echo
+
+$QEMU_IO "$BASE_NEW" -c "write -P 0x11 0 $(( CLUSTER_SIZE * 5 ))" \
+| _filter_qemu_io
+$QEMU_IO "$BASE_OLD" -c "write -P 0x22 0 $(( CLUSTER_SIZE * 4 ))" \
+| _filter_qemu_io
+
+echo
+echo "Check the last cluster is zeroed in overlay before the rebase"
+echo
+$QEMU_IO "$OVERLAY" -c "read -P 0x00 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \
+| _filter_qemu_io
+
+echo
+echo "Rebase onto another image in the same chain"
+echo
+
+$QEMU_IMG rebase -b "$BASE_NEW" -F $IMGFMT "$OVERLAY"
+
+echo "Verify that data is read the same before and after rebase"
+echo
+
+# Verify the first 4 clusters are still read the same as in the old base
+$QEMU_IO "$OVERLAY" -c "read -P 0x22 0 $(( CLUSTER_SIZE * 4 ))" \
+| _filter_qemu_io
+# Verify the last cluster still reads as zeroes
+$QEMU_IO "$OVERLAY" -c "read -P 0x00 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \
+| _filter_qemu_io
+
+echo
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/024.out b/tests/qemu-iotests/024.out
index 973a5a3711..245fe8b1d1 100644
--- a/tests/qemu-iotests/024.out
+++ b/tests/qemu-iotests/024.out
@@ -171,4 +171,34 @@ read 65536/65536 bytes at offset 196608
 Offset  Length  File
 0   0x3 TEST_DIR/subdir/t.IMGFMT
 0x3 0x1 TEST_DIR/subdir/t.IMGFMT.base_new
+
+=== Test rebase within one backing chain ===
+
+Creating backing chain
+
+Formatting 'TEST_DIR/subdir/t.IMGFMT.base_new', fmt=IMGFMT size=327680
+Formatting 'TEST_DIR/subdir/t.IMGFMT.base_old', fmt=IMGFMT size=262144 
backing_file=TEST_DIR/subdir/t.IMGFMT.base_new backing_fmt=IMGFMT
+Formatting 'TEST_DIR/subdir/t.IMGFMT', fmt=IMGFMT size=327680 
backing_file=TEST_DIR/subdir/t.IMGFMT.base_old backing_fmt=IMGFMT
+
+Fill backing files with data
+
+wrote 327680/327680 bytes at offset 0
+320 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 262144/262144 bytes at offset 0
+256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Check the last cluster is zeroed in overlay before the rebase
+
+read 65536/65536 bytes at offset 262144
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Rebase onto another image in the same chain
+
+Verify that data is read the same before and after rebase
+
+read 262144/262144 bytes at offset 0
+256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 262144
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
 *** done
-- 
2.39.3

[PATCH v2 3/8] qemu-img: rebase: use backing files' BlockBackend for buffer alignment

2023-09-15 Thread Andrey Drobyshev via

Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for
the data read from the old and new backing files are aligned using
BlockDriverState (or BlockBackend later on) referring to the target image.
However, this isn't quite right, because buf_new is only being used for
reading from the new backing, while buf_old is being used for both reading
from the old backing and writing to the target.  Let's take that into account
and use more appropriate values as alignments.

Signed-off-by: Andrey Drobyshev 
---
 qemu-img.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 50660ba920..d12e4a4753 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3750,8 +3750,13 @@ static int img_rebase(int argc, char **argv)
 int64_t n;
 float local_progress = 0;
 
-buf_old = blk_blockalign(blk, IO_BUF_SIZE);
-buf_new = blk_blockalign(blk, IO_BUF_SIZE);
+if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk)) >
+bdrv_opt_mem_align(blk_bs(blk_old_backing))) {
+buf_old = blk_blockalign(blk, IO_BUF_SIZE);
+} else {
+buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
+}
+buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE);
 
 size = blk_getlength(blk);
 if (size < 0) {
-- 
2.39.3

[PATCH v2 0/8] qemu-img: rebase: add compression support

2023-09-15 Thread Andrey Drobyshev via

v1 --> v2:
 * Choose proper BlockBackend when aligning buf_old;
 * Add new patch ("qemu-img: add chunk size parameter to
   compare_buffers()");
 * Rework write alignment logic; now writes are aligned to either
   subcluster or cluster size, depending on whether compressionis enabled;
 * Add new patch ("iotests/{024, 271}: add testcases for qemu-img
   rebase");
 * Add another compressed rebase testcase for images having subclusters.

v1: https://lists.nongnu.org/archive/html/qemu-block/2023-06/msg00068.html

NOTE: compressed rebase testcase for subclusters assume "compressed"
field in "qemu-img map" output.  This series is currently in the block
branch and is likely to be merged into master soon:

https://lists.nongnu.org/archive/html/qemu-devel/2023-09/msg01489.html


Andrey Drobyshev (8):
  qemu-img: rebase: stop when reaching EOF of old backing file
  qemu-iotests: 024: add rebasing test case for overlay_size >
backing_size
  qemu-img: rebase: use backing files' BlockBackend for buffer alignment
  qemu-img: add chunk size parameter to compare_buffers()
  qemu-img: rebase: avoid unnecessary COW operations
  iotests/{024, 271}: add testcases for qemu-img rebase
  qemu-img: add compression option to rebase subcommand
  iotests: add tests for "qemu-img rebase" with compression

 docs/tools/qemu-img.rst|   6 +-
 qemu-img-cmds.hx   |   4 +-
 qemu-img.c | 136 +++---
 tests/qemu-iotests/024 | 117 ++
 tests/qemu-iotests/024.out |  73 
 tests/qemu-iotests/271 | 131 +
 tests/qemu-iotests/271.out |  82 ++
 tests/qemu-iotests/314 | 165 +
 tests/qemu-iotests/314.out |  75 +
 9 files changed, 753 insertions(+), 36 deletions(-)
 create mode 100755 tests/qemu-iotests/314
 create mode 100644 tests/qemu-iotests/314.out

-- 
2.39.3

[PATCH v2 8/8] iotests: add tests for "qemu-img rebase" with compression

2023-09-15 Thread Andrey Drobyshev via

The test cases considered so far:

314 (new test suite):

1. Check that compression mode isn't compatible with "-f raw" (raw
   format doesn't support compression).
2. Check that rebasing an image onto no backing file preserves the data
   and writes the copied clusters actually compressed.
3. Same as 2, but with a raw backing file (i.e. the clusters copied from the
   backing are originally uncompressed -- we check they end up compressed
   after being merged).
4. Remove a single delta from a backing chain, perform the same checks
   as in 2.
5. Check that even when backing and overlay are initially uncompressed,
   copied clusters end up compressed when rebase with compression is
   performed.

271:

1. Check that when target image has subclusters, rebase with compression
   will make an entire cluster containing the written subcluster
   compressed.

Signed-off-by: Andrey Drobyshev 
---
 tests/qemu-iotests/271 |  65 +++
 tests/qemu-iotests/271.out |  40 +
 tests/qemu-iotests/314 | 165 +
 tests/qemu-iotests/314.out |  75 +
 4 files changed, 345 insertions(+)
 create mode 100755 tests/qemu-iotests/314
 create mode 100644 tests/qemu-iotests/314.out

diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
index e243f57ba7..59a6fafa2f 100755
--- a/tests/qemu-iotests/271
+++ b/tests/qemu-iotests/271
@@ -965,6 +965,71 @@ echo
 
 TEST_IMG="$TEST_IMG.top" alloc="1 30" zero="" _verify_l2_bitmap 0
 
+# Check that rebase with compression works correctly with images containing
+# subclusters.  When compression is enabled and we allocate a new
+# subcluster within the target (overlay) image, we expect the entire cluster
+# containing that subcluster to become compressed.
+#
+# Here we expect 1st and 3rd clusters of the top (overlay) image to become
+# compressed after the rebase, while cluster 2 to remain unallocated and
+# be read from the base (new backing) image.
+#
+# Base (new backing): |-- -- .. -- --|11 11 .. 11 11|-- -- .. -- --|
+# Mid (old backing):  |-- -- .. -- 22|-- -- .. -- --|33 -- .. -- --|
+# Top:|-- -- .. -- --|-- -- -- -- --|-- -- .. -- --|
+
+echo
+echo "### Rebase with compression for images with subclusters ###"
+echo
+
+echo "# create backing chain"
+echo
+
+TEST_IMG="$TEST_IMG.base" _make_test_img -o cluster_size=1M,extended_l2=on 3M
+TEST_IMG="$TEST_IMG.mid" _make_test_img -o cluster_size=1M,extended_l2=on \
+-b "$TEST_IMG.base" -F qcow2 3M
+TEST_IMG="$TEST_IMG.top" _make_test_img -o cluster_size=1M,extended_l2=on \
+-b "$TEST_IMG.mid" -F qcow2 3M
+
+echo
+echo "# fill old and new backing with data"
+echo
+
+$QEMU_IO -c "write -P 0x11 1M 1M" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c "write -P 0x22 $(( 31 * 32 ))k 32k" \
+ -c "write -P 0x33 $(( 64 * 32 ))k 32k" \
+ "$TEST_IMG.mid" | _filter_qemu_io
+
+echo
+echo "# rebase topmost image onto the new backing, with compression"
+echo
+
+$QEMU_IMG rebase -c -b "$TEST_IMG.base" -F qcow2 "$TEST_IMG.top"
+
+echo "# verify that the 1st and 3rd clusters've become compressed"
+echo
+
+$QEMU_IMG map --output=json "$TEST_IMG.top" | _filter_testdir
+
+echo
+echo "# verify that data is read the same before and after rebase"
+echo
+
+$QEMU_IO -c "read -P 0x22 $(( 31 * 32 ))k 32k" \
+ -c "read -P 0x11 1M 1M" \
+ -c "read -P 0x33 $(( 64 * 32 ))k 32k" \
+ "$TEST_IMG.top" | _filter_qemu_io
+
+echo
+echo "# verify image bitmap"
+echo
+
+# For compressed clusters bitmap is always 0.  For unallocated cluster
+# there should be no entry at all, thus bitmap is also 0.
+TEST_IMG="$TEST_IMG.top" alloc="" zero="" _verify_l2_bitmap 0
+TEST_IMG="$TEST_IMG.top" alloc="" zero="" _verify_l2_bitmap 1
+TEST_IMG="$TEST_IMG.top" alloc="" zero="" _verify_l2_bitmap 2
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/271.out b/tests/qemu-iotests/271.out
index c335a6c608..0b24d50159 100644
--- a/tests/qemu-iotests/271.out
+++ b/tests/qemu-iotests/271.out
@@ -765,4 +765,44 @@ Offset  Length  Mapped to   File
 # verify image bitmap
 
 L2 entry #0: 0x8050 4002
+
+### Rebase with compression for images with subclusters ###
+
+# create backing chain
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=3145728
+Formatting 'TEST_DIR/t.IMGFMT.mid', fmt=IMGFMT size=3145728 
backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=IMGFMT
+Formatting 'TEST_DIR/t.IMGFMT.top', fmt=IMGFMT size=3145728 
backing_file=TEST_DIR/t.IMGFMT.mid backing_fmt=IMGFMT
+
+# fill old and new backing with data
+
+wrote 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 32768/32768 bytes at offset 1015808
+32 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 32768/32768 bytes at offset 2097152
+32 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+# rebase topmost image onto the new backing, with compression
+
+# verify

[PATCH v2 4/8] qemu-img: add chunk size parameter to compare_buffers()

2023-09-15 Thread Andrey Drobyshev via

Add @chsize param to the function which, if non-zero, would represent
the chunk size to be used for comparison.  If it's zero, then
BDRV_SECTOR_SIZE is used as default chunk size, which is the previous
behaviour.

In particular, we're going to use this param in img_rebase() to make the
write requests aligned to a predefined alignment value.

Signed-off-by: Andrey Drobyshev 
---
 qemu-img.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index d12e4a4753..fcd31d7b5b 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1274,23 +1274,29 @@ static int is_allocated_sectors_min(const uint8_t *buf, 
int n, int *pnum,
 }
 
 /*
- * Compares two buffers sector by sector. Returns 0 if the first
- * sector of each buffer matches, non-zero otherwise.
+ * Compares two buffers chunk by chunk, where @chsize is the chunk size.
+ * If @chsize is 0, default chunk size of BDRV_SECTOR_SIZE is used.
+ * Returns 0 if the first chunk of each buffer matches, non-zero otherwise.
  *
- * pnum is set to the sector-aligned size of the buffer prefix that
+ * @pnum is set to the size of the buffer prefix aligned to @chsize that
  * has the same matching status as the first sector.
  */
 static int compare_buffers(const uint8_t *buf1, const uint8_t *buf2,
-   int64_t bytes, int64_t *pnum)
+   int64_t bytes, uint64_t chsize, int64_t *pnum)
 {
 bool res;
-int64_t i = MIN(bytes, BDRV_SECTOR_SIZE);
+int64_t i;
 
 assert(bytes > 0);
 
+if (!chsize) {
+chsize = BDRV_SECTOR_SIZE;
+}
+i = MIN(bytes, chsize);
+
 res = !!memcmp(buf1, buf2, i);
 while (i < bytes) {
-int64_t len = MIN(bytes - i, BDRV_SECTOR_SIZE);
+int64_t len = MIN(bytes - i, chsize);
 
 if (!!memcmp(buf1 + i, buf2 + i, len) != res) {
 break;
@@ -1559,7 +1565,7 @@ static int img_compare(int argc, char **argv)
 ret = 4;
 goto out;
 }
-ret = compare_buffers(buf1, buf2, chunk, &pnum);
+ret = compare_buffers(buf1, buf2, chunk, 0, &pnum);
 if (ret || pnum != chunk) {
 qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
 offset + (ret ? 0 : pnum));
@@ -3878,7 +3884,7 @@ static int img_rebase(int argc, char **argv)
 int64_t pnum;
 
 if (compare_buffers(buf_old + written, buf_new + written,
-n - written, &pnum))
+n - written, 0, &pnum))
 {
 if (buf_old_is_zero) {
 ret = blk_pwrite_zeroes(blk, offset + written, pnum, 
0);
-- 
2.39.3

[PATCH v2 5/8] qemu-img: rebase: avoid unnecessary COW operations

2023-09-15 Thread Andrey Drobyshev via

When rebasing an image from one backing file to another, we need to
compare data from old and new backings.  If the diff between that data
happens to be unaligned to the target cluster size, we might end up
doing partial writes, which would lead to copy-on-write and additional IO.

Consider the following simple case (virtual_size == cluster_size == 64K):

base <-- inc1 <-- inc2

qemu-io -c "write -P 0xaa 0 32K" base.qcow2
qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2

While doing rebase, we'll write a half of the cluster to inc2, and block
layer will have to read the 2nd half of the same cluster from the base image
inc1 while doing this write operation, although the whole cluster is already
read earlier to perform data comparison.

In order to avoid these unnecessary IO cycles, let's make sure every
write request is aligned to the overlay subcluster boundaries.  Using
subcluster size is universal as for the images which don't have them
this size equals to the cluster size, so in any case we end up aligning
to the smallest unit of allocation.

Signed-off-by: Andrey Drobyshev 
---
 qemu-img.c | 76 --
 1 file changed, 56 insertions(+), 20 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index fcd31d7b5b..83950af42b 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3523,6 +3523,7 @@ static int img_rebase(int argc, char **argv)
 uint8_t *buf_new = NULL;
 BlockDriverState *bs = NULL, *prefix_chain_bs = NULL;
 BlockDriverState *unfiltered_bs;
+BlockDriverInfo bdi = {0};
 char *filename;
 const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
 int c, flags, src_flags, ret;
@@ -3533,6 +3534,7 @@ static int img_rebase(int argc, char **argv)
 bool quiet = false;
 Error *local_err = NULL;
 bool image_opts = false;
+int64_t write_align;
 
 /* Parse commandline parameters */
 fmt = NULL;
@@ -3656,6 +3658,20 @@ static int img_rebase(int argc, char **argv)
 }
 }
 
+/*
+ * We need overlay subcluster size to make sure write requests are
+ * aligned.
+ */
+ret = bdrv_get_info(unfiltered_bs, &bdi);
+if (ret < 0) {
+error_report("could not get block driver info");
+goto out;
+} else if (bdi.subcluster_size == 0) {
+bdi.subcluster_size = 1;
+}
+
+write_align = bdi.subcluster_size;
+
 /* For safe rebasing we need to compare old and new backing file */
 if (!unsafe) {
 QDict *options = NULL;
@@ -3753,7 +3769,7 @@ static int img_rebase(int argc, char **argv)
 int64_t old_backing_size = 0;
 int64_t new_backing_size = 0;
 uint64_t offset;
-int64_t n;
+int64_t n, n_old = 0, n_new = 0;
 float local_progress = 0;
 
 if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk)) >
@@ -3799,7 +3815,8 @@ static int img_rebase(int argc, char **argv)
 }
 
 for (offset = 0; offset < size; offset += n) {
-bool buf_old_is_zero = false;
+bool old_backing_eof = false;
+int64_t n_alloc;
 
 /* How many bytes can we handle with the next read? */
 n = MIN(IO_BUF_SIZE, size - offset);
@@ -3844,33 +3861,48 @@ static int img_rebase(int argc, char **argv)
 }
 }
 
+/*
+ * At this point we know that the region [offset; offset + n)
+ * is unallocated within the target image.  This region might be
+ * unaligned to the target image's (sub)cluster boundaries, as
+ * old backing may have smaller clusters (or have subclusters).
+ * We extend it to the aligned boundaries to avoid CoW on
+ * partial writes in blk_pwrite(),
+ */
+n += offset - QEMU_ALIGN_DOWN(offset, write_align);
+offset = QEMU_ALIGN_DOWN(offset, write_align);
+n += QEMU_ALIGN_UP(offset + n, write_align) - (offset + n);
+n = MIN(n, size - offset);
+assert(!bdrv_is_allocated(unfiltered_bs, offset, n, &n_alloc) &&
+   n_alloc == n);
+
+/*
+ * Much like the with the target image, we'll try to read as much
+ * of the old and new backings as we can.
+ */
+n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset));
+if (blk_new_backing) {
+n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset));
+}
+
 /*
  * Read old and new backing file and take into consideration that
  * backing files may be smaller than the COW image.
  */
-if (offset >= old_backing_size) {
-memset(buf_old, 0, n);
-buf_old_is_zero = true;
+memset(buf_old + n_old, 0, n

[PATCH v2 7/8] qemu-img: add compression option to rebase subcommand

2023-09-15 Thread Andrey Drobyshev via

If we rebase an image whose backing file has compressed clusters, we
might end up wasting disk space since the copied clusters are now
uncompressed.  In order to have better control over this, let's add
"--compress" option to the "qemu-img rebase" command.

Note that this option affects only the clusters which are actually being
copied from the original backing file.  The clusters which were
uncompressed in the target image will remain so.

Signed-off-by: Andrey Drobyshev 
Reviewed-by: Denis V. Lunev 
Reviewed-by: Hanna Czenczek 
---
 docs/tools/qemu-img.rst |  6 --
 qemu-img-cmds.hx|  4 ++--
 qemu-img.c  | 26 --
 3 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index ca5a2773cf..4459c065f1 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -667,7 +667,7 @@ Command description:
 
   List, apply, create or delete snapshots in image *FILENAME*.
 
-.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-p] [-u] -b BACKING_FILE [-F BACKING_FMT] FILENAME
+.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-p] [-u] [-c] -b BACKING_FILE [-F BACKING_FMT] FILENAME
 
   Changes the backing file of an image. Only the formats ``qcow2`` and
   ``qed`` support changing the backing file.
@@ -694,7 +694,9 @@ Command description:
 
 In order to achieve this, any clusters that differ between
 *BACKING_FILE* and the old backing file of *FILENAME* are merged
-into *FILENAME* before actually changing the backing file.
+into *FILENAME* before actually changing the backing file. With the
+``-c`` option specified, the clusters which are being merged (but not
+the entire *FILENAME* image) are compressed when written.
 
 Note that the safe mode is an expensive operation, comparable to
 converting an image. It only works if the old backing file still
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 1b1dab5b17..068692d13e 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -88,9 +88,9 @@ SRST
 ERST
 
 DEF("rebase", img_rebase,
-"rebase [--object objectdef] [--image-opts] [-U] [-q] [-f fmt] [-t cache] 
[-T src_cache] [-p] [-u] -b backing_file [-F backing_fmt] filename")
+"rebase [--object objectdef] [--image-opts] [-U] [-q] [-f fmt] [-t cache] 
[-T src_cache] [-p] [-u] [-c] -b backing_file [-F backing_fmt] filename")
 SRST
-.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-p] [-u] -b BACKING_FILE [-F BACKING_FMT] FILENAME
+.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-p] [-u] [-c] -b BACKING_FILE [-F BACKING_FMT] FILENAME
 ERST
 
 DEF("resize", img_resize,
diff --git a/qemu-img.c b/qemu-img.c
index 83950af42b..8f39dae187 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3527,11 +3527,13 @@ static int img_rebase(int argc, char **argv)
 char *filename;
 const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
 int c, flags, src_flags, ret;
+BdrvRequestFlags write_flags = 0;
 bool writethrough, src_writethrough;
 int unsafe = 0;
 bool force_share = false;
 int progress = 0;
 bool quiet = false;
+bool compress = false;
 Error *local_err = NULL;
 bool image_opts = false;
 int64_t write_align;
@@ -3548,9 +3550,10 @@ static int img_rebase(int argc, char **argv)
 {"object", required_argument, 0, OPTION_OBJECT},
 {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
 {"force-share", no_argument, 0, 'U'},
+{"compress", no_argument, 0, 'c'},
 {0, 0, 0, 0}
 };
-c = getopt_long(argc, argv, ":hf:F:b:upt:T:qU",
+c = getopt_long(argc, argv, ":hf:F:b:upt:T:qUc",
 long_options, NULL);
 if (c == -1) {
 break;
@@ -3598,6 +3601,9 @@ static int img_rebase(int argc, char **argv)
 case 'U':
 force_share = true;
 break;
+case 'c':
+compress = true;
+break;
 }
 }
 
@@ -3650,6 +3656,14 @@ static int img_rebase(int argc, char **argv)
 
 unfiltered_bs = bdrv_skip_filters(bs);
 
+if (compress && !block_driver_can_compress(unfiltered_bs->drv)) {
+error_report("Compression not supported for this file format");
+ret = -1;
+goto out;
+} else if (compress) {
+write_flags |= BDRV_REQ_WRITE_COMPRESSED;
+}
+
 if (out_basefmt != NULL) {
 if (bdrv_find_format(out_basefmt) == NULL) {
 error_report("Invalid format name: '%s'", out_basefmt);
@@ -3659,18 +3673,18 @@ static int img_rebase(int argc, char **argv)
 }
 
 /*
- * We need overlay subcluster size to make sure write requests are
- * aligned.
+ * We need overlay subcluster size (or cluster size in

[PATCH v2 6/8] iotests/{024, 271}: add testcases for qemu-img rebase

2023-09-15 Thread Andrey Drobyshev via

As the previous commit changes the logic of "qemu-img rebase" (it's using
write alignment now), let's add a couple more test cases which would
ensure it works correctly.  In particular, the following scenarios:

024: add test case for rebase within one backing chain when the overlay
 cluster size > backings cluster size;
271: add test case for rebase images that contain subclusters.  Check
 that no extra allocations are being made.

Signed-off-by: Andrey Drobyshev 
---
 tests/qemu-iotests/024 | 60 ++
 tests/qemu-iotests/024.out | 43 +
 tests/qemu-iotests/271 | 66 ++
 tests/qemu-iotests/271.out | 42 
 4 files changed, 211 insertions(+)

diff --git a/tests/qemu-iotests/024 b/tests/qemu-iotests/024
index 98a7c8fd65..285f17e79f 100755
--- a/tests/qemu-iotests/024
+++ b/tests/qemu-iotests/024
@@ -257,6 +257,66 @@ $QEMU_IO "$OVERLAY" -c "read -P 0x00 $(( CLUSTER_SIZE * 4 
)) $CLUSTER_SIZE" \
 
 echo
 
+# Check that rebase within the chain is working when
+# overlay cluster size > backings cluster size
+# (here overlay cluster size == 2 * backings cluster size)
+#
+# base_new <-- base_old <-- overlay
+#
+# Backing (new): -- -- -- -- -- --
+# Backing (old): -- 11 -- -- 22 --
+# Overlay:  |-- --|-- --|-- --|
+#
+# We should end up having 1st and 3rd cluster allocated, and their halves
+# being read as zeroes.
+
+echo
+echo "=== Test rebase with different cluster sizes ==="
+echo
+
+echo "Creating backing chain"
+echo
+
+TEST_IMG=$BASE_NEW _make_test_img $(( CLUSTER_SIZE * 6 ))
+TEST_IMG=$BASE_OLD _make_test_img -b "$BASE_NEW" -F $IMGFMT \
+$(( CLUSTER_SIZE * 6 ))
+CLUSTER_SIZE=$(( CLUSTER_SIZE * 2 )) TEST_IMG=$OVERLAY \
+_make_test_img -b "$BASE_OLD" -F $IMGFMT $(( CLUSTER_SIZE * 6 ))
+
+TEST_IMG=$OVERLAY _img_info
+
+echo
+echo "Fill backing files with data"
+echo
+
+$QEMU_IO "$BASE_OLD" -c "write -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" \
+-c "write -P 0x22 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \
+| _filter_qemu_io
+
+echo
+echo "Rebase onto another image in the same chain"
+echo
+
+$QEMU_IMG rebase -b "$BASE_NEW" -F $IMGFMT "$OVERLAY"
+
+echo "Verify that data is read the same before and after rebase"
+echo
+
+$QEMU_IO "$OVERLAY" -c "read -P 0x00 0 $CLUSTER_SIZE" \
+-c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" \
+-c "read -P 0x00 $(( CLUSTER_SIZE * 2 )) $(( CLUSTER_SIZE * 2 ))" \
+-c "read -P 0x22 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \
+-c "read -P 0x00 $(( CLUSTER_SIZE * 5 )) $CLUSTER_SIZE" \
+| _filter_qemu_io
+
+echo
+echo "Verify that untouched cluster remains unallocated"
+echo
+
+$QEMU_IMG map "$OVERLAY" | _filter_qemu_img_map
+
+echo
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/024.out b/tests/qemu-iotests/024.out
index 245fe8b1d1..e1e8eea863 100644
--- a/tests/qemu-iotests/024.out
+++ b/tests/qemu-iotests/024.out
@@ -201,4 +201,47 @@ read 262144/262144 bytes at offset 0
 read 65536/65536 bytes at offset 262144
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+
+=== Test rebase with different cluster sizes ===
+
+Creating backing chain
+
+Formatting 'TEST_DIR/subdir/t.IMGFMT.base_new', fmt=IMGFMT size=393216
+Formatting 'TEST_DIR/subdir/t.IMGFMT.base_old', fmt=IMGFMT size=393216 
backing_file=TEST_DIR/subdir/t.IMGFMT.base_new backing_fmt=IMGFMT
+Formatting 'TEST_DIR/subdir/t.IMGFMT', fmt=IMGFMT size=393216 
backing_file=TEST_DIR/subdir/t.IMGFMT.base_old backing_fmt=IMGFMT
+image: TEST_DIR/subdir/t.IMGFMT
+file format: IMGFMT
+virtual size: 384 KiB (393216 bytes)
+cluster_size: 131072
+backing file: TEST_DIR/subdir/t.IMGFMT.base_old
+backing file format: IMGFMT
+
+Fill backing files with data
+
+wrote 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 262144
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Rebase onto another image in the same chain
+
+Verify that data is read the same before and after rebase
+
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 131072/131072 bytes at offset 131072
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 262144
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 327680
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Verify that untouched cluster remains unallocated
+
+Offset  Length  File
+0   0x2 TEST_DIR/subdir/t.IMGFMT
+0x4 0x2 TEST_DIR/subdir/t.IMGFMT
+
 *** done
diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
index c7c2cadda0..e243f57ba7 100755
--- a/tests/qemu-iotests/271
+++ b/tests/qemu-iotests/271
@@ -899,6 +899,72 @@ _concurrent_io | $

Re: [PATCH v2 3/8] qemu-img: rebase: use backing files' BlockBackend for buffer alignment

2023-09-15 Thread Eric Blake

On Fri, Sep 15, 2023 at 07:20:11PM +0300, Andrey Drobyshev wrote:
> Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for
> the data read from the old and new backing files are aligned using
> BlockDriverState (or BlockBackend later on) referring to the target image.
> However, this isn't quite right, because buf_new is only being used for
> reading from the new backing, while buf_old is being used for both reading
> from the old backing and writing to the target.  Let's take that into account
> and use more appropriate values as alignments.
> 
> Signed-off-by: Andrey Drobyshev 
> ---
>  qemu-img.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index 50660ba920..d12e4a4753 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -3750,8 +3750,13 @@ static int img_rebase(int argc, char **argv)
>  int64_t n;
>  float local_progress = 0;
>  
> -buf_old = blk_blockalign(blk, IO_BUF_SIZE);
> -buf_new = blk_blockalign(blk, IO_BUF_SIZE);
> +if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk)) >
> +bdrv_opt_mem_align(blk_bs(blk_old_backing))) {
> +buf_old = blk_blockalign(blk, IO_BUF_SIZE);
> +} else {
> +buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
> +}

Since bdrv_opt_mem_align(NULL) is safe, could we just simplify this to:

buf_old = qemu_memalign(MAX(bdrv_opt_mem_align(blk_old_backing),
bdrv_opt_mem_align(blk)), IO_BUF_SIZE);

instead of going through an if statement?  Or is the problem that
bdrv_opt_mem_align(NULL) can return the host page size (perhaps 64k),
which may be larger than technically needed in some scenarios?

> +buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE);
>  
>  size = blk_getlength(blk);
>  if (size < 0) {
> -- 
> 2.39.3

At any rate, aligning the buffers by how they will be used makes sense
(if the destination blk has looser requirements than the source
blk_old_backing, then accesses into blk_old are suspect).

Reviewed-by: Eric Blake

[PATCH 05/21] parallels: return earlier from parallels_open() function on error

2023-09-15 Thread Denis V. Lunev

At the beginning of the function we can return immediately until we
really allocate s->header.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 0f127427bf..8f223bfd89 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -1084,7 +1084,7 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 
 ret = bdrv_pread(bs->file, 0, sizeof(ph), &ph, 0);
 if (ret < 0) {
-goto fail;
+return ret;
 }
 
 bs->total_sectors = le64_to_cpu(ph.nb_sectors);
@@ -1104,13 +1104,11 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 s->tracks = le32_to_cpu(ph.tracks);
 if (s->tracks == 0) {
 error_setg(errp, "Invalid image: Zero sectors per track");
-ret = -EINVAL;
-goto fail;
+return -EINVAL;
 }
 if (s->tracks > INT32_MAX/513) {
 error_setg(errp, "Invalid image: Too big cluster");
-ret = -EFBIG;
-goto fail;
+return -EFBIG;
 }
 s->prealloc_size = MAX(s->tracks, s->prealloc_size);
 s->cluster_size = s->tracks << BDRV_SECTOR_BITS;
@@ -1118,16 +1116,14 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 s->bat_size = le32_to_cpu(ph.bat_entries);
 if (s->bat_size > INT_MAX / sizeof(uint32_t)) {
 error_setg(errp, "Catalog too large");
-ret = -EFBIG;
-goto fail;
+return -EFBIG;
 }
 
 size = bat_entry_off(s->bat_size);
 s->header_size = ROUND_UP(size, bdrv_opt_mem_align(bs->file->bs));
 s->header = qemu_try_blockalign(bs->file->bs, s->header_size);
 if (s->header == NULL) {
-ret = -ENOMEM;
-goto fail;
+return -ENOMEM;
 }
 
 ret = bdrv_pread(bs->file, 0, s->header_size, s->header, 0);
-- 
2.34.1

[PATCH 02/21] parallels: mark driver as supporting CBT

2023-09-15 Thread Denis V. Lunev

Parallels driver indeed support Parallels Dirty Bitmap Feature in
read-only mode. The patch adds bdrv_supports_persistent_dirty_bitmap()
callback which always return 1 to indicate that.

This will allow to copy CBT from Parallels image with qemu-img.

Note: read-write support is signalled through
bdrv_co_can_store_new_dirty_bitmap() and is different.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index 2ebd8e1301..428f72de1c 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -1248,6 +1248,11 @@ static void parallels_close(BlockDriverState *bs)
 error_free(s->migration_blocker);
 }
 
+static bool parallels_is_support_dirty_bitmaps(BlockDriverState *bs)
+{
+return 1;
+}
+
 static BlockDriver bdrv_parallels = {
 .format_name= "parallels",
 .instance_size  = sizeof(BDRVParallelsState),
@@ -1256,6 +1261,7 @@ static BlockDriver bdrv_parallels = {
 .supports_backing   = true,
 
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
+.bdrv_supports_persistent_dirty_bitmap = 
parallels_is_support_dirty_bitmaps,
 
 .bdrv_probe = parallels_probe,
 .bdrv_open  = parallels_open,
-- 
2.34.1

[PATCH 06/21] parallels: refactor path when we need to re-check image in parallels_open

2023-09-15 Thread Denis V. Lunev

More conditions follows thus the check should be more scalable.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 8f223bfd89..aa29df9f77 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -1065,7 +1065,7 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 int ret, size, i;
 int64_t file_nb_sectors, sector;
 uint32_t data_start;
-bool data_off_is_correct;
+bool need_check = false;
 
 ret = parallels_opts_prealloc(bs, options, errp);
 if (ret < 0) {
@@ -1133,11 +1133,12 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 s->bat_bitmap = (uint32_t *)(s->header + 1);
 
 if (le32_to_cpu(ph.inuse) == HEADER_INUSE_MAGIC) {
-s->header_unclean = true;
+need_check = s->header_unclean = true;
 }
 
-data_off_is_correct = parallels_test_data_off(s, file_nb_sectors,
-  &data_start);
+need_check = need_check ||
+ !parallels_test_data_off(s, file_nb_sectors, &data_start);
+
 s->data_start = data_start;
 s->data_end = s->data_start;
 if (s->data_end < (s->header_size >> BDRV_SECTOR_BITS)) {
@@ -1194,6 +1195,7 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 s->data_end = sector + s->tracks;
 }
 }
+need_check = need_check || s->data_end > file_nb_sectors;
 
 /*
  * We don't repair the image here if it's opened for checks. Also we don't
@@ -1203,12 +1205,8 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 return 0;
 }
 
-/*
- * Repair the image if it's dirty or
- * out-of-image corruption was detected.
- */
-if (s->data_end > file_nb_sectors || s->header_unclean
-|| !data_off_is_correct) {
+/* Repair the image if corruption was detected. */
+if (need_check) {
 BdrvCheckResult res;
 ret = bdrv_check(bs, &res, BDRV_FIX_ERRORS | BDRV_FIX_LEAKS);
 if (ret < 0) {
@@ -1217,7 +1215,6 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto fail;
 }
 }
-
 return 0;
 
 fail_format:
-- 
2.34.1

[PATCH 04/21] parallels: return earler in fail_format branch in parallels_open()

2023-09-15 Thread Denis V. Lunev

We do not need to perform any deallocation/cleanup if wrong format is
detected.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/parallels.c b/block/parallels.c
index 1d5409f2ba..0f127427bf 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -1226,7 +1226,8 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 
 fail_format:
 error_setg(errp, "Image not in Parallels format");
-ret = -EINVAL;
+return -EINVAL;
+
 fail:
 /*
  * "s" object was allocated by g_malloc0 so we can safely
-- 
2.34.1

[PATCH 09/21] parallels: fix broken parallels_check_data_off()

2023-09-15 Thread Denis V. Lunev

Once we have repaired data_off field in the header we should update
s->data_start which is calculated on the base of it.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/parallels.c b/block/parallels.c
index 60ad41b49b..bdc4dd081b 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -531,6 +531,7 @@ parallels_check_data_off(BlockDriverState *bs, 
BdrvCheckResult *res,
 res->corruptions++;
 if (fix & BDRV_FIX_ERRORS) {
 s->header->data_off = cpu_to_le32(data_off);
+s->data_start = data_off;
 res->corruptions_fixed++;
 }
 
-- 
2.34.1

[PATCH 16/21] parallels: naive implementation of allocate_clusters with used bitmap

2023-09-15 Thread Denis V. Lunev

The access to the bitmap is not optimized completely.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 51 ---
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index a6d2f05863..2efa578e21 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -253,7 +253,7 @@ allocate_clusters(BlockDriverState *bs, int64_t sector_num,
 {
 int ret = 0;
 BDRVParallelsState *s = bs->opaque;
-int64_t pos, space, idx, to_allocate, i, len;
+int64_t i, pos, idx, to_allocate, first_free, host_off;
 
 pos = block_status(s, sector_num, nb_sectors, pnum);
 if (pos > 0) {
@@ -276,15 +276,13 @@ allocate_clusters(BlockDriverState *bs, int64_t 
sector_num,
  */
 assert(idx < s->bat_size && idx + to_allocate <= s->bat_size);
 
-space = to_allocate * s->tracks;
-len = bdrv_co_getlength(bs->file->bs);
-if (len < 0) {
-return len;
-}
-if (s->data_end + space > (len >> BDRV_SECTOR_BITS)) {
+first_free = find_first_zero_bit(s->used_bmap, s->used_bmap_size);
+if (first_free == s->used_bmap_size) {
 uint32_t new_usedsize;
+int64_t space = to_allocate * s->tracks + s->prealloc_size;
+
+host_off = s->data_end * BDRV_SECTOR_SIZE;
 
-space += s->prealloc_size;
 /*
  * We require the expanded size to read back as zero. If the
  * user permitted truncation, we try that; but if it fails, we
@@ -313,6 +311,32 @@ allocate_clusters(BlockDriverState *bs, int64_t sector_num,
 s->used_bmap = bitmap_zero_extend(s->used_bmap, s->used_bmap_size,
   new_usedsize);
 s->used_bmap_size = new_usedsize;
+} else {
+int64_t next_used;
+next_used = find_next_bit(s->used_bmap, s->used_bmap_size, first_free);
+
+/* Not enough continuous clusters in the middle, adjust the size */
+if (next_used - first_free < to_allocate) {
+to_allocate = next_used - first_free;
+*pnum = (idx + to_allocate) * s->tracks - sector_num;
+}
+
+host_off = s->data_start * BDRV_SECTOR_SIZE;
+host_off += first_free * s->cluster_size;
+
+/*
+ * No need to preallocate if we are using tail area from the above
+ * branch. In the other case we are likely re-using hole. Preallocate
+ * the space if required by the prealloc_mode.
+ */
+if (s->prealloc_mode == PRL_PREALLOC_MODE_FALLOCATE &&
+host_off < s->data_end * BDRV_SECTOR_SIZE) {
+ret = bdrv_co_pwrite_zeroes(bs->file, host_off,
+s->cluster_size * to_allocate, 0);
+if (ret < 0) {
+return ret;
+}
+}
 }
 
 /*
@@ -344,15 +368,18 @@ allocate_clusters(BlockDriverState *bs, int64_t 
sector_num,
 }
 }
 
-ret = mark_used(bs, s->used_bmap, s->used_bmap_size,
-s->data_end << BDRV_SECTOR_BITS, to_allocate);
+ret = mark_used(bs, s->used_bmap, s->used_bmap_size, host_off, 
to_allocate);
 if (ret < 0) {
 /* Image consistency is broken. Alarm! */
 return ret;
 }
 for (i = 0; i < to_allocate; i++) {
-parallels_set_bat_entry(s, idx + i, s->data_end / s->off_multiplier);
-s->data_end += s->tracks;
+parallels_set_bat_entry(s, idx + i,
+host_off / BDRV_SECTOR_SIZE / s->off_multiplier);
+host_off += s->cluster_size;
+}
+if (host_off > s->data_end * BDRV_SECTOR_SIZE) {
+s->data_end = host_off / BDRV_SECTOR_SIZE;
 }
 
 return bat2sect(s, idx) + sector_num % s->tracks;
-- 
2.34.1

[PATCH 20/21] parallels: naive implementation of parallels_co_pwrite_zeroes

2023-09-15 Thread Denis V. Lunev

The zero flag is missed in the Parallels format specification. We can
resort to discard if we have no backing file.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index 83cb8d6722..a098e2cbc2 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -583,6 +583,19 @@ done:
 return ret;
 }
 
+static int coroutine_fn GRAPH_RDLOCK
+parallels_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int64_t bytes,
+   BdrvRequestFlags flags)
+{
+/*
+ * The zero flag is missed in the Parallels format specification. We can
+ * resort to discard if we have no backing file (this condition is checked
+ * inside parallels_co_pdiscard().
+ */
+return parallels_co_pdiscard(bs, offset, bytes);
+}
+
+
 static void parallels_check_unclean(BlockDriverState *bs,
 BdrvCheckResult *res,
 BdrvCheckMode fix)
@@ -1456,6 +1469,7 @@ static BlockDriver bdrv_parallels = {
 .bdrv_co_create_opts= parallels_co_create_opts,
 .bdrv_co_check  = parallels_co_check,
 .bdrv_co_pdiscard   = parallels_co_pdiscard,
+.bdrv_co_pwrite_zeroes  = parallels_co_pwrite_zeroes,
 };
 
 static void bdrv_parallels_init(void)
-- 
2.34.1

[PATCH 07/21] parallels: create mark_used() helper which sets bit in used bitmap

2023-09-15 Thread Denis V. Lunev

This functionality is used twice already and next patch will add more
code with it.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 34 +-
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index aa29df9f77..60ad41b49b 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -178,6 +178,21 @@ static void parallels_set_bat_entry(BDRVParallelsState *s,
 bitmap_set(s->bat_dirty_bmap, bat_entry_off(index) / s->bat_dirty_block, 
1);
 }
 
+static int mark_used(BlockDriverState *bs,
+ unsigned long *bitmap, uint32_t bitmap_size, int64_t off)
+{
+BDRVParallelsState *s = bs->opaque;
+uint32_t cluster_index = host_cluster_index(s, off);
+if (cluster_index >= bitmap_size) {
+return -E2BIG;
+}
+if (test_bit(cluster_index, bitmap)) {
+return -EBUSY;
+}
+bitmap_set(bitmap, cluster_index, 1);
+return 0;
+}
+
 static int64_t coroutine_fn GRAPH_RDLOCK
 allocate_clusters(BlockDriverState *bs, int64_t sector_num,
   int nb_sectors, int *pnum)
@@ -621,7 +636,7 @@ parallels_check_duplicate(BlockDriverState *bs, 
BdrvCheckResult *res,
 BDRVParallelsState *s = bs->opaque;
 int64_t host_off, host_sector, guest_sector;
 unsigned long *bitmap;
-uint32_t i, bitmap_size, cluster_index, bat_entry;
+uint32_t i, bitmap_size, bat_entry;
 int n, ret = 0;
 uint64_t *buf = NULL;
 bool fixed = false;
@@ -655,10 +670,9 @@ parallels_check_duplicate(BlockDriverState *bs, 
BdrvCheckResult *res,
 continue;
 }
 
-cluster_index = host_cluster_index(s, host_off);
-assert(cluster_index < bitmap_size);
-if (!test_bit(cluster_index, bitmap)) {
-bitmap_set(bitmap, cluster_index, 1);
+ret = mark_used(bs, bitmap, bitmap_size, host_off);
+assert(ret != -E2BIG);
+if (ret == 0) {
 continue;
 }
 
@@ -713,11 +727,13 @@ parallels_check_duplicate(BlockDriverState *bs, 
BdrvCheckResult *res,
  * consistent for the new allocated clusters too.
  *
  * Note, clusters allocated outside the current image are not
- * considered, and the bitmap size doesn't change.
+ * considered, and the bitmap size doesn't change. This specifically
+ * means that -E2BIG is OK.
  */
-cluster_index = host_cluster_index(s, host_off);
-if (cluster_index < bitmap_size) {
-bitmap_set(bitmap, cluster_index, 1);
+ret = mark_used(bs, bitmap, bitmap_size, host_off);
+if (ret == -EBUSY) {
+res->check_errors++;
+goto out_repair_bat;
 }
 
 fixed = true;
-- 
2.34.1

[PATCH 08/21] tests: ensure that image validation will not cure the corruption

2023-09-15 Thread Denis V. Lunev

Since
commit cfce1091d55322789582480798a891cbaf66924e
Author: Alexander Ivanov 
Date:   Tue Jul 18 12:44:29 2023 +0200
parallels: Image repairing in parallels_open()
there is a potential pit fall with calling
qemu-io -c "read"
The image is opened in read-write mode and thus could be potentially
repaired. This could ruin testing process.

The patch forces read-only opening for reads. In that case repairing
is impossible.

Signed-off-by: Denis V. Lunev 
---
 tests/qemu-iotests/tests/parallels-checks | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/tests/parallels-checks 
b/tests/qemu-iotests/tests/parallels-checks
index a7a1b357b5..5917ee079d 100755
--- a/tests/qemu-iotests/tests/parallels-checks
+++ b/tests/qemu-iotests/tests/parallels-checks
@@ -91,7 +91,7 @@ file_size=`stat --printf="%s" "$TEST_IMG"`
 echo "file size: $file_size"
 
 echo "== check last cluster =="
-{ $QEMU_IO -c "read -P 0x11 $LAST_CLUSTER_OFF $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+{ $QEMU_IO -r -c "read -P 0x11 $LAST_CLUSTER_OFF $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
 
 # Clear image
 _make_test_img $SIZE
@@ -105,19 +105,20 @@ echo "== write another pattern to second cluster =="
 { $QEMU_IO -c "write -P 0x55 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 
| _filter_qemu_io | _filter_testdir
 
 echo "== check second cluster =="
-{ $QEMU_IO -c "read -P 0x55 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+{ $QEMU_IO -r -c "read -P 0x55 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+
 
 echo "== corrupt image =="
 poke_file "$TEST_IMG" "$(($BAT_OFFSET + 4))" "\x01\x00\x00\x00"
 
 echo "== check second cluster =="
-{ $QEMU_IO -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+{ $QEMU_IO -r -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
 
 echo "== repair image =="
 _check_test_img -r all
 
 echo "== check second cluster =="
-{ $QEMU_IO -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+{ $QEMU_IO -r -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
 
 echo "== check first cluster on host =="
 printf "content: 0x%02x\n" `peek_file_le $TEST_IMG $(($CLUSTER_SIZE)) 1`
-- 
2.34.1

[PATCH 12/21] tests: fix broken deduplication check in parallels format test

2023-09-15 Thread Denis V. Lunev

Original check is broken as supposed reading from 2 different clusters
results in read from the same file offset twice. This is definitely
wrong.

We should be sure that
* the content of both clusters is correct after repair
* clusters are at the different offsets after repair
In order to check the latter we write some content into the first one
and validate that fact.

Signed-off-by: Denis V. Lunev 
---
 tests/qemu-iotests/tests/parallels-checks | 14 ++
 tests/qemu-iotests/tests/parallels-checks.out | 16 
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/tests/qemu-iotests/tests/parallels-checks 
b/tests/qemu-iotests/tests/parallels-checks
index f4ca50295e..df99558486 100755
--- a/tests/qemu-iotests/tests/parallels-checks
+++ b/tests/qemu-iotests/tests/parallels-checks
@@ -117,14 +117,20 @@ echo "== check second cluster =="
 echo "== repair image =="
 _check_test_img -r all
 
+echo "== check the first cluster =="
+{ $QEMU_IO -r -c "read -P 0x11 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+
 echo "== check second cluster =="
 { $QEMU_IO -r -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
 
-echo "== check first cluster on host =="
-printf "content: 0x%02x\n" `peek_file_le $TEST_IMG $(($CLUSTER_SIZE)) 1`
+echo "== write another pattern to the first clusters =="
+{ $QEMU_IO -c "write -P 0x66 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+
+echo "== check the first cluster =="
+{ $QEMU_IO -r -c "read -P 0x66 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
 
-echo "== check second cluster on host =="
-printf "content: 0x%02x\n" `peek_file_le $TEST_IMG $(($CLUSTER_SIZE)) 1`
+echo "== check the second cluster (deduplicated) =="
+{ $QEMU_IO -r -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
 
 # Clear image
 _make_test_img $SIZE
diff --git a/tests/qemu-iotests/tests/parallels-checks.out 
b/tests/qemu-iotests/tests/parallels-checks.out
index 74a5e29260..1325d2b611 100644
--- a/tests/qemu-iotests/tests/parallels-checks.out
+++ b/tests/qemu-iotests/tests/parallels-checks.out
@@ -55,13 +55,21 @@ The following inconsistencies were found and repaired:
 
 Double checking the fixed image now...
 No errors were found on the image.
+== check the first cluster ==
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 == check second cluster ==
 read 1048576/1048576 bytes at offset 1048576
 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-== check first cluster on host ==
-content: 0x11
-== check second cluster on host ==
-content: 0x11
+== write another pattern to the first clusters ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check the first cluster ==
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check the second cluster (deduplicated) ==
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=4194304
 == TEST DATA_OFF CHECK ==
 == write pattern to first cluster ==
-- 
2.34.1

[PATCH 00/21] implement discard operation for Parallels images

2023-09-15 Thread Denis V. Lunev

This series introduces new block allocator scheme into unused data
blocks inside the image first and only after that extends the file.
On top of that naive implementation of discard and write-zeroes
(through the discard) is added.

There are also a bunch of bugs revealed in the code during the
implementation and testing.

Signed-off-by: Denis V. Lunev 

Denis V. Lunev (21):
  parallels: fix formatting in bdrv_parallels initialization
  parallels: mark driver as supporting CBT
  parallels: invent parallels_opts_prealloc() helper to parse prealloc
opts
  parallels: return earler in fail_format branch in parallels_open()
  parallels: return earlier from parallels_open() function on error
  parallels: refactor path when we need to re-check image in
parallels_open
  parallels: create mark_used() helper which sets bit in used bitmap
  tests: ensure that image validation will not cure the corruption
  parallels: fix broken parallels_check_data_off()
  parallels: add test which will validate data_off fixes through repair
  parallels: collect bitmap of used clusters at open
  tests: fix broken deduplication check in parallels format test
  tests: test self-cure of parallels image with duplicated clusters
  parallels: accept multiple clusters in mark_used()
  parallels: update used bitmap in allocate_cluster
  parallels: naive implementation of allocate_clusters with used bitmap
  parallels: improve readability of allocate_clusters
  parallels: naive implementation of parallels_co_pdiscard
  tests: extend test 131 to cover availability of the discard operation
  parallels: naive implementation of parallels_co_pwrite_zeroes
  tests: extend test 131 to cover availability of the write-zeroes

 block/parallels.c | 382 ++
 block/parallels.h |   3 +
 tests/qemu-iotests/131|  51 +++
 tests/qemu-iotests/131.out|  58 +++
 tests/qemu-iotests/tests/parallels-checks |  76 +++-
 tests/qemu-iotests/tests/parallels-checks.out |  65 ++-
 6 files changed, 534 insertions(+), 101 deletions(-)

-- 
2.34.1

[PATCH 02/21] parallels: mark driver as supporting CBT

2023-09-15 Thread Denis V. Lunev

Parallels driver indeed support Parallels Dirty Bitmap Feature in
read-only mode. The patch adds bdrv_supports_persistent_dirty_bitmap()
callback which always return 1 to indicate that.

This will allow to copy CBT from Parallels image with qemu-img.

Note: read-write support is signalled through
bdrv_co_can_store_new_dirty_bitmap() and is different.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index 2ebd8e1301..428f72de1c 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -1248,6 +1248,11 @@ static void parallels_close(BlockDriverState *bs)
 error_free(s->migration_blocker);
 }
 
+static bool parallels_is_support_dirty_bitmaps(BlockDriverState *bs)
+{
+return 1;
+}
+
 static BlockDriver bdrv_parallels = {
 .format_name= "parallels",
 .instance_size  = sizeof(BDRVParallelsState),
@@ -1256,6 +1261,7 @@ static BlockDriver bdrv_parallels = {
 .supports_backing   = true,
 
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
+.bdrv_supports_persistent_dirty_bitmap = 
parallels_is_support_dirty_bitmaps,
 
 .bdrv_probe = parallels_probe,
 .bdrv_open  = parallels_open,
-- 
2.34.1

[PATCH 18/21] parallels: naive implementation of parallels_co_pdiscard

2023-09-15 Thread Denis V. Lunev

* Discarding with backing stores is not supported by the format.
* There is no buffering/queueing of the discard operation.
* Only operations aligned to the cluster are supported.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 47 +++
 1 file changed, 47 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index 76aedfd7c4..83cb8d6722 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -537,6 +537,52 @@ parallels_co_readv(BlockDriverState *bs, int64_t 
sector_num, int nb_sectors,
 return ret;
 }
 
+
+static int coroutine_fn GRAPH_RDLOCK_PTR
+parallels_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+int ret = 0;
+uint32_t cluster, count;
+BDRVParallelsState *s = bs->opaque;
+
+/*
+ * The image does not support ZERO mark inside the BAT, which means that
+ * stale data could be exposed from the backing file.
+ */
+if (bs->backing) {
+return -ENOTSUP;
+}
+
+if (!QEMU_IS_ALIGNED(offset, s->cluster_size)) {
+return -ENOTSUP;
+} else if (!QEMU_IS_ALIGNED(bytes, s->cluster_size)) {
+return -ENOTSUP;
+}
+
+cluster = offset / s->cluster_size;
+count = bytes / s->cluster_size;
+
+qemu_co_mutex_lock(&s->lock);
+for (; count > 0; cluster++, count--) {
+int64_t host_off = bat2sect(s, cluster) << BDRV_SECTOR_BITS;
+if (host_off == 0) {
+continue;
+}
+
+ret = bdrv_co_pdiscard(bs->file, cluster * s->cluster_size,
+   s->cluster_size);
+if (ret < 0) {
+goto done;
+}
+
+parallels_set_bat_entry(s, cluster, 0);
+bitmap_clear(s->used_bmap, host_cluster_index(s, host_off), 1);
+}
+done:
+qemu_co_mutex_unlock(&s->lock);
+return ret;
+}
+
 static void parallels_check_unclean(BlockDriverState *bs,
 BdrvCheckResult *res,
 BdrvCheckMode fix)
@@ -1409,6 +1455,7 @@ static BlockDriver bdrv_parallels = {
 .bdrv_co_create = parallels_co_create,
 .bdrv_co_create_opts= parallels_co_create_opts,
 .bdrv_co_check  = parallels_co_check,
+.bdrv_co_pdiscard   = parallels_co_pdiscard,
 };
 
 static void bdrv_parallels_init(void)
-- 
2.34.1

[PATCH 13/21] tests: test self-cure of parallels image with duplicated clusters

2023-09-15 Thread Denis V. Lunev

The test is quite similar with the original one for duplicated clusters.
There is the only difference in the operation which should fix the
image.

Signed-off-by: Denis V. Lunev 
---
 tests/qemu-iotests/tests/parallels-checks | 36 +++
 tests/qemu-iotests/tests/parallels-checks.out | 31 
 2 files changed, 67 insertions(+)

diff --git a/tests/qemu-iotests/tests/parallels-checks 
b/tests/qemu-iotests/tests/parallels-checks
index df99558486..b281246a42 100755
--- a/tests/qemu-iotests/tests/parallels-checks
+++ b/tests/qemu-iotests/tests/parallels-checks
@@ -135,6 +135,42 @@ echo "== check the second cluster (deduplicated) =="
 # Clear image
 _make_test_img $SIZE
 
+echo "== TEST DUPLICATION SELF-CURE =="
+
+echo "== write pattern to whole image =="
+{ $QEMU_IO -c "write -P 0x11 0 $SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | 
_filter_testdir
+
+echo "== write another pattern to second cluster =="
+{ $QEMU_IO -c "write -P 0x55 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 
| _filter_qemu_io | _filter_testdir
+
+echo "== check second cluster =="
+{ $QEMU_IO -r -c "read -P 0x55 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+
+
+echo "== corrupt image =="
+poke_file "$TEST_IMG" "$(($BAT_OFFSET + 4))" "\x01\x00\x00\x00"
+
+echo "== check second cluster =="
+{ $QEMU_IO -r -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+
+echo "== check the first cluster with self-repair =="
+{ $QEMU_IO -c "read -P 0x11 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+
+echo "== check second cluster =="
+{ $QEMU_IO -r -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+
+echo "== write another pattern to the first clusters =="
+{ $QEMU_IO -c "write -P 0x66 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+
+echo "== check the first cluster =="
+{ $QEMU_IO -r -c "read -P 0x66 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+
+echo "== check the second cluster (deduplicated) =="
+{ $QEMU_IO -r -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+
+# Clear image
+_make_test_img $SIZE
+
 echo "== TEST DATA_OFF CHECK =="
 
 echo "== write pattern to first cluster =="
diff --git a/tests/qemu-iotests/tests/parallels-checks.out 
b/tests/qemu-iotests/tests/parallels-checks.out
index 1325d2b611..9793423111 100644
--- a/tests/qemu-iotests/tests/parallels-checks.out
+++ b/tests/qemu-iotests/tests/parallels-checks.out
@@ -71,6 +71,37 @@ read 1048576/1048576 bytes at offset 0
 read 1048576/1048576 bytes at offset 1048576
 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=4194304
+== TEST DUPLICATION SELF-CURE ==
+== write pattern to whole image ==
+wrote 4194304/4194304 bytes at offset 0
+4 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== write another pattern to second cluster ==
+wrote 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check second cluster ==
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== corrupt image ==
+== check second cluster ==
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check the first cluster with self-repair ==
+Repairing duplicate offset in BAT entry 1
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check second cluster ==
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== write another pattern to the first clusters ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check the first cluster ==
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check the second cluster (deduplicated) ==
+read 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=4194304
 == TEST DATA_OFF CHECK ==
 == write pattern to first cluster ==
 wrote 1048576/1048576 bytes at offset 0
-- 
2.34.1

[PATCH 21/21] tests: extend test 131 to cover availability of the write-zeroes

2023-09-15 Thread Denis V. Lunev

This patch contains test which minimally tests write-zeroes on top of
working discard.

The following checks are added:
* write 2 clusters, write-zero to the first allocated cluster
* write 2 cluster, write-zero to the half the first allocated cluster

Signed-off-by: Denis V. Lunev 
---
 tests/qemu-iotests/131 | 20 
 tests/qemu-iotests/131.out | 20 
 2 files changed, 40 insertions(+)

diff --git a/tests/qemu-iotests/131 b/tests/qemu-iotests/131
index e50a658f22..308732d84b 100755
--- a/tests/qemu-iotests/131
+++ b/tests/qemu-iotests/131
@@ -105,6 +105,26 @@ _make_test_img $size
 { $QEMU_IO -c "read -P 0 0 $CLUSTER_HALF_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
 { $QEMU_IO -c "read -P 0 $((CLUSTER_SIZE + CLUSTER_HALF_SIZE)) 
$CLUSTER_DBL_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir
 
+echo "== check write-zeroes =="
+
+# Clear image
+_make_test_img $size
+
+{ $QEMU_IO -c "write -P 0x11 0 $CLUSTER_DBL_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+{ $QEMU_IO -c "write -z 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io 
| _filter_testdir
+{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map
+{ $QEMU_IO -c "read -P 0 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+
+echo "== check cluster-partial write-zeroes =="
+
+# Clear image
+_make_test_img $size
+
+{ $QEMU_IO -c "write -P 0x11 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+{ $QEMU_IO -c "write -z 0 $CLUSTER_HALF_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+{ $QEMU_IO -c "read -P 0 0 $CLUSTER_HALF_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+{ $QEMU_IO -c "read -P 0x11 $CLUSTER_HALF_SIZE $CLUSTER_HALF_SIZE" 
"$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir
+
 echo "== allocate with backing =="
 # Verify that allocating clusters works fine even when there is a backing 
image.
 # Regression test for a bug where we would pass a buffer read from the backing
diff --git a/tests/qemu-iotests/131.out b/tests/qemu-iotests/131.out
index 9882f9df6c..8493561bab 100644
--- a/tests/qemu-iotests/131.out
+++ b/tests/qemu-iotests/131.out
@@ -64,6 +64,26 @@ read 524288/524288 bytes at offset 0
 512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 read 2097152/2097152 bytes at offset 1572864
 2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check write-zeroes ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset  Length  File
+0x100x10TEST_DIR/t.IMGFMT
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check cluster-partial write-zeroes ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 524288/524288 bytes at offset 0
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 0
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 524288
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 == allocate with backing ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
-- 
2.34.1

[PATCH 17/21] parallels: improve readability of allocate_clusters

2023-09-15 Thread Denis V. Lunev

Replace 'space' representing the amount of data to preallocate with
'bytes'.

Rationale:
* 'space' at each place is converted to bytes
* the unit is more close to the variable name

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 2efa578e21..76aedfd7c4 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -279,7 +279,8 @@ allocate_clusters(BlockDriverState *bs, int64_t sector_num,
 first_free = find_first_zero_bit(s->used_bmap, s->used_bmap_size);
 if (first_free == s->used_bmap_size) {
 uint32_t new_usedsize;
-int64_t space = to_allocate * s->tracks + s->prealloc_size;
+int64_t bytes = to_allocate * s->cluster_size;
+bytes += s->prealloc_size * BDRV_SECTOR_SIZE;
 
 host_off = s->data_end * BDRV_SECTOR_SIZE;
 
@@ -289,8 +290,7 @@ allocate_clusters(BlockDriverState *bs, int64_t sector_num,
  * force the safer-but-slower fallocate.
  */
 if (s->prealloc_mode == PRL_PREALLOC_MODE_TRUNCATE) {
-ret = bdrv_co_truncate(bs->file,
-   (s->data_end + space) << BDRV_SECTOR_BITS,
+ret = bdrv_co_truncate(bs->file, host_off + bytes,
false, PREALLOC_MODE_OFF,
BDRV_REQ_ZERO_WRITE, NULL);
 if (ret == -ENOTSUP) {
@@ -298,16 +298,13 @@ allocate_clusters(BlockDriverState *bs, int64_t 
sector_num,
 }
 }
 if (s->prealloc_mode == PRL_PREALLOC_MODE_FALLOCATE) {
-ret = bdrv_co_pwrite_zeroes(bs->file,
-s->data_end << BDRV_SECTOR_BITS,
-space << BDRV_SECTOR_BITS, 0);
+ret = bdrv_co_pwrite_zeroes(bs->file, host_off, bytes, 0);
 }
 if (ret < 0) {
 return ret;
 }
 
-new_usedsize = s->used_bmap_size +
-   (space << BDRV_SECTOR_BITS) / s->cluster_size;
+new_usedsize = s->used_bmap_size + bytes / s->cluster_size;
 s->used_bmap = bitmap_zero_extend(s->used_bmap, s->used_bmap_size,
   new_usedsize);
 s->used_bmap_size = new_usedsize;
-- 
2.34.1

[PATCH 10/21] parallels: add test which will validate data_off fixes through repair

2023-09-15 Thread Denis V. Lunev

We have only check through self-repair and that proven to be not enough.

Signed-off-by: Denis V. Lunev 
---
 tests/qemu-iotests/tests/parallels-checks | 17 +
 tests/qemu-iotests/tests/parallels-checks.out | 18 ++
 2 files changed, 35 insertions(+)

diff --git a/tests/qemu-iotests/tests/parallels-checks 
b/tests/qemu-iotests/tests/parallels-checks
index 5917ee079d..f4ca50295e 100755
--- a/tests/qemu-iotests/tests/parallels-checks
+++ b/tests/qemu-iotests/tests/parallels-checks
@@ -140,6 +140,23 @@ poke_file "$TEST_IMG" "$DATA_OFF_OFFSET" "\xff\xff\xff\xff"
 echo "== check first cluster =="
 { $QEMU_IO -c "read -P 0x55 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
 
+# Clear image
+_make_test_img $SIZE
+
+echo "== TEST DATA_OFF THROUGH REPAIR =="
+
+echo "== write pattern to first cluster =="
+{ $QEMU_IO -c "write -P 0x55 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+
+echo "== spoil data_off field =="
+poke_file "$TEST_IMG" "$DATA_OFF_OFFSET" "\xff\xff\xff\xff"
+
+echo "== repair image =="
+_check_test_img -r all
+
+echo "== check first cluster =="
+{ $QEMU_IO -r -c "read -P 0x55 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/tests/parallels-checks.out 
b/tests/qemu-iotests/tests/parallels-checks.out
index 98a3a7f55e..74a5e29260 100644
--- a/tests/qemu-iotests/tests/parallels-checks.out
+++ b/tests/qemu-iotests/tests/parallels-checks.out
@@ -72,4 +72,22 @@ wrote 1048576/1048576 bytes at offset 0
 Repairing data_off field has incorrect value
 read 1048576/1048576 bytes at offset 0
 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=4194304
+== TEST DATA_OFF THROUGH REPAIR ==
+== write pattern to first cluster ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== spoil data_off field ==
+== repair image ==
+Repairing data_off field has incorrect value
+The following inconsistencies were found and repaired:
+
+0 leaked clusters
+1 corruptions
+
+Double checking the fixed image now...
+No errors were found on the image.
+== check first cluster ==
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 *** done
-- 
2.34.1

[PATCH 11/21] parallels: collect bitmap of used clusters at open

2023-09-15 Thread Denis V. Lunev

If the operation is failed, we need to check image consistency if the
problem is not about memory allocation.

Bitmap adjustments in allocate_cluster are not performed yet.
They worth to be separate. This was proven useful during debug of this
series. Kept as is for future bissecting.

It should be specifically noted that used bitmap must be recalculated
if data_off has been fixed during image consistency check.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 73 +++
 block/parallels.h |  3 ++
 2 files changed, 76 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index bdc4dd081b..2517f35581 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -193,6 +193,58 @@ static int mark_used(BlockDriverState *bs,
 return 0;
 }
 
+/*
+ * Collect used bitmap. The image can contain errors, we should fill the
+ * bitmap anyway, as much as we can. This information will be used for
+ * error resolution.
+ */
+static int parallels_fill_used_bitmap(BlockDriverState *bs)
+{
+BDRVParallelsState *s = bs->opaque;
+int64_t payload_bytes;
+uint32_t i;
+int err = 0;
+
+payload_bytes = bdrv_co_getlength(bs->file->bs);
+if (payload_bytes < 0) {
+return payload_bytes;
+}
+payload_bytes -= s->data_start * BDRV_SECTOR_SIZE;
+if (payload_bytes < 0) {
+return -EINVAL;
+}
+
+s->used_bmap_size = DIV_ROUND_UP(payload_bytes, s->cluster_size);
+if (s->used_bmap_size == 0) {
+return 0;
+}
+s->used_bmap = bitmap_try_new(s->used_bmap_size);
+if (s->used_bmap == NULL) {
+return -ENOMEM;
+}
+
+for (i = 0; i < s->bat_size; i++) {
+int err2;
+int64_t host_off = bat2sect(s, i) << BDRV_SECTOR_BITS;
+if (host_off == 0) {
+continue;
+}
+
+err2 = mark_used(bs, s->used_bmap, s->used_bmap_size, host_off);
+if (err2 < 0 && err == 0) {
+err = err2;
+}
+}
+return err;
+}
+
+static void parallels_free_used_bitmap(BlockDriverState *bs)
+{
+BDRVParallelsState *s = bs->opaque;
+s->used_bmap_size = 0;
+g_free(s->used_bmap);
+}
+
 static int64_t coroutine_fn GRAPH_RDLOCK
 allocate_clusters(BlockDriverState *bs, int64_t sector_num,
   int nb_sectors, int *pnum)
@@ -530,8 +582,17 @@ parallels_check_data_off(BlockDriverState *bs, 
BdrvCheckResult *res,
 
 res->corruptions++;
 if (fix & BDRV_FIX_ERRORS) {
+int err;
 s->header->data_off = cpu_to_le32(data_off);
 s->data_start = data_off;
+
+parallels_free_used_bitmap(bs);
+err = parallels_fill_used_bitmap(bs);
+if (err == -ENOMEM) {
+res->check_errors++;
+return err;
+}
+
 res->corruptions_fixed++;
 }
 
@@ -1214,6 +1275,14 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 need_check = need_check || s->data_end > file_nb_sectors;
 
+if (!need_check) {
+ret = parallels_fill_used_bitmap(bs);
+if (ret == -ENOMEM) {
+goto fail;
+}
+need_check = need_check || ret < 0; /* These are correctable errors */
+}
+
 /*
  * We don't repair the image here if it's opened for checks. Also we don't
  * want to change inactive images and can't change readonly images.
@@ -1243,6 +1312,8 @@ fail:
  * "s" object was allocated by g_malloc0 so we can safely
  * try to free its fields even they were not allocated.
  */
+parallels_free_used_bitmap(bs);
+
 error_free(s->migration_blocker);
 g_free(s->bat_dirty_bmap);
 qemu_vfree(s->header);
@@ -1263,6 +1334,8 @@ static void parallels_close(BlockDriverState *bs)
   PREALLOC_MODE_OFF, 0, NULL);
 }
 
+parallels_free_used_bitmap(bs);
+
 g_free(s->bat_dirty_bmap);
 qemu_vfree(s->header);
 
diff --git a/block/parallels.h b/block/parallels.h
index 4e53e9572d..6b199443cf 100644
--- a/block/parallels.h
+++ b/block/parallels.h
@@ -72,6 +72,9 @@ typedef struct BDRVParallelsState {
 unsigned long *bat_dirty_bmap;
 unsigned int  bat_dirty_block;
 
+unsigned long *used_bmap;
+unsigned long used_bmap_size;
+
 uint32_t *bat_bitmap;
 unsigned int bat_size;
 
-- 
2.34.1

[PATCH 03/21] parallels: invent parallels_opts_prealloc() helper to parse prealloc opts

2023-09-15 Thread Denis V. Lunev

This patch creates above mentioned helper and moves its usage to the
beginning of parallels_open(). This simplifies parallels_open() a bit.

The patch also ensures that we store prealloc_size on block driver state
always in sectors. This makes code cleaner and avoids wrong opinion at
the assignment that the value is in bytes.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 65 +++
 1 file changed, 38 insertions(+), 27 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 428f72de1c..1d5409f2ba 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -1025,6 +1025,38 @@ static int parallels_update_header(BlockDriverState *bs)
 return bdrv_pwrite_sync(bs->file, 0, size, s->header, 0);
 }
 
+
+static int parallels_opts_prealloc(BlockDriverState *bs, QDict *options,
+   Error **errp)
+{
+char *buf;
+int64_t bytes;
+BDRVParallelsState *s = bs->opaque;
+Error *local_err = NULL;
+QemuOpts *opts = qemu_opts_create(¶llels_runtime_opts, NULL, 0, errp);
+if (!opts) {
+return -ENOMEM;
+}
+
+if (!qemu_opts_absorb_qdict(opts, options, errp)) {
+return -EINVAL;
+}
+
+bytes = qemu_opt_get_size_del(opts, PARALLELS_OPT_PREALLOC_SIZE, 0);
+s->prealloc_size = bytes >> BDRV_SECTOR_BITS;
+buf = qemu_opt_get_del(opts, PARALLELS_OPT_PREALLOC_MODE);
+/* prealloc_mode can be downgraded later during allocate_clusters */
+s->prealloc_mode = qapi_enum_parse(&prealloc_mode_lookup, buf,
+   PRL_PREALLOC_MODE_FALLOCATE,
+   &local_err);
+g_free(buf);
+if (local_err != NULL) {
+error_propagate(errp, local_err);
+return -EINVAL;
+}
+return 0;
+}
+
 static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
   Error **errp)
 {
@@ -1033,11 +1065,13 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 int ret, size, i;
 int64_t file_nb_sectors, sector;
 uint32_t data_start;
-QemuOpts *opts = NULL;
-Error *local_err = NULL;
-char *buf;
 bool data_off_is_correct;
 
+ret = parallels_opts_prealloc(bs, options, errp);
+if (ret < 0) {
+return ret;
+}
+
 ret = bdrv_open_file_child(NULL, options, "file", bs, errp);
 if (ret < 0) {
 return ret;
@@ -1078,6 +1112,7 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 ret = -EFBIG;
 goto fail;
 }
+s->prealloc_size = MAX(s->tracks, s->prealloc_size);
 s->cluster_size = s->tracks << BDRV_SECTOR_BITS;
 
 s->bat_size = le32_to_cpu(ph.bat_entries);
@@ -1117,29 +1152,6 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 s->header_size = size;
 }
 
-opts = qemu_opts_create(¶llels_runtime_opts, NULL, 0, errp);
-if (!opts) {
-goto fail_options;
-}
-
-if (!qemu_opts_absorb_qdict(opts, options, errp)) {
-goto fail_options;
-}
-
-s->prealloc_size =
-qemu_opt_get_size_del(opts, PARALLELS_OPT_PREALLOC_SIZE, 0);
-s->prealloc_size = MAX(s->tracks, s->prealloc_size >> BDRV_SECTOR_BITS);
-buf = qemu_opt_get_del(opts, PARALLELS_OPT_PREALLOC_MODE);
-/* prealloc_mode can be downgraded later during allocate_clusters */
-s->prealloc_mode = qapi_enum_parse(&prealloc_mode_lookup, buf,
-   PRL_PREALLOC_MODE_FALLOCATE,
-   &local_err);
-g_free(buf);
-if (local_err != NULL) {
-error_propagate(errp, local_err);
-goto fail_options;
-}
-
 if (ph.ext_off) {
 if (flags & BDRV_O_RDWR) {
 /*
@@ -1214,7 +1226,6 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 
 fail_format:
 error_setg(errp, "Image not in Parallels format");
-fail_options:
 ret = -EINVAL;
 fail:
 /*
-- 
2.34.1

[PATCH 14/21] parallels: accept multiple clusters in mark_used()

2023-09-15 Thread Denis V. Lunev

This would be useful in the next patch in allocate_clusters(). This
change would not imply serious performance drawbacks as usually image
is full of data or are at the end of the bitmap.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 2517f35581..a2ba5a9353 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -178,18 +178,20 @@ static void parallels_set_bat_entry(BDRVParallelsState *s,
 bitmap_set(s->bat_dirty_bmap, bat_entry_off(index) / s->bat_dirty_block, 
1);
 }
 
-static int mark_used(BlockDriverState *bs,
- unsigned long *bitmap, uint32_t bitmap_size, int64_t off)
+static int mark_used(BlockDriverState *bs, unsigned long *bitmap,
+ uint32_t bitmap_size, int64_t off, uint32_t count)
 {
 BDRVParallelsState *s = bs->opaque;
 uint32_t cluster_index = host_cluster_index(s, off);
-if (cluster_index >= bitmap_size) {
+unsigned long next_used;
+if (cluster_index + count > bitmap_size) {
 return -E2BIG;
 }
-if (test_bit(cluster_index, bitmap)) {
+next_used = find_next_bit(bitmap, bitmap_size, cluster_index);
+if (next_used < cluster_index + count) {
 return -EBUSY;
 }
-bitmap_set(bitmap, cluster_index, 1);
+bitmap_set(bitmap, cluster_index, count);
 return 0;
 }
 
@@ -230,7 +232,7 @@ static int parallels_fill_used_bitmap(BlockDriverState *bs)
 continue;
 }
 
-err2 = mark_used(bs, s->used_bmap, s->used_bmap_size, host_off);
+err2 = mark_used(bs, s->used_bmap, s->used_bmap_size, host_off, 1);
 if (err2 < 0 && err == 0) {
 err = err2;
 }
@@ -732,7 +734,7 @@ parallels_check_duplicate(BlockDriverState *bs, 
BdrvCheckResult *res,
 continue;
 }
 
-ret = mark_used(bs, bitmap, bitmap_size, host_off);
+ret = mark_used(bs, bitmap, bitmap_size, host_off, 1);
 assert(ret != -E2BIG);
 if (ret == 0) {
 continue;
@@ -792,7 +794,7 @@ parallels_check_duplicate(BlockDriverState *bs, 
BdrvCheckResult *res,
  * considered, and the bitmap size doesn't change. This specifically
  * means that -E2BIG is OK.
  */
-ret = mark_used(bs, bitmap, bitmap_size, host_off);
+ret = mark_used(bs, bitmap, bitmap_size, host_off, 1);
 if (ret == -EBUSY) {
 res->check_errors++;
 goto out_repair_bat;
-- 
2.34.1

[PATCH 15/21] parallels: update used bitmap in allocate_cluster

2023-09-15 Thread Denis V. Lunev

We should extend the bitmap ff the file is extended and set the bit in
the image used bitmap once the cluster is allocated. Sanity check at
that moment also looks like a good idea.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index a2ba5a9353..a6d2f05863 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -282,6 +282,8 @@ allocate_clusters(BlockDriverState *bs, int64_t sector_num,
 return len;
 }
 if (s->data_end + space > (len >> BDRV_SECTOR_BITS)) {
+uint32_t new_usedsize;
+
 space += s->prealloc_size;
 /*
  * We require the expanded size to read back as zero. If the
@@ -305,6 +307,12 @@ allocate_clusters(BlockDriverState *bs, int64_t sector_num,
 if (ret < 0) {
 return ret;
 }
+
+new_usedsize = s->used_bmap_size +
+   (space << BDRV_SECTOR_BITS) / s->cluster_size;
+s->used_bmap = bitmap_zero_extend(s->used_bmap, s->used_bmap_size,
+  new_usedsize);
+s->used_bmap_size = new_usedsize;
 }
 
 /*
@@ -336,6 +344,12 @@ allocate_clusters(BlockDriverState *bs, int64_t sector_num,
 }
 }
 
+ret = mark_used(bs, s->used_bmap, s->used_bmap_size,
+s->data_end << BDRV_SECTOR_BITS, to_allocate);
+if (ret < 0) {
+/* Image consistency is broken. Alarm! */
+return ret;
+}
 for (i = 0; i < to_allocate; i++) {
 parallels_set_bat_entry(s, idx + i, s->data_end / s->off_multiplier);
 s->data_end += s->tracks;
-- 
2.34.1

[PATCH 19/21] tests: extend test 131 to cover availability of the discard operation

2023-09-15 Thread Denis V. Lunev

This patch contains test which minimally tests discard and new cluster
allocation logic.

The following checks are added:
* write 2 clusters, discard the first allocated
* write another cluster, check that the hole is filled
* write 2 clusters, discard the first allocated, write 1 cluster at
  non-aligned to cluster offset (2 new clusters should be allocated)

Signed-off-by: Denis V. Lunev 
---
 tests/qemu-iotests/131 | 31 +++
 tests/qemu-iotests/131.out | 38 ++
 2 files changed, 69 insertions(+)

diff --git a/tests/qemu-iotests/131 b/tests/qemu-iotests/131
index 304bbb3f61..e50a658f22 100755
--- a/tests/qemu-iotests/131
+++ b/tests/qemu-iotests/131
@@ -74,6 +74,37 @@ poke_file "$TEST_IMG" "$inuse_offset" "\x59\x6e\x6f\x74"
 echo "== read corrupted image with repairing =="
 { $QEMU_IO -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
 
+echo "== check discard =="
+
+# Clear image
+_make_test_img $size
+
+{ $QEMU_IO -c "write -P 0x11 0 $CLUSTER_DBL_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map
+{ $QEMU_IO -c "discard 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io 
| _filter_testdir
+{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map
+{ $QEMU_IO -c "read -P 0 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+
+echo "== check simple allocation over the discarded hole =="
+
+{ $QEMU_IO -c "write -P 0x11 $CLUSTER_DBL_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map
+{ $QEMU_IO -c "read -P 0x11 $CLUSTER_DBL_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+
+echo "== check more complex allocation over the discard hole =="
+
+# Clear image
+_make_test_img $size
+
+{ $QEMU_IO -c "write -P 0x11 $CLUSTER_DBL_SIZE $CLUSTER_DBL_SIZE" "$TEST_IMG"; 
} 2>&1 | _filter_qemu_io | _filter_testdir
+{ $QEMU_IO -c "discard $CLUSTER_DBL_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+# There is 1 cluster hole. Fill it fully and allocate 1 cluster at the end
+{ $QEMU_IO -c "write -P 0x12 $CLUSTER_HALF_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map
+{ $QEMU_IO -c "read -P 0x12 $CLUSTER_HALF_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+{ $QEMU_IO -c "read -P 0 0 $CLUSTER_HALF_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+{ $QEMU_IO -c "read -P 0 $((CLUSTER_SIZE + CLUSTER_HALF_SIZE)) 
$CLUSTER_DBL_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir
+
 echo "== allocate with backing =="
 # Verify that allocating clusters works fine even when there is a backing 
image.
 # Regression test for a bug where we would pass a buffer read from the backing
diff --git a/tests/qemu-iotests/131.out b/tests/qemu-iotests/131.out
index d2904578df..9882f9df6c 100644
--- a/tests/qemu-iotests/131.out
+++ b/tests/qemu-iotests/131.out
@@ -26,6 +26,44 @@ read 524288/524288 bytes at offset 0
 Repairing image was not closed correctly
 read 1048576/1048576 bytes at offset 1048576
 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check discard ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset  Length  File
+0   0x20TEST_DIR/t.IMGFMT
+discard 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset  Length  File
+0x100x10TEST_DIR/t.IMGFMT
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check simple allocation over the discarded hole ==
+wrote 1048576/1048576 bytes at offset 2097152
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset  Length  File
+0x100x10TEST_DIR/t.IMGFMT
+0x200x10TEST_DIR/t.IMGFMT
+read 1048576/1048576 bytes at offset 2097152
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check more complex allocation over the discard hole ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+wrote 2097152/2097152 bytes at offset 2097152
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+discard 1048576/1048576 bytes at offset 2097152
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 1048576/1048576 bytes at offset 524288
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset  Length  File
+0   0x10TEST_DIR/t.IMGFMT
+0x100x10TEST_DIR/t.IMGFMT
+0x300x10TEST_DIR/t.IMGFMT
+read 1048576/1048576 bytes at offset 524288
+1 MiB, X ops; XX

[PATCH 11/21] parallels: collect bitmap of used clusters at open

2023-09-15 Thread Denis V. Lunev

If the operation is failed, we need to check image consistency if the
problem is not about memory allocation.

Bitmap adjustments in allocate_cluster are not performed yet.
They worth to be separate. This was proven useful during debug of this
series. Kept as is for future bissecting.

It should be specifically noted that used bitmap must be recalculated
if data_off has been fixed during image consistency check.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 73 +++
 block/parallels.h |  3 ++
 2 files changed, 76 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index 182ef98872..d677a1a253 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -193,6 +193,58 @@ static int mark_used(BlockDriverState *bs,
 return 0;
 }
 
+/*
+ * Collect used bitmap. The image can contain errors, we should fill the
+ * bitmap anyway, as much as we can. This information will be used for
+ * error resolution.
+ */
+static int parallels_fill_used_bitmap(BlockDriverState *bs)
+{
+BDRVParallelsState *s = bs->opaque;
+int64_t payload_bytes;
+uint32_t i;
+int err = 0;
+
+payload_bytes = bdrv_co_getlength(bs->file->bs);
+if (payload_bytes < 0) {
+return payload_bytes;
+}
+payload_bytes -= s->data_start * BDRV_SECTOR_SIZE;
+if (payload_bytes < 0) {
+return -EINVAL;
+}
+
+s->used_bmap_size = DIV_ROUND_UP(payload_bytes, s->cluster_size);
+if (s->used_bmap_size == 0) {
+return 0;
+}
+s->used_bmap = bitmap_try_new(s->used_bmap_size);
+if (s->used_bmap == NULL) {
+return -ENOMEM;
+}
+
+for (i = 0; i < s->bat_size; i++) {
+int err2;
+int64_t host_off = bat2sect(s, i) << BDRV_SECTOR_BITS;
+if (host_off == 0) {
+continue;
+}
+
+err2 = mark_used(bs, s->used_bmap, s->used_bmap_size, host_off);
+if (err2 < 0 && err == 0) {
+err = err2;
+}
+}
+return err;
+}
+
+static void parallels_free_used_bitmap(BlockDriverState *bs)
+{
+BDRVParallelsState *s = bs->opaque;
+s->used_bmap_size = 0;
+g_free(s->used_bmap);
+}
+
 static int64_t coroutine_fn GRAPH_RDLOCK
 allocate_clusters(BlockDriverState *bs, int64_t sector_num,
   int nb_sectors, int *pnum)
@@ -530,8 +582,17 @@ parallels_check_data_off(BlockDriverState *bs, 
BdrvCheckResult *res,
 
 res->corruptions++;
 if (fix & BDRV_FIX_ERRORS) {
+int err;
 s->header->data_off = cpu_to_le32(data_off);
 s->data_start = data_off;
+
+parallels_free_used_bitmap(bs);
+err = parallels_fill_used_bitmap(bs);
+if (err == -ENOMEM) {
+res->check_errors++;
+return err;
+}
+
 res->corruptions_fixed++;
 }
 
@@ -1214,6 +1275,14 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 need_check = need_check || s->data_end > file_nb_sectors;
 
+if (!need_check) {
+ret = parallels_fill_used_bitmap(bs);
+if (ret == -ENOMEM) {
+goto fail;
+}
+need_check = need_check || ret < 0; /* These are correctable errors */
+}
+
 /*
  * We don't repair the image here if it's opened for checks. Also we don't
  * want to change inactive images and can't change readonly images.
@@ -1243,6 +1312,8 @@ fail:
  * "s" object was allocated by g_malloc0 so we can safely
  * try to free its fields even they were not allocated.
  */
+parallels_free_used_bitmap(bs);
+
 error_free(s->migration_blocker);
 g_free(s->bat_dirty_bmap);
 qemu_vfree(s->header);
@@ -1263,6 +1334,8 @@ static void parallels_close(BlockDriverState *bs)
   PREALLOC_MODE_OFF, 0, NULL);
 }
 
+parallels_free_used_bitmap(bs);
+
 g_free(s->bat_dirty_bmap);
 qemu_vfree(s->header);
 
diff --git a/block/parallels.h b/block/parallels.h
index 4e53e9572d..6b199443cf 100644
--- a/block/parallels.h
+++ b/block/parallels.h
@@ -72,6 +72,9 @@ typedef struct BDRVParallelsState {
 unsigned long *bat_dirty_bmap;
 unsigned int  bat_dirty_block;
 
+unsigned long *used_bmap;
+unsigned long used_bmap_size;
+
 uint32_t *bat_bitmap;
 unsigned int bat_size;
 
-- 
2.34.1

[PATCH 01/21] parallels: fix formatting in bdrv_parallels initialization

2023-09-15 Thread Denis V. Lunev

Old code is ugly and contains tabulations. There are no functional
changes in this patch.

Signed-off-by: Denis V. Lunev 
---
 block/parallels.c | 36 +++-
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 48c32d6821..2ebd8e1301 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -1249,23 +1249,25 @@ static void parallels_close(BlockDriverState *bs)
 }
 
 static BlockDriver bdrv_parallels = {
-.format_name   = "parallels",
-.instance_size = sizeof(BDRVParallelsState),
-.bdrv_probe= parallels_probe,
-.bdrv_open = parallels_open,
-.bdrv_close= parallels_close,
-.bdrv_child_perm  = bdrv_default_perms,
-.bdrv_co_block_status = parallels_co_block_status,
-.bdrv_has_zero_init   = bdrv_has_zero_init_1,
-.bdrv_co_flush_to_os  = parallels_co_flush_to_os,
-.bdrv_co_readv  = parallels_co_readv,
-.bdrv_co_writev = parallels_co_writev,
-.is_format  = true,
-.supports_backing = true,
-.bdrv_co_create  = parallels_co_create,
-.bdrv_co_create_opts = parallels_co_create_opts,
-.bdrv_co_check  = parallels_co_check,
-.create_opts= ¶llels_create_opts,
+.format_name= "parallels",
+.instance_size  = sizeof(BDRVParallelsState),
+.create_opts= ¶llels_create_opts,
+.is_format  = true,
+.supports_backing   = true,
+
+.bdrv_has_zero_init = bdrv_has_zero_init_1,
+
+.bdrv_probe = parallels_probe,
+.bdrv_open  = parallels_open,
+.bdrv_close = parallels_close,
+.bdrv_child_perm= bdrv_default_perms,
+.bdrv_co_block_status   = parallels_co_block_status,
+.bdrv_co_flush_to_os= parallels_co_flush_to_os,
+.bdrv_co_readv  = parallels_co_readv,
+.bdrv_co_writev = parallels_co_writev,
+.bdrv_co_create = parallels_co_create,
+.bdrv_co_create_opts= parallels_co_create_opts,
+.bdrv_co_check  = parallels_co_check,
 };
 
 static void bdrv_parallels_init(void)
-- 
2.34.1

Re: [PATCH v2 3/8] qemu-img: rebase: use backing files' BlockBackend for buffer alignment

2023-09-15 Thread Andrey Drobyshev

On 9/15/23 21:39, Eric Blake wrote:
> On Fri, Sep 15, 2023 at 07:20:11PM +0300, Andrey Drobyshev wrote:
>> Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for
>> the data read from the old and new backing files are aligned using
>> BlockDriverState (or BlockBackend later on) referring to the target image.
>> However, this isn't quite right, because buf_new is only being used for
>> reading from the new backing, while buf_old is being used for both reading
>> from the old backing and writing to the target.  Let's take that into account
>> and use more appropriate values as alignments.
>>
>> Signed-off-by: Andrey Drobyshev 
>> ---
>>  qemu-img.c | 9 +++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/qemu-img.c b/qemu-img.c
>> index 50660ba920..d12e4a4753 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -3750,8 +3750,13 @@ static int img_rebase(int argc, char **argv)
>>  int64_t n;
>>  float local_progress = 0;
>>  
>> -buf_old = blk_blockalign(blk, IO_BUF_SIZE);
>> -buf_new = blk_blockalign(blk, IO_BUF_SIZE);
>> +if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk)) >
>> +bdrv_opt_mem_align(blk_bs(blk_old_backing))) {
>> +buf_old = blk_blockalign(blk, IO_BUF_SIZE);
>> +} else {
>> +buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
>> +}
> 
> Since bdrv_opt_mem_align(NULL) is safe, could we just simplify this to:
> 
> buf_old = qemu_memalign(MAX(bdrv_opt_mem_align(blk_old_backing),
> bdrv_opt_mem_align(blk)), IO_BUF_SIZE);
> 
> instead of going through an if statement?  Or is the problem that
> bdrv_opt_mem_align(NULL) can return the host page size (perhaps 64k),
> which may be larger than technically needed in some scenarios?
>

Although bdrv_opt_mem_align(NULL) is safe, blk_bs(NULL) is not.  And
bdrv_opt_mem_align() takes BlockDriverState* not BlockBackend*, so we
would have to perform the same check and there would be no simplification.

>> +buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE);
>>  
>>  size = blk_getlength(blk);
>>  if (size < 0) {
>> -- 
>> 2.39.3
> 
> At any rate, aligning the buffers by how they will be used makes sense
> (if the destination blk has looser requirements than the source
> blk_old_backing, then accesses into blk_old are suspect).
> 
> Reviewed-by: Eric Blake

Re: [PATCH v2 4/8] qemu-img: add chunk size parameter to compare_buffers()

2023-09-15 Thread Eric Blake

On Fri, Sep 15, 2023 at 07:20:12PM +0300, Andrey Drobyshev wrote:
> Add @chsize param to the function which, if non-zero, would represent
> the chunk size to be used for comparison.  If it's zero, then
> BDRV_SECTOR_SIZE is used as default chunk size, which is the previous
> behaviour.
> 
> In particular, we're going to use this param in img_rebase() to make the
> write requests aligned to a predefined alignment value.
> 
> Signed-off-by: Andrey Drobyshev 
> ---
>  qemu-img.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index d12e4a4753..fcd31d7b5b 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1274,23 +1274,29 @@ static int is_allocated_sectors_min(const uint8_t 
> *buf, int n, int *pnum,
>  }
>  
>  /*
> - * Compares two buffers sector by sector. Returns 0 if the first
> - * sector of each buffer matches, non-zero otherwise.
> + * Compares two buffers chunk by chunk, where @chsize is the chunk size.
> + * If @chsize is 0, default chunk size of BDRV_SECTOR_SIZE is used.
> + * Returns 0 if the first chunk of each buffer matches, non-zero otherwise.
>   *
> - * pnum is set to the sector-aligned size of the buffer prefix that
> + * @pnum is set to the size of the buffer prefix aligned to @chsize that
>   * has the same matching status as the first sector.

s/sector/chunk/

With that,
Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org

Re: [PATCH v2 5/8] qemu-img: rebase: avoid unnecessary COW operations

2023-09-15 Thread Eric Blake

On Fri, Sep 15, 2023 at 07:20:13PM +0300, Andrey Drobyshev wrote:
> When rebasing an image from one backing file to another, we need to
> compare data from old and new backings.  If the diff between that data
> happens to be unaligned to the target cluster size, we might end up
> doing partial writes, which would lead to copy-on-write and additional IO.
> 
> Consider the following simple case (virtual_size == cluster_size == 64K):
> 
> base <-- inc1 <-- inc2
> 
> qemu-io -c "write -P 0xaa 0 32K" base.qcow2
> qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
> qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
> qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
> qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2
> 
> While doing rebase, we'll write a half of the cluster to inc2, and block
> layer will have to read the 2nd half of the same cluster from the base image
> inc1 while doing this write operation, although the whole cluster is already
> read earlier to perform data comparison.
> 
> In order to avoid these unnecessary IO cycles, let's make sure every
> write request is aligned to the overlay subcluster boundaries.  Using
> subcluster size is universal as for the images which don't have them
> this size equals to the cluster size, so in any case we end up aligning
> to the smallest unit of allocation.
> 
> Signed-off-by: Andrey Drobyshev 
> ---
>  qemu-img.c | 76 --
>  1 file changed, 56 insertions(+), 20 deletions(-)
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index fcd31d7b5b..83950af42b 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -3523,6 +3523,7 @@ static int img_rebase(int argc, char **argv)
>  uint8_t *buf_new = NULL;
>  BlockDriverState *bs = NULL, *prefix_chain_bs = NULL;
>  BlockDriverState *unfiltered_bs;
> +BlockDriverInfo bdi = {0};
>  char *filename;
>  const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
>  int c, flags, src_flags, ret;
> @@ -3533,6 +3534,7 @@ static int img_rebase(int argc, char **argv)
>  bool quiet = false;
>  Error *local_err = NULL;
>  bool image_opts = false;
> +int64_t write_align;
>  
>  /* Parse commandline parameters */
>  fmt = NULL;
> @@ -3656,6 +3658,20 @@ static int img_rebase(int argc, char **argv)
>  }
>  }
>  
> +/*
> + * We need overlay subcluster size to make sure write requests are
> + * aligned.
> + */
> +ret = bdrv_get_info(unfiltered_bs, &bdi);
> +if (ret < 0) {
> +error_report("could not get block driver info");
> +goto out;
> +} else if (bdi.subcluster_size == 0) {
> +bdi.subcluster_size = 1;
> +}
> +
> +write_align = bdi.subcluster_size;
> +
>  /* For safe rebasing we need to compare old and new backing file */
>  if (!unsafe) {
>  QDict *options = NULL;
> @@ -3753,7 +3769,7 @@ static int img_rebase(int argc, char **argv)
>  int64_t old_backing_size = 0;
>  int64_t new_backing_size = 0;
>  uint64_t offset;
> -int64_t n;
> +int64_t n, n_old = 0, n_new = 0;
>  float local_progress = 0;
>  
>  if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk)) >
> @@ -3799,7 +3815,8 @@ static int img_rebase(int argc, char **argv)
>  }
>  
>  for (offset = 0; offset < size; offset += n) {
> -bool buf_old_is_zero = false;
> +bool old_backing_eof = false;
> +int64_t n_alloc;
>  
>  /* How many bytes can we handle with the next read? */
>  n = MIN(IO_BUF_SIZE, size - offset);
> @@ -3844,33 +3861,48 @@ static int img_rebase(int argc, char **argv)
>  }
>  }
>  
> +/*
> + * At this point we know that the region [offset; offset + n)
> + * is unallocated within the target image.  This region might be
> + * unaligned to the target image's (sub)cluster boundaries, as
> + * old backing may have smaller clusters (or have subclusters).
> + * We extend it to the aligned boundaries to avoid CoW on
> + * partial writes in blk_pwrite(),
> + */
> +n += offset - QEMU_ALIGN_DOWN(offset, write_align);
> +offset = QEMU_ALIGN_DOWN(offset, write_align);

If we are always aligning to write_align on each iteration of this
loop, won't this round down always be a no-op?

> +n += QEMU_ALIGN_UP(offset + n, write_align) - (offset + n);
> +n = MIN(n, size - offset);

However, I can see how this round up can matter.

> +assert(!bdrv_is_allocated(unfiltered_bs, offset, n, &n_alloc) &&
> +   n_alloc == n);

This assertion feels a bit heavyweight.  I see what you're trying to
say: if we found a (partial) unallocated region in the destination,
then since write_align is the minimum alignment of such allocation,
our rounding up to alignment boundaries should not change the f

78 matches

Mail list logo