Re: [PATCH 3/4] qemu-img bitmap: Report errors while closing the image

2023-01-12 Thread Philippe Mathieu-Daudé

On 12/1/23 20:14, Kevin Wolf wrote:

blk_unref() can't report any errors that happen while closing the image.
For example, if qcow2 hits an -ENOSPC error while writing out dirty
bitmaps when it's closed, it prints error messages to stderr, but
'qemu-img bitmap' won't see any error return value and will therefore
look successful with exit code 0.

In order to fix this, manually inactivate the image first before calling
blk_unref(). This already performs the operations that would be most
likely to fail while closing the image, but it can still return errors.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1330
Signed-off-by: Kevin Wolf 
---
  qemu-img.c | 11 +++
  1 file changed, 11 insertions(+)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH 0/4] qemu-img: Fix exit code for errors closing the image

2023-01-12 Thread Markus Armbruster
Drive-by comment...

Kevin Wolf  writes:

> This series addresses the problem described in these bug reports:
> https://gitlab.com/qemu-project/qemu/-/issues/1330
> https://bugzilla.redhat.com/show_bug.cgi?id=2147617
>
> qcow2 can fail when writing back dirty bitmaps in qcow2_inactivate().
> However, when the function is called through blk_unref(), in the case of
> such errors, while an error message is written to stderr, the callers
> never see an error return. Specifically, 'qemu-img bitmap/commit' are
> reported to exit with an exit code 0 despite the errors.

After having tead the "potential alternative" below, I figure this
failure happens within blk_unref().  But I can't see a call chain.  Am I
confused?

> The solution taken here is inactivating the images first, which can
> still return errors, but already performs all of the write operations.
> Only then the images are actually blk_unref()-ed.
>
> If we agree that this is the way to go (as a potential alternative,
> allowing blk_unref() to fail would require changes in all kinds of
> places, many of which probably wouldn't even know what to do with the
> error),

blk_unref() could fail only when it destroys @blk (refcnt goes to zero).
Correct?

We have a bunch of "unref" functions in the tree, and, as far as I can
tell from a quick grep, none of them can fail.  Supports your apparent
preference for not changing blk_unref().

> then I suppose doing the same for other qemu-img subcommands
> would make sense, too.

I was about to ask whether there could be more silent failures like the
ones in commit and bitmap.  This suggests there are.

Say we do the same for all known such failures.  Would any remaining (or
new) such failures be programming errors?

[...]




Re: [PATCH 1/4] qcow2: Fix theoretical corruption in store_bitmap() error path

2023-01-12 Thread Philippe Mathieu-Daudé

On 12/1/23 20:14, Kevin Wolf wrote:

In order to write the bitmap table to the image file, it is converted to
big endian. If the write fails, it is passed to clear_bitmap_table() to
free all of the clusters it had allocated before. However, if we don't
convert it back to native endianness first, we'll free things at a wrong
offset.

In practical terms, the offsets will be so high that we won't actually
free any allocated clusters, but just run into an error, but in theory
this can cause image corruption.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
---
  block/qcow2-bitmap.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index bcad567c0c..3dff99ba06 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -115,7 +115,7 @@ static int update_header_sync(BlockDriverState *bs)
  return bdrv_flush(bs->file->bs);
  }
  


Maybe add a comment here remembering to bswap back to native endianness?


-static inline void bitmap_table_to_be(uint64_t *bitmap_table, size_t size)
+static inline void bitmap_table_bswap_be(uint64_t *bitmap_table, size_t size)
  {


This function uses cpu_to_be64(), semantically we convert back calling
be64_to_cpu(), but technically both functions end up being the same.

Alternatively:

 for (i = 0; i < size; ++i) {
-bitmap_table[i] = cpu_to_be64(bitmap_table[i]);
+bswap64s(&bitmap_table[i]);
 }


@@ -1401,9 +1401,10 @@ static int store_bitmap(BlockDriverState *bs, 
Qcow2Bitmap *bm, Error **errp)
  goto fail;
  }
  
-bitmap_table_to_be(tb, tb_size);

+bitmap_table_bswap_be(tb, tb_size);
  ret = bdrv_pwrite(bs->file, tb_offset, tb_size * sizeof(tb[0]), tb, 0);
  if (ret < 0) {
+bitmap_table_bswap_be(tb, tb_size);
  error_setg_errno(errp, -ret, "Failed to write bitmap '%s' to file",
   bm_name);
  goto fail;


Pre-existing, but consider using g_autofree for 'tb'.

Reviewed-by: Philippe Mathieu-Daudé 




Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Guenter Roeck

On 1/12/23 11:27, Keith Busch wrote:

On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote:

On Jan 12 09:34, Keith Busch wrote:

On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:


The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
am wondering if there is something going on with the kernel driver (but
I certainly do not rule out that hw/nvme is at fault here, since
pin-based interrupts has also been a source of several issues in the
past).


Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
While probably not the "correct" thing to do, it has better results in
my testing.



A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,

diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
index 03760ddeae8c..0fc46dcb9ec4 100644
--- i/hw/nvme/ctrl.c
+++ w/hw/nvme/ctrl.c
@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
 return;
 }
 if (~intms & n->irq_status) {
+pci_irq_deassert(&n->parent_obj);
 pci_irq_assert(&n->parent_obj);
 } else {
 pci_irq_deassert(&n->parent_obj);


seems to do the trick (pulse is the other way around, assert, then
deassert).

Probably not the "correct" thing to do, but I'll take it since it seems
to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
on ~20 runs now and have not encountered it.

I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
machine/board(s) are you testing?


Could you try the below?



This works as well: 50 iterations with no failures after applying the patch
below on top of qemu v7.2.0. Tested with qemu-system-mipsel.

Guenter


---
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 2c85de4700..521c3c80c1 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -558,6 +558,20 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
  }
  }
  
+static void nvme_irq_pulse(NvmeCtrl *n, NvmeCQueue *cq)

+{
+if (!cq->irq_enabled) {
+return;
+}
+
+if (msix_enabled(&(n->parent_obj))) {
+msix_notify(&(n->parent_obj), cq->vector);
+return;
+}
+
+pci_irq_pulse(&n->parent_obj);
+}
+
  static void nvme_req_clear(NvmeRequest *req)
  {
  req->ns = NULL;
@@ -6917,6 +6931,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, 
int val)
  }
  
  nvme_irq_deassert(n, cq);

+} else {
+/*
+ * Retrigger the irq just to make sure the host has no excuse for
+ * not knowing there's more work to complete on this CQ.
+ */
+nvme_irq_pulse(n, cq);
  }
  } else {
  /* Submission queue doorbell write */
--





Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Guenter Roeck

On 1/12/23 09:45, Klaus Jensen wrote:

On Jan 12 09:34, Keith Busch wrote:

On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:


The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
am wondering if there is something going on with the kernel driver (but
I certainly do not rule out that hw/nvme is at fault here, since
pin-based interrupts has also been a source of several issues in the
past).


Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
While probably not the "correct" thing to do, it has better results in
my testing.



A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,

diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
index 03760ddeae8c..0fc46dcb9ec4 100644
--- i/hw/nvme/ctrl.c
+++ w/hw/nvme/ctrl.c
@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
 return;
 }
 if (~intms & n->irq_status) {
+pci_irq_deassert(&n->parent_obj);
 pci_irq_assert(&n->parent_obj);
 } else {
 pci_irq_deassert(&n->parent_obj);


seems to do the trick (pulse is the other way around, assert, then
deassert).

Probably not the "correct" thing to do, but I'll take it since it seems
to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
on ~20 runs now and have not encountered it.

I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
machine/board(s) are you testing?


So, for mipsel, two sets of results for the above:

First, qemu v7.2 is already much better than qemu v7.1. With qemu v7.1,
the boot test fails roughly every other test. Failure rate with qemu v7.2
is much less.

Second, my nvme boot test with qemu 7.2 fails after ~5-10 iterations.
After the above change, I did not see a single failure in 50 boot tests.

I'll test the other suggested change next.

Guenter




Re: [PATCH v1 1/1] virtio-block: switch to blk_get_max_hw_transfer

2023-01-12 Thread Ilya Dryomov
On Thu, Dec 9, 2021 at 10:34 AM Or Ozeri  wrote:
>
> The blk_get_max_hw_transfer API was recently added in 6.1.0.
> It allows querying an underlying block device its max transfer capability.
> This commit changes virtio-blk to use this.
>
> Signed-off-by: Or Ozeri 
> ---
>  hw/block/virtio-blk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index f139cd7cc9..1ba9a06888 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -458,7 +458,7 @@ static void virtio_blk_submit_multireq(BlockBackend *blk, 
> MultiReqBuffer *mrb)
>  return;
>  }
>
> -max_transfer = blk_get_max_transfer(mrb->reqs[0]->dev->blk);
> +max_transfer = blk_get_max_hw_transfer(mrb->reqs[0]->dev->blk);
>
>  qsort(mrb->reqs, mrb->num_reqs, sizeof(*mrb->reqs),
>&multireq_compare);
> --
> 2.25.1
>
>

Hi Or,

Superficially, this makes sense to me.  A couple of things to consider:

- Move the explanation from the cover letter into the patch
  description.  The current patch description is pretty much
  meaningless.
- Should the actual limit be consulted for the number of segments
  as well?  IOW should blk_get_max_hw_iov() be called instead of
  blk_get_max_iov() a dozen lines below?

I'm adding Stefan and Kevin to CC to get more eyes on this patch as it
fixes a regression.  I believe this was encountered with the following
NBD device, Or please correct me if I'm wrong:

$ cat /sys/block/nbd0/queue/max_sectors_kb
128
$ cat /sys/block/nbd0/queue/max_segments
65535

Thanks,

Ilya



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Keith Busch
On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote:
> On Jan 12 09:34, Keith Busch wrote:
> > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> > > 
> > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> > > am wondering if there is something going on with the kernel driver (but
> > > I certainly do not rule out that hw/nvme is at fault here, since
> > > pin-based interrupts has also been a source of several issues in the
> > > past).
> > 
> > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
> > While probably not the "correct" thing to do, it has better results in
> > my testing.
> > 
> 
> A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,
> 
>   diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
>   index 03760ddeae8c..0fc46dcb9ec4 100644
>   --- i/hw/nvme/ctrl.c
>   +++ w/hw/nvme/ctrl.c
>   @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
>return;
>}
>if (~intms & n->irq_status) {
>   +pci_irq_deassert(&n->parent_obj);
>pci_irq_assert(&n->parent_obj);
>} else {
>pci_irq_deassert(&n->parent_obj);
> 
> 
> seems to do the trick (pulse is the other way around, assert, then
> deassert).
> 
> Probably not the "correct" thing to do, but I'll take it since it seems
> to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
> on ~20 runs now and have not encountered it.
> 
> I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
> machine/board(s) are you testing?

Could you try the below?

---
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 2c85de4700..521c3c80c1 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -558,6 +558,20 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 }
 }
 
+static void nvme_irq_pulse(NvmeCtrl *n, NvmeCQueue *cq)
+{
+if (!cq->irq_enabled) {
+return;
+}
+
+if (msix_enabled(&(n->parent_obj))) {
+msix_notify(&(n->parent_obj), cq->vector);
+return;
+}
+
+pci_irq_pulse(&n->parent_obj);
+}
+
 static void nvme_req_clear(NvmeRequest *req)
 {
 req->ns = NULL;
@@ -6917,6 +6931,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, 
int val)
 }
 
 nvme_irq_deassert(n, cq);
+} else {
+/*
+ * Retrigger the irq just to make sure the host has no excuse for
+ * not knowing there's more work to complete on this CQ.
+ */
+nvme_irq_pulse(n, cq);
 }
 } else {
 /* Submission queue doorbell write */
--



[PATCH 1/4] qcow2: Fix theoretical corruption in store_bitmap() error path

2023-01-12 Thread Kevin Wolf
In order to write the bitmap table to the image file, it is converted to
big endian. If the write fails, it is passed to clear_bitmap_table() to
free all of the clusters it had allocated before. However, if we don't
convert it back to native endianness first, we'll free things at a wrong
offset.

In practical terms, the offsets will be so high that we won't actually
free any allocated clusters, but just run into an error, but in theory
this can cause image corruption.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
---
 block/qcow2-bitmap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index bcad567c0c..3dff99ba06 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -115,7 +115,7 @@ static int update_header_sync(BlockDriverState *bs)
 return bdrv_flush(bs->file->bs);
 }
 
-static inline void bitmap_table_to_be(uint64_t *bitmap_table, size_t size)
+static inline void bitmap_table_bswap_be(uint64_t *bitmap_table, size_t size)
 {
 size_t i;
 
@@ -1401,9 +1401,10 @@ static int store_bitmap(BlockDriverState *bs, 
Qcow2Bitmap *bm, Error **errp)
 goto fail;
 }
 
-bitmap_table_to_be(tb, tb_size);
+bitmap_table_bswap_be(tb, tb_size);
 ret = bdrv_pwrite(bs->file, tb_offset, tb_size * sizeof(tb[0]), tb, 0);
 if (ret < 0) {
+bitmap_table_bswap_be(tb, tb_size);
 error_setg_errno(errp, -ret, "Failed to write bitmap '%s' to file",
  bm_name);
 goto fail;
-- 
2.38.1




[PATCH 3/4] qemu-img bitmap: Report errors while closing the image

2023-01-12 Thread Kevin Wolf
blk_unref() can't report any errors that happen while closing the image.
For example, if qcow2 hits an -ENOSPC error while writing out dirty
bitmaps when it's closed, it prints error messages to stderr, but
'qemu-img bitmap' won't see any error return value and will therefore
look successful with exit code 0.

In order to fix this, manually inactivate the image first before calling
blk_unref(). This already performs the operations that would be most
likely to fail while closing the image, but it can still return errors.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1330
Signed-off-by: Kevin Wolf 
---
 qemu-img.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/qemu-img.c b/qemu-img.c
index 22d6ecefd5..c3671d4890 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -4645,6 +4645,7 @@ static int img_bitmap(int argc, char **argv)
 QSIMPLEQ_HEAD(, ImgBitmapAction) actions;
 ImgBitmapAction *act, *act_next;
 const char *op;
+int inactivate_ret;
 
 QSIMPLEQ_INIT(&actions);
 
@@ -4829,6 +4830,16 @@ static int img_bitmap(int argc, char **argv)
 ret = 0;
 
  out:
+/*
+ * Manually inactivate the images first because this way we can know 
whether
+ * an error occurred. blk_unref() doesn't tell us about failures.
+ */
+inactivate_ret = bdrv_inactivate_all();
+if (inactivate_ret < 0) {
+error_report("Error while closing the image: %s", 
strerror(-inactivate_ret));
+ret = 1;
+}
+
 blk_unref(src);
 blk_unref(blk);
 qemu_opts_del(opts);
-- 
2.38.1




[PATCH 2/4] qemu-img commit: Report errors while closing the image

2023-01-12 Thread Kevin Wolf
blk_unref() can't report any errors that happen while closing the image.
For example, if qcow2 hits an -ENOSPC error while writing out dirty
bitmaps when it's closed, it prints error messages to stderr, but
'qemu-img commit' won't see any error return value and will therefore
look successful with exit code 0.

In order to fix this, manually inactivate the image first before calling
blk_unref(). This already performs the operations that would be most
likely to fail while closing the image, but it can still return errors.

Signed-off-by: Kevin Wolf 
---
 qemu-img.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/qemu-img.c b/qemu-img.c
index 439d8de1e3..22d6ecefd5 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -449,6 +449,11 @@ static BlockBackend *img_open(bool image_opts,
 blk = img_open_file(filename, NULL, fmt, flags, writethrough, quiet,
 force_share);
 }
+
+if (blk) {
+blk_set_force_allow_inactivate(blk);
+}
+
 return blk;
 }
 
@@ -1119,6 +1124,14 @@ unref_backing:
 done:
 qemu_progress_end();
 
+/*
+ * Manually inactivate the image first because this way we can know whether
+ * an error occurred. blk_unref() doesn't tell us about failures.
+ */
+ret = bdrv_inactivate_all();
+if (ret < 0 && !local_err) {
+error_setg_errno(&local_err, -ret, "Error while closing the image");
+}
 blk_unref(blk);
 
 if (local_err) {
-- 
2.38.1




[PATCH 4/4] qemu-iotests: Test qemu-img bitmap/commit exit code on error

2023-01-12 Thread Kevin Wolf
This tests that when an error happens while writing back bitmaps to the
image file in qcow2_inactivate(), 'qemu-img bitmap/commit' actually
return an error value in their exit code instead of making the operation
look successful to scripts.

Signed-off-by: Kevin Wolf 
---
 .../qemu-iotests/tests/qemu-img-close-errors  | 95 +++
 .../tests/qemu-img-close-errors.out   | 23 +
 2 files changed, 118 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/qemu-img-close-errors
 create mode 100644 tests/qemu-iotests/tests/qemu-img-close-errors.out

diff --git a/tests/qemu-iotests/tests/qemu-img-close-errors 
b/tests/qemu-iotests/tests/qemu-img-close-errors
new file mode 100755
index 00..662f726e5d
--- /dev/null
+++ b/tests/qemu-iotests/tests/qemu-img-close-errors
@@ -0,0 +1,95 @@
+#!/usr/bin/env bash
+# group: rw auto quick
+#
+# Check that errors while closing the image, in particular writing back dirty
+# bitmaps, is correctly reported with a failing qemu-img exit code.
+#
+# Copyright (C) 2023 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+# creator
+owner=kw...@redhat.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+_cleanup()
+{
+_cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+cd ..
+. ./common.rc
+. ./common.filter
+
+_supported_fmt qcow2
+_supported_proto generic
+
+size=1G
+
+# The error we are going to use is ENOSPC. Depending on how many bitmaps we
+# create in the backing file (and therefore increase the used up space), we get
+# failures in different places. With a low number, only merging the bitmap
+# fails, whereas with a higher number, already 'qemu-img commit' fails.
+for max_bitmap in 6 7; do
+echo
+echo "=== Test with $max_bitmap bitmaps ==="
+
+TEST_IMG="$TEST_IMG.base" _make_test_img -q $size
+for i in $(seq 1 $max_bitmap); do
+$QEMU_IMG bitmap --add "$TEST_IMG.base" "stale-bitmap-$i"
+done
+
+# Simulate a block device of 128 MB by resizing the image file accordingly
+# and then enforcing the size with the raw driver
+truncate "$TEST_IMG.base" --size 128M
+BASE_JSON='json:{
+"driver": "qcow2",
+"file": {
+"driver": "raw",
+"size": 134217728,
+"file": {
+"driver": "file",
+"filename":"'"$TEST_IMG.base"'"
+}
+}
+}'
+
+_make_test_img -q -b "$BASE_JSON" -F $IMGFMT
+$QEMU_IMG bitmap --add "$TEST_IMG" "good-bitmap"
+
+$QEMU_IO -c 'write 0 126m' "$TEST_IMG" | _filter_qemu_io
+
+$QEMU_IMG commit -d "$TEST_IMG" 2>&1 | _filter_generated_node_ids
+echo "qemu-img commit exit code: ${PIPESTATUS[0]}"
+
+$QEMU_IMG bitmap --add "$BASE_JSON" "good-bitmap"
+echo "qemu-img bitmap --add exit code: $?"
+
+$QEMU_IMG bitmap --merge "good-bitmap" -b "$TEST_IMG" "$BASE_JSON" \
+"good-bitmap" 2>&1 | _filter_generated_node_ids
+echo "qemu-img bitmap --merge exit code:  ${PIPESTATUS[0]}"
+done
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
+
diff --git a/tests/qemu-iotests/tests/qemu-img-close-errors.out 
b/tests/qemu-iotests/tests/qemu-img-close-errors.out
new file mode 100644
index 00..1bfe88f176
--- /dev/null
+++ b/tests/qemu-iotests/tests/qemu-img-close-errors.out
@@ -0,0 +1,23 @@
+QA output created by qemu-img-close-errors
+
+=== Test with 6 bitmaps ===
+wrote 132120576/132120576 bytes at offset 0
+126 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image committed.
+qemu-img commit exit code: 0
+qemu-img bitmap --add exit code: 0
+qemu-img: Lost persistent bitmaps during inactivation of node 'NODE_NAME': 
Failed to write bitmap 'good-bitmap' to file: No space left on device
+qemu-img: Error while closing the image: Invalid argument
+qemu-img: Lost persistent bitmaps during inactivation of node 'NODE_NAME': 
Failed to write bitmap 'good-bitmap' to file: No space left on device
+qemu-img bitmap --merge exit code:  1
+
+=== Test with 7 bitmaps ===
+wrote 132120576/132120576 bytes at offset 0
+126 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+qemu-img: Lost persistent bitmaps during inactivation of node 'NODE_NAME': 
Failed to write bitmap 'stale-bitmap-7' to file: No space left on device
+qemu-img: Lost pers

[PATCH 0/4] qemu-img: Fix exit code for errors closing the image

2023-01-12 Thread Kevin Wolf
This series addresses the problem described in these bug reports:
https://gitlab.com/qemu-project/qemu/-/issues/1330
https://bugzilla.redhat.com/show_bug.cgi?id=2147617

qcow2 can fail when writing back dirty bitmaps in qcow2_inactivate().
However, when the function is called through blk_unref(), in the case of
such errors, while an error message is written to stderr, the callers
never see an error return. Specifically, 'qemu-img bitmap/commit' are
reported to exit with an exit code 0 despite the errors.

The solution taken here is inactivating the images first, which can
still return errors, but already performs all of the write operations.
Only then the images are actually blk_unref()-ed.

If we agree that this is the way to go (as a potential alternative,
allowing blk_unref() to fail would require changes in all kinds of
places, many of which probably wouldn't even know what to do with the
error), then I suppose doing the same for other qemu-img subcommands
would make sense, too.

As a bonus fix, I found an endianness confusion in an error path of
store_bitmap(). The reported error message "qcow2_free_clusters failed:
No space left on device" looked too suspicious to ignore this. Freeing
an actually existing cluster should never run into ENOSPC.

Kevin Wolf (4):
  qcow2: Fix theoretical corruption in store_bitmap() error path
  qemu-img commit: Report errors while closing the image
  qemu-img bitmap: Report errors while closing the image
  qemu-iotests: Test qemu-img bitmap/commit exit code on error

 block/qcow2-bitmap.c  |  5 +-
 qemu-img.c| 24 +
 .../qemu-iotests/tests/qemu-img-close-errors  | 95 +++
 .../tests/qemu-img-close-errors.out   | 23 +
 4 files changed, 145 insertions(+), 2 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/qemu-img-close-errors
 create mode 100644 tests/qemu-iotests/tests/qemu-img-close-errors.out

-- 
2.38.1




Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Guenter Roeck

On 1/12/23 09:45, Klaus Jensen wrote:

On Jan 12 09:34, Keith Busch wrote:

On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:


The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
am wondering if there is something going on with the kernel driver (but
I certainly do not rule out that hw/nvme is at fault here, since
pin-based interrupts has also been a source of several issues in the
past).


Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
While probably not the "correct" thing to do, it has better results in
my testing.



A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,

diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
index 03760ddeae8c..0fc46dcb9ec4 100644
--- i/hw/nvme/ctrl.c
+++ w/hw/nvme/ctrl.c
@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
 return;
 }
 if (~intms & n->irq_status) {
+pci_irq_deassert(&n->parent_obj);
 pci_irq_assert(&n->parent_obj);
 } else {
 pci_irq_deassert(&n->parent_obj);


seems to do the trick (pulse is the other way around, assert, then
deassert).

Probably not the "correct" thing to do, but I'll take it since it seems
to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
on ~20 runs now and have not encountered it.

I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
machine/board(s) are you testing?


See
https://github.com/groeck/linux-build-test/blob/master/rootfs/mips/run-qemu-mips.sh
and
https://github.com/groeck/linux-build-test/blob/master/rootfs/mipsel/run-qemu-mipsel.sh

I'll apply the change suggested above to my version of qemu (7.2.0)
and give it a try.

Guenter




Re: [PATCH v6 28/33] hw/isa/piix3: Merge hw/isa/piix4.c

2023-01-12 Thread Bernhard Beschow



Am 12. Januar 2023 13:32:02 UTC schrieb "Philippe Mathieu-Daudé" 
:
>On 12/1/23 13:50, Philippe Mathieu-Daudé wrote:
>> On 9/1/23 18:23, Bernhard Beschow wrote:
>>> Now that the PIIX3 and PIIX4 device models are sufficiently consolidated,
>>> their implementations can be merged into one file for further
>>> consolidation.
>>> 
>>> Signed-off-by: Bernhard Beschow 
>>> Reviewed-by: Michael S. Tsirkin 
>>> Message-Id: <20221022150508.26830-37-shen...@gmail.com>
>>> ---
>>>   hw/isa/{piix3.c => piix.c} | 158 
>>>   hw/isa/piix4.c | 285 -
>>>   MAINTAINERS    |   6 +-
>>>   hw/i386/Kconfig    |   2 +-
>>>   hw/isa/Kconfig |  12 +-
>>>   hw/isa/meson.build |   3 +-
>>>   hw/mips/Kconfig    |   2 +-
>>>   7 files changed, 165 insertions(+), 303 deletions(-)
>>>   rename hw/isa/{piix3.c => piix.c} (75%)
>>>   delete mode 100644 hw/isa/piix4.c
>> 
>> Reviewed-by: Philippe Mathieu-Daudé 
>
>BTW I wonder why PIIX4 isn't calling pci_bus_set_route_irq_fn()...
>Any clue?

Looks like it gets used for proxying: 
https://elixir.bootlin.com/qemu/v7.2.0/C/ident/pci_device_route_intx_to_irq



Re: [PATCH v6 28/33] hw/isa/piix3: Merge hw/isa/piix4.c

2023-01-12 Thread Bernhard Beschow



Am 12. Januar 2023 16:36:30 UTC schrieb "Philippe Mathieu-Daudé" 
:
>On 9/1/23 18:23, Bernhard Beschow wrote:
>> Now that the PIIX3 and PIIX4 device models are sufficiently consolidated,
>> their implementations can be merged into one file for further
>> consolidation.
>> 
>> Signed-off-by: Bernhard Beschow 
>> Reviewed-by: Michael S. Tsirkin 
>> Message-Id: <20221022150508.26830-37-shen...@gmail.com>
>> ---
>>   hw/isa/{piix3.c => piix.c} | 158 
>>   hw/isa/piix4.c | 285 -
>>   MAINTAINERS|   6 +-
>>   hw/i386/Kconfig|   2 +-
>>   hw/isa/Kconfig |  12 +-
>>   hw/isa/meson.build |   3 +-
>>   hw/mips/Kconfig|   2 +-
>>   7 files changed, 165 insertions(+), 303 deletions(-)
>>   rename hw/isa/{piix3.c => piix.c} (75%)
>>   delete mode 100644 hw/isa/piix4.c
>
>
>> @@ -489,11 +534,124 @@ static const TypeInfo piix3_xen_info = {
>>   .class_init= piix3_xen_class_init,
>>   };
>>   +static void piix4_realize(PCIDevice *dev, Error **errp)
>> +{
>
>> +/* initialize pit */
>> +i8254_pit_init(isa_bus, 0x40, 0, NULL);
>Pre-existing, why there is no equivalent PIT creation in the
>PIIX3 variant? Due to in-kernel PIT in KVM?

Correct, that's one reason. The other reason is for interrupt wiring in case an 
HPET is present (see our discussion about the intercept_irq for the rtc).

I would like to create and wire up the PIT in PIIX3 as well. Since we'd have to 
take KVM into account here we may need a similar solution as for I8259. This is 
another open question for PIIX4 to become a drop-in replacement for PIIX3. Any 
ideas?

Best regards,
Bernhard



Re: [PATCH v4 1/3] block/rbd: encryption nit fixes

2023-01-12 Thread Ilya Dryomov
On Thu, Jan 12, 2023 at 3:46 PM Daniel P. Berrangé  wrote:
>
> On Thu, Jan 12, 2023 at 03:26:56PM +0100, Ilya Dryomov wrote:
> > On Thu, Jan 12, 2023 at 1:35 PM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Sun, Nov 20, 2022 at 04:28:34AM -0600, Or Ozeri wrote:
> > > > Add const modifier to passphrases,
> > > > and remove redundant stack variable passphrase_len.
> > > >
> > > > Signed-off-by: Or Ozeri 
> > > > ---
> > > >  block/rbd.c | 24 ++--
> > > >  1 file changed, 10 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/block/rbd.c b/block/rbd.c
> > > > index f826410f40..e575105e6d 100644
> > > > --- a/block/rbd.c
> > > > +++ b/block/rbd.c
> > > > @@ -330,7 +330,7 @@ static int qemu_rbd_set_keypairs(rados_t cluster, 
> > > > const char *keypairs_json,
> > > >  #ifdef LIBRBD_SUPPORTS_ENCRYPTION
> > > >  static int qemu_rbd_convert_luks_options(
> > > >  RbdEncryptionOptionsLUKSBase *luks_opts,
> > > > -char **passphrase,
> > > > +const char **passphrase,
> > > >  size_t *passphrase_len,
> > > >  Error **errp)
> > > >  {
> > > > @@ -341,7 +341,7 @@ static int qemu_rbd_convert_luks_options(
> > > >  static int qemu_rbd_convert_luks_create_options(
> > > >  RbdEncryptionCreateOptionsLUKSBase *luks_opts,
> > > >  rbd_encryption_algorithm_t *alg,
> > > > -char **passphrase,
> > > > +const char **passphrase,
> > > >  size_t *passphrase_len,
> > > >  Error **errp)
> > > >  {
> > > > @@ -384,8 +384,7 @@ static int qemu_rbd_encryption_format(rbd_image_t 
> > > > image,
> > > >Error **errp)
> > > >  {
> > > >  int r = 0;
> > > > -g_autofree char *passphrase = NULL;
> > > > -size_t passphrase_len;
> > > > +g_autofree const char *passphrase = NULL;
> > >
> > > This looks wierd.  If it is as const string, why are
> > > we free'ing it ?  Either want g_autofree, or const,
> > > but not both.
> >
> > Just curious, is it a requirement imposed by g_autofree?  Otherwise
> > pointer constness and pointee lifetime are completely orthogonal and
> > freeing (or, in this case, wanting to auto-free) an object referred to
> > by a const pointer seems perfectly fine to me.
>
> Free'ing a const point is not OK
>
> $ cat c.c
> #include 
> void bar(const char *foo) {
> free(foo);
> }
>
> $ gcc -Wall -c c.c
> c.c: In function ‘bar’:
> c.c:5:10: warning: passing argument 1 of ‘free’ discards ‘const’ qualifier 
> from pointer target type [-Wdiscarded-qualifiers]
> 5 | free(foo);
>   |  ^~~
> In file included from c.c:2:
> /usr/include/stdlib.h:568:25: note: expected ‘void *’ but argument is of type 
> ‘const char *’
>   568 | extern void free (void *__ptr) __THROW;
>   |   ~~^
>
> The g_autofree happens to end up hiding this warning, because the const
> annotation isn't propagated to the registere callback, but that doesn't
> mean we should do that.
>
> When a programmer sees a variable annotated const, they expect that
> either someone else is responsible for free'ing it, or that the data
> is statically initialized or stack allocated and thus doesn't need
> free'ing.  So g_autofree + const is just wrong.

FWIW many believe that this specification of free() was a mistake and
that it should have been specified to take const void *.  Some projects
actually went ahead and fixed that: kfree() and friends in the Linux
kernel take const void *, for example.  C++ delete operator works on
const pointers as well -- because object creation and destruction is
fundamentally independent of modification.

But this is more of a philosophical thing...  I asked about g_autofree
because a quick grep revealed a bunch of g_autofree const char * locals
in the tree.  Or would probably prefer to just drop const here ;)

Thanks,

Ilya



Re: [PATCH v6 28/33] hw/isa/piix3: Merge hw/isa/piix4.c

2023-01-12 Thread Bernhard Beschow



Am 12. Januar 2023 16:31:23 UTC schrieb "Philippe Mathieu-Daudé" 
:
>On 12/1/23 16:04, Philippe Mathieu-Daudé wrote:
>> On 9/1/23 18:23, Bernhard Beschow wrote:
>>> Now that the PIIX3 and PIIX4 device models are sufficiently consolidated,
>>> their implementations can be merged into one file for further
>>> consolidation.
>>> 
>>> Signed-off-by: Bernhard Beschow 
>>> Reviewed-by: Michael S. Tsirkin 
>>> Message-Id: <20221022150508.26830-37-shen...@gmail.com>
>>> ---
>>>   hw/isa/{piix3.c => piix.c} | 158 
>>>   hw/isa/piix4.c | 285 -
>>>   MAINTAINERS    |   6 +-
>>>   hw/i386/Kconfig    |   2 +-
>>>   hw/isa/Kconfig |  12 +-
>>>   hw/isa/meson.build |   3 +-
>>>   hw/mips/Kconfig    |   2 +-
>>>   7 files changed, 165 insertions(+), 303 deletions(-)
>>>   rename hw/isa/{piix3.c => piix.c} (75%)
>>>   delete mode 100644 hw/isa/piix4.c
>> 
>>> +static void piix4_realize(PCIDevice *dev, Error **errp)
>>> +{
>>> +    PIIXState *s = PIIX_PCI_DEVICE(dev);
>>> +    PCIBus *pci_bus = pci_get_bus(dev);
>>> +    ISABus *isa_bus;
>>> +
>>> +    isa_bus = isa_bus_new(DEVICE(dev), pci_address_space(dev),
>>> +  pci_address_space_io(dev), errp);
>>> +    if (!isa_bus) {
>>> +    return;
>>> +    }
>>> +
>>> +    memory_region_init_io(&s->rcr_mem, OBJECT(dev), &rcr_ops, s,
>>> +  "reset-control", 1);
>>> +    memory_region_add_subregion_overlap(pci_address_space_io(dev),
>>> +    PIIX_RCR_IOPORT, &s->rcr_mem, 1);
>>> +
>>> +    /* initialize i8259 pic */
>>> +    if (!qdev_realize(DEVICE(&s->pic), NULL, errp)) {
>>> +    return;
>>> +    }
>>> +
>>> +    /* initialize ISA irqs */
>>> +    isa_bus_irqs(isa_bus, s->pic.in_irqs);
>>> +
>>> +    /* initialize pit */
>>> +    i8254_pit_init(isa_bus, 0x40, 0, NULL);
>>> +
>>> +    /* DMA */
>>> +    i8257_dma_init(isa_bus, 0);
>>> +
>>> +    /* RTC */
>>> +    qdev_prop_set_int32(DEVICE(&s->rtc), "base_year", 2000);
>>> +    if (!qdev_realize(DEVICE(&s->rtc), BUS(isa_bus), errp)) {
>>> +    return;
>>> +    }
>>> +    s->rtc.irq = qdev_get_gpio_in(DEVICE(&s->pic), s->rtc.isairq);
>> 
>> Pre-existing; it seems this difference with PIIX3 can be removed
>> because already taken care by calling isa_connect_gpio_out() in 
>> mc146818_rtc_init()?
>> 
>> ISADevice *mc146818_rtc_init(ISABus *bus, int base_year, qemu_irq 
>> intercept_irq)
>> {
>>      DeviceState *dev;
>>      ISADevice *isadev;
>>      RTCState *s;
>> 
>>      isadev = isa_new(TYPE_MC146818_RTC);
>>      dev = DEVICE(isadev);
>>      s = MC146818_RTC(isadev);
>>      qdev_prop_set_int32(dev, "base_year", base_year);
>>      isa_realize_and_unref(isadev, bus, &error_fatal);
>>      if (intercept_irq) {
>>      qdev_connect_gpio_out(dev, 0, intercept_irq);
>>      } else {
>>      isa_connect_gpio_out(isadev, 0, s->isairq);
>> 
>
>I meant to paste:
>
>static void rtc_realizefn(DeviceState *dev, Error **errp)
>{
>...
>qdev_init_gpio_out(dev, &s->irq, 1);
>
>
>> Having:
>> 
>> void isa_connect_gpio_out(ISADevice *isadev, int gpioirq, unsigned isairq)
>> {
>>      qemu_irq irq = isa_get_irq(isadev, isairq);
>>      qdev_connect_gpio_out(DEVICE(isadev), gpioirq, irq);
>> }
>> 
>> What do you think?
>

In "[PATCH v6 11/33] hw/i386/pc: Create RTC controllers in south bridges" 
mc146818_rtc_init() got inlined, taking into account the intercept_irq, which 
required the rtc interrupt to be wired up in board code. Since we don't have to 
deal with intercept_irq in the Malta code, wiring up of the rtc interrupt could 
be moved into PIIX4.

I would prefer to wire up the rtc interrupt in PIIX3 as well, and to re-wire it 
in board code in case of intercept_irq != NULL. That's still an open question 
which needs to be solved for PIIX4 to become a drop-in replacement for PIIX3. 
Any ideas?

Best regards,
Bernhard



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Klaus Jensen
On Jan 12 09:34, Keith Busch wrote:
> On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> > 
> > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> > am wondering if there is something going on with the kernel driver (but
> > I certainly do not rule out that hw/nvme is at fault here, since
> > pin-based interrupts has also been a source of several issues in the
> > past).
> 
> Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
> While probably not the "correct" thing to do, it has better results in
> my testing.
> 

A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,

diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
index 03760ddeae8c..0fc46dcb9ec4 100644
--- i/hw/nvme/ctrl.c
+++ w/hw/nvme/ctrl.c
@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
 return;
 }
 if (~intms & n->irq_status) {
+pci_irq_deassert(&n->parent_obj);
 pci_irq_assert(&n->parent_obj);
 } else {
 pci_irq_deassert(&n->parent_obj);


seems to do the trick (pulse is the other way around, assert, then
deassert).

Probably not the "correct" thing to do, but I'll take it since it seems
to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
on ~20 runs now and have not encountered it.

I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
machine/board(s) are you testing?


signature.asc
Description: PGP signature


Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Keith Busch
On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote:
> On Jan 12 09:34, Keith Busch wrote:
> > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> > > 
> > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> > > am wondering if there is something going on with the kernel driver (but
> > > I certainly do not rule out that hw/nvme is at fault here, since
> > > pin-based interrupts has also been a source of several issues in the
> > > past).
> > 
> > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
> > While probably not the "correct" thing to do, it has better results in
> > my testing.
> > 
> 
> A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,
> 
>   diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
>   index 03760ddeae8c..0fc46dcb9ec4 100644
>   --- i/hw/nvme/ctrl.c
>   +++ w/hw/nvme/ctrl.c
>   @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
>return;
>}
>if (~intms & n->irq_status) {
>   +pci_irq_deassert(&n->parent_obj);
>pci_irq_assert(&n->parent_obj);
>} else {
>pci_irq_deassert(&n->parent_obj);
> 
> 
> seems to do the trick (pulse is the other way around, assert, then
> deassert).
> 
> Probably not the "correct" thing to do, but I'll take it since it seems
> to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
> on ~20 runs now and have not encountered it.

Yep, that looks good. It's sounding like something with the CPU irq
handling is treating the level interrupt as edge triggered.



Re: [PATCH v4 1/3] block/rbd: encryption nit fixes

2023-01-12 Thread Daniel P . Berrangé
On Thu, Jan 12, 2023 at 06:07:58PM +0100, Ilya Dryomov wrote:
> On Thu, Jan 12, 2023 at 3:46 PM Daniel P. Berrangé  
> wrote:
> >
> > On Thu, Jan 12, 2023 at 03:26:56PM +0100, Ilya Dryomov wrote:
> > > On Thu, Jan 12, 2023 at 1:35 PM Daniel P. Berrangé  
> > > wrote:
> > > >
> > > > On Sun, Nov 20, 2022 at 04:28:34AM -0600, Or Ozeri wrote:
> > > > > Add const modifier to passphrases,
> > > > > and remove redundant stack variable passphrase_len.
> > > > >
> > > > > Signed-off-by: Or Ozeri 
> > > > > ---
> > > > >  block/rbd.c | 24 ++--
> > > > >  1 file changed, 10 insertions(+), 14 deletions(-)
> > > > >
> > > > > diff --git a/block/rbd.c b/block/rbd.c
> > > > > index f826410f40..e575105e6d 100644
> > > > > --- a/block/rbd.c
> > > > > +++ b/block/rbd.c
> > > > > @@ -330,7 +330,7 @@ static int qemu_rbd_set_keypairs(rados_t cluster, 
> > > > > const char *keypairs_json,
> > > > >  #ifdef LIBRBD_SUPPORTS_ENCRYPTION
> > > > >  static int qemu_rbd_convert_luks_options(
> > > > >  RbdEncryptionOptionsLUKSBase *luks_opts,
> > > > > -char **passphrase,
> > > > > +const char **passphrase,
> > > > >  size_t *passphrase_len,
> > > > >  Error **errp)
> > > > >  {
> > > > > @@ -341,7 +341,7 @@ static int qemu_rbd_convert_luks_options(
> > > > >  static int qemu_rbd_convert_luks_create_options(
> > > > >  RbdEncryptionCreateOptionsLUKSBase *luks_opts,
> > > > >  rbd_encryption_algorithm_t *alg,
> > > > > -char **passphrase,
> > > > > +const char **passphrase,
> > > > >  size_t *passphrase_len,
> > > > >  Error **errp)
> > > > >  {
> > > > > @@ -384,8 +384,7 @@ static int qemu_rbd_encryption_format(rbd_image_t 
> > > > > image,
> > > > >Error **errp)
> > > > >  {
> > > > >  int r = 0;
> > > > > -g_autofree char *passphrase = NULL;
> > > > > -size_t passphrase_len;
> > > > > +g_autofree const char *passphrase = NULL;
> > > >
> > > > This looks wierd.  If it is as const string, why are
> > > > we free'ing it ?  Either want g_autofree, or const,
> > > > but not both.
> > >
> > > Just curious, is it a requirement imposed by g_autofree?  Otherwise
> > > pointer constness and pointee lifetime are completely orthogonal and
> > > freeing (or, in this case, wanting to auto-free) an object referred to
> > > by a const pointer seems perfectly fine to me.
> >
> > Free'ing a const point is not OK
> >
> > $ cat c.c
> > #include 
> > void bar(const char *foo) {
> > free(foo);
> > }
> >
> > $ gcc -Wall -c c.c
> > c.c: In function ‘bar’:
> > c.c:5:10: warning: passing argument 1 of ‘free’ discards ‘const’ qualifier 
> > from pointer target type [-Wdiscarded-qualifiers]
> > 5 | free(foo);
> >   |  ^~~
> > In file included from c.c:2:
> > /usr/include/stdlib.h:568:25: note: expected ‘void *’ but argument is of 
> > type ‘const char *’
> >   568 | extern void free (void *__ptr) __THROW;
> >   |   ~~^
> >
> > The g_autofree happens to end up hiding this warning, because the const
> > annotation isn't propagated to the registere callback, but that doesn't
> > mean we should do that.
> >
> > When a programmer sees a variable annotated const, they expect that
> > either someone else is responsible for free'ing it, or that the data
> > is statically initialized or stack allocated and thus doesn't need
> > free'ing.  So g_autofree + const is just wrong.
> 
> FWIW many believe that this specification of free() was a mistake and
> that it should have been specified to take const void *.  Some projects
> actually went ahead and fixed that: kfree() and friends in the Linux
> kernel take const void *, for example.  C++ delete operator works on
> const pointers as well -- because object creation and destruction is
> fundamentally independent of modification.

I'd really not like that as IMHO seeing the 'const' gives an important
hint to developers as to who is responsible for the releasing the pointer

> But this is more of a philosophical thing...  I asked about g_autofree
> because a quick grep revealed a bunch of g_autofree const char * locals
> in the tree.  Or would probably prefer to just drop const here ;)

IMHO those existing cases are all bugs that we should fix, along with
adding a rule to checkpatch.pl to detect this mistake.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v6 28/33] hw/isa/piix3: Merge hw/isa/piix4.c

2023-01-12 Thread Philippe Mathieu-Daudé

On 9/1/23 18:23, Bernhard Beschow wrote:

Now that the PIIX3 and PIIX4 device models are sufficiently consolidated,
their implementations can be merged into one file for further
consolidation.

Signed-off-by: Bernhard Beschow 
Reviewed-by: Michael S. Tsirkin 
Message-Id: <20221022150508.26830-37-shen...@gmail.com>
---
  hw/isa/{piix3.c => piix.c} | 158 
  hw/isa/piix4.c | 285 -
  MAINTAINERS|   6 +-
  hw/i386/Kconfig|   2 +-
  hw/isa/Kconfig |  12 +-
  hw/isa/meson.build |   3 +-
  hw/mips/Kconfig|   2 +-
  7 files changed, 165 insertions(+), 303 deletions(-)
  rename hw/isa/{piix3.c => piix.c} (75%)
  delete mode 100644 hw/isa/piix4.c




@@ -489,11 +534,124 @@ static const TypeInfo piix3_xen_info = {
  .class_init= piix3_xen_class_init,
  };
  
+static void piix4_realize(PCIDevice *dev, Error **errp)

+{



+/* initialize pit */
+i8254_pit_init(isa_bus, 0x40, 0, NULL);

Pre-existing, why there is no equivalent PIT creation in the
PIIX3 variant? Due to in-kernel PIT in KVM?



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Keith Busch
On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> 
> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> am wondering if there is something going on with the kernel driver (but
> I certainly do not rule out that hw/nvme is at fault here, since
> pin-based interrupts has also been a source of several issues in the
> past).

Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
While probably not the "correct" thing to do, it has better results in
my testing.



Re: [PATCH v6 28/33] hw/isa/piix3: Merge hw/isa/piix4.c

2023-01-12 Thread Philippe Mathieu-Daudé

On 12/1/23 16:04, Philippe Mathieu-Daudé wrote:

On 9/1/23 18:23, Bernhard Beschow wrote:

Now that the PIIX3 and PIIX4 device models are sufficiently consolidated,
their implementations can be merged into one file for further
consolidation.

Signed-off-by: Bernhard Beschow 
Reviewed-by: Michael S. Tsirkin 
Message-Id: <20221022150508.26830-37-shen...@gmail.com>
---
  hw/isa/{piix3.c => piix.c} | 158 
  hw/isa/piix4.c | 285 -
  MAINTAINERS    |   6 +-
  hw/i386/Kconfig    |   2 +-
  hw/isa/Kconfig |  12 +-
  hw/isa/meson.build |   3 +-
  hw/mips/Kconfig    |   2 +-
  7 files changed, 165 insertions(+), 303 deletions(-)
  rename hw/isa/{piix3.c => piix.c} (75%)
  delete mode 100644 hw/isa/piix4.c



+static void piix4_realize(PCIDevice *dev, Error **errp)
+{
+    PIIXState *s = PIIX_PCI_DEVICE(dev);
+    PCIBus *pci_bus = pci_get_bus(dev);
+    ISABus *isa_bus;
+
+    isa_bus = isa_bus_new(DEVICE(dev), pci_address_space(dev),
+  pci_address_space_io(dev), errp);
+    if (!isa_bus) {
+    return;
+    }
+
+    memory_region_init_io(&s->rcr_mem, OBJECT(dev), &rcr_ops, s,
+  "reset-control", 1);
+    memory_region_add_subregion_overlap(pci_address_space_io(dev),
+    PIIX_RCR_IOPORT, &s->rcr_mem, 
1);

+
+    /* initialize i8259 pic */
+    if (!qdev_realize(DEVICE(&s->pic), NULL, errp)) {
+    return;
+    }
+
+    /* initialize ISA irqs */
+    isa_bus_irqs(isa_bus, s->pic.in_irqs);
+
+    /* initialize pit */
+    i8254_pit_init(isa_bus, 0x40, 0, NULL);
+
+    /* DMA */
+    i8257_dma_init(isa_bus, 0);
+
+    /* RTC */
+    qdev_prop_set_int32(DEVICE(&s->rtc), "base_year", 2000);
+    if (!qdev_realize(DEVICE(&s->rtc), BUS(isa_bus), errp)) {
+    return;
+    }
+    s->rtc.irq = qdev_get_gpio_in(DEVICE(&s->pic), s->rtc.isairq);


Pre-existing; it seems this difference with PIIX3 can be removed
because already taken care by calling isa_connect_gpio_out() in 
mc146818_rtc_init()?


ISADevice *mc146818_rtc_init(ISABus *bus, int base_year, qemu_irq 
intercept_irq)

{
     DeviceState *dev;
     ISADevice *isadev;
     RTCState *s;

     isadev = isa_new(TYPE_MC146818_RTC);
     dev = DEVICE(isadev);
     s = MC146818_RTC(isadev);
     qdev_prop_set_int32(dev, "base_year", base_year);
     isa_realize_and_unref(isadev, bus, &error_fatal);
     if (intercept_irq) {
     qdev_connect_gpio_out(dev, 0, intercept_irq);
     } else {
     isa_connect_gpio_out(isadev, 0, s->isairq);



I meant to paste:

static void rtc_realizefn(DeviceState *dev, Error **errp)
{
...
qdev_init_gpio_out(dev, &s->irq, 1);



Having:

void isa_connect_gpio_out(ISADevice *isadev, int gpioirq, unsigned isairq)
{
     qemu_irq irq = isa_get_irq(isadev, isairq);
     qdev_connect_gpio_out(DEVICE(isadev), gpioirq, irq);
}

What do you think?





Re: [PATCH v6 05/13] migration/qemu-file: Add qemu_file_get_to_fd()

2023-01-12 Thread Cédric Le Goater

On 1/12/23 09:50, Avihai Horon wrote:

Add new function qemu_file_get_to_fd() that allows reading data from
QEMUFile and writing it straight into a given fd.

This will be used later in VFIO migration code.

Signed-off-by: Avihai Horon 
Reviewed-by: Vladimir Sementsov-Ogievskiy 


Reviewed-by: Cédric Le Goater 

Thanks,

C.



---
  migration/qemu-file.h |  1 +
  migration/qemu-file.c | 34 ++
  2 files changed, 35 insertions(+)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index fa13d04d78..9d0155a2a1 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -148,6 +148,7 @@ int qemu_file_shutdown(QEMUFile *f);
  QEMUFile *qemu_file_get_return_path(QEMUFile *f);
  void qemu_fflush(QEMUFile *f);
  void qemu_file_set_blocking(QEMUFile *f, bool block);
+int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
  
  void ram_control_before_iterate(QEMUFile *f, uint64_t flags);

  void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 2d5f74ffc2..102ab3b439 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -940,3 +940,37 @@ QIOChannel *qemu_file_get_ioc(QEMUFile *file)
  {
  return file->ioc;
  }
+
+/*
+ * Read size bytes from QEMUFile f and write them to fd.
+ */
+int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
+{
+while (size) {
+size_t pending = f->buf_size - f->buf_index;
+ssize_t rc;
+
+if (!pending) {
+rc = qemu_fill_buffer(f);
+if (rc < 0) {
+return rc;
+}
+if (rc == 0) {
+return -EIO;
+}
+continue;
+}
+
+rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
+if (rc < 0) {
+return -errno;
+}
+if (rc == 0) {
+return -EIO;
+}
+f->buf_index += rc;
+size -= rc;
+}
+
+return 0;
+}





Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Guenter Roeck

On 1/12/23 05:10, Klaus Jensen wrote:

Hi all (linux-nvme, qemu-devel, maintainers),

On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
pin-based interrupts, I'm seeing occasional completion timeouts, i.e.

   nvme nvme0: I/O 333 QID 1 timeout, completion polled

To rule out issues with shadow doorbells (which have been a source of
frustration in the past), those are disabled. FWIW I'm also seeing the
issue with shadow doorbells.

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f25cc2c235e9..28d8e7f4b56c 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->mdts = n->params.mdts;
 id->ver = cpu_to_le32(NVME_SPEC_VER);
 id->oacs =
-cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | 
NVME_OACS_DBBUF);
+cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
 id->cntrltype = 0x1;

 /*


I captured a trace from QEMU when this happens:

pci_nvme_mmio_write addr 0x1008 data 0x4e size 4
pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78
pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324
pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 
num_prps 5
pci_nvme_map_addr addr 0x80aca000 len 4096
pci_nvme_map_addr addr 0x80ac9000 len 4096
pci_nvme_map_addr addr 0x80ac8000 len 4096
pci_nvme_map_addr addr 0x80ac7000 len 4096
pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242
pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 
num_prps 29
pci_nvme_map_addr addr 0x80ae6000 len 4096
pci_nvme_map_addr addr 0x80ae5000 len 4096
pci_nvme_map_addr addr 0x80ae4000 len 4096
pci_nvme_map_addr addr 0x80ae3000 len 4096
pci_nvme_map_addr addr 0x80ae2000 len 4096
pci_nvme_map_addr addr 0x80ae1000 len 4096
pci_nvme_map_addr addr 0x80ae len 4096
pci_nvme_map_addr addr 0x80adf000 len 4096
pci_nvme_map_addr addr 0x80ade000 len 4096
pci_nvme_map_addr addr 0x80add000 len 4096
pci_nvme_map_addr addr 0x80adc000 len 4096
pci_nvme_map_addr addr 0x80adb000 len 4096
pci_nvme_map_addr addr 0x80ada000 len 4096
pci_nvme_map_addr addr 0x80ad9000 len 4096
pci_nvme_map_addr addr 0x80ad8000 len 4096
pci_nvme_map_addr addr 0x80ad7000 len 4096
pci_nvme_map_addr addr 0x80ad6000 len 4096
pci_nvme_map_addr addr 0x80ad5000 len 4096
pci_nvme_map_addr addr 0x80ad4000 len 4096
pci_nvme_map_addr addr 0x80ad3000 len 4096
pci_nvme_map_addr addr 0x80ad2000 len 4096
pci_nvme_map_addr addr 0x80ad1000 len 4096
pci_nvme_map_addr addr 0x80ad len 4096
pci_nvme_map_addr addr 0x80acf000 len 4096
pci_nvme_map_addr addr 0x80ace000 len 4096
pci_nvme_map_addr addr 0x80acd000 len 4096
pci_nvme_map_addr addr 0x80acc000 len 4096
pci_nvme_map_addr addr 0x80acb000 len 4096
pci_nvme_rw_cb cid 4428 blk 'd0'
pci_nvme_rw_complete_cb cid 4428 blk 'd0'
pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0
[1]: pci_nvme_irq_pin pulsing IRQ pin
pci_nvme_rw_cb cid 4429 blk 'd0'
pci_nvme_rw_complete_cb cid 4429 blk 'd0'
pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0
[2]: pci_nvme_irq_pin pulsing IRQ pin
[3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4
[4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77
 TIMEOUT HERE (30s) ---
[5]: pci_nvme_mmio_read addr 0x1c size 4
[6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4
[7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78
--- Interrupt deasserted (cq->tail == cq->head)
[   31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled

Following the timeout, everything returns to "normal" and device/driver
happily continues.

The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
am wondering if there is something going on with the kernel driver (but
I certainly do not rule out that hw/nvme is at fault here, since
pin-based interrupts has also been a source of several issues in the
past).

What I'm thinking is that following the interrupt in [1], the driver
picks up completion for cid 4428 but does not find cid 4429 in the queue
since it has not been posted yet. Before getting a cq head doorbell
write (which would cause the pin to be deasserted), the device posts the
completion for cid 4429 which just keeps the interrupt asserted in [2].
The trace then shows the cq head doorbell update in [3,4] for cid 4428
and then we hit the timeout since the driver is not aware that cid 4429
has been posted in between this (why is it not aware of this?) Timing
out, the driver then polls the queue and notices cid 4429 and updates
the cq head doorbell in [5-7], causing the device to deassert the
interrupt and we are "back in shape".

I'm observing this on 6.0 kernels and v6.2-rc3 (have not tested <6.0).
Tested on QEMU v7.0.0 (to rule out all the sha

[PATCH v2 4/5] parallels: Replace fprintf by qemu_log in check

2023-01-12 Thread Alexander Ivanov
If the check is called during normal work, tracking of the check must be
present in VM logs to have some clues if something going wrong with user's
data.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 73e992875a..5c9568f197 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -42,6 +42,7 @@
 #include "qemu/bswap.h"
 #include "qemu/bitmap.h"
 #include "qemu/memalign.h"
+#include "qemu/log-for-trace.h"
 #include "migration/blocker.h"
 #include "parallels.h"
 
@@ -448,8 +449,8 @@ static void parallels_check_unclean(BlockDriverState *bs,
 return;
 }
 
-fprintf(stderr, "%s image was not closed correctly\n",
-fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");
+qemu_log("%s image was not closed correctly\n",
+ fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");
 res->corruptions++;
 if (fix & BDRV_FIX_ERRORS) {
 /* parallels_close will do the job right */
@@ -476,8 +477,8 @@ static int parallels_check_outside_image(BlockDriverState 
*bs,
 for (i = 0; i < s->bat_size; i++) {
 off = bat2sect(s, i) << BDRV_SECTOR_BITS;
 if (off >= size) {
-fprintf(stderr, "%s cluster %u is outside image\n",
-fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+qemu_log("%s cluster %u is outside image\n",
+ fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
 res->corruptions++;
 if (fix & BDRV_FIX_ERRORS) {
 parallels_set_bat_entry(s, i, 0);
@@ -554,8 +555,8 @@ static int parallels_check_leak(BlockDriverState *bs,
 }
 
 count = DIV_ROUND_UP(leak_size, s->cluster_size);
-fprintf(stderr, "%s space leaked at the end of the image %" PRId64 "\n",
-fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR", leak_size);
+qemu_log("%s space leaked at the end of the image %" PRId64 "\n",
+ fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR", leak_size);
 
 res->leaks += count;
 if (fix & BDRV_FIX_LEAKS) {
@@ -608,9 +609,8 @@ static int parallels_check_duplicate(BlockDriverState *bs,
 cluster_index = host_cluster_index(s, off);
 if (test_bit(cluster_index, bitmap)) {
 /* this cluster duplicates another one */
-fprintf(stderr,
-"%s duplicate offset in BAT entry %u\n",
-*fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+qemu_log("%s duplicate offset in BAT entry %u\n",
+ *fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
 
 res->corruptions++;
 
-- 
2.34.1




Re: [PATCH v6 28/33] hw/isa/piix3: Merge hw/isa/piix4.c

2023-01-12 Thread Philippe Mathieu-Daudé

On 9/1/23 18:23, Bernhard Beschow wrote:

Now that the PIIX3 and PIIX4 device models are sufficiently consolidated,
their implementations can be merged into one file for further
consolidation.

Signed-off-by: Bernhard Beschow 
Reviewed-by: Michael S. Tsirkin 
Message-Id: <20221022150508.26830-37-shen...@gmail.com>
---
  hw/isa/{piix3.c => piix.c} | 158 
  hw/isa/piix4.c | 285 -
  MAINTAINERS|   6 +-
  hw/i386/Kconfig|   2 +-
  hw/isa/Kconfig |  12 +-
  hw/isa/meson.build |   3 +-
  hw/mips/Kconfig|   2 +-
  7 files changed, 165 insertions(+), 303 deletions(-)
  rename hw/isa/{piix3.c => piix.c} (75%)
  delete mode 100644 hw/isa/piix4.c



+static void piix4_realize(PCIDevice *dev, Error **errp)
+{
+PIIXState *s = PIIX_PCI_DEVICE(dev);
+PCIBus *pci_bus = pci_get_bus(dev);
+ISABus *isa_bus;
+
+isa_bus = isa_bus_new(DEVICE(dev), pci_address_space(dev),
+  pci_address_space_io(dev), errp);
+if (!isa_bus) {
+return;
+}
+
+memory_region_init_io(&s->rcr_mem, OBJECT(dev), &rcr_ops, s,
+  "reset-control", 1);
+memory_region_add_subregion_overlap(pci_address_space_io(dev),
+PIIX_RCR_IOPORT, &s->rcr_mem, 1);
+
+/* initialize i8259 pic */
+if (!qdev_realize(DEVICE(&s->pic), NULL, errp)) {
+return;
+}
+
+/* initialize ISA irqs */
+isa_bus_irqs(isa_bus, s->pic.in_irqs);
+
+/* initialize pit */
+i8254_pit_init(isa_bus, 0x40, 0, NULL);
+
+/* DMA */
+i8257_dma_init(isa_bus, 0);
+
+/* RTC */
+qdev_prop_set_int32(DEVICE(&s->rtc), "base_year", 2000);
+if (!qdev_realize(DEVICE(&s->rtc), BUS(isa_bus), errp)) {
+return;
+}
+s->rtc.irq = qdev_get_gpio_in(DEVICE(&s->pic), s->rtc.isairq);


Pre-existing; it seems this difference with PIIX3 can be removed
because already taken care by calling isa_connect_gpio_out() in 
mc146818_rtc_init()?


ISADevice *mc146818_rtc_init(ISABus *bus, int base_year, qemu_irq 
intercept_irq)

{
DeviceState *dev;
ISADevice *isadev;
RTCState *s;

isadev = isa_new(TYPE_MC146818_RTC);
dev = DEVICE(isadev);
s = MC146818_RTC(isadev);
qdev_prop_set_int32(dev, "base_year", base_year);
isa_realize_and_unref(isadev, bus, &error_fatal);
if (intercept_irq) {
qdev_connect_gpio_out(dev, 0, intercept_irq);
} else {
isa_connect_gpio_out(isadev, 0, s->isairq);

Having:

void isa_connect_gpio_out(ISADevice *isadev, int gpioirq, unsigned isairq)
{
qemu_irq irq = isa_get_irq(isadev, isairq);
qdev_connect_gpio_out(DEVICE(isadev), gpioirq, irq);
}

What do you think?



[PATCH v2 3/5] parallels: Add checking and repairing duplicate offsets in BAT

2023-01-12 Thread Alexander Ivanov
Cluster offsets must be unique among all the BAT entries. Find duplicate
offsets in the BAT and fix it by copying the content of the relevant
cluster to a newly allocated cluster and set the new cluster offset to the
duplicated entry.

Add host_cluster_index() and highest_offset() helpers to deduplicate the
code.

Move parallels_fix_leak() call to parallels_co_check() to fix both types
of leak: real corruption and a leak produced by allocate_clusters()
during deduplication.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 168 +-
 1 file changed, 151 insertions(+), 17 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index da1e75096c..73e992875a 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -136,6 +136,26 @@ static int cluster_remainder(BDRVParallelsState *s, 
int64_t sector_num,
 return MIN(nb_sectors, ret);
 }
 
+static uint32_t host_cluster_index(BDRVParallelsState *s, int64_t off)
+{
+off -= s->header->data_off << BDRV_SECTOR_BITS;
+return off / s->cluster_size;
+}
+
+static int64_t highest_offset(BDRVParallelsState *s)
+{
+int64_t off, high_off = 0;
+int i;
+
+for (i = 0; i < s->bat_size; i++) {
+off = bat2sect(s, i) << BDRV_SECTOR_BITS;
+if (off > high_off) {
+high_off = off;
+}
+}
+return high_off;
+}
+
 static int64_t block_status(BDRVParallelsState *s, int64_t sector_num,
 int nb_sectors, int *pnum)
 {
@@ -518,17 +538,9 @@ static int parallels_check_leak(BlockDriverState *bs,
 BdrvCheckMode fix)
 {
 BDRVParallelsState *s = bs->opaque;
-int64_t off, high_off, count, leak_size;
-uint32_t i;
-int ret;
+int64_t high_off, count, leak_size;
 
-high_off = 0;
-for (i = 0; i < s->bat_size; i++) {
-off = bat2sect(s, i) << BDRV_SECTOR_BITS;
-if (off > high_off) {
-high_off = off;
-}
-}
+high_off = highest_offset(s);
 
 res->image_end_offset = high_off + s->cluster_size;
 
@@ -541,13 +553,6 @@ static int parallels_check_leak(BlockDriverState *bs,
 return 0;
 }
 
-if (fix & BDRV_FIX_LEAKS) {
-ret = parallels_fix_leak(bs, res);
-if (ret < 0) {
-return ret;
-}
-}
-
 count = DIV_ROUND_UP(leak_size, s->cluster_size);
 fprintf(stderr, "%s space leaked at the end of the image %" PRId64 "\n",
 fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR", leak_size);
@@ -560,6 +565,122 @@ static int parallels_check_leak(BlockDriverState *bs,
 return 0;
 }
 
+static int parallels_check_duplicate(BlockDriverState *bs,
+ BdrvCheckResult *res,
+ BdrvCheckMode *fix)
+{
+BDRVParallelsState *s = bs->opaque;
+QEMUIOVector qiov;
+int64_t off, high_off, sector;
+unsigned long *bitmap;
+uint32_t i, bitmap_size, cluster_index;
+int n, ret = 0;
+uint64_t *buf = NULL;
+
+high_off = highest_offset(s);
+if (high_off == 0) {
+return 0;
+}
+
+/*
+ * Create a bitmap of used clusters.
+ * If a bit is set, there is a BAT entry pointing to this cluster.
+ * Loop through the BAT entries, check bits relevant to an entry offset.
+ * If bit is set, this entry is duplicated. Otherwise set the bit.
+ *
+ * We shouldn't worry about newly allocated clusters outside the image
+ * because they are created higher then any existing cluster pointed by
+ * a BAT entry.
+ */
+bitmap_size = host_cluster_index(s, high_off) + 1;
+bitmap = bitmap_new(bitmap_size);
+
+buf = qemu_memalign(4096, s->cluster_size);
+qemu_iovec_init(&qiov, 0);
+qemu_iovec_add(&qiov, buf, s->cluster_size);
+
+for (i = 0; i < s->bat_size; i++) {
+off = bat2sect(s, i) << BDRV_SECTOR_BITS;
+if (off == 0) {
+continue;
+}
+
+cluster_index = host_cluster_index(s, off);
+if (test_bit(cluster_index, bitmap)) {
+/* this cluster duplicates another one */
+fprintf(stderr,
+"%s duplicate offset in BAT entry %u\n",
+*fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+
+res->corruptions++;
+
+if (*fix & BDRV_FIX_ERRORS) {
+/*
+ * Reset the entry and allocate a new cluster
+ * for the relevant guest offset. In this way we let
+ * the lower layer to place the new cluster properly.
+ * Copy the original cluster to the allocated one.
+ */
+parallels_set_bat_entry(s, i, 0);
+
+ret = bdrv_co_pread(bs->file, off, s->cluster_size, buf, 0);
+if (ret < 0) {
+res->check_errors++;
+goto out;
+}
+
+sector = (i * s->cluster_size) >> BDRV

[PATCH v2 0/5] parallels: Add duplication check, repair at open, fix bugs

2023-01-12 Thread Alexander Ivanov
Fix incorrect data end calculation in parallels_open().

Split image leak handling to separate check and fix helpers.

Add checking and repairing duplicate offsets in BAT

Replace fprintf() by qemu_log().

Image repairing in parallels_open().

v2:
2: Moved outsude parallels_check_leak() 2 helpers:
   parallels_get_leak_size() and parallels_fix_leak().
   
3: Used highest_offset() helper in parallels_check_leak(). Fixed a typo.
   Added comments. Replaced g_malloc() call by qemu_memalign(). Replaced
   bdrv_pread() call by bdrv_co_pread(). Got rid of keeping bytes and
   sectors in the same variable. Added setting the bitmap of the used
   clusters for a new allocated cluster if it isn't out of the bitmap.
   Moved the leak fix to the end of all the checks. Removed a dependence
   on image format for the duplicate check.
   
4 (old): Merged this patch to the previous.
4 (former 5): Fixed formatting.
5 (former 6): Fixed comments. Added O_INACTIVE check in the condition.
  Replaced inuse detection by header_unclean checking.
  Replaced playing with corutines by bdrv_check() usage.

Alexander Ivanov (5):
  parallels: Incorrect data end calculation in parallels_open()
  parallels: Split image leak handling to separate check and fix helpers
  parallels: Add checking and repairing duplicate offsets in BAT
  parallels: Replace fprintf by qemu_log in check
  parallels: Image repairing in parallels_open()

 block/parallels.c | 321 +++---
 1 file changed, 247 insertions(+), 74 deletions(-)

-- 
2.34.1




[PATCH v2 1/5] parallels: Incorrect data end calculation in parallels_open()

2023-01-12 Thread Alexander Ivanov
The BDRVParallelsState structure contains data_end field that is measured
in sectors. In parallels_open() initially this field is set by data_off
field from parallels image header.

According to the parallels format documentation, data_off field contains
an offset, in sectors, from the start of the file to the start of the
data area. For "WithoutFreeSpace" images: if data_off is zero, the offset
is calculated as the end of the BAT table plus some padding to ensure
sector size alignment.

The parallels_open() function has code for handling zero value in
data_off, but in the result data_end contains the offset in bytes.

Replace the alignment to sector size by division by sector size and fix
the comparision with s->header_size.

Signed-off-by: Alexander Ivanov 
Reviewed-by: Denis V. Lunev 
---
 block/parallels.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index eda3fb558d..ed2cf27abc 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -862,9 +862,9 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 s->data_end = le32_to_cpu(ph.data_off);
 if (s->data_end == 0) {
-s->data_end = ROUND_UP(bat_entry_off(s->bat_size), BDRV_SECTOR_SIZE);
+s->data_end = DIV_ROUND_UP(size, BDRV_SECTOR_SIZE);
 }
-if (s->data_end < s->header_size) {
+if (s->data_end < (s->header_size >> BDRV_SECTOR_BITS)) {
 /* there is not enough unused space to fit to block align between BAT
and actual data. We can't avoid read-modify-write... */
 s->header_size = size;
-- 
2.34.1




[PATCH v2 2/5] parallels: Split image leak handling to separate check and fix helpers

2023-01-12 Thread Alexander Ivanov
We need to fix leak after deduplication in the next patch. Move leak
fixing to a separate helper parallels_fix_leak() and add
parallels_get_leak_size() helper wich used in parallels_fix_leak() and
parallels_check_leak().

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 88 ---
 1 file changed, 61 insertions(+), 27 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index ed2cf27abc..da1e75096c 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -475,21 +475,53 @@ static int parallels_check_outside_image(BlockDriverState 
*bs,
 return 0;
 }
 
+static int64_t parallels_get_leak_size(BlockDriverState *bs,
+   BdrvCheckResult *res)
+{
+int64_t size;
+size = bdrv_getlength(bs->file->bs);
+/*
+ * Before any usage of this function out-of-image corruption has been
+ * fixed. If the function returns a negative value, it means an error.
+ */
+return (size < 0) ? size : (size - res->image_end_offset);
+}
+
+static int parallels_fix_leak(BlockDriverState *bs,
+  BdrvCheckResult *res)
+{
+Error *local_err = NULL;
+int64_t size;
+int ret;
+
+size = parallels_get_leak_size(bs, res);
+if (size <= 0) {
+return size;
+}
+
+/*
+ * In order to really repair the image, we must shrink it.
+ * That means we have to pass exact=true.
+ */
+ret = bdrv_co_truncate(bs->file, res->image_end_offset, true,
+   PREALLOC_MODE_OFF, 0, &local_err);
+if (ret < 0) {
+error_report_err(local_err);
+return ret;
+}
+
+return 0;
+}
+
 static int parallels_check_leak(BlockDriverState *bs,
 BdrvCheckResult *res,
 BdrvCheckMode fix)
 {
 BDRVParallelsState *s = bs->opaque;
-int64_t size, off, high_off, count;
+int64_t off, high_off, count, leak_size;
 uint32_t i;
 int ret;
 
-size = bdrv_getlength(bs->file->bs);
-if (size < 0) {
-res->check_errors++;
-return size;
-}
-
 high_off = 0;
 for (i = 0; i < s->bat_size; i++) {
 off = bat2sect(s, i) << BDRV_SECTOR_BITS;
@@ -499,30 +531,32 @@ static int parallels_check_leak(BlockDriverState *bs,
 }
 
 res->image_end_offset = high_off + s->cluster_size;
-if (size > res->image_end_offset) {
-count = DIV_ROUND_UP(size - res->image_end_offset, s->cluster_size);
-fprintf(stderr, "%s space leaked at the end of the image %" PRId64 
"\n",
-fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR",
-size - res->image_end_offset);
-res->leaks += count;
-if (fix & BDRV_FIX_LEAKS) {
-Error *local_err = NULL;
 
-/*
- * In order to really repair the image, we must shrink it.
- * That means we have to pass exact=true.
- */
-ret = bdrv_co_truncate(bs->file, res->image_end_offset, true,
-   PREALLOC_MODE_OFF, 0, &local_err);
-if (ret < 0) {
-error_report_err(local_err);
-res->check_errors++;
-return ret;
-}
-res->leaks_fixed += count;
+leak_size = parallels_get_leak_size(bs, res);
+if (leak_size < 0) {
+res->check_errors++;
+return leak_size;
+}
+if (leak_size == 0) {
+return 0;
+}
+
+if (fix & BDRV_FIX_LEAKS) {
+ret = parallels_fix_leak(bs, res);
+if (ret < 0) {
+return ret;
 }
 }
 
+count = DIV_ROUND_UP(leak_size, s->cluster_size);
+fprintf(stderr, "%s space leaked at the end of the image %" PRId64 "\n",
+fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR", leak_size);
+
+res->leaks += count;
+if (fix & BDRV_FIX_LEAKS) {
+res->leaks_fixed += count;
+}
+
 return 0;
 }
 
-- 
2.34.1




[PATCH v2 5/5] parallels: Image repairing in parallels_open()

2023-01-12 Thread Alexander Ivanov
Repair an image at opening if the image is unclean or out-of-image
corruption was detected.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 67 +--
 1 file changed, 36 insertions(+), 31 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 5c9568f197..74f6d00ffb 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -753,7 +753,6 @@ static int coroutine_fn parallels_co_check(BlockDriverState 
*bs,
 return ret;
 }
 
-
 static int coroutine_fn parallels_co_create(BlockdevCreateOptions* opts,
 Error **errp)
 {
@@ -965,8 +964,8 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 {
 BDRVParallelsState *s = bs->opaque;
 ParallelsHeader ph;
-int ret, size, i;
-int64_t file_size;
+int ret, size;
+int64_t file_size, high_off;
 QemuOpts *opts = NULL;
 Error *local_err = NULL;
 char *buf;
@@ -1044,34 +1043,6 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 s->bat_bitmap = (uint32_t *)(s->header + 1);
 
-for (i = 0; i < s->bat_size; i++) {
-int64_t off = bat2sect(s, i);
-if (off >= file_size) {
-if (flags & BDRV_O_CHECK) {
-continue;
-}
-error_setg(errp, "parallels: Offset %" PRIi64 " in BAT[%d] entry "
-   "is larger than file size (%" PRIi64 ")",
-   off, i, file_size);
-ret = -EINVAL;
-goto fail;
-}
-if (off >= s->data_end) {
-s->data_end = off + s->tracks;
-}
-}
-
-if (le32_to_cpu(ph.inuse) == HEADER_INUSE_MAGIC) {
-/* Image was not closed correctly. The check is mandatory */
-s->header_unclean = true;
-if ((flags & BDRV_O_RDWR) && !(flags & BDRV_O_CHECK)) {
-error_setg(errp, "parallels: Image was not closed correctly; "
-   "cannot be opened read/write");
-ret = -EACCES;
-goto fail;
-}
-}
-
 opts = qemu_opts_create(¶llels_runtime_opts, NULL, 0, errp);
 if (!opts) {
 goto fail_options;
@@ -1133,7 +1104,41 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 error_free(s->migration_blocker);
 goto fail;
 }
+
 qemu_co_mutex_init(&s->lock);
+
+if (le32_to_cpu(ph.inuse) == HEADER_INUSE_MAGIC) {
+s->header_unclean = true;
+}
+
+high_off = highest_offset(s) >> BDRV_SECTOR_BITS;
+if (high_off >= s->data_end) {
+s->data_end = high_off + s->tracks;
+}
+
+/*
+ * We don't repair the image here if it is opened for checks and
+ * shouldn't change the image if BDRV_O_RDWR or BDRV_O_INACTIVE
+ * flag is present.
+ */
+if ((flags & (BDRV_O_CHECK | BDRV_O_INACTIVE)) || !(flags & BDRV_O_RDWR)) {
+return 0;
+}
+
+/*
+ * Repair the image if it's dirty or
+ * out-of-image corruption was detected.
+ */
+if (s->data_end > file_size || s->header_unclean) {
+BdrvCheckResult res;
+ret = bdrv_check(bs, &res, BDRV_FIX_ERRORS | BDRV_FIX_LEAKS);
+if (ret < 0) {
+error_setg_errno(errp, -ret,
+ "Could not repair corrupted image");
+goto fail;
+}
+}
+
 return 0;
 
 fail_format:
-- 
2.34.1




Re: [PATCH v4 1/3] block/rbd: encryption nit fixes

2023-01-12 Thread Daniel P . Berrangé
On Thu, Jan 12, 2023 at 03:26:56PM +0100, Ilya Dryomov wrote:
> On Thu, Jan 12, 2023 at 1:35 PM Daniel P. Berrangé  
> wrote:
> >
> > On Sun, Nov 20, 2022 at 04:28:34AM -0600, Or Ozeri wrote:
> > > Add const modifier to passphrases,
> > > and remove redundant stack variable passphrase_len.
> > >
> > > Signed-off-by: Or Ozeri 
> > > ---
> > >  block/rbd.c | 24 ++--
> > >  1 file changed, 10 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/block/rbd.c b/block/rbd.c
> > > index f826410f40..e575105e6d 100644
> > > --- a/block/rbd.c
> > > +++ b/block/rbd.c
> > > @@ -330,7 +330,7 @@ static int qemu_rbd_set_keypairs(rados_t cluster, 
> > > const char *keypairs_json,
> > >  #ifdef LIBRBD_SUPPORTS_ENCRYPTION
> > >  static int qemu_rbd_convert_luks_options(
> > >  RbdEncryptionOptionsLUKSBase *luks_opts,
> > > -char **passphrase,
> > > +const char **passphrase,
> > >  size_t *passphrase_len,
> > >  Error **errp)
> > >  {
> > > @@ -341,7 +341,7 @@ static int qemu_rbd_convert_luks_options(
> > >  static int qemu_rbd_convert_luks_create_options(
> > >  RbdEncryptionCreateOptionsLUKSBase *luks_opts,
> > >  rbd_encryption_algorithm_t *alg,
> > > -char **passphrase,
> > > +const char **passphrase,
> > >  size_t *passphrase_len,
> > >  Error **errp)
> > >  {
> > > @@ -384,8 +384,7 @@ static int qemu_rbd_encryption_format(rbd_image_t 
> > > image,
> > >Error **errp)
> > >  {
> > >  int r = 0;
> > > -g_autofree char *passphrase = NULL;
> > > -size_t passphrase_len;
> > > +g_autofree const char *passphrase = NULL;
> >
> > This looks wierd.  If it is as const string, why are
> > we free'ing it ?  Either want g_autofree, or const,
> > but not both.
> 
> Just curious, is it a requirement imposed by g_autofree?  Otherwise
> pointer constness and pointee lifetime are completely orthogonal and
> freeing (or, in this case, wanting to auto-free) an object referred to
> by a const pointer seems perfectly fine to me.

Free'ing a const point is not OK

$ cat c.c
#include 
void bar(const char *foo) {
free(foo);
}

$ gcc -Wall -c c.c
c.c: In function ‘bar’:
c.c:5:10: warning: passing argument 1 of ‘free’ discards ‘const’ qualifier from 
pointer target type [-Wdiscarded-qualifiers]
5 | free(foo);
  |  ^~~
In file included from c.c:2:
/usr/include/stdlib.h:568:25: note: expected ‘void *’ but argument is of type 
‘const char *’
  568 | extern void free (void *__ptr) __THROW;
  |   ~~^


The g_autofree happens to end up hiding this warning, because the const
annotation isn't propagated to the registere callback, but that doesn't
mean we should do that.

When a programmer sees a variable annotated const, they expect that
either someone else is responsible for free'ing it, or that the data
is statically initialized or stack allocated and thus doesn't need
free'ing.  So g_autofree + const is just wrong.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v4 1/3] block/rbd: encryption nit fixes

2023-01-12 Thread Ilya Dryomov
On Thu, Jan 12, 2023 at 1:35 PM Daniel P. Berrangé  wrote:
>
> On Sun, Nov 20, 2022 at 04:28:34AM -0600, Or Ozeri wrote:
> > Add const modifier to passphrases,
> > and remove redundant stack variable passphrase_len.
> >
> > Signed-off-by: Or Ozeri 
> > ---
> >  block/rbd.c | 24 ++--
> >  1 file changed, 10 insertions(+), 14 deletions(-)
> >
> > diff --git a/block/rbd.c b/block/rbd.c
> > index f826410f40..e575105e6d 100644
> > --- a/block/rbd.c
> > +++ b/block/rbd.c
> > @@ -330,7 +330,7 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
> > char *keypairs_json,
> >  #ifdef LIBRBD_SUPPORTS_ENCRYPTION
> >  static int qemu_rbd_convert_luks_options(
> >  RbdEncryptionOptionsLUKSBase *luks_opts,
> > -char **passphrase,
> > +const char **passphrase,
> >  size_t *passphrase_len,
> >  Error **errp)
> >  {
> > @@ -341,7 +341,7 @@ static int qemu_rbd_convert_luks_options(
> >  static int qemu_rbd_convert_luks_create_options(
> >  RbdEncryptionCreateOptionsLUKSBase *luks_opts,
> >  rbd_encryption_algorithm_t *alg,
> > -char **passphrase,
> > +const char **passphrase,
> >  size_t *passphrase_len,
> >  Error **errp)
> >  {
> > @@ -384,8 +384,7 @@ static int qemu_rbd_encryption_format(rbd_image_t image,
> >Error **errp)
> >  {
> >  int r = 0;
> > -g_autofree char *passphrase = NULL;
> > -size_t passphrase_len;
> > +g_autofree const char *passphrase = NULL;
>
> This looks wierd.  If it is as const string, why are
> we free'ing it ?  Either want g_autofree, or const,
> but not both.

Just curious, is it a requirement imposed by g_autofree?  Otherwise
pointer constness and pointee lifetime are completely orthogonal and
freeing (or, in this case, wanting to auto-free) an object referred to
by a const pointer seems perfectly fine to me.

Thanks,

Ilya



Re: [PULL 0/6] hw/nvme updates

2023-01-12 Thread Peter Maydell
On Wed, 11 Jan 2023 at 07:52, Klaus Jensen  wrote:
>
> From: Klaus Jensen 
>
> Hi Peter,
>
> The following changes since commit 528d9f33cad5245c1099d77084c78bb2244d5143:
>
>   Merge tag 'pull-tcg-20230106' of https://gitlab.com/rth7680/qemu into 
> staging (2023-01-08 11:23:17 +)
>
> are available in the Git repository at:
>
>   https://gitlab.com/birkelund/qemu.git tags/nvme-next-pull-request
>
> for you to fetch changes up to 973f76cf7743545a5d8a0a8bfdfe2cd02aa3e238:
>
>   hw/nvme: cleanup error reporting in nvme_init_pci() (2023-01-11 08:41:19 
> +0100)
>
> 
> hw/nvme updates
>
> 


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/8.0
for any user-visible changes.

-- PMM



Re: [PATCH v6 28/33] hw/isa/piix3: Merge hw/isa/piix4.c

2023-01-12 Thread Philippe Mathieu-Daudé

On 12/1/23 13:50, Philippe Mathieu-Daudé wrote:

On 9/1/23 18:23, Bernhard Beschow wrote:

Now that the PIIX3 and PIIX4 device models are sufficiently consolidated,
their implementations can be merged into one file for further
consolidation.

Signed-off-by: Bernhard Beschow 
Reviewed-by: Michael S. Tsirkin 
Message-Id: <20221022150508.26830-37-shen...@gmail.com>
---
  hw/isa/{piix3.c => piix.c} | 158 
  hw/isa/piix4.c | 285 -
  MAINTAINERS    |   6 +-
  hw/i386/Kconfig    |   2 +-
  hw/isa/Kconfig |  12 +-
  hw/isa/meson.build |   3 +-
  hw/mips/Kconfig    |   2 +-
  7 files changed, 165 insertions(+), 303 deletions(-)
  rename hw/isa/{piix3.c => piix.c} (75%)
  delete mode 100644 hw/isa/piix4.c


Reviewed-by: Philippe Mathieu-Daudé 


BTW I wonder why PIIX4 isn't calling pci_bus_set_route_irq_fn()...
Any clue?




Re: [PATCH v4 3/3] block/rbd: Add support for layered encryption

2023-01-12 Thread Daniel P . Berrangé
On Thu, Jan 12, 2023 at 01:06:51PM +, Or Ozeri wrote:
> > -Original Message-
> > From: Daniel P. Berrangé 
> > Sent: Thursday, 12 January 2023 14:50
> > To: Or Ozeri 
> > Cc: qemu-de...@nongnu.org; qemu-block@nongnu.org; Danny Harnik
> > ; idryo...@gmail.com
> > Subject: [EXTERNAL] Re: [PATCH v4 3/3] block/rbd: Add support for layered
> > encryption
> > 
> > I don't think we should be reporting this differently.
> > 
> > The layering is not a different encryption format. It is a configuration
> > convenience to avoid repeating the same passphrase for a stack of images
> > when opening an image.
> > 
> > In terms of encryption format it is still either using 'luks1' or 'luks2'.
> > 
> 
> I don’t think that's right.
> The simplest argument is that the magic for RBD layered-luks is not "LUKS".
> So, it's a different format, which cannot be opened by dm-crypt for example.
> I think this is important for the user to know that, and thus it is useful to 
> point it out
> in the output of qemu-img info.

This different magic is an internal implementation detail of RBD. The
on-disk encryption is still following either the luks1 or luks2 format
spec. On the QEMU side we're only needing to know what the on disk format
spec is, and whether or not the parents use a common key, so that apps
know what they need to provide to QEMU for disk config. 

Opening a volume  with dm-crypt is not relevant to QEMU's usage, and
if users are doing that, they should be using the RBD tools directly
and qemu-img info is unrelated to that.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Klaus Jensen
Hi all (linux-nvme, qemu-devel, maintainers),

On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
pin-based interrupts, I'm seeing occasional completion timeouts, i.e.

  nvme nvme0: I/O 333 QID 1 timeout, completion polled

To rule out issues with shadow doorbells (which have been a source of
frustration in the past), those are disabled. FWIW I'm also seeing the
issue with shadow doorbells.

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f25cc2c235e9..28d8e7f4b56c 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->mdts = n->params.mdts;
 id->ver = cpu_to_le32(NVME_SPEC_VER);
 id->oacs =
-cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | 
NVME_OACS_DBBUF);
+cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
 id->cntrltype = 0x1;

 /*


I captured a trace from QEMU when this happens:

pci_nvme_mmio_write addr 0x1008 data 0x4e size 4
pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78
pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324
pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 
num_prps 5
pci_nvme_map_addr addr 0x80aca000 len 4096
pci_nvme_map_addr addr 0x80ac9000 len 4096
pci_nvme_map_addr addr 0x80ac8000 len 4096
pci_nvme_map_addr addr 0x80ac7000 len 4096
pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242
pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 
num_prps 29
pci_nvme_map_addr addr 0x80ae6000 len 4096
pci_nvme_map_addr addr 0x80ae5000 len 4096
pci_nvme_map_addr addr 0x80ae4000 len 4096
pci_nvme_map_addr addr 0x80ae3000 len 4096
pci_nvme_map_addr addr 0x80ae2000 len 4096
pci_nvme_map_addr addr 0x80ae1000 len 4096
pci_nvme_map_addr addr 0x80ae len 4096
pci_nvme_map_addr addr 0x80adf000 len 4096
pci_nvme_map_addr addr 0x80ade000 len 4096
pci_nvme_map_addr addr 0x80add000 len 4096
pci_nvme_map_addr addr 0x80adc000 len 4096
pci_nvme_map_addr addr 0x80adb000 len 4096
pci_nvme_map_addr addr 0x80ada000 len 4096
pci_nvme_map_addr addr 0x80ad9000 len 4096
pci_nvme_map_addr addr 0x80ad8000 len 4096
pci_nvme_map_addr addr 0x80ad7000 len 4096
pci_nvme_map_addr addr 0x80ad6000 len 4096
pci_nvme_map_addr addr 0x80ad5000 len 4096
pci_nvme_map_addr addr 0x80ad4000 len 4096
pci_nvme_map_addr addr 0x80ad3000 len 4096
pci_nvme_map_addr addr 0x80ad2000 len 4096
pci_nvme_map_addr addr 0x80ad1000 len 4096
pci_nvme_map_addr addr 0x80ad len 4096
pci_nvme_map_addr addr 0x80acf000 len 4096
pci_nvme_map_addr addr 0x80ace000 len 4096
pci_nvme_map_addr addr 0x80acd000 len 4096
pci_nvme_map_addr addr 0x80acc000 len 4096
pci_nvme_map_addr addr 0x80acb000 len 4096
pci_nvme_rw_cb cid 4428 blk 'd0'
pci_nvme_rw_complete_cb cid 4428 blk 'd0'
pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0
[1]: pci_nvme_irq_pin pulsing IRQ pin
pci_nvme_rw_cb cid 4429 blk 'd0'
pci_nvme_rw_complete_cb cid 4429 blk 'd0'
pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0
[2]: pci_nvme_irq_pin pulsing IRQ pin
[3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4
[4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77
 TIMEOUT HERE (30s) ---
[5]: pci_nvme_mmio_read addr 0x1c size 4
[6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4
[7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78
--- Interrupt deasserted (cq->tail == cq->head)
[   31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled

Following the timeout, everything returns to "normal" and device/driver
happily continues.

The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
am wondering if there is something going on with the kernel driver (but
I certainly do not rule out that hw/nvme is at fault here, since
pin-based interrupts has also been a source of several issues in the
past).

What I'm thinking is that following the interrupt in [1], the driver
picks up completion for cid 4428 but does not find cid 4429 in the queue
since it has not been posted yet. Before getting a cq head doorbell
write (which would cause the pin to be deasserted), the device posts the
completion for cid 4429 which just keeps the interrupt asserted in [2].
The trace then shows the cq head doorbell update in [3,4] for cid 4428
and then we hit the timeout since the driver is not aware that cid 4429
has been posted in between this (why is it not aware of this?) Timing
out, the driver then polls the queue and notices cid 4429 and updates
the cq head doorbell in [5-7], causing the device to deassert the
interrupt and we are "back in shape".

I'm observing this on 6.0 kernels and v6.2-rc3 (have not tested <6.0).
Tested on QEMU v7.0.0 (to rule out all the shadow doorbell
optimizations) as well as QE

RE: [PATCH v4 3/3] block/rbd: Add support for layered encryption

2023-01-12 Thread Or Ozeri
> -Original Message-
> From: Daniel P. Berrangé 
> Sent: Thursday, 12 January 2023 14:50
> To: Or Ozeri 
> Cc: qemu-de...@nongnu.org; qemu-block@nongnu.org; Danny Harnik
> ; idryo...@gmail.com
> Subject: [EXTERNAL] Re: [PATCH v4 3/3] block/rbd: Add support for layered
> encryption
> 
> I don't think we should be reporting this differently.
> 
> The layering is not a different encryption format. It is a configuration
> convenience to avoid repeating the same passphrase for a stack of images
> when opening an image.
> 
> In terms of encryption format it is still either using 'luks1' or 'luks2'.
> 

I don’t think that's right.
The simplest argument is that the magic for RBD layered-luks is not "LUKS".
So, it's a different format, which cannot be opened by dm-crypt for example.
I think this is important for the user to know that, and thus it is useful to 
point it out
in the output of qemu-img info.




Re: [PATCH v6 30/33] hw/isa/piix: Reuse PIIX3 base class' realize method in PIIX4

2023-01-12 Thread Philippe Mathieu-Daudé

On 9/1/23 18:23, Bernhard Beschow wrote:

Resolves duplicate code.

Signed-off-by: Bernhard Beschow 
Reviewed-by: Michael S. Tsirkin 
Message-Id: <20221022150508.26830-39-shen...@gmail.com>
---
  hw/isa/piix.c | 65 +++
  1 file changed, 9 insertions(+), 56 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH v6 28/33] hw/isa/piix3: Merge hw/isa/piix4.c

2023-01-12 Thread Philippe Mathieu-Daudé

On 9/1/23 18:23, Bernhard Beschow wrote:

Now that the PIIX3 and PIIX4 device models are sufficiently consolidated,
their implementations can be merged into one file for further
consolidation.

Signed-off-by: Bernhard Beschow 
Reviewed-by: Michael S. Tsirkin 
Message-Id: <20221022150508.26830-37-shen...@gmail.com>
---
  hw/isa/{piix3.c => piix.c} | 158 
  hw/isa/piix4.c | 285 -
  MAINTAINERS|   6 +-
  hw/i386/Kconfig|   2 +-
  hw/isa/Kconfig |  12 +-
  hw/isa/meson.build |   3 +-
  hw/mips/Kconfig|   2 +-
  7 files changed, 165 insertions(+), 303 deletions(-)
  rename hw/isa/{piix3.c => piix.c} (75%)
  delete mode 100644 hw/isa/piix4.c


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH v4 3/3] block/rbd: Add support for layered encryption

2023-01-12 Thread Daniel P . Berrangé
On Sun, Nov 20, 2022 at 04:28:36AM -0600, Or Ozeri wrote:
> Starting from ceph Reef, RBD has built-in support for layered encryption,
> where each ancestor image (in a cloned image setting) can be possibly
> encrypted using a unique passphrase.
> 
> A new function, rbd_encryption_load2, was added to librbd API.
> This new function supports an array of passphrases (via "spec" structs).
> 
> This commit extends the qemu rbd driver API to use this new librbd API,
> in order to support this new layered encryption feature.
> 
> Signed-off-by: Or Ozeri 
> ---
>  block/rbd.c  | 161 ++-
>  qapi/block-core.json |  17 -
>  2 files changed, 175 insertions(+), 3 deletions(-)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index 7feae45e82..157922e23a 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -71,6 +71,16 @@ static const char rbd_luks2_header_verification[
>  'L', 'U', 'K', 'S', 0xBA, 0xBE, 0, 2
>  };
>  
> +static const char rbd_layered_luks_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 1
> +};
> +
> +static const char rbd_layered_luks2_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 2
> +};
> +
>  typedef enum {
>  RBD_AIO_READ,
>  RBD_AIO_WRITE,
> @@ -537,6 +547,136 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>  
>  return 0;
>  }
> +
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
> +static int qemu_rbd_encryption_load2(rbd_image_t image,
> + RbdEncryptionOptions *encrypt,
> + Error **errp)
> +{
> +int r = 0;
> +int encrypt_count = 1;
> +int i;
> +RbdEncryptionOptions *curr_encrypt;
> +rbd_encryption_spec_t *specs;
> +rbd_encryption_luks1_format_options_t* luks_opts;
> +rbd_encryption_luks2_format_options_t* luks2_opts;
> +rbd_encryption_luks_format_options_t* luks_any_opts;
> +
> +/* count encryption options */
> +for (curr_encrypt = encrypt; curr_encrypt->has_parent;
> + curr_encrypt = curr_encrypt->parent) {
> +++encrypt_count;
> +}
> +
> +specs = g_new0(rbd_encryption_spec_t, encrypt_count);
> +
> +curr_encrypt = encrypt;
> +for (i = 0; i < encrypt_count; ++i) {
> +switch (curr_encrypt->format) {
> +case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS: {
> +specs[i].format = RBD_ENCRYPTION_FORMAT_LUKS1;
> +specs[i].opts_size =
> +sizeof(rbd_encryption_luks1_format_options_t);
> +
> +luks_opts = g_new0(rbd_encryption_luks1_format_options_t, 1);
> +specs[i].opts = luks_opts;
> +
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionOptionsLUKS_base(
> +&curr_encrypt->u.luks),
> +&luks_opts->passphrase,
> +&luks_opts->passphrase_size,
> +errp);
> +break;
> +}
> +
> +case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS2: {
> +specs[i].format = RBD_ENCRYPTION_FORMAT_LUKS2;
> +specs[i].opts_size =
> +sizeof(rbd_encryption_luks2_format_options_t);
> +
> +luks2_opts = g_new0(rbd_encryption_luks2_format_options_t, 
> 1);
> +specs[i].opts = luks2_opts;
> +
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionOptionsLUKS2_base(
> +&curr_encrypt->u.luks2),
> +&luks2_opts->passphrase,
> +&luks2_opts->passphrase_size,
> +errp);
> +break;
> +}
> +
> +case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS_ANY: {
> +specs[i].format = RBD_ENCRYPTION_FORMAT_LUKS;
> +specs[i].opts_size =
> +sizeof(rbd_encryption_luks_format_options_t);
> +
> +luks_any_opts = g_new0(rbd_encryption_luks_format_options_t, 
> 1);
> +specs[i].opts = luks_any_opts;
> +
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionOptionsLUKSAny_base(
> +&curr_encrypt->u.luks_any),
> +&luks_any_opts->passphrase,
> +&luks_any_opts->passphrase_size,
> +errp);
> +break;
> +}
> +
> +default: {
> +r = -ENOTSUP;
> +error_setg_errno(
> +errp, -r, "unknown image encryption format: %u",
> +curr_encrypt->format);
> +}
> +}
> +
> +if (r < 0) {
> +goto exit;
> +}
> +
> +curr_en

Re: [PATCH v6 22/33] hw/isa/piix3: Drop the "3" from PIIX base class

2023-01-12 Thread Philippe Mathieu-Daudé

On 9/1/23 18:23, Bernhard Beschow wrote:

This commit marks the finalization of the PIIX3 preparations
to be merged with PIIX4. In particular, PIIXState is prepared
to be reused in piix4.c.

Signed-off-by: Bernhard Beschow 
Reviewed-by: Michael S. Tsirkin 
Message-Id: <20221022150508.26830-25-shen...@gmail.com>
---
  include/hw/southbridge/piix.h |  6 ++--
  hw/isa/piix3.c| 60 +--
  2 files changed, 32 insertions(+), 34 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH v6 15/33] hw/isa/piix3: Create power management controller in host device

2023-01-12 Thread Philippe Mathieu-Daudé

On 9/1/23 18:23, Bernhard Beschow wrote:

The power management controller is an integral part of PIIX3 (function
3). So create it as part of the south bridge.

Note that the ACPI function is optional in QEMU. This is why it gets
object_initialize_child()'ed in realize rather than in instance_init.

Signed-off-by: Bernhard Beschow 
Reviewed-by: Michael S. Tsirkin 
Message-Id: <20221022150508.26830-14-shen...@gmail.com>
---
  include/hw/southbridge/piix.h |  6 ++
  hw/i386/pc_piix.c | 24 ++--
  hw/isa/piix3.c| 14 ++
  hw/isa/Kconfig|  1 +
  4 files changed, 35 insertions(+), 10 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH v6 14/33] hw/isa/piix3: Create USB controller in host device

2023-01-12 Thread Philippe Mathieu-Daudé

On 9/1/23 18:23, Bernhard Beschow wrote:

The USB controller is an integral part of PIIX3 (function 2). So create
it as part of the south bridge.

Note that the USB function is optional in QEMU. This is why it gets
object_initialize_child()'ed in realize rather than in instance_init.

Signed-off-by: Bernhard Beschow 
Reviewed-by: Michael S. Tsirkin 
Message-Id: <20221022150508.26830-13-shen...@gmail.com>
---
  include/hw/southbridge/piix.h |  4 
  hw/i386/pc_piix.c |  7 ++-
  hw/isa/piix3.c| 17 +
  hw/isa/Kconfig|  1 +
  4 files changed, 24 insertions(+), 5 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH v6 13/33] hw/i386/pc_piix: Allow for setting properties before realizing PIIX3 south bridge

2023-01-12 Thread Philippe Mathieu-Daudé

On 9/1/23 18:23, Bernhard Beschow wrote:

The next patches will need to take advantage of it.

Signed-off-by: Bernhard Beschow 
Reviewed-by: Peter Maydell 
Reviewed-by: Michael S. Tsirkin 
Message-Id: <20221022150508.26830-3-shen...@gmail.com>
---
  hw/i386/pc_piix.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH v4 2/3] block/rbd: Add luks-any encryption opening option

2023-01-12 Thread Daniel P . Berrangé
On Sun, Nov 20, 2022 at 04:28:35AM -0600, Or Ozeri wrote:
> Ceph RBD encryption API required specifying the encryption format
> for loading encryption. The supported formats were LUKS (v1) and LUKS2.
> 
> Starting from Reef release, RBD also supports loading with "luks-any" format,
> which works for both versions of LUKS.
> 
> This commit extends the qemu rbd driver API to enable qemu users to use
> this luks-any wildcard format.
> 
> Signed-off-by: Or Ozeri 
> ---
>  block/rbd.c  | 19 +++
>  qapi/block-core.json | 20 ++--
>  2 files changed, 37 insertions(+), 2 deletions(-)

Reviewed-by: Daniel P. Berrangé 


With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v6 10/33] hw/intc/i8259: Introduce i8259 proxy TYPE_ISA_PIC

2023-01-12 Thread Philippe Mathieu-Daudé

On 9/1/23 18:23, Bernhard Beschow wrote:

Having an i8259 proxy allows for ISA PICs to be created and wired up in
southbridges. This is especially interesting for PIIX3 for two reasons:
First, the southbridge doesn't need to care about the virtualization
technology used (KVM, TCG, Xen) due to in-IRQs (where devices get
attached) and out-IRQs (which will trigger the IRQs of the respective
virtualization technology) are separated. Second, since the in-IRQs are
populated with fully initialized qemu_irq's, they can already be wired
up inside PIIX3.

Cc: Mark Cave-Ayland 
Signed-off-by: Bernhard Beschow 
Reviewed-by: Michael S. Tsirkin 
Reviewed-by: Mark Cave-Ayland 
---
  include/hw/intc/i8259.h | 18 ++
  hw/intc/i8259.c | 27 +++
  2 files changed, 45 insertions(+)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH v6 09/33] hw/intc/i8259: Make using the isa_pic singleton more type-safe

2023-01-12 Thread Philippe Mathieu-Daudé

On 9/1/23 18:23, Bernhard Beschow wrote:

This even spares some casts in hot code paths along the way.

Signed-off-by: Bernhard Beschow 
Reviewed-by: Michael S. Tsirkin 
Reviewed-by: Mark Cave-Ayland 
---
Note: The next patch will introduce a class "isa-pic", which is
shall not be confused with the isa_pic singleton.
---
  include/hw/intc/i8259.h |  6 +++---
  include/qemu/typedefs.h |  1 +
  hw/intc/i8259.c | 11 ---
  3 files changed, 8 insertions(+), 10 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH v4 1/3] block/rbd: encryption nit fixes

2023-01-12 Thread Daniel P . Berrangé
On Sun, Nov 20, 2022 at 04:28:34AM -0600, Or Ozeri wrote:
> Add const modifier to passphrases,
> and remove redundant stack variable passphrase_len.
> 
> Signed-off-by: Or Ozeri 
> ---
>  block/rbd.c | 24 ++--
>  1 file changed, 10 insertions(+), 14 deletions(-)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index f826410f40..e575105e6d 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -330,7 +330,7 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
> char *keypairs_json,
>  #ifdef LIBRBD_SUPPORTS_ENCRYPTION
>  static int qemu_rbd_convert_luks_options(
>  RbdEncryptionOptionsLUKSBase *luks_opts,
> -char **passphrase,
> +const char **passphrase,
>  size_t *passphrase_len,
>  Error **errp)
>  {
> @@ -341,7 +341,7 @@ static int qemu_rbd_convert_luks_options(
>  static int qemu_rbd_convert_luks_create_options(
>  RbdEncryptionCreateOptionsLUKSBase *luks_opts,
>  rbd_encryption_algorithm_t *alg,
> -char **passphrase,
> +const char **passphrase,
>  size_t *passphrase_len,
>  Error **errp)
>  {
> @@ -384,8 +384,7 @@ static int qemu_rbd_encryption_format(rbd_image_t image,
>Error **errp)
>  {
>  int r = 0;
> -g_autofree char *passphrase = NULL;
> -size_t passphrase_len;
> +g_autofree const char *passphrase = NULL;

This looks wierd.  If it is as const string, why are
we free'ing it ?  Either want g_autofree, or const,
but not both.

>  rbd_encryption_format_t format;
>  rbd_encryption_options_t opts;
>  rbd_encryption_luks1_format_options_t luks_opts;
> @@ -407,12 +406,12 @@ static int qemu_rbd_encryption_format(rbd_image_t image,
>  opts_size = sizeof(luks_opts);
>  r = qemu_rbd_convert_luks_create_options(
>  
> qapi_RbdEncryptionCreateOptionsLUKS_base(&encrypt->u.luks),
> -&luks_opts.alg, &passphrase, &passphrase_len, errp);
> +&luks_opts.alg, &passphrase, &luks_opts.passphrase_size,
> +errp);
>  if (r < 0) {
>  return r;
>  }
>  luks_opts.passphrase = passphrase;
> -luks_opts.passphrase_size = passphrase_len;
>  break;
>  }
>  case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS2: {
> @@ -423,12 +422,12 @@ static int qemu_rbd_encryption_format(rbd_image_t image,
>  r = qemu_rbd_convert_luks_create_options(
>  qapi_RbdEncryptionCreateOptionsLUKS2_base(
>  &encrypt->u.luks2),
> -&luks2_opts.alg, &passphrase, &passphrase_len, errp);
> +&luks2_opts.alg, &passphrase, 
> &luks2_opts.passphrase_size,
> +errp);
>  if (r < 0) {
>  return r;
>  }
>  luks2_opts.passphrase = passphrase;
> -luks2_opts.passphrase_size = passphrase_len;
>  break;
>  }
>  default: {
> @@ -466,8 +465,7 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>  Error **errp)
>  {
>  int r = 0;
> -g_autofree char *passphrase = NULL;
> -size_t passphrase_len;
> +g_autofree const char *passphrase = NULL;
>  rbd_encryption_luks1_format_options_t luks_opts;
>  rbd_encryption_luks2_format_options_t luks2_opts;
>  rbd_encryption_format_t format;
> @@ -482,12 +480,11 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>  opts_size = sizeof(luks_opts);
>  r = qemu_rbd_convert_luks_options(
>  qapi_RbdEncryptionOptionsLUKS_base(&encrypt->u.luks),
> -&passphrase, &passphrase_len, errp);
> +&passphrase, &luks_opts.passphrase_size, errp);
>  if (r < 0) {
>  return r;
>  }
>  luks_opts.passphrase = passphrase;
> -luks_opts.passphrase_size = passphrase_len;
>  break;
>  }
>  case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS2: {
> @@ -497,12 +494,11 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>  opts_size = sizeof(luks2_opts);
>  r = qemu_rbd_convert_luks_options(
>  qapi_RbdEncryptionOptionsLUKS2_base(&encrypt->u.luks2),
> -&passphrase, &passphrase_len, errp);
> +&passphrase, &luks2_opts.passphrase_size, errp);
>  if (r < 0) {
>  return r;
>  }
>  luks2_opts.passphrase = passphrase;
> -luks2_opts.passphrase_size = passphrase_len;
>  break;
>  }
>  default: {
> -- 
> 2.25.1
> 
> 

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://ent

Re: [PATCH v4 3/3] block/rbd: Add support for layered encryption

2023-01-12 Thread Ilya Dryomov
On Sun, Nov 20, 2022 at 11:28 AM Or Ozeri  wrote:
>
> Starting from ceph Reef, RBD has built-in support for layered encryption,
> where each ancestor image (in a cloned image setting) can be possibly
> encrypted using a unique passphrase.
>
> A new function, rbd_encryption_load2, was added to librbd API.
> This new function supports an array of passphrases (via "spec" structs).
>
> This commit extends the qemu rbd driver API to use this new librbd API,
> in order to support this new layered encryption feature.
>
> Signed-off-by: Or Ozeri 
> ---
>  block/rbd.c  | 161 ++-
>  qapi/block-core.json |  17 -
>  2 files changed, 175 insertions(+), 3 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 7feae45e82..157922e23a 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -71,6 +71,16 @@ static const char rbd_luks2_header_verification[
>  'L', 'U', 'K', 'S', 0xBA, 0xBE, 0, 2
>  };
>
> +static const char rbd_layered_luks_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 1
> +};
> +
> +static const char rbd_layered_luks2_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 2
> +};
> +
>  typedef enum {
>  RBD_AIO_READ,
>  RBD_AIO_WRITE,
> @@ -537,6 +547,136 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>
>  return 0;
>  }
> +
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
> +static int qemu_rbd_encryption_load2(rbd_image_t image,
> + RbdEncryptionOptions *encrypt,
> + Error **errp)
> +{
> +int r = 0;
> +int encrypt_count = 1;
> +int i;
> +RbdEncryptionOptions *curr_encrypt;
> +rbd_encryption_spec_t *specs;
> +rbd_encryption_luks1_format_options_t* luks_opts;
> +rbd_encryption_luks2_format_options_t* luks2_opts;
> +rbd_encryption_luks_format_options_t* luks_any_opts;

Hi Or,

Stick to the pointer alignment style used in this file:

rbd_encryption_luks1_format_options_t *luks_opts;
rbd_encryption_luks2_format_options_t *luks2_opts;
rbd_encryption_luks_format_options_t *luks_any_opts;

> +
> +/* count encryption options */
> +for (curr_encrypt = encrypt; curr_encrypt->has_parent;

I think this needs to be rebased on top of 54fde4ff0621 ("qapi block:
Elide redundant has_FOO in generated C").  has_parent is probably not
a thing anymore.

> + curr_encrypt = curr_encrypt->parent) {
> +++encrypt_count;
> +}
> +
> +specs = g_new0(rbd_encryption_spec_t, encrypt_count);
> +
> +curr_encrypt = encrypt;
> +for (i = 0; i < encrypt_count; ++i) {
> +switch (curr_encrypt->format) {
> +case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS: {
> +specs[i].format = RBD_ENCRYPTION_FORMAT_LUKS1;
> +specs[i].opts_size =
> +sizeof(rbd_encryption_luks1_format_options_t);
> +
> +luks_opts = g_new0(rbd_encryption_luks1_format_options_t, 1);
> +specs[i].opts = luks_opts;

I would move opts_size assignment here and avoid repeating the type (and
similar for LUKS2 and LUKS cases):

specs[i].opts_size = sizeof(*luks_opts);

> +
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionOptionsLUKS_base(
> +&curr_encrypt->u.luks),
> +&luks_opts->passphrase,
> +&luks_opts->passphrase_size,
> +errp);
> +break;
> +}
> +

No need to leave a blank line between case statements.

> +case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS2: {
> +specs[i].format = RBD_ENCRYPTION_FORMAT_LUKS2;
> +specs[i].opts_size =
> +sizeof(rbd_encryption_luks2_format_options_t);
> +
> +luks2_opts = g_new0(rbd_encryption_luks2_format_options_t, 
> 1);
> +specs[i].opts = luks2_opts;
> +
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionOptionsLUKS2_base(
> +&curr_encrypt->u.luks2),
> +&luks2_opts->passphrase,
> +&luks2_opts->passphrase_size,
> +errp);
> +break;
> +}
> +
> +case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS_ANY: {
> +specs[i].format = RBD_ENCRYPTION_FORMAT_LUKS;
> +specs[i].opts_size =
> +sizeof(rbd_encryption_luks_format_options_t);
> +
> +luks_any_opts = g_new0(rbd_encryption_luks_format_options_t, 
> 1);
> +specs[i].opts = luks_any_opts;
> +
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionOptionsLUKSAny_base(
> +  

Re: [PATCH-for-8.0] block/nbd: Add missing include

2023-01-12 Thread Philippe Mathieu-Daudé

Hi, can this reviewed patch get merged via a block tree?

On 25/11/22 18:53, Philippe Mathieu-Daudé wrote:

The inlined nbd_readXX() functions call beXX_to_cpu(), themselves
declared in . This fixes when refactoring:

   In file included from ../../block/nbd.c:44:
   include/block/nbd.h: In function 'nbd_read16':
   include/block/nbd.h:383:12: error: implicit declaration of function 
'be16_to_cpu' [-Werror=implicit-function-declaration]
 383 | *val = be##bits##_to_cpu(*val);  
   \
 |^~
   include/block/nbd.h:387:1: note: in expansion of macro 'DEF_NBD_READ_N'
 387 | DEF_NBD_READ_N(16) /* Defines nbd_read16(). */
 | ^~

Signed-off-by: Philippe Mathieu-Daudé 
---
  include/block/nbd.h | 1 +
  1 file changed, 1 insertion(+)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 4ede3b2bd0..a4c98169c3 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -24,6 +24,7 @@
  #include "io/channel-socket.h"
  #include "crypto/tlscreds.h"
  #include "qapi/error.h"
+#include "qemu/bswap.h"
  
  extern const BlockExportDriver blk_exp_nbd;
  





Re: Questions about how block devices use snapshots

2023-01-12 Thread Kevin Wolf
Am 11.01.2023 um 17:21 hat Zhiyong Ye geschrieben:
> Hi Kevin,
> 
> Can I ask again how base.img + diff.qcow2 can be re-merged into one image
> via qemu-img or hmp command when modified.img is discarded?

You can either use 'qemu-img commit' to copy all of the data from
diff.qcow2 back into base.img (this is probably what you want), or
'qemu-img rebase' to copy all of the data from base.img into diff.qcow2.

Kevin




Re: [PATCH] bulk: Rename TARGET_FMT_plx -> HWADDR_FMT_plx

2023-01-12 Thread Peter Maydell
On Tue, 10 Jan 2023 at 22:04, BALATON Zoltan  wrote:
>
> On Tue, 10 Jan 2023, Philippe Mathieu-Daudé wrote:
> > The 'hwaddr' type is defined in "exec/hwaddr.h" as:
> >
> >hwaddr is the type of a physical address
> >   (its size can be different from 'target_ulong').
> >
> > All definitions use the 'HWADDR_' prefix, except TARGET_FMT_plx:
> >
> > $ fgrep define include/exec/hwaddr.h
> > #define HWADDR_H
> > #define HWADDR_BITS 64
> > #define HWADDR_MAX UINT64_MAX
> > #define TARGET_FMT_plx "%016" PRIx64
> > ^^
> > #define HWADDR_PRId PRId64
> > #define HWADDR_PRIi PRIi64
> > #define HWADDR_PRIo PRIo64
> > #define HWADDR_PRIu PRIu64
> > #define HWADDR_PRIx PRIx64
>
> Why are there both TARGET_FMT_plx and HWADDR_PRIx? Why not just use
> HWADDR_PRIx instead?

TARGET_FMT_plx is part of a family of defines for printing
target_foo types; the rest are in cpu-defs.h. These all include the
'%' character. This is more convenient to use, but it's also
out-of-line with the C standard format macros like PRIx64.
The HWADDR_* macros take the approach of aligning with how you
use the C standard format macros.

As usual in QEMU, where there are two different ways of doing
things, it's probably because one of them is a lot older than
the other and written by a different person. In theory it would
be nice to apply some consistency here but it rarely seems
worth the effort of the bulk code edit.

thanks
-- PMM



Re: [PATCH v2] hw/misc/sifive_u_otp: Remove the deprecated OTP config with '-drive if=none'

2023-01-12 Thread Alistair Francis
On Thu, Jan 12, 2023 at 6:40 PM Thomas Huth  wrote:
>
> '-drive if=none' is meant for configuring back-end devices only, so this
> got marked as deprecated in QEMU 6.2. Users should now only use the new
> way with '-drive if=pflash' instead.
>
> Signed-off-by: Thomas Huth 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  docs/about/deprecated.rst   | 6 --
>  docs/about/removed-features.rst | 7 +++
>  hw/misc/sifive_u_otp.c  | 7 ---
>  3 files changed, 7 insertions(+), 13 deletions(-)
>
> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> index 68d29642d7..bfe8148490 100644
> --- a/docs/about/deprecated.rst
> +++ b/docs/about/deprecated.rst
> @@ -87,12 +87,6 @@ as short-form boolean values, and passed to plugins as 
> ``arg_name=on``.
>  However, short-form booleans are deprecated and full explicit ``arg_name=on``
>  form is preferred.
>
> -``-drive if=none`` for the sifive_u OTP device (since 6.2)
> -''
> -
> -Using ``-drive if=none`` to configure the OTP device of the sifive_u
> -RISC-V machine is deprecated. Use ``-drive if=pflash`` instead.
> -
>  ``-no-hpet`` (since 8.0)
>  
>
> diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
> index c918cabd1a..6bd0a2b4e4 100644
> --- a/docs/about/removed-features.rst
> +++ b/docs/about/removed-features.rst
> @@ -422,6 +422,13 @@ the value is hexadecimal.  That is, '0x20M' should be 
> written either as
>  ``tty`` and ``parport`` used to be aliases for ``serial`` and ``parallel``
>  respectively. The actual backend names should be used instead.
>
> +``-drive if=none`` for the sifive_u OTP device (removed in 8.0)
> +'''
> +
> +Use ``-drive if=pflash`` to configure the OTP device of the sifive_u
> +RISC-V machine instead.
> +
> +
>  QEMU Machine Protocol (QMP) commands
>  
>
> diff --git a/hw/misc/sifive_u_otp.c b/hw/misc/sifive_u_otp.c
> index 6d7fdb040a..8965f5c22a 100644
> --- a/hw/misc/sifive_u_otp.c
> +++ b/hw/misc/sifive_u_otp.c
> @@ -210,13 +210,6 @@ static void sifive_u_otp_realize(DeviceState *dev, Error 
> **errp)
>  sysbus_init_mmio(SYS_BUS_DEVICE(dev), &s->mmio);
>
>  dinfo = drive_get(IF_PFLASH, 0, 0);
> -if (!dinfo) {
> -dinfo = drive_get(IF_NONE, 0, 0);
> -if (dinfo) {
> -warn_report("using \"-drive if=none\" for the OTP is deprecated, 
> "
> -"use \"-drive if=pflash\" instead.");
> -}
> -}
>  if (dinfo) {
>  int ret;
>  uint64_t perm;
> --
> 2.31.1
>
>



Re: [PATCH] hw/misc/sifive_u_otp: Remove the deprecated OTP config with '-drive if=none'

2023-01-12 Thread Philippe Mathieu-Daudé

On 12/1/23 10:15, Philippe Mathieu-Daudé wrote:

On 12/1/23 09:29, Thomas Huth wrote:

'-drive if=none' is meant for configuring back-end devices only, so this
got marked as deprecated in QEMU 6.2. Users should now only use the new
way with '-drive if=pflash' instead.

Signed-off-by: Thomas Huth 
---
  docs/about/deprecated.rst   | 6 --
  docs/about/removed-features.rst | 7 +++
  hw/misc/sifive_u_otp.c  | 7 ---
  3 files changed, 7 insertions(+), 13 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 


-ENOTENOUGHCOFFEEYET I read v2 and meant to reply there :P



Re: [PATCH v2] hw/misc/sifive_u_otp: Remove the deprecated OTP config with '-drive if=none'

2023-01-12 Thread Philippe Mathieu-Daudé

On 12/1/23 09:39, Thomas Huth wrote:

'-drive if=none' is meant for configuring back-end devices only, so this
got marked as deprecated in QEMU 6.2. Users should now only use the new
way with '-drive if=pflash' instead.

Signed-off-by: Thomas Huth 
---
  docs/about/deprecated.rst   | 6 --
  docs/about/removed-features.rst | 7 +++
  hw/misc/sifive_u_otp.c  | 7 ---
  3 files changed, 7 insertions(+), 13 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH] hw/misc/sifive_u_otp: Remove the deprecated OTP config with '-drive if=none'

2023-01-12 Thread Philippe Mathieu-Daudé

On 12/1/23 09:29, Thomas Huth wrote:

'-drive if=none' is meant for configuring back-end devices only, so this
got marked as deprecated in QEMU 6.2. Users should now only use the new
way with '-drive if=pflash' instead.

Signed-off-by: Thomas Huth 
---
  docs/about/deprecated.rst   | 6 --
  docs/about/removed-features.rst | 7 +++
  hw/misc/sifive_u_otp.c  | 7 ---
  3 files changed, 7 insertions(+), 13 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




[PATCH v6 06/13] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one

2023-01-12 Thread Avihai Horon
vfio_devices_all_running_and_saving() is used to check if migration is
in pre-copy phase. This is done by checking if migration is in setup or
active states and if all VFIO devices are in pre-copy state, i.e.
_SAVING | _RUNNING.

In VFIO migration protocol v2 pre-copy support is made optional. Hence,
a matching v2 protocol pre-copy state can't be used here.

As preparation for adding v2 protocol, change
vfio_devices_all_running_and_saving() logic such that it doesn't use the
VFIO pre-copy state.

The new equivalent logic checks if migration is in active state and if
all VFIO devices are in running state [1]. No functional changes
intended.

[1] Note that checking if migration is in setup or active states and if
all VFIO devices are in running state doesn't guarantee that we are in
pre-copy phase, thus we check if migration is only in active state.

Signed-off-by: Avihai Horon 
---
 hw/vfio/common.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f6dd571549..3a35f4afad 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
+#include "migration/misc.h"
 #include "sysemu/tpm.h"
 
 VFIOGroupList vfio_group_list =
@@ -363,13 +364,16 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer 
*container)
 return true;
 }
 
-static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+/*
+ * Check if all VFIO devices are running and migration is active, which is
+ * essentially equivalent to the migration being in pre-copy phase.
+ */
+static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
 {
 VFIOGroup *group;
 VFIODevice *vbasedev;
-MigrationState *ms = migrate_get_current();
 
-if (!migration_is_setup_or_active(ms->state)) {
+if (!migration_is_active(migrate_get_current())) {
 return false;
 }
 
@@ -381,8 +385,7 @@ static bool 
vfio_devices_all_running_and_saving(VFIOContainer *container)
 return false;
 }
 
-if ((migration->device_state & VFIO_DEVICE_STATE_V1_SAVING) &&
-(migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
+if (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING) {
 continue;
 } else {
 return false;
@@ -461,7 +464,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
 };
 
 if (iotlb && container->dirty_pages_supported &&
-vfio_devices_all_running_and_saving(container)) {
+vfio_devices_all_running_and_mig_active(container)) {
 return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
 }
 
@@ -488,7 +491,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
 return -errno;
 }
 
-if (iotlb && vfio_devices_all_running_and_saving(container)) {
+if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
 cpu_physical_memory_set_dirty_range(iotlb->translated_addr, size,
 tcg_enabled() ? DIRTY_CLIENTS_ALL :
 DIRTY_CLIENTS_NOCODE);
-- 
2.26.3




[PATCH v6 03/13] vfio/migration: Fix NULL pointer dereference bug

2023-01-12 Thread Avihai Horon
As part of its error flow, vfio_vmstate_change() accesses
MigrationState->to_dst_file without any checks. This can cause a NULL
pointer dereference if the error flow is taken and
MigrationState->to_dst_file is not set.

For example, this can happen if VM is started or stopped not during
migration and vfio_vmstate_change() error flow is taken, as
MigrationState->to_dst_file is not set at that time.

Fix it by checking that MigrationState->to_dst_file is set before using
it.

Fixes: 02a7e71b1e5b ("vfio: Add VM state change handler to know state of VM")
Signed-off-by: Avihai Horon 
Reviewed-by: Juan Quintela 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 hw/vfio/migration.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index e1413ac90c..09fe7c1de2 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -743,7 +743,9 @@ static void vfio_vmstate_change(void *opaque, bool running, 
RunState state)
  */
 error_report("%s: Failed to set device state 0x%x", vbasedev->name,
  (migration->device_state & mask) | value);
-qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+if (migrate_get_current()->to_dst_file) {
+qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+}
 }
 vbasedev->migration->vm_running = running;
 trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
-- 
2.26.3




[PATCH v6 08/13] vfio/migration: Rename functions/structs related to v1 protocol

2023-01-12 Thread Avihai Horon
To avoid name collisions, rename functions and structs related to VFIO
migration protocol v1. This will allow the two protocols to co-exist
when v2 protocol is added, until v1 is removed. No functional changes
intended.

Signed-off-by: Avihai Horon 
Reviewed-by: Cédric Le Goater 
---
 include/hw/vfio/vfio-common.h |   2 +-
 hw/vfio/common.c  |   6 +-
 hw/vfio/migration.c   | 106 +-
 hw/vfio/trace-events  |  12 ++--
 4 files changed, 63 insertions(+), 63 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e573f5a9f1..bbaf72ba00 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -62,7 +62,7 @@ typedef struct VFIOMigration {
 struct VFIODevice *vbasedev;
 VMChangeStateEntry *vm_state;
 VFIORegion region;
-uint32_t device_state;
+uint32_t device_state_v1;
 int vm_running;
 Notifier migration_state;
 uint64_t pending_bytes;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3a35f4afad..550b2d7ded 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -355,8 +355,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer 
*container)
 return false;
 }
 
-if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
-&& (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
+if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+(migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
 return false;
 }
 }
@@ -385,7 +385,7 @@ static bool 
vfio_devices_all_running_and_mig_active(VFIOContainer *container)
 return false;
 }
 
-if (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING) {
+if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
 continue;
 } else {
 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 977da64411..9df859f4d3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -107,8 +107,8 @@ static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, 
size_t count,
  * an error is returned.
  */
 
-static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
-uint32_t value)
+static int vfio_migration_v1_set_state(VFIODevice *vbasedev, uint32_t mask,
+   uint32_t value)
 {
 VFIOMigration *migration = vbasedev->migration;
 VFIORegion *region = &migration->region;
@@ -145,8 +145,8 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, 
uint32_t mask,
 return ret;
 }
 
-migration->device_state = device_state;
-trace_vfio_migration_set_state(vbasedev->name, device_state);
+migration->device_state_v1 = device_state;
+trace_vfio_migration_v1_set_state(vbasedev->name, device_state);
 return 0;
 }
 
@@ -260,8 +260,8 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
*vbasedev, uint64_t *size)
 return ret;
 }
 
-static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
-uint64_t data_size)
+static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
+   uint64_t data_size)
 {
 VFIORegion *region = &vbasedev->migration->region;
 uint64_t data_offset = 0, size, report_size;
@@ -288,7 +288,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice 
*vbasedev,
 data_size = 0;
 }
 
-trace_vfio_load_state_device_data(vbasedev->name, data_offset, size);
+trace_vfio_v1_load_state_device_data(vbasedev->name, data_offset, 
size);
 
 while (size) {
 void *buf;
@@ -394,7 +394,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void 
*opaque)
 return qemu_file_get_error(f);
 }
 
-static void vfio_migration_cleanup(VFIODevice *vbasedev)
+static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
 {
 VFIOMigration *migration = vbasedev->migration;
 
@@ -405,13 +405,13 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
 
 /* -- */
 
-static int vfio_save_setup(QEMUFile *f, void *opaque)
+static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
 {
 VFIODevice *vbasedev = opaque;
 VFIOMigration *migration = vbasedev->migration;
 int ret;
 
-trace_vfio_save_setup(vbasedev->name);
+trace_vfio_v1_save_setup(vbasedev->name);
 
 qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
@@ -431,8 +431,8 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
 }
 }
 
-ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
-   VFIO_DEVICE_STATE_V1_SAVING);
+ret = vfio_migration_v1_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
+ 

[PATCH v6 12/13] vfio: Alphabetize migration section of VFIO trace-events file

2023-01-12 Thread Avihai Horon
Sort the migration section of VFIO trace events file alphabetically
and move two misplaced traces to common.c section.

Signed-off-by: Avihai Horon 
---
 hw/vfio/trace-events | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 60c49b2ecf..db9cb94952 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -119,6 +119,8 @@ vfio_region_sparse_mmap_header(const char *name, int index, 
int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) 
"sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t 
subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t 
bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 
0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
+vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu 
dirty @ 0x%"PRIx64" - 0x%"PRIx64
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group 
#%d"
@@ -148,20 +150,18 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) 
"%ux%u"
 vfio_display_edid_write_error(void) ""
 
 # migration.c
+vfio_load_cleanup(const char *name) " (%s)"
+vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
+vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " 
(%s) size 0x%"PRIx64" ret %d"
+vfio_migration_data_notifier(const char *name, uint64_t stopcopy_size) " (%s) 
stopcopy size 0x%"PRIx64
 vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
-vfio_vmstate_change(const char *name, int running, const char *reason, const 
char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) 
state %s"
-vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data 
buffer size 0x%"PRIx64
+vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
+vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, 
uint64_t compatible, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" 
postcopy 0x%"PRIx64" compatible 0x%"PRIx64" stopcopy size 0x%"PRIx64
-vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
-vfio_load_device_config_state(const char *name) " (%s)"
-vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
-vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " 
(%s) size 0x%"PRIx64" ret %d"
-vfio_load_cleanup(const char *name) " (%s)"
-vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t 
bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 
0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
-vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu 
dirty @ 0x%"PRIx64" - 0x%"PRIx64
-vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
-vfio_migration_data_notifier(const char *name, uint64_t stopcopy_size) " (%s) 
stopcopy size 0x%"PRIx64
+vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data 
buffer size 0x%"PRIx64
+vfio_vmstate_change(const char *name, int running, const char *reason, const 
char *dev_state) " (%s) running %d reason %s device state %s"
-- 
2.26.3




[PATCH v6 05/13] migration/qemu-file: Add qemu_file_get_to_fd()

2023-01-12 Thread Avihai Horon
Add new function qemu_file_get_to_fd() that allows reading data from
QEMUFile and writing it straight into a given fd.

This will be used later in VFIO migration code.

Signed-off-by: Avihai Horon 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 migration/qemu-file.h |  1 +
 migration/qemu-file.c | 34 ++
 2 files changed, 35 insertions(+)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index fa13d04d78..9d0155a2a1 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -148,6 +148,7 @@ int qemu_file_shutdown(QEMUFile *f);
 QEMUFile *qemu_file_get_return_path(QEMUFile *f);
 void qemu_fflush(QEMUFile *f);
 void qemu_file_set_blocking(QEMUFile *f, bool block);
+int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
 
 void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
 void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 2d5f74ffc2..102ab3b439 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -940,3 +940,37 @@ QIOChannel *qemu_file_get_ioc(QEMUFile *file)
 {
 return file->ioc;
 }
+
+/*
+ * Read size bytes from QEMUFile f and write them to fd.
+ */
+int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
+{
+while (size) {
+size_t pending = f->buf_size - f->buf_index;
+ssize_t rc;
+
+if (!pending) {
+rc = qemu_fill_buffer(f);
+if (rc < 0) {
+return rc;
+}
+if (rc == 0) {
+return -EIO;
+}
+continue;
+}
+
+rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
+if (rc < 0) {
+return -errno;
+}
+if (rc == 0) {
+return -EIO;
+}
+f->buf_index += rc;
+size -= rc;
+}
+
+return 0;
+}
-- 
2.26.3




[PATCH v6 09/13] vfio/migration: Implement VFIO migration protocol v2

2023-01-12 Thread Avihai Horon
Implement the basic mandatory part of VFIO migration protocol v2.
This includes all functionality that is necessary to support
VFIO_MIGRATION_STOP_COPY part of the v2 protocol.

The two protocols, v1 and v2, will co-exist and in the following patches
v1 protocol code will be removed.

There are several main differences between v1 and v2 protocols:
- VFIO device state is now represented as a finite state machine instead
  of a bitmap.

- Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
  ioctl and normal read() and write() instead of the migration region.

- Pre-copy is made optional in v2 protocol. Support for pre-copy will be
  added later on.

Detailed information about VFIO migration protocol v2 and its difference
compared to v1 protocol can be found here [1].

[1]
https://lore.kernel.org/all/20220224142024.147653-10-yish...@nvidia.com/

Signed-off-by: Avihai Horon 
---
 include/hw/vfio/vfio-common.h |   5 +
 hw/vfio/common.c  |  19 +-
 hw/vfio/migration.c   | 455 +++---
 hw/vfio/trace-events  |   7 +
 4 files changed, 447 insertions(+), 39 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index bbaf72ba00..2ec3346fea 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,11 @@ typedef struct VFIOMigration {
 int vm_running;
 Notifier migration_state;
 uint64_t pending_bytes;
+enum vfio_device_mig_state device_state;
+int data_fd;
+void *data_buffer;
+size_t data_buffer_size;
+bool v2;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 550b2d7ded..dcaa77d2a8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -355,10 +355,18 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer 
*container)
 return false;
 }
 
-if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+if (!migration->v2 &&
+(vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
 (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
 return false;
 }
+
+if (migration->v2 &&
+(vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+ migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
+return false;
+}
 }
 }
 return true;
@@ -385,7 +393,14 @@ static bool 
vfio_devices_all_running_and_mig_active(VFIOContainer *container)
 return false;
 }
 
-if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
+if (!migration->v2 &&
+migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
+continue;
+}
+
+if (migration->v2 &&
+(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+ migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
 continue;
 } else {
 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 9df859f4d3..08f53189fa 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include "qemu/main-loop.h"
 #include "qemu/cutils.h"
+#include "qemu/units.h"
 #include 
 #include 
 
@@ -44,8 +45,103 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
 
+/*
+ * This is an arbitrary size based on migration of mlx5 devices, where 
typically
+ * total device migration size is on the order of 100s of MB. Testing with
+ * larger values, e.g. 128MB and 1GB, did not show a performance improvement.
+ */
+#define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
+
 static int64_t bytes_transferred;
 
+static const char *mig_state_to_str(enum vfio_device_mig_state state)
+{
+switch (state) {
+case VFIO_DEVICE_STATE_ERROR:
+return "ERROR";
+case VFIO_DEVICE_STATE_STOP:
+return "STOP";
+case VFIO_DEVICE_STATE_RUNNING:
+return "RUNNING";
+case VFIO_DEVICE_STATE_STOP_COPY:
+return "STOP_COPY";
+case VFIO_DEVICE_STATE_RESUMING:
+return "RESUMING";
+case VFIO_DEVICE_STATE_RUNNING_P2P:
+return "RUNNING_P2P";
+default:
+return "UNKNOWN STATE";
+}
+}
+
+static int vfio_migration_set_state(VFIODevice *vbasedev,
+enum vfio_device_mig_state new_state,
+enum vfio_device_mig_state recover_state)
+{
+VFIOMigration *migration = vbasedev->migration;
+uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+  sizeof(struct vfio_device_feature_mig_state),
+  

[PATCH v6 04/13] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support

2023-01-12 Thread Avihai Horon
Currently, if IOMMU of a VFIO container doesn't support dirty page
tracking, migration is blocked. This is because a DMA-able VFIO device
can dirty RAM pages without updating QEMU about it, thus breaking the
migration.

However, this doesn't mean that migration can't be done at all.
In such case, allow migration and let QEMU VFIO code mark all pages
dirty.

This guarantees that all pages that might have gotten dirty are reported
back, and thus guarantees a valid migration even without VFIO IOMMU
dirty tracking support.

The motivation for this patch is the introduction of iommufd [1].
iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into its internal ops, allowing the usage of these IOCTLs
over iommufd. However, VFIO IOMMU dirty tracking is not supported by
this VFIO compatibility API.

This patch will allow migration by hosts that use the VFIO compatibility
API and prevent migration regressions caused by the lack of VFIO IOMMU
dirty tracking support.

[1]
https://lore.kernel.org/kvm/0-v6-a196d26f289e+11787-iommufd_...@nvidia.com/

Signed-off-by: Avihai Horon 
---
 hw/vfio/common.c| 20 ++--
 hw/vfio/migration.c |  3 +--
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 130e5d1dc7..f6dd571549 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -488,6 +488,12 @@ static int vfio_dma_unmap(VFIOContainer *container,
 return -errno;
 }
 
+if (iotlb && vfio_devices_all_running_and_saving(container)) {
+cpu_physical_memory_set_dirty_range(iotlb->translated_addr, size,
+tcg_enabled() ? DIRTY_CLIENTS_ALL :
+DIRTY_CLIENTS_NOCODE);
+}
+
 return 0;
 }
 
@@ -1201,6 +1207,10 @@ static void vfio_set_dirty_page_tracking(VFIOContainer 
*container, bool start)
 .argsz = sizeof(dirty),
 };
 
+if (!container->dirty_pages_supported) {
+return;
+}
+
 if (start) {
 dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
 } else {
@@ -1236,6 +1246,13 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
*container, uint64_t iova,
 uint64_t pages;
 int ret;
 
+if (!container->dirty_pages_supported) {
+cpu_physical_memory_set_dirty_range(ram_addr, size,
+tcg_enabled() ? DIRTY_CLIENTS_ALL :
+DIRTY_CLIENTS_NOCODE);
+return 0;
+}
+
 dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
 dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
@@ -1409,8 +1426,7 @@ static void vfio_listener_log_sync(MemoryListener 
*listener,
 {
 VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
-if (vfio_listener_skipped_section(section) ||
-!container->dirty_pages_supported) {
+if (vfio_listener_skipped_section(section)) {
 return;
 }
 
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 09fe7c1de2..552c2313b2 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -860,11 +860,10 @@ int64_t vfio_mig_bytes_transferred(void)
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
-VFIOContainer *container = vbasedev->group->container;
 struct vfio_region_info *info = NULL;
 int ret = -ENOTSUP;
 
-if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
+if (!vbasedev->enable_migration) {
 goto add_blocker;
 }
 
-- 
2.26.3




[PATCH v6 07/13] vfio/migration: Move migration v1 logic to vfio_migration_init()

2023-01-12 Thread Avihai Horon
Move vfio_dev_get_region_info() logic from vfio_migration_probe() to
vfio_migration_init(). This logic is specific to v1 protocol and moving
it will make it easier to add the v2 protocol implementation later.
No functional changes intended.

Signed-off-by: Avihai Horon 
Reviewed-by: Cédric Le Goater 
---
 hw/vfio/migration.c  | 30 +++---
 hw/vfio/trace-events |  2 +-
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 552c2313b2..977da64411 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -788,14 +788,14 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
 vbasedev->migration = NULL;
 }
 
-static int vfio_migration_init(VFIODevice *vbasedev,
-   struct vfio_region_info *info)
+static int vfio_migration_init(VFIODevice *vbasedev)
 {
 int ret;
 Object *obj;
 VFIOMigration *migration;
 char id[256] = "";
 g_autofree char *path = NULL, *oid = NULL;
+struct vfio_region_info *info;
 
 if (!vbasedev->ops->vfio_get_object) {
 return -EINVAL;
@@ -806,6 +806,14 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 return -EINVAL;
 }
 
+ret = vfio_get_dev_region_info(vbasedev,
+   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
+   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
+   &info);
+if (ret) {
+return ret;
+}
+
 vbasedev->migration = g_new0(VFIOMigration, 1);
 vbasedev->migration->device_state = VFIO_DEVICE_STATE_V1_RUNNING;
 vbasedev->migration->vm_running = runstate_is_running();
@@ -825,6 +833,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 goto err;
 }
 
+g_free(info);
+
 migration = vbasedev->migration;
 migration->vbasedev = vbasedev;
 
@@ -847,6 +857,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 return 0;
 
 err:
+g_free(info);
 vfio_migration_exit(vbasedev);
 return ret;
 }
@@ -860,34 +871,23 @@ int64_t vfio_mig_bytes_transferred(void)
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
-struct vfio_region_info *info = NULL;
 int ret = -ENOTSUP;
 
 if (!vbasedev->enable_migration) {
 goto add_blocker;
 }
 
-ret = vfio_get_dev_region_info(vbasedev,
-   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
-   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
-   &info);
-if (ret) {
-goto add_blocker;
-}
-
-ret = vfio_migration_init(vbasedev, info);
+ret = vfio_migration_init(vbasedev);
 if (ret) {
 goto add_blocker;
 }
 
-trace_vfio_migration_probe(vbasedev->name, info->index);
-g_free(info);
+trace_vfio_migration_probe(vbasedev->name);
 return 0;
 
 add_blocker:
 error_setg(&vbasedev->migration_blocker,
"VFIO device doesn't support migration");
-g_free(info);
 
 ret = migrate_add_blocker(vbasedev->migration_blocker, errp);
 if (ret < 0) {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 73dffe9e00..b259dcc644 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -148,7 +148,7 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) 
"%ux%u"
 vfio_display_edid_write_error(void) ""
 
 # migration.c
-vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
+vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, 
uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) 
state %s"
-- 
2.26.3




[PATCH v6 13/13] docs/devel: Align VFIO migration docs to v2 protocol

2023-01-12 Thread Avihai Horon
Now that VFIO migration protocol v2 has been implemented and v1 protocol
has been removed, update the documentation according to v2 protocol.

Signed-off-by: Avihai Horon 
---
 docs/devel/vfio-migration.rst | 68 ---
 1 file changed, 30 insertions(+), 38 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 9ff6163c88..1d50c2fe5f 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,46 +7,39 @@ the guest is running on source host and restoring this saved 
state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices consists of two phases: the optional pre-copy phase,
-and the stop-and-copy phase. The pre-copy phase is iterative and allows to
-accommodate VFIO devices that have a large amount of data that needs to be
-transferred. The iterative pre-copy phase of migration allows for the guest to
-continue whilst the VFIO device state is transferred to the destination, this
-helps to reduce the total downtime of the VM. VFIO devices can choose to skip
-the pre-copy phase of migration by returning pending_bytes as zero during the
-pre-copy phase.
+Migration of VFIO devices currently consists of a single stop-and-copy phase.
+During the stop-and-copy phase the guest is stopped and the entire VFIO device
+data is transferred to the destination.
+
+The pre-copy phase of migration is currently not supported for VFIO devices.
+Support for VFIO pre-copy will be added later on.
 
 A detailed description of the UAPI for VFIO device migration can be found in
-the comment for the ``vfio_device_migration_info`` structure in the header
-file linux-headers/linux/vfio.h.
+the comment for the ``vfio_device_mig_state`` structure in the header file
+linux-headers/linux/vfio.h.
 
 VFIO implements the device hooks for the iterative approach as follows:
 
-* A ``save_setup`` function that sets up the migration region and sets _SAVING
-  flag in the VFIO device state.
+* A ``save_setup`` function that sets up migration on the source.
 
-* A ``load_setup`` function that sets up the migration region on the
-  destination and sets _RESUMING flag in the VFIO device state.
+* A ``load_setup`` function that sets the VFIO device on the destination in
+  _RESUMING state.
 
 * A ``save_live_pending`` function that reads pending_bytes from the vendor
   driver, which indicates the amount of data that the vendor driver has yet to
   save for the VFIO device.
 
-* A ``save_live_iterate`` function that reads the VFIO device's data from the
-  vendor driver through the migration region during iterative phase.
-
 * A ``save_state`` function to save the device config space if it is present.
 
-* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
-  VFIO device state and iteratively copies the remaining data for the VFIO
-  device until the vendor driver indicates that no data remains (pending bytes
-  is zero).
+* A ``save_live_complete_precopy`` function that sets the VFIO device in
+  _STOP_COPY state and iteratively copies the data for the VFIO device until
+  the vendor driver indicates that no data remains.
 
 * A ``load_state`` function that loads the config section and the data
-  sections that are generated by the save functions above
+  sections that are generated by the save functions above.
 
 * ``cleanup`` functions for both save and load that perform any migration
-  related cleanup, including unmapping the migration region
+  related cleanup.
 
 
 The VFIO migration code uses a VM state change handler to change the VFIO
@@ -71,13 +64,13 @@ tracking can identify dirtied pages, but any page pinned by 
the vendor driver
 can also be written by the device. There is currently no device or IOMMU
 support for dirty page tracking in hardware.
 
-By default, dirty pages are tracked when the device is in pre-copy as well as
-stop-and-copy phase. So, a page pinned by the vendor driver will be copied to
-the destination in both phases. Copying dirty pages in pre-copy phase helps
-QEMU to predict if it can achieve its downtime tolerances. If QEMU during
-pre-copy phase keeps finding dirty pages continuously, then it understands
-that even in stop-and-copy phase, it is likely to find dirty pages and can
-predict the downtime accordingly.
+By default, dirty pages are tracked during pre-copy as well as stop-and-copy
+phase. So, a page pinned by the vendor driver will be copied to the destination
+in both phases. Copying dirty pages in pre-copy phase helps QEMU to predict if
+it can achieve its downtime tolerances. If QEMU during pre-copy phase keeps
+finding dirty pages continuously, then it understands that even in 
stop-and-copy
+phase, it is likely to find dirty pages and can predict the downtime
+accordingly.
 
 QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
 which disables querying the dirty bitmap durin

[PATCH v6 10/13] vfio/migration: Optimize vfio_save_pending()

2023-01-12 Thread Avihai Horon
During pre-copy phase of migration vfio_save_pending() is called
repeatedly and queries the VFIO device for its pending data size.

As long as pending RAM size is over the threshold, migration can't
converge and be completed. Therefore, during this time there is no
point in querying the VFIO device pending data size.

Avoid these unnecessary queries by issuing them in a RAM pre-copy
notifier instead of vfio_save_pending().

This way the VFIO device is queried only when RAM pending data is
below the threshold, when there is an actual chance for migration to
converge.

Signed-off-by: Avihai Horon 
---
 include/hw/vfio/vfio-common.h |  2 ++
 hw/vfio/migration.c   | 56 +++
 hw/vfio/trace-events  |  1 +
 3 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 2ec3346fea..113f8d9208 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -65,11 +65,13 @@ typedef struct VFIOMigration {
 uint32_t device_state_v1;
 int vm_running;
 Notifier migration_state;
+NotifierWithReturn migration_data;
 uint64_t pending_bytes;
 enum vfio_device_mig_state device_state;
 int data_fd;
 void *data_buffer;
 size_t data_buffer_size;
+uint64_t stop_copy_size;
 bool v2;
 } VFIOMigration;
 
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 08f53189fa..04f4397212 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -655,29 +655,19 @@ static void vfio_v1_save_cleanup(void *opaque)
 trace_vfio_save_cleanup(vbasedev->name);
 }
 
-/*
- * Migration size of VFIO devices can be as little as a few KBs or as big as
- * many GBs. This value should be big enough to cover the worst case.
- */
-#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
 static void vfio_save_pending(void *opaque, uint64_t threshold_size,
   uint64_t *res_precopy_only,
   uint64_t *res_compatible,
   uint64_t *res_postcopy_only)
 {
 VFIODevice *vbasedev = opaque;
-uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
+VFIOMigration *migration = vbasedev->migration;
 
-/*
- * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
- * reported so downtime limit won't be violated.
- */
-vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
-*res_precopy_only += stop_copy_size;
+*res_precopy_only += migration->stop_copy_size;
 
 trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
 *res_postcopy_only, *res_compatible,
-stop_copy_size);
+migration->stop_copy_size);
 }
 
 static void vfio_v1_save_pending(void *opaque, uint64_t threshold_size,
@@ -1104,6 +1094,40 @@ static void vfio_migration_state_notifier(Notifier 
*notifier, void *data)
 }
 }
 
+/*
+ * Migration size of VFIO devices can be as little as a few KBs or as big as
+ * many GBs. This value should be big enough to cover the worst case.
+ */
+#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
+static int vfio_migration_data_notifier(NotifierWithReturn *n, void *data)
+{
+VFIOMigration *migration = container_of(n, VFIOMigration, migration_data);
+VFIODevice *vbasedev = migration->vbasedev;
+PrecopyNotifyData *pnd = data;
+
+if (pnd->reason != PRECOPY_NOTIFY_AFTER_BITMAP_SYNC) {
+return 0;
+}
+
+/* No need to get pending size when finishing migration */
+if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
+return 0;
+}
+
+if (vfio_query_stop_copy_size(vbasedev, &migration->stop_copy_size)) {
+/*
+ * Failed to get pending migration size. Report big pending size so
+ * downtime limit won't be violated.
+ */
+migration->stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
+}
+
+trace_vfio_migration_data_notifier(vbasedev->name,
+   migration->stop_copy_size);
+
+return 0;
+}
+
 static void vfio_migration_exit(VFIODevice *vbasedev)
 {
 VFIOMigration *migration = vbasedev->migration;
@@ -1225,6 +1249,9 @@ static int vfio_migration_init(VFIODevice *vbasedev)
 
 migration->vm_state = qdev_add_vm_change_state_handler(
 vbasedev->dev, vfio_vmstate_change, vbasedev);
+
+migration->migration_data.notify = vfio_migration_data_notifier;
+precopy_add_notifier(&migration->migration_data);
 } else {
 register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
  &savevm_vfio_v1_handlers, vbasedev);
@@ -1283,6 +1310,9 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
 if (vbasedev->migration) {
 VFIOMigration *migration = vbasedev->migration;
 
+if (migration->v2) {
+precopy_remove_notifier(&migration->migration_data);
+}
 remove_migration_state_change_notifier(&mi

[PATCH v6 11/13] vfio/migration: Remove VFIO migration protocol v1

2023-01-12 Thread Avihai Horon
Now that v2 protocol implementation has been added, remove the
deprecated v1 implementation.

Signed-off-by: Avihai Horon 
---
 include/hw/vfio/vfio-common.h |   5 -
 hw/vfio/common.c  |  19 +-
 hw/vfio/migration.c   | 703 +-
 hw/vfio/trace-events  |   9 -
 4 files changed, 24 insertions(+), 712 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 113f8d9208..2aba45887c 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -61,18 +61,13 @@ typedef struct VFIORegion {
 typedef struct VFIOMigration {
 struct VFIODevice *vbasedev;
 VMChangeStateEntry *vm_state;
-VFIORegion region;
-uint32_t device_state_v1;
-int vm_running;
 Notifier migration_state;
 NotifierWithReturn migration_data;
-uint64_t pending_bytes;
 enum vfio_device_mig_state device_state;
 int data_fd;
 void *data_buffer;
 size_t data_buffer_size;
 uint64_t stop_copy_size;
-bool v2;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index dcaa77d2a8..9a0dbee6b4 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -355,14 +355,7 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer 
*container)
 return false;
 }
 
-if (!migration->v2 &&
-(vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
-(migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
-return false;
-}
-
-if (migration->v2 &&
-(vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
 (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
  migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
 return false;
@@ -393,14 +386,8 @@ static bool 
vfio_devices_all_running_and_mig_active(VFIOContainer *container)
 return false;
 }
 
-if (!migration->v2 &&
-migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
-continue;
-}
-
-if (migration->v2 &&
-(migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
- migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
+if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
 continue;
 } else {
 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 04f4397212..7688c83127 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -142,220 +142,6 @@ static int vfio_migration_set_state(VFIODevice *vbasedev,
 return 0;
 }
 
-static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
-  off_t off, bool iswrite)
-{
-int ret;
-
-ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
-pread(vbasedev->fd, val, count, off);
-if (ret < count) {
-error_report("vfio_mig_%s %d byte %s: failed at offset 0x%"
- HWADDR_PRIx", err: %s", iswrite ? "write" : "read", count,
- vbasedev->name, off, strerror(errno));
-return (ret < 0) ? ret : -EINVAL;
-}
-return 0;
-}
-
-static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
-   off_t off, bool iswrite)
-{
-int ret, done = 0;
-__u8 *tbuf = buf;
-
-while (count) {
-int bytes = 0;
-
-if (count >= 8 && !(off % 8)) {
-bytes = 8;
-} else if (count >= 4 && !(off % 4)) {
-bytes = 4;
-} else if (count >= 2 && !(off % 2)) {
-bytes = 2;
-} else {
-bytes = 1;
-}
-
-ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
-if (ret) {
-return ret;
-}
-
-count -= bytes;
-done += bytes;
-off += bytes;
-tbuf += bytes;
-}
-return done;
-}
-
-#define vfio_mig_read(f, v, c, o)   vfio_mig_rw(f, (__u8 *)v, c, o, false)
-#define vfio_mig_write(f, v, c, o)  vfio_mig_rw(f, (__u8 *)v, c, o, true)
-
-#define VFIO_MIG_STRUCT_OFFSET(f)   \
- offsetof(struct vfio_device_migration_info, f)
-/*
- * Change the device_state register for device @vbasedev. Bits set in @mask
- * are preserved, bits set in @value are set, and bits not set in either @mask
- * or @value are cleared in device_state. If the register cannot be accessed,
- * the resulting state would be invalid, or the device enters an error state,
- * an error is returned.
- */
-
-static int vfio_migration_v1_set_state(VFIODevice *vbasedev, uint32_t mask,
-

[PATCH v6 00/13] vfio/migration: Implement VFIO migration protocol v2

2023-01-12 Thread Avihai Horon
Hello,

Here is v6 of the series.
Thanks for reviewing!

Following VFIO migration protocol v2 acceptance in kernel, this series
implements VFIO migration according to the new v2 protocol and replaces
the now deprecated v1 implementation.

The main differences between v1 and v2 migration protocols are:
1. VFIO device state is represented as a finite state machine instead of
   a bitmap.

2. The migration interface with kernel is done using VFIO_DEVICE_FEATURE
   ioctl and normal read() and write() instead of the migration region
   used in v1.

3. Pre-copy is made optional in v2 protocol. Support for pre-copy will
   be added later on.

Full description of the v2 protocol and the differences from v1 can be
found here [1].



Patch list:

Patch 1 updates linux headers so we will have the MIG_DATA_SIZE ioctl.

Patches 2-8 are prep patches fixing bugs, adding QEMUFile function
that will be used later and refactoring v1 protocol code to make it
easier to add v2 protocol.

Patches 9-13 implement v2 protocol and remove v1 protocol.

Thanks.



Changes from v5 [2]:
- Dropped patch #3.
- Simplified patch #5 as per Alex's suggestion.
- Changed qemu_file_get_to_fd() to return -EIO instead of -1, as
  suggested by Cedric.
  Also changed it so now write returns -errno instead of -1 on error.
- Fixed compilation error reported by Cedric.
- Changed vfio_migration_query_flags() to print error message and return
  -errno in error case as suggested by Cedric.
- Added Reviewed-by tags.



Changes from v4 [3]:
- Rebased on latest master branch.
- Added linux header update to kernel v6.2-rc1.
- Merged preview patches (#13-14) into this series.



Changes from v3 [4]:
- Rebased on latest master branch.

- Dropped patch #1 "migration: Remove res_compatible parameter" as
  it's not mandatory to this series and needs some further discussion.

- Dropped patch #3 "migration: Block migration comment or code is
  wrong" as it has been merged already.

- Addressed overlooked corner case reported by Vladimir in patch #4
  "migration: Simplify migration_iteration_run()".

- Dropped patch #5 "vfio/migration: Fix wrong enum usage" as it has
  been merged already.

- In patch #12 "vfio/migration: Implement VFIO migration protocol v2":
  1. Changed vfio_save_pending() to update res_precopy_only instead of
 res_postcopy_only (as VFIO migration doesn’t support postcopy).
  2. Moved VFIOMigration->data_buffer allocation to vfio_save_setup()
 and its de-allocation to vfio_save_cleanup(), so now it's
 allocated when actually used (during migration and only on source
 side).

- Addressed Alex's comments:
  1. Eliminated code duplication in patch #7 "vfio/migration: Allow
 migration without VFIO IOMMU dirty tracking support".
  2. Removed redundant initialization of vfio_region_info in patch #10
 "vfio/migration: Move migration v1 logic to vfio_migration_init()".
  3. Added comment about VFIO_MIG_DATA_BUFFER_SIZE heuristic (and
 renamed to VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE).
  4. Cast migration structs to their actual types instead of void *.
  5. Return -errno and -EBADF instead of -1 in vfio_migration_set_state().
  6. Set migration->device_state to new_state even in case of data_fd
 out of sync. Although migration will be aborted, setting device
 state succeeded so we should reflect that.
  7. Renamed VFIO_MIG_PENDING_SIZE to VFIO_MIG_STOP_COPY_SIZE, set it
 to 100G and added a comment about the size choice.
  8. Changed vfio_save_block() to return -errno on error.
  9. Squashed Patch #14 to patch #12.
  10. Adjusted migration data buffer size according to MIG_DATA_SIZE
  ioctl.

- In preview patch #17 "vfio/migration: Query device data size in
  vfio_save_pending()" - changed vfio_save_pending() to report
  VFIO_MIG_STOP_COPY_SIZE on any error.
   
- Added another preview patch "vfio/migration: Optimize
  vfio_save_pending()".

- Added ret value on some traces as suggested by David.

- Added Reviewed-By tags.



Changes from v2 [5]:
- Rebased on top of latest master branch.

- Added relevant patches from Juan's RFC [6] with minor changes:
  1. Added Reviewed-by tag to patch #3 in the RFC.
  2. Adjusted patch #6 to work without patch #4 in the RFC.

- Added a new patch "vfio/migration: Fix wrong enum usage" that fixes a
  small bug in v1 code. This patch has been sent a few weeks ago [7] but
  wasn't taken yet.

- Patch #2 (vfio/migration: Skip pre-copy if dirty page tracking is not
  supported):
  1. Dropped this patch and replaced it with
 "vfio/migration: Allow migration without VFIO IOMMU dirty tracking
 support".
 The new patch uses a different approach – instead of skipping
 pre-copy phase completely, QEMU VFIO code will mark RAM dirty
 (instead of kernel). This ensures that current migration behavior
 is not changed and SLA is taken into account.

- Patch #4 (vfio/common: Change vfio_devices_all_running_and_saving()
  logic to equivalent one):
  1. Improved commit message

[PATCH v6 02/13] migration: No save_live_pending() method uses the QEMUFile parameter

2023-01-12 Thread Avihai Horon
From: Juan Quintela 

So remove it everywhere.

Signed-off-by: Juan Quintela 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Dr. David Alan Gilbert 
---
 include/migration/register.h   | 3 +--
 migration/savevm.h | 3 +--
 hw/s390x/s390-stattrib.c   | 2 +-
 hw/vfio/migration.c| 3 +--
 migration/block-dirty-bitmap.c | 3 +--
 migration/block.c  | 2 +-
 migration/migration.c  | 4 ++--
 migration/ram.c| 2 +-
 migration/savevm.c | 7 +++
 9 files changed, 12 insertions(+), 17 deletions(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index c1dcff0f90..eb6266a877 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -46,8 +46,7 @@ typedef struct SaveVMHandlers {
 
 /* This runs outside the iothread lock!  */
 int (*save_setup)(QEMUFile *f, void *opaque);
-void (*save_live_pending)(QEMUFile *f, void *opaque,
-  uint64_t threshold_size,
+void (*save_live_pending)(void *opaque, uint64_t threshold_size,
   uint64_t *res_precopy_only,
   uint64_t *res_compatible,
   uint64_t *res_postcopy_only);
diff --git a/migration/savevm.h b/migration/savevm.h
index 6461342cb4..6dec468cc3 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -40,8 +40,7 @@ void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
bool inactivate_disks);
-void qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size,
-   uint64_t *res_precopy_only,
+void qemu_savevm_state_pending(uint64_t max_size, uint64_t *res_precopy_only,
uint64_t *res_compatible,
uint64_t *res_postcopy_only);
 void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
index 9eda1c3b2a..a553a1e850 100644
--- a/hw/s390x/s390-stattrib.c
+++ b/hw/s390x/s390-stattrib.c
@@ -182,7 +182,7 @@ static int cmma_save_setup(QEMUFile *f, void *opaque)
 return 0;
 }
 
-static void cmma_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
+static void cmma_save_pending(void *opaque, uint64_t max_size,
   uint64_t *res_precopy_only,
   uint64_t *res_compatible,
   uint64_t *res_postcopy_only)
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index c74453e0b5..e1413ac90c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -456,8 +456,7 @@ static void vfio_save_cleanup(void *opaque)
 trace_vfio_save_cleanup(vbasedev->name);
 }
 
-static void vfio_save_pending(QEMUFile *f, void *opaque,
-  uint64_t threshold_size,
+static void vfio_save_pending(void *opaque, uint64_t threshold_size,
   uint64_t *res_precopy_only,
   uint64_t *res_compatible,
   uint64_t *res_postcopy_only)
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 283017d7d3..ffc433cd11 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -761,8 +761,7 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void 
*opaque)
 return 0;
 }
 
-static void dirty_bitmap_save_pending(QEMUFile *f, void *opaque,
-  uint64_t max_size,
+static void dirty_bitmap_save_pending(void *opaque, uint64_t max_size,
   uint64_t *res_precopy_only,
   uint64_t *res_compatible,
   uint64_t *res_postcopy_only)
diff --git a/migration/block.c b/migration/block.c
index 4347da1526..b6a98caf78 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -862,7 +862,7 @@ static int block_save_complete(QEMUFile *f, void *opaque)
 return 0;
 }
 
-static void block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
+static void block_save_pending(void *opaque, uint64_t max_size,
uint64_t *res_precopy_only,
uint64_t *res_compatible,
uint64_t *res_postcopy_only)
diff --git a/migration/migration.c b/migration/migration.c
index 52b5d39244..9795d0ec5c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3751,8 +3751,8 @@ static MigIterateState 
migration_iteration_run(MigrationState *s)
 uint64_t pending_size, pend_pre, pend_compat, pend_post;
 bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
 
-qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre,
-  &pend_compat, &pend_post);
+qemu_savevm

[PATCH v6 01/13] linux-headers: Update to v6.2-rc1

2023-01-12 Thread Avihai Horon
Update to commit 1b929c02afd3 ("Linux 6.2-rc1").

Signed-off-by: Avihai Horon 
---
 include/standard-headers/drm/drm_fourcc.h |  63 +++-
 include/standard-headers/linux/ethtool.h  |  81 -
 include/standard-headers/linux/fuse.h |  20 +-
 .../linux/input-event-codes.h |   4 +
 include/standard-headers/linux/pci_regs.h |   2 +
 include/standard-headers/linux/virtio_blk.h   |  19 ++
 include/standard-headers/linux/virtio_bt.h|   8 +
 include/standard-headers/linux/virtio_net.h   |   4 +
 linux-headers/asm-arm64/kvm.h |   1 +
 linux-headers/asm-generic/hugetlb_encode.h|  26 +-
 linux-headers/asm-generic/mman-common.h   |   2 +
 linux-headers/asm-mips/mman.h |   2 +
 linux-headers/asm-riscv/kvm.h |   7 +
 linux-headers/asm-x86/kvm.h   |  11 +-
 linux-headers/linux/kvm.h |  32 +-
 linux-headers/linux/psci.h|  14 +
 linux-headers/linux/userfaultfd.h |   4 +
 linux-headers/linux/vfio.h| 278 +-
 18 files changed, 522 insertions(+), 56 deletions(-)

diff --git a/include/standard-headers/drm/drm_fourcc.h 
b/include/standard-headers/drm/drm_fourcc.h
index 48b620cbef..69cab17b38 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -98,18 +98,42 @@ extern "C" {
 #define DRM_FORMAT_INVALID 0
 
 /* color index */
+#define DRM_FORMAT_C1  fourcc_code('C', '1', ' ', ' ') /* [7:0] 
C0:C1:C2:C3:C4:C5:C6:C7 1:1:1:1:1:1:1:1 eight pixels/byte */
+#define DRM_FORMAT_C2  fourcc_code('C', '2', ' ', ' ') /* [7:0] 
C0:C1:C2:C3 2:2:2:2 four pixels/byte */
+#define DRM_FORMAT_C4  fourcc_code('C', '4', ' ', ' ') /* [7:0] C0:C1 
4:4 two pixels/byte */
 #define DRM_FORMAT_C8  fourcc_code('C', '8', ' ', ' ') /* [7:0] C */
 
-/* 8 bpp Red */
+/* 1 bpp Darkness (inverse relationship between channel value and brightness) 
*/
+#define DRM_FORMAT_D1  fourcc_code('D', '1', ' ', ' ') /* [7:0] 
D0:D1:D2:D3:D4:D5:D6:D7 1:1:1:1:1:1:1:1 eight pixels/byte */
+
+/* 2 bpp Darkness (inverse relationship between channel value and brightness) 
*/
+#define DRM_FORMAT_D2  fourcc_code('D', '2', ' ', ' ') /* [7:0] 
D0:D1:D2:D3 2:2:2:2 four pixels/byte */
+
+/* 4 bpp Darkness (inverse relationship between channel value and brightness) 
*/
+#define DRM_FORMAT_D4  fourcc_code('D', '4', ' ', ' ') /* [7:0] D0:D1 
4:4 two pixels/byte */
+
+/* 8 bpp Darkness (inverse relationship between channel value and brightness) 
*/
+#define DRM_FORMAT_D8  fourcc_code('D', '8', ' ', ' ') /* [7:0] D */
+
+/* 1 bpp Red (direct relationship between channel value and brightness) */
+#define DRM_FORMAT_R1  fourcc_code('R', '1', ' ', ' ') /* [7:0] 
R0:R1:R2:R3:R4:R5:R6:R7 1:1:1:1:1:1:1:1 eight pixels/byte */
+
+/* 2 bpp Red (direct relationship between channel value and brightness) */
+#define DRM_FORMAT_R2  fourcc_code('R', '2', ' ', ' ') /* [7:0] 
R0:R1:R2:R3 2:2:2:2 four pixels/byte */
+
+/* 4 bpp Red (direct relationship between channel value and brightness) */
+#define DRM_FORMAT_R4  fourcc_code('R', '4', ' ', ' ') /* [7:0] R0:R1 
4:4 two pixels/byte */
+
+/* 8 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R8  fourcc_code('R', '8', ' ', ' ') /* [7:0] R */
 
-/* 10 bpp Red */
+/* 10 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R10 fourcc_code('R', '1', '0', ' ') /* [15:0] x:R 
6:10 little endian */
 
-/* 12 bpp Red */
+/* 12 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R12 fourcc_code('R', '1', '2', ' ') /* [15:0] x:R 
4:12 little endian */
 
-/* 16 bpp Red */
+/* 16 bpp Red (direct relationship between channel value and brightness) */
 #define DRM_FORMAT_R16 fourcc_code('R', '1', '6', ' ') /* [15:0] R 
little endian */
 
 /* 16 bpp RG */
@@ -204,7 +228,9 @@ extern "C" {
 #define DRM_FORMAT_VYUYfourcc_code('V', 'Y', 'U', 'Y') /* 
[31:0] Y1:Cb0:Y0:Cr0 8:8:8:8 little endian */
 
 #define DRM_FORMAT_AYUVfourcc_code('A', 'Y', 'U', 'V') /* 
[31:0] A:Y:Cb:Cr 8:8:8:8 little endian */
+#define DRM_FORMAT_AVUYfourcc_code('A', 'V', 'U', 'Y') /* [31:0] 
A:Cr:Cb:Y 8:8:8:8 little endian */
 #define DRM_FORMAT_XYUVfourcc_code('X', 'Y', 'U', 'V') /* [31:0] 
X:Y:Cb:Cr 8:8:8:8 little endian */
+#define DRM_FORMAT_XVUYfourcc_code('X', 'V', 'U', 'Y') /* [31:0] 
X:Cr:Cb:Y 8:8:8:8 little endian */
 #define DRM_FORMAT_VUY888  fourcc_code('V', 'U', '2', '4') /* [23:0] 
Cr:Cb:Y 8:8:8 little endian */
 #define DRM_FORMAT_VUY101010   fourcc_code('V', 'U', '3', '0') /* Y followed 
by U then V, 10:10:10. Non-linear modifier only */
 
@@ -717,6 +743,35 @@ extern "C" {
  */
 #define DRM_FORMAT_MOD_VIVANTE_SPLIT_SUPER_TILED fourcc_mod_code(VIVANTE, 

[PATCH v2] hw/misc/sifive_u_otp: Remove the deprecated OTP config with '-drive if=none'

2023-01-12 Thread Thomas Huth
'-drive if=none' is meant for configuring back-end devices only, so this
got marked as deprecated in QEMU 6.2. Users should now only use the new
way with '-drive if=pflash' instead.

Signed-off-by: Thomas Huth 
---
 docs/about/deprecated.rst   | 6 --
 docs/about/removed-features.rst | 7 +++
 hw/misc/sifive_u_otp.c  | 7 ---
 3 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index 68d29642d7..bfe8148490 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -87,12 +87,6 @@ as short-form boolean values, and passed to plugins as 
``arg_name=on``.
 However, short-form booleans are deprecated and full explicit ``arg_name=on``
 form is preferred.
 
-``-drive if=none`` for the sifive_u OTP device (since 6.2)
-''
-
-Using ``-drive if=none`` to configure the OTP device of the sifive_u
-RISC-V machine is deprecated. Use ``-drive if=pflash`` instead.
-
 ``-no-hpet`` (since 8.0)
 
 
diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
index c918cabd1a..6bd0a2b4e4 100644
--- a/docs/about/removed-features.rst
+++ b/docs/about/removed-features.rst
@@ -422,6 +422,13 @@ the value is hexadecimal.  That is, '0x20M' should be 
written either as
 ``tty`` and ``parport`` used to be aliases for ``serial`` and ``parallel``
 respectively. The actual backend names should be used instead.
 
+``-drive if=none`` for the sifive_u OTP device (removed in 8.0)
+'''
+
+Use ``-drive if=pflash`` to configure the OTP device of the sifive_u
+RISC-V machine instead.
+
+
 QEMU Machine Protocol (QMP) commands
 
 
diff --git a/hw/misc/sifive_u_otp.c b/hw/misc/sifive_u_otp.c
index 6d7fdb040a..8965f5c22a 100644
--- a/hw/misc/sifive_u_otp.c
+++ b/hw/misc/sifive_u_otp.c
@@ -210,13 +210,6 @@ static void sifive_u_otp_realize(DeviceState *dev, Error 
**errp)
 sysbus_init_mmio(SYS_BUS_DEVICE(dev), &s->mmio);
 
 dinfo = drive_get(IF_PFLASH, 0, 0);
-if (!dinfo) {
-dinfo = drive_get(IF_NONE, 0, 0);
-if (dinfo) {
-warn_report("using \"-drive if=none\" for the OTP is deprecated, "
-"use \"-drive if=pflash\" instead.");
-}
-}
 if (dinfo) {
 int ret;
 uint64_t perm;
-- 
2.31.1




Re: [PATCH] hw/misc/sifive_u_otp: Remove the deprecated OTP config with '-drive if=none'

2023-01-12 Thread Thomas Huth

On 12/01/2023 09.29, Thomas Huth wrote:

'-drive if=none' is meant for configuring back-end devices only, so this
got marked as deprecated in QEMU 6.2. Users should now only use the new
way with '-drive if=pflash' instead.

Signed-off-by: Thomas Huth 
---
  docs/about/deprecated.rst   | 6 --
  docs/about/removed-features.rst | 7 +++
  hw/misc/sifive_u_otp.c  | 7 ---
  3 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index 68d29642d7..bfe8148490 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -87,12 +87,6 @@ as short-form boolean values, and passed to plugins as 
``arg_name=on``.
  However, short-form booleans are deprecated and full explicit ``arg_name=on``
  form is preferred.
  
-``-drive if=none`` for the sifive_u OTP device (since 6.2)

-''
-
-Using ``-drive if=none`` to configure the OTP device of the sifive_u
-RISC-V machine is deprecated. Use ``-drive if=pflash`` instead.
-
  ``-no-hpet`` (since 8.0)
  
  
diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst

index c918cabd1a..b1cb15f3d9 100644
--- a/docs/about/removed-features.rst
+++ b/docs/about/removed-features.rst
@@ -422,6 +422,13 @@ the value is hexadecimal.  That is, '0x20M' should be 
written either as
  ``tty`` and ``parport`` used to be aliases for ``serial`` and ``parallel``
  respectively. The actual backend names should be used instead.
  
+``-drive if=none`` for the sifive_u OTP device (removed in 8.0)

+'''
+
+Using ``-drive if=none`` to configure the OTP device of the sifive_u
+RISC-V machine is deprecated. Use ``-drive if=pflash`` instead.


-ENOTENOUGHCOFFEEYET

I think I should adjust that description a little bit instead of blindly 
copy-n-pasting it... Sorry. I'll send a v2.


 Thomas





[PATCH] hw/misc/sifive_u_otp: Remove the deprecated OTP config with '-drive if=none'

2023-01-12 Thread Thomas Huth
'-drive if=none' is meant for configuring back-end devices only, so this
got marked as deprecated in QEMU 6.2. Users should now only use the new
way with '-drive if=pflash' instead.

Signed-off-by: Thomas Huth 
---
 docs/about/deprecated.rst   | 6 --
 docs/about/removed-features.rst | 7 +++
 hw/misc/sifive_u_otp.c  | 7 ---
 3 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index 68d29642d7..bfe8148490 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -87,12 +87,6 @@ as short-form boolean values, and passed to plugins as 
``arg_name=on``.
 However, short-form booleans are deprecated and full explicit ``arg_name=on``
 form is preferred.
 
-``-drive if=none`` for the sifive_u OTP device (since 6.2)
-''
-
-Using ``-drive if=none`` to configure the OTP device of the sifive_u
-RISC-V machine is deprecated. Use ``-drive if=pflash`` instead.
-
 ``-no-hpet`` (since 8.0)
 
 
diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
index c918cabd1a..b1cb15f3d9 100644
--- a/docs/about/removed-features.rst
+++ b/docs/about/removed-features.rst
@@ -422,6 +422,13 @@ the value is hexadecimal.  That is, '0x20M' should be 
written either as
 ``tty`` and ``parport`` used to be aliases for ``serial`` and ``parallel``
 respectively. The actual backend names should be used instead.
 
+``-drive if=none`` for the sifive_u OTP device (removed in 8.0)
+'''
+
+Using ``-drive if=none`` to configure the OTP device of the sifive_u
+RISC-V machine is deprecated. Use ``-drive if=pflash`` instead.
+
+
 QEMU Machine Protocol (QMP) commands
 
 
diff --git a/hw/misc/sifive_u_otp.c b/hw/misc/sifive_u_otp.c
index 6d7fdb040a..8965f5c22a 100644
--- a/hw/misc/sifive_u_otp.c
+++ b/hw/misc/sifive_u_otp.c
@@ -210,13 +210,6 @@ static void sifive_u_otp_realize(DeviceState *dev, Error 
**errp)
 sysbus_init_mmio(SYS_BUS_DEVICE(dev), &s->mmio);
 
 dinfo = drive_get(IF_PFLASH, 0, 0);
-if (!dinfo) {
-dinfo = drive_get(IF_NONE, 0, 0);
-if (dinfo) {
-warn_report("using \"-drive if=none\" for the OTP is deprecated, "
-"use \"-drive if=pflash\" instead.");
-}
-}
 if (dinfo) {
 int ret;
 uint64_t perm;
-- 
2.31.1