date:20221115

Re: [PATCH v2] m25p80: Improve error when the backend file size does not match the device

2022-11-15 Thread Cédric Le Goater


On 11/16/22 07:56, Markus Armbruster wrote:

Cédric Le Goater  writes:


Currently, when a block backend is attached to a m25p80 device and the
associated file size does not match the flash model, QEMU complains
with the error message "failed to read the initial flash content".
This is confusing for the user.

Use blk_check_size_and_read_all() instead of blk_pread() to improve
the reported error.

Signed-off-by: Cédric Le Goater 
---
  hw/block/m25p80.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
index 02adc87527..68a757abf3 100644
--- a/hw/block/m25p80.c
+++ b/hw/block/m25p80.c
@@ -24,6 +24,7 @@
  #include "qemu/osdep.h"
  #include "qemu/units.h"
  #include "sysemu/block-backend.h"
+#include "hw/block/block.h"
  #include "hw/qdev-properties.h"
  #include "hw/qdev-properties-system.h"
  #include "hw/ssi/ssi.h"
@@ -1614,8 +1615,7 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
**errp)
  trace_m25p80_binding(s);
  s->storage = blk_blockalign(s->blk, s->size);
  
-if (blk_pread(s->blk, 0, s->size, s->storage, 0) < 0) {

-error_setg(errp, "failed to read the initial flash content");
+if (!blk_check_size_and_read_all(s->blk, s->storage, s->size, errp)) {
  return;
  }
  } else {


Ignorant question: what does blk_pread() on short read?  Does it fail?


an underlying call to blk_check_byte_request() makes it fail.


Or does it succeed, returning how much it read?  I tried to find an
answer in function comments, no luck.

Are there more instances of "we fill some fixed-size memory (such as a
ROM or flash) from a block backend?"


Yes. There are other similar devices :  nand, nvram, pnv_pnor, etc.

C.

Re: [PATCH v2] m25p80: Improve error when the backend file size does not match the device

2022-11-15 Thread Markus Armbruster

Cédric Le Goater  writes:

> Currently, when a block backend is attached to a m25p80 device and the
> associated file size does not match the flash model, QEMU complains
> with the error message "failed to read the initial flash content".
> This is confusing for the user.
>
> Use blk_check_size_and_read_all() instead of blk_pread() to improve
> the reported error.
>
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/block/m25p80.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
> index 02adc87527..68a757abf3 100644
> --- a/hw/block/m25p80.c
> +++ b/hw/block/m25p80.c
> @@ -24,6 +24,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "sysemu/block-backend.h"
> +#include "hw/block/block.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/qdev-properties-system.h"
>  #include "hw/ssi/ssi.h"
> @@ -1614,8 +1615,7 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
> **errp)
>  trace_m25p80_binding(s);
>  s->storage = blk_blockalign(s->blk, s->size);
>  
> -if (blk_pread(s->blk, 0, s->size, s->storage, 0) < 0) {
> -error_setg(errp, "failed to read the initial flash content");
> +if (!blk_check_size_and_read_all(s->blk, s->storage, s->size, errp)) 
> {
>  return;
>  }
>  } else {

Ignorant question: what does blk_pread() on short read?  Does it fail?
Or does it succeed, returning how much it read?  I tried to find an
answer in function comments, no luck.

Are there more instances of "we fill some fixed-size memory (such as a
ROM or flash) from a block backend?"

Re: [PATCH v2] m25p80: Improve error when the backend file size does not match the device

2022-11-15 Thread Alistair Francis

On Wed, Nov 16, 2022 at 1:13 AM Cédric Le Goater  wrote:
>
> Currently, when a block backend is attached to a m25p80 device and the
> associated file size does not match the flash model, QEMU complains
> with the error message "failed to read the initial flash content".
> This is confusing for the user.
>
> Use blk_check_size_and_read_all() instead of blk_pread() to improve
> the reported error.
>
> Signed-off-by: Cédric Le Goater 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  hw/block/m25p80.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
> index 02adc87527..68a757abf3 100644
> --- a/hw/block/m25p80.c
> +++ b/hw/block/m25p80.c
> @@ -24,6 +24,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "sysemu/block-backend.h"
> +#include "hw/block/block.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/qdev-properties-system.h"
>  #include "hw/ssi/ssi.h"
> @@ -1614,8 +1615,7 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
> **errp)
>  trace_m25p80_binding(s);
>  s->storage = blk_blockalign(s->blk, s->size);
>
> -if (blk_pread(s->blk, 0, s->size, s->storage, 0) < 0) {
> -error_setg(errp, "failed to read the initial flash content");
> +if (!blk_check_size_and_read_all(s->blk, s->storage, s->size, errp)) 
> {
>  return;
>  }
>  } else {
> --
> 2.38.1
>
>

Re: [PATCH v3 10/17] vfio/migration: Move migration v1 logic to vfio_migration_init()

2022-11-15 Thread Alex Williamson

On Thu, 3 Nov 2022 18:16:13 +0200
Avihai Horon  wrote:

> Move vfio_dev_get_region_info() logic from vfio_migration_probe() to
> vfio_migration_init(). This logic is specific to v1 protocol and moving
> it will make it easier to add the v2 protocol implementation later.
> No functional changes intended.
> 
> Signed-off-by: Avihai Horon 
> ---
>  hw/vfio/migration.c  | 30 +++---
>  hw/vfio/trace-events |  2 +-
>  2 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 99ffb75782..0e3a950746 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -785,14 +785,14 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
>  vbasedev->migration = NULL;
>  }
>  
> -static int vfio_migration_init(VFIODevice *vbasedev,
> -   struct vfio_region_info *info)
> +static int vfio_migration_init(VFIODevice *vbasedev)
>  {
>  int ret;
>  Object *obj;
>  VFIOMigration *migration;
>  char id[256] = "";
>  g_autofree char *path = NULL, *oid = NULL;
> +struct vfio_region_info *info = NULL;

Nit, I'm not spotting any cases where we need this initialization.  The
same is not true in the code the info handling was extracted from.
Thanks,

Alex

>  
>  if (!vbasedev->ops->vfio_get_object) {
>  return -EINVAL;
> @@ -803,6 +803,14 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  return -EINVAL;
>  }
>  
> +ret = vfio_get_dev_region_info(vbasedev,
> +   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> +   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> +   );
> +if (ret) {
> +return ret;
> +}
> +
>  vbasedev->migration = g_new0(VFIOMigration, 1);
>  vbasedev->migration->device_state = VFIO_DEVICE_STATE_V1_RUNNING;
>  vbasedev->migration->vm_running = runstate_is_running();
> @@ -822,6 +830,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  goto err;
>  }
>  
> +g_free(info);
> +
>  migration = vbasedev->migration;
>  migration->vbasedev = vbasedev;
>  
> @@ -844,6 +854,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  return 0;
>  
>  err:
> +g_free(info);
>  vfio_migration_exit(vbasedev);
>  return ret;
>  }
> @@ -857,34 +868,23 @@ int64_t vfio_mig_bytes_transferred(void)
>  
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>  {
> -struct vfio_region_info *info = NULL;
>  int ret = -ENOTSUP;
>  
>  if (!vbasedev->enable_migration) {
>  goto add_blocker;
>  }
>  
> -ret = vfio_get_dev_region_info(vbasedev,
> -   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> -   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> -   );
> +ret = vfio_migration_init(vbasedev);
>  if (ret) {
>  goto add_blocker;
>  }
>  
> -ret = vfio_migration_init(vbasedev, info);
> -if (ret) {
> -goto add_blocker;
> -}
> -
> -trace_vfio_migration_probe(vbasedev->name, info->index);
> -g_free(info);
> +trace_vfio_migration_probe(vbasedev->name);
>  return 0;
>  
>  add_blocker:
>  error_setg(>migration_blocker,
> "VFIO device doesn't support migration");
> -g_free(info);
>  
>  ret = migrate_add_blocker(vbasedev->migration_blocker, errp);
>  if (ret < 0) {
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index a21cbd2a56..27c059f96e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -148,7 +148,7 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) 
> "%ux%u"
>  vfio_display_edid_write_error(void) ""
>  
>  # migration.c
> -vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> +vfio_migration_probe(const char *name) " (%s)"
>  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(const char *name, int running, const char *reason, 
> uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) 
> state %s"

Re: [PATCH v3 07/17] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support

2022-11-15 Thread Alex Williamson

On Thu, 3 Nov 2022 18:16:10 +0200
Avihai Horon  wrote:

> Currently, if IOMMU of a VFIO container doesn't support dirty page
> tracking, migration is blocked. This is because a DMA-able VFIO device
> can dirty RAM pages without updating QEMU about it, thus breaking the
> migration.
> 
> However, this doesn't mean that migration can't be done at all.
> In such case, allow migration and let QEMU VFIO code mark the entire
> bitmap dirty.
> 
> This guarantees that all pages that might have gotten dirty are reported
> back, and thus guarantees a valid migration even without VFIO IOMMU
> dirty tracking support.
> 
> The motivation for this patch is the future introduction of iommufd [1].
> iommufd will directly implement the /dev/vfio/vfio container IOCTLs by
> mapping them into its internal ops, allowing the usage of these IOCTLs
> over iommufd. However, VFIO IOMMU dirty tracking will not be supported
> by this VFIO compatibility API.
> 
> This patch will allow migration by hosts that use the VFIO compatibility
> API and prevent migration regressions caused by the lack of VFIO IOMMU
> dirty tracking support.
> 
> [1] https://lore.kernel.org/kvm/0-v2-f9436d0bde78+4bb-iommufd_...@nvidia.com/
> 
> Signed-off-by: Avihai Horon 
> ---
>  hw/vfio/common.c| 84 +
>  hw/vfio/migration.c |  3 +-
>  2 files changed, 70 insertions(+), 17 deletions(-)

This duplicates quite a bit of code, I think we can integrate this into
a common flow quite a bit more.  See below, only compile tested. Thanks,

Alex

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6b5d8c0bf694..4117b40fd9b0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -397,17 +397,33 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
  IOMMUTLBEntry *iotlb)
 {
 struct vfio_iommu_type1_dma_unmap *unmap;
-struct vfio_bitmap *bitmap;
+struct vfio_bitmap *vbitmap;
+unsigned long *bitmap;
+uint64_t bitmap_size;
 uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
 int ret;
 
-unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+unmap = g_malloc0(sizeof(*unmap) + sizeof(*vbitmap));
 
-unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+unmap->argsz = sizeof(*unmap);
 unmap->iova = iova;
 unmap->size = size;
-unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
-bitmap = (struct vfio_bitmap *)>data;
+
+bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+  BITS_PER_BYTE;
+bitmap = g_try_malloc0(bitmap_size);
+if (!bitmap) {
+ret = -ENOMEM;
+goto unmap_exit;
+}
+
+if (!container->dirty_pages_supported) {
+bitmap_set(bitmap, 0, pages);
+goto do_unmap;
+}
+
+unmap->argsz += sizeof(*vbitmap);
+unmap->flags = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
 
 /*
  * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
@@ -415,33 +431,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
  * to qemu_real_host_page_size.
  */
 
-bitmap->pgsize = qemu_real_host_page_size();
-bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-   BITS_PER_BYTE;
+vbitmap = (struct vfio_bitmap *)>data;
+vbitmap->data = (__u64 *)bitmap;
+vbitmap->pgsize = qemu_real_host_page_size();
+vbitmap->size = bitmap_size;
 
-if (bitmap->size > container->max_dirty_bitmap_size) {
-error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
- (uint64_t)bitmap->size);
+if (bitmap_size > container->max_dirty_bitmap_size) {
+error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, bitmap_size);
 ret = -E2BIG;
 goto unmap_exit;
 }
 
-bitmap->data = g_try_malloc0(bitmap->size);
-if (!bitmap->data) {
-ret = -ENOMEM;
-goto unmap_exit;
-}
-
+do_unmap:
 ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
 if (!ret) {
-cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
-iotlb->translated_addr, pages);
+cpu_physical_memory_set_dirty_lebitmap(bitmap, iotlb->translated_addr,
+   pages);
 } else {
 error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
 }
 
-g_free(bitmap->data);
 unmap_exit:
+g_free(bitmap);
 g_free(unmap);
 return ret;
 }
@@ -460,8 +471,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
 .size = size,
 };
 
-if (iotlb && container->dirty_pages_supported &&
-vfio_devices_all_running_and_saving(container)) {
+if (iotlb && vfio_devices_all_running_and_saving(container)) {
 return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
 }
 
@@ -1257,6 +1267,10 @@ static void vfio_set_dirty_page_tracking(VFIOContainer 
*container, bool start)
 .argsz = sizeof(dirty),
 };
 
+if

Re: [PATCH v2] m25p80: Improve error when the backend file size does not match the device

2022-11-15 Thread Peter Delevoryas

On Tue, Nov 15, 2022 at 04:50:11PM +0100, Philippe Mathieu-Daudé wrote:
> On 15/11/22 16:10, Cédric Le Goater wrote:
> > Currently, when a block backend is attached to a m25p80 device and the
> > associated file size does not match the flash model, QEMU complains
> > with the error message "failed to read the initial flash content".
> > This is confusing for the user.
> > 
> > Use blk_check_size_and_read_all() instead of blk_pread() to improve
> > the reported error.
> > 
> > Signed-off-by: Cédric Le Goater 
> > ---
> >   hw/block/m25p80.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Reviewed-by: Philippe Mathieu-Daudé 
> 

Thanks Cedric!

Reviewed-by: Peter Delevoryas

Re: [PULL 00/30] Next patches

2022-11-15 Thread Stefan Hajnoczi

Please only include bug fixes for 7.2 in pull requests during QEMU
hard freeze. The AVX2 support has issues (see my other email) and
anything else that isn't a bug fix should be dropped too.

Stefan

Re: [PULL 00/30] Next patches

2022-11-15 Thread Stefan Hajnoczi

On Tue, 15 Nov 2022 at 10:40, Juan Quintela  wrote:
>
> The following changes since commit 98f10f0e2613ba1ac2ad3f57a5174014f6dcb03d:
>
>   Merge tag 'pull-target-arm-20221114' of 
> https://git.linaro.org/people/pmaydell/qemu-arm into staging (2022-11-14 
> 13:31:17 -0500)
>
> are available in the Git repository at:
>
>   https://gitlab.com/juan.quintela/qemu.git tags/next-pull-request
>
> for you to fetch changes up to d896a7a40db13fc2d05828c94ddda2747530089c:
>
>   migration: Block migration comment or code is wrong (2022-11-15 10:31:06 
> +0100)
>
> 
> Migration PULL request (take 2)
>
> Hi
>
> This time properly signed.
>
> [take 1]
> It includes:
> - Leonardo fix for zero_copy flush
> - Fiona fix for return value of readv/writev
> - Peter Xu cleanups
> - Peter Xu preempt patches
> - Patches ready from zero page (me)
> - AVX2 support (ling)
> - fix for slow networking and reordering of first packets (manish)
>
> Please, apply.
>
> 
>
> Fiona Ebner (1):
>   migration/channel-block: fix return value for
> qio_channel_block_{readv,writev}
>
> Juan Quintela (5):
>   multifd: Create page_size fields into both MultiFD{Recv,Send}Params
>   multifd: Create page_count fields into both MultiFD{Recv,Send}Params
>   migration: Export ram_transferred_ram()
>   migration: Export ram_release_page()
>   migration: Block migration comment or code is wrong
>
> Leonardo Bras (1):
>   migration/multifd/zero-copy: Create helper function for flushing
>
> Peter Xu (20):
>   migration: Fix possible infinite loop of ram save process
>   migration: Fix race on qemu_file_shutdown()
>   migration: Disallow postcopy preempt to be used with compress
>   migration: Use non-atomic ops for clear log bitmap
>   migration: Disable multifd explicitly with compression
>   migration: Take bitmap mutex when completing ram migration
>   migration: Add postcopy_preempt_active()
>   migration: Cleanup xbzrle zero page cache update logic
>   migration: Trivial cleanup save_page_header() on same block check
>   migration: Remove RAMState.f references in compression code
>   migration: Yield bitmap_mutex properly when sending/sleeping
>   migration: Use atomic ops properly for page accountings
>   migration: Teach PSS about host page
>   migration: Introduce pss_channel
>   migration: Add pss_init()
>   migration: Make PageSearchStatus part of RAMState
>   migration: Move last_sent_block into PageSearchStatus
>   migration: Send requested page directly in rp-return thread
>   migration: Remove old preempt code around state maintainance
>   migration: Drop rs->f
>
> ling xu (2):
>   Update AVX512 support for xbzrle_encode_buffer
>   Unit test code and benchmark code

This commit causes the following CI failure:

cc -m64 -mcx16 -Ilibauthz.fa.p -I. -I.. -Iqapi -Itrace -Iui/shader
-I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include
-fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g
-isystem /builds/qemu-project/qemu/linux-headers -isystem
linux-headers -iquote . -iquote /builds/qemu-project/qemu -iquote
/builds/qemu-project/qemu/include -iquote
/builds/qemu-project/qemu/tcg/i386 -pthread -U_FORTIFY_SOURCE
-D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64
-D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wundef
-Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common
-fwrapv -Wold-style-declaration -Wold-style-definition -Wtype-limits
-Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers
-Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined
-Wimplicit-fallthrough=2 -Wno-missing-include-dirs
-Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE
-MD -MQ libauthz.fa.p/authz_simple.c.o -MF
libauthz.fa.p/authz_simple.c.o.d -o libauthz.fa.p/authz_simple.c.o -c
../authz/simple.c
In file included from ../authz/simple.c:23:
../authz/trace.h:1:10: fatal error: trace/trace-authz.h: No such file
or directory
1 | #include "trace/trace-authz.h"
| ^

https://gitlab.com/qemu-project/qemu/-/jobs/3326576115

I think the issue is that the test links against objects that aren't
present when qemu-user only build is performed. That's my first guess,
I might be wrong but it is definitely this commit that causes the
failure (I bisected it).

There is a second CI failure here:

clang -m64 -mcx16 -Itests/bench/xbzrle-bench.p -Itests/bench
-I../tests/bench -I. -Iqapi -Itrace -Iui -Iui/shader
-I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include
-I/usr/include/sysprof-4 -flto -fcolor-diagnostics -Wall -Winvalid-pch
-Werror -std=gnu11 -O2 -g -isystem
/builds/qemu-project/qemu/linux-headers -isystem linux-headers -iquote
. -iquote /builds/qemu-project/qemu -iquote
/builds/qemu-project/qemu/include -iquote
/builds/qemu-project/qemu/tcg/i386 -pthread -D_GNU_SOURCE
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes

Re: [PULL 00/30] Next patches

2022-11-15 Thread Daniel P . Berrangé

Please don't merge this PULL request,

It contains changes to the "io" subsystem in patch 3 that I
have not reviewed nor acked yet, and which should be been
split as a separate patch from the migration changes too.

With regards,
Daniel

On Tue, Nov 15, 2022 at 04:34:44PM +0100, Juan Quintela wrote:
> The following changes since commit 98f10f0e2613ba1ac2ad3f57a5174014f6dcb03d:
> 
>   Merge tag 'pull-target-arm-20221114' of 
> https://git.linaro.org/people/pmaydell/qemu-arm into staging (2022-11-14 
> 13:31:17 -0500)
> 
> are available in the Git repository at:
> 
>   https://gitlab.com/juan.quintela/qemu.git tags/next-pull-request
> 
> for you to fetch changes up to d896a7a40db13fc2d05828c94ddda2747530089c:
> 
>   migration: Block migration comment or code is wrong (2022-11-15 10:31:06 
> +0100)
> 
> 
> Migration PULL request (take 2)
> 
> Hi
> 
> This time properly signed.
> 
> [take 1]
> It includes:
> - Leonardo fix for zero_copy flush
> - Fiona fix for return value of readv/writev
> - Peter Xu cleanups
> - Peter Xu preempt patches
> - Patches ready from zero page (me)
> - AVX2 support (ling)
> - fix for slow networking and reordering of first packets (manish)
> 
> Please, apply.
> 
> 
> 
> Fiona Ebner (1):
>   migration/channel-block: fix return value for
> qio_channel_block_{readv,writev}
> 
> Juan Quintela (5):
>   multifd: Create page_size fields into both MultiFD{Recv,Send}Params
>   multifd: Create page_count fields into both MultiFD{Recv,Send}Params
>   migration: Export ram_transferred_ram()
>   migration: Export ram_release_page()
>   migration: Block migration comment or code is wrong
> 
> Leonardo Bras (1):
>   migration/multifd/zero-copy: Create helper function for flushing
> 
> Peter Xu (20):
>   migration: Fix possible infinite loop of ram save process
>   migration: Fix race on qemu_file_shutdown()
>   migration: Disallow postcopy preempt to be used with compress
>   migration: Use non-atomic ops for clear log bitmap
>   migration: Disable multifd explicitly with compression
>   migration: Take bitmap mutex when completing ram migration
>   migration: Add postcopy_preempt_active()
>   migration: Cleanup xbzrle zero page cache update logic
>   migration: Trivial cleanup save_page_header() on same block check
>   migration: Remove RAMState.f references in compression code
>   migration: Yield bitmap_mutex properly when sending/sleeping
>   migration: Use atomic ops properly for page accountings
>   migration: Teach PSS about host page
>   migration: Introduce pss_channel
>   migration: Add pss_init()
>   migration: Make PageSearchStatus part of RAMState
>   migration: Move last_sent_block into PageSearchStatus
>   migration: Send requested page directly in rp-return thread
>   migration: Remove old preempt code around state maintainance
>   migration: Drop rs->f
> 
> ling xu (2):
>   Update AVX512 support for xbzrle_encode_buffer
>   Unit test code and benchmark code
> 
> manish.mishra (1):
>   migration: check magic value for deciding the mapping of channels
> 
>  meson.build   |  16 +
>  include/exec/ram_addr.h   |  11 +-
>  include/exec/ramblock.h   |   3 +
>  include/io/channel.h  |  25 ++
>  include/qemu/bitmap.h |   1 +
>  migration/migration.h |   7 -
>  migration/multifd.h   |  10 +-
>  migration/postcopy-ram.h  |   2 +-
>  migration/ram.h   |  23 +
>  migration/xbzrle.h|   4 +
>  io/channel-socket.c   |  27 ++
>  io/channel.c  |  39 ++
>  migration/block.c |   4 +-
>  migration/channel-block.c |   6 +-
>  migration/migration.c | 109 +++--
>  migration/multifd-zlib.c  |  14 +-
>  migration/multifd-zstd.c  |  12 +-
>  migration/multifd.c   |  69 +--
>  migration/postcopy-ram.c  |   5 +-
>  migration/qemu-file.c |  27 +-
>  migration/ram.c   | 794 +-
>  migration/xbzrle.c| 124 ++
>  tests/bench/xbzrle-bench.c| 465 
>  tests/unit/test-xbzrle.c  |  39 +-
>  util/bitmap.c |  45 ++
>  meson_options.txt |   2 +
>  scripts/meson-buildoptions.sh |  14 +-
>  tests/bench/meson.build   |   4 +
>  28 files changed, 1379 insertions(+), 522 deletions(-)
>  create mode 100644 tests/bench/xbzrle-bench.c
> 
> -- 
> 2.38.1
> 

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH v3] block/rbd: Add support for layered encryption

2022-11-15 Thread Daniel P . Berrangé

On Tue, Nov 15, 2022 at 06:25:27AM -0600, Or Ozeri wrote:
> Starting from ceph Reef, RBD has built-in support for layered encryption,
> where each ancestor image (in a cloned image setting) can be possibly
> encrypted using a unique passphrase.
> 
> A new function, rbd_encryption_load2, was added to librbd API.
> This new function supports an array of passphrases (via "spec" structs).
> 
> This commit extends the qemu rbd driver API to use this new librbd API,
> in order to support this new layered encryption feature.
> 
> Signed-off-by: Or Ozeri 
> ---
> v3: further nit fixes suggested by @idryomov
> v2: nit fixes suggested by @idryomov
> ---
>  block/rbd.c  | 119 ++-
>  qapi/block-core.json |  35 +++--
>  2 files changed, 150 insertions(+), 4 deletions(-)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index f826410f40..ce017c29b5 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -71,6 +71,16 @@ static const char rbd_luks2_header_verification[
>  'L', 'U', 'K', 'S', 0xBA, 0xBE, 0, 2
>  };
>  
> +static const char rbd_layered_luks_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 1
> +};
> +
> +static const char rbd_layered_luks2_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 2
> +};
> +
>  typedef enum {
>  RBD_AIO_READ,
>  RBD_AIO_WRITE,
> @@ -470,6 +480,9 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>  size_t passphrase_len;
>  rbd_encryption_luks1_format_options_t luks_opts;
>  rbd_encryption_luks2_format_options_t luks2_opts;
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
> +rbd_encryption_luks_format_options_t luks_any_opts;
> +#endif
>  rbd_encryption_format_t format;
>  rbd_encryption_options_t opts;
>  size_t opts_size;
> @@ -505,6 +518,23 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>  luks2_opts.passphrase_size = passphrase_len;
>  break;
>  }
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
> +case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS_ANY: {
> +memset(_any_opts, 0, sizeof(luks_any_opts));
> +format = RBD_ENCRYPTION_FORMAT_LUKS;
> +opts = _any_opts;
> +opts_size = sizeof(luks_any_opts);
> +r = qemu_rbd_convert_luks_options(
> +
> qapi_RbdEncryptionOptionsLUKSAny_base(>u.luks_any),
> +, _len, errp);
> +if (r < 0) {
> +return r;
> +}
> +luks_any_opts.passphrase = passphrase;
> +luks_any_opts.passphrase_size = passphrase_len;
> +break;
> +}
> +#endif

This looks unrelated to support of multiple layers, unless I'm missing
something.

>  default: {
>  r = -ENOTSUP;
>  error_setg_errno(
> @@ -522,6 +552,74 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>  
>  return 0;
>  }
> +
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
> +static int qemu_rbd_encryption_load2(rbd_image_t image,
> + RbdEncryptionOptions *encrypt,
> + Error **errp)
> +{
> +int r = 0;
> +int encrypt_count = 1;
> +int i;
> +RbdEncryptionOptions *curr_encrypt;
> +rbd_encryption_spec_t *specs;
> +rbd_encryption_luks_format_options_t* luks_any_opts;
> +
> +/* count encryption options */
> +for (curr_encrypt = encrypt; curr_encrypt->has_parent;
> + curr_encrypt = curr_encrypt->parent) {
> +++encrypt_count;
> +}
> +
> +specs = g_new0(rbd_encryption_spec_t, encrypt_count);
> +
> +curr_encrypt = encrypt;
> +for (i = 0; i < encrypt_count; ++i) {
> +if (curr_encrypt->format != RBD_IMAGE_ENCRYPTION_FORMAT_LUKS_ANY) {
> +r = -ENOTSUP;
> +error_setg_errno(
> +errp, -r, "unknown image encryption format: %u",
> +curr_encrypt->format);
> +goto exit;
> +}
> +
> +specs[i].format = RBD_ENCRYPTION_FORMAT_LUKS;
> +specs[i].opts_size = sizeof(rbd_encryption_luks_format_options_t);
> +
> +luks_any_opts = g_new0(rbd_encryption_luks_format_options_t, 1);
> +specs[i].opts = luks_any_opts;
> +
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionOptionsLUKSAny_base(
> +_encrypt->u.luks_any),
> +(char**)_any_opts->passphrase,
> +_any_opts->passphrase_size,
> +errp);
> +if (r < 0) {
> +goto exit;
> +}
> +
> +curr_encrypt = curr_encrypt->parent;
> +}
> +
> +r = rbd_encryption_load2(image, specs, encrypt_count);
> +if (r < 0) {
> +error_setg_errno(errp, -r, "layered encryption load fail");
> +goto exit;
> +}
> +
> +exit:
> +

Re: [PATCH v3] block/rbd: Add support for layered encryption

2022-11-15 Thread Ilya Dryomov

On Tue, Nov 15, 2022 at 1:25 PM Or Ozeri  wrote:
>
> Starting from ceph Reef, RBD has built-in support for layered encryption,
> where each ancestor image (in a cloned image setting) can be possibly
> encrypted using a unique passphrase.
>
> A new function, rbd_encryption_load2, was added to librbd API.
> This new function supports an array of passphrases (via "spec" structs).
>
> This commit extends the qemu rbd driver API to use this new librbd API,
> in order to support this new layered encryption feature.
>
> Signed-off-by: Or Ozeri 
> ---
> v3: further nit fixes suggested by @idryomov
> v2: nit fixes suggested by @idryomov
> ---
>  block/rbd.c  | 119 ++-
>  qapi/block-core.json |  35 +++--
>  2 files changed, 150 insertions(+), 4 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index f826410f40..ce017c29b5 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -71,6 +71,16 @@ static const char rbd_luks2_header_verification[
>  'L', 'U', 'K', 'S', 0xBA, 0xBE, 0, 2
>  };
>
> +static const char rbd_layered_luks_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 1
> +};
> +
> +static const char rbd_layered_luks2_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 2
> +};
> +
>  typedef enum {
>  RBD_AIO_READ,
>  RBD_AIO_WRITE,
> @@ -470,6 +480,9 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>  size_t passphrase_len;
>  rbd_encryption_luks1_format_options_t luks_opts;
>  rbd_encryption_luks2_format_options_t luks2_opts;
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
> +rbd_encryption_luks_format_options_t luks_any_opts;
> +#endif
>  rbd_encryption_format_t format;
>  rbd_encryption_options_t opts;
>  size_t opts_size;
> @@ -505,6 +518,23 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>  luks2_opts.passphrase_size = passphrase_len;
>  break;
>  }
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
> +case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS_ANY: {
> +memset(_any_opts, 0, sizeof(luks_any_opts));
> +format = RBD_ENCRYPTION_FORMAT_LUKS;
> +opts = _any_opts;
> +opts_size = sizeof(luks_any_opts);
> +r = qemu_rbd_convert_luks_options(
> +
> qapi_RbdEncryptionOptionsLUKSAny_base(>u.luks_any),
> +, _len, errp);
> +if (r < 0) {
> +return r;
> +}
> +luks_any_opts.passphrase = passphrase;
> +luks_any_opts.passphrase_size = passphrase_len;
> +break;
> +}
> +#endif
>  default: {
>  r = -ENOTSUP;
>  error_setg_errno(
> @@ -522,6 +552,74 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>
>  return 0;
>  }
> +
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
> +static int qemu_rbd_encryption_load2(rbd_image_t image,
> + RbdEncryptionOptions *encrypt,
> + Error **errp)
> +{
> +int r = 0;
> +int encrypt_count = 1;
> +int i;
> +RbdEncryptionOptions *curr_encrypt;
> +rbd_encryption_spec_t *specs;
> +rbd_encryption_luks_format_options_t* luks_any_opts;
> +
> +/* count encryption options */
> +for (curr_encrypt = encrypt; curr_encrypt->has_parent;
> + curr_encrypt = curr_encrypt->parent) {
> +++encrypt_count;
> +}
> +
> +specs = g_new0(rbd_encryption_spec_t, encrypt_count);
> +
> +curr_encrypt = encrypt;
> +for (i = 0; i < encrypt_count; ++i) {
> +if (curr_encrypt->format != RBD_IMAGE_ENCRYPTION_FORMAT_LUKS_ANY) {
> +r = -ENOTSUP;
> +error_setg_errno(
> +errp, -r, "unknown image encryption format: %u",
> +curr_encrypt->format);
> +goto exit;
> +}
> +
> +specs[i].format = RBD_ENCRYPTION_FORMAT_LUKS;
> +specs[i].opts_size = sizeof(rbd_encryption_luks_format_options_t);
> +
> +luks_any_opts = g_new0(rbd_encryption_luks_format_options_t, 1);
> +specs[i].opts = luks_any_opts;
> +
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionOptionsLUKSAny_base(
> +_encrypt->u.luks_any),
> +(char**)_any_opts->passphrase,

Nit: I would change qemu_rbd_convert_luks_options() to take
const char **passphrase and eliminate this cast.  It's a trivial
fixup so it can be folded into this patch with no explanation.

> +_any_opts->passphrase_size,
> +errp);
> +if (r < 0) {
> +goto exit;
> +}
> +
> +curr_encrypt = curr_encrypt->parent;
> +}
> +
> +r = rbd_encryption_load2(image, specs, encrypt_count);
> +if (r < 0) {
> +

Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-15 Thread Christian Borntraeger





Am 15.11.22 um 17:40 schrieb Christian Borntraeger:



Am 15.11.22 um 17:05 schrieb Alex Bennée:


Christian Borntraeger  writes:


Am 15.11.22 um 15:31 schrieb Alex Bennée:

"Michael S. Tsirkin"  writes:


On Mon, Nov 14, 2022 at 06:15:30PM +0100, Christian Borntraeger wrote:



Am 14.11.22 um 18:10 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 05:55:09PM +0100, Christian Borntraeger wrote:



Am 14.11.22 um 17:37 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 05:18:53PM +0100, Christian Borntraeger wrote:

Am 08.11.22 um 10:23 schrieb Alex Bennée:

The previous fix to virtio_device_started revealed a problem in its
use by both the core and the device code. The core code should be able
to handle the device "starting" while the VM isn't running to handle
the restoration of migration state. To solve this dual use introduce a
new helper for use by the vhost-user backends who all use it to feed a
should_start variable.

We can also pick up a change vhost_user_blk_set_status while we are at
it which follows the same pattern.

Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started)
Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
Signed-off-by: Alex Bennée 
Cc: "Michael S. Tsirkin" 


Hmmm, is this
commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
Author: Alex Bennée 
AuthorDate: Mon Nov 7 12:14:07 2022 +
Commit: Michael S. Tsirkin 
CommitDate: Mon Nov 7 14:08:18 2022 -0500

    hw/virtio: introduce virtio_device_should_start

and older version?


This is what got merged:
https://lore.kernel.org/r/20221107121407.1010913-1-alex.bennee%40linaro.org
This patch was sent after I merged the RFC.
I think the only difference is the commit log but I might be missing
something.


This does not seem to fix the regression that I have reported.


This was applied on top of 9f6bcfd99f which IIUC does, right?




QEMU master still fails for me for suspend/resume to disk:

#0  0x03ff8e3980a6 in __pthread_kill_implementation () at /lib64/libc.so.6
#1  0x03ff8e348580 in raise () at /lib64/libc.so.6
#2  0x03ff8e32b5c0 in abort () at /lib64/libc.so.6
#3  0x03ff8e3409da in __assert_fail_base () at /lib64/libc.so.6
#4  0x03ff8e340a4e in  () at /lib64/libc.so.6
#5 0x02aa1ffa8966 in vhost_vsock_common_pre_save
(opaque=) at
../hw/virtio/vhost-vsock-common.c:203
#6  0x02aa1fe5e0ee in vmstate_save_state_v
   (f=f@entry=0x2aa21bdc170, vmsd=0x2aa204ac5f0
, opaque=0x2aa21bac9f8,
vmdesc=vmdesc@entry=0x3fddc08eb30,
version_id=version_id@entry=0) at ../migration/vmstate.c:329
#7 0x02aa1fe5ebf8 in vmstate_save_state
(f=f@entry=0x2aa21bdc170, vmsd=,
opaque=, vmdesc_id=vmdesc_id@entry=0x3fddc08eb30)
at ../migration/vmstate.c:317
#8 0x02aa1fe75bd0 in vmstate_save (f=f@entry=0x2aa21bdc170,
se=se@entry=0x2aa21bdbe90, vmdesc=vmdesc@entry=0x3fddc08eb30) at
../migration/savevm.c:908
#9 0x02aa1fe79584 in
qemu_savevm_state_complete_precopy_non_iterable
(f=f@entry=0x2aa21bdc170, in_postcopy=in_postcopy@entry=false,
inactivate_disks=inactivate_disks@entry=true)
   at ../migration/savevm.c:1393
#10 0x02aa1fe79a96 in qemu_savevm_state_complete_precopy
(f=0x2aa21bdc170, iterable_only=iterable_only@entry=false,
inactivate_disks=inactivate_disks@entry=true) at
../migration/savevm.c:1459
#11 0x02aa1fe6d6ee in migration_completion (s=0x2aa218ef600) at 
../migration/migration.c:3314
#12 migration_iteration_run (s=0x2aa218ef600) at ../migration/migration.c:3761
#13 migration_thread (opaque=opaque@entry=0x2aa218ef600) at 
../migration/migration.c:3989
#14 0x02aa201f0b8c in qemu_thread_start (args=) at 
../util/qemu-thread-posix.c:505
#15 0x03ff8e396248 in start_thread () at /lib64/libc.so.6
#16 0x03ff8e41183e in thread_start () at /lib64/libc.so.6

Michael, your previous branch did work if I recall correctly.


That one was failing under github CI though (for reasons we didn't
really address, such as disconnect during stop causing a recursive
call to stop, but there you are).

Even the double revert of everything?


I don't remember at this point.


So how do we proceed now?


I'm hopeful Alex will come up with a fix.

I need to replicate the failing test for that. Which test is
failing?



Pretty much the same as before. guest with vsock, managedsave and
restore.


If this isn't in our test suite I'm going to need exact steps.


Just get any libvirt guest, add
     
   
     

to your libvirt xml. Start the guest (with the new xml).
Run virsh managedsave - qemu crashes. On x86 and s390.



the libvirt log:

/home/cborntra/REPOS/qemu/build/x86_64-softmmu/qemu-system-x86_64 \
-name guest=f36,debug-threads=on \
-S \
-object 
'{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-1-f36/master-key.aes"}'
 \
-machine pc-i440fx-7.2,usb=off,dump-guest-core=off,memory-backend=pc.ram \
-accel kvm \
-cpu

Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-15 Thread Christian Borntraeger





Am 15.11.22 um 17:05 schrieb Alex Bennée:


Christian Borntraeger  writes:


Am 15.11.22 um 15:31 schrieb Alex Bennée:

"Michael S. Tsirkin"  writes:


On Mon, Nov 14, 2022 at 06:15:30PM +0100, Christian Borntraeger wrote:



Am 14.11.22 um 18:10 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 05:55:09PM +0100, Christian Borntraeger wrote:



Am 14.11.22 um 17:37 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 05:18:53PM +0100, Christian Borntraeger wrote:

Am 08.11.22 um 10:23 schrieb Alex Bennée:

The previous fix to virtio_device_started revealed a problem in its
use by both the core and the device code. The core code should be able
to handle the device "starting" while the VM isn't running to handle
the restoration of migration state. To solve this dual use introduce a
new helper for use by the vhost-user backends who all use it to feed a
should_start variable.

We can also pick up a change vhost_user_blk_set_status while we are at
it which follows the same pattern.

Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started)
Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
Signed-off-by: Alex Bennée 
Cc: "Michael S. Tsirkin" 


Hmmm, is this
commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
Author: Alex Bennée 
AuthorDate: Mon Nov 7 12:14:07 2022 +
Commit: Michael S. Tsirkin 
CommitDate: Mon Nov 7 14:08:18 2022 -0500

hw/virtio: introduce virtio_device_should_start

and older version?


This is what got merged:
https://lore.kernel.org/r/20221107121407.1010913-1-alex.bennee%40linaro.org
This patch was sent after I merged the RFC.
I think the only difference is the commit log but I might be missing
something.


This does not seem to fix the regression that I have reported.


This was applied on top of 9f6bcfd99f which IIUC does, right?




QEMU master still fails for me for suspend/resume to disk:

#0  0x03ff8e3980a6 in __pthread_kill_implementation () at /lib64/libc.so.6
#1  0x03ff8e348580 in raise () at /lib64/libc.so.6
#2  0x03ff8e32b5c0 in abort () at /lib64/libc.so.6
#3  0x03ff8e3409da in __assert_fail_base () at /lib64/libc.so.6
#4  0x03ff8e340a4e in  () at /lib64/libc.so.6
#5 0x02aa1ffa8966 in vhost_vsock_common_pre_save
(opaque=) at
../hw/virtio/vhost-vsock-common.c:203
#6  0x02aa1fe5e0ee in vmstate_save_state_v
   (f=f@entry=0x2aa21bdc170, vmsd=0x2aa204ac5f0
, opaque=0x2aa21bac9f8,
vmdesc=vmdesc@entry=0x3fddc08eb30,
version_id=version_id@entry=0) at ../migration/vmstate.c:329
#7 0x02aa1fe5ebf8 in vmstate_save_state
(f=f@entry=0x2aa21bdc170, vmsd=,
opaque=, vmdesc_id=vmdesc_id@entry=0x3fddc08eb30)
at ../migration/vmstate.c:317
#8 0x02aa1fe75bd0 in vmstate_save (f=f@entry=0x2aa21bdc170,
se=se@entry=0x2aa21bdbe90, vmdesc=vmdesc@entry=0x3fddc08eb30) at
../migration/savevm.c:908
#9 0x02aa1fe79584 in
qemu_savevm_state_complete_precopy_non_iterable
(f=f@entry=0x2aa21bdc170, in_postcopy=in_postcopy@entry=false,
inactivate_disks=inactivate_disks@entry=true)
   at ../migration/savevm.c:1393
#10 0x02aa1fe79a96 in qemu_savevm_state_complete_precopy
(f=0x2aa21bdc170, iterable_only=iterable_only@entry=false,
inactivate_disks=inactivate_disks@entry=true) at
../migration/savevm.c:1459
#11 0x02aa1fe6d6ee in migration_completion (s=0x2aa218ef600) at 
../migration/migration.c:3314
#12 migration_iteration_run (s=0x2aa218ef600) at ../migration/migration.c:3761
#13 migration_thread (opaque=opaque@entry=0x2aa218ef600) at 
../migration/migration.c:3989
#14 0x02aa201f0b8c in qemu_thread_start (args=) at 
../util/qemu-thread-posix.c:505
#15 0x03ff8e396248 in start_thread () at /lib64/libc.so.6
#16 0x03ff8e41183e in thread_start () at /lib64/libc.so.6

Michael, your previous branch did work if I recall correctly.


That one was failing under github CI though (for reasons we didn't
really address, such as disconnect during stop causing a recursive
call to stop, but there you are).

Even the double revert of everything?


I don't remember at this point.


So how do we proceed now?


I'm hopeful Alex will come up with a fix.

I need to replicate the failing test for that. Which test is
failing?



Pretty much the same as before. guest with vsock, managedsave and
restore.


If this isn't in our test suite I'm going to need exact steps.


Just get any libvirt guest, add

  


to your libvirt xml. Start the guest (with the new xml).
Run virsh managedsave - qemu crashes. On x86 and s390.

Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-15 Thread Alex Bennée



Christian Borntraeger  writes:

> Am 15.11.22 um 15:31 schrieb Alex Bennée:
>> "Michael S. Tsirkin"  writes:
>> 
>>> On Mon, Nov 14, 2022 at 06:15:30PM +0100, Christian Borntraeger wrote:


 Am 14.11.22 um 18:10 schrieb Michael S. Tsirkin:
> On Mon, Nov 14, 2022 at 05:55:09PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 14.11.22 um 17:37 schrieb Michael S. Tsirkin:
>>> On Mon, Nov 14, 2022 at 05:18:53PM +0100, Christian Borntraeger wrote:
 Am 08.11.22 um 10:23 schrieb Alex Bennée:
> The previous fix to virtio_device_started revealed a problem in its
> use by both the core and the device code. The core code should be able
> to handle the device "starting" while the VM isn't running to handle
> the restoration of migration state. To solve this dual use introduce a
> new helper for use by the vhost-user backends who all use it to feed a
> should_start variable.
>
> We can also pick up a change vhost_user_blk_set_status while we are at
> it which follows the same pattern.
>
> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to 
> virtio_device_started)
> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio 
> device)
> Signed-off-by: Alex Bennée 
> Cc: "Michael S. Tsirkin" 

 Hmmm, is this
 commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
 Author: Alex Bennée 
 AuthorDate: Mon Nov 7 12:14:07 2022 +
 Commit: Michael S. Tsirkin 
 CommitDate: Mon Nov 7 14:08:18 2022 -0500

hw/virtio: introduce virtio_device_should_start

 and older version?
>>>
>>> This is what got merged:
>>> https://lore.kernel.org/r/20221107121407.1010913-1-alex.bennee%40linaro.org
>>> This patch was sent after I merged the RFC.
>>> I think the only difference is the commit log but I might be missing
>>> something.
>>>
 This does not seem to fix the regression that I have reported.
>>>
>>> This was applied on top of 9f6bcfd99f which IIUC does, right?
>>>
>>>
>>
>> QEMU master still fails for me for suspend/resume to disk:
>>
>> #0  0x03ff8e3980a6 in __pthread_kill_implementation () at 
>> /lib64/libc.so.6
>> #1  0x03ff8e348580 in raise () at /lib64/libc.so.6
>> #2  0x03ff8e32b5c0 in abort () at /lib64/libc.so.6
>> #3  0x03ff8e3409da in __assert_fail_base () at /lib64/libc.so.6
>> #4  0x03ff8e340a4e in  () at /lib64/libc.so.6
>> #5 0x02aa1ffa8966 in vhost_vsock_common_pre_save
>> (opaque=) at
>> ../hw/virtio/vhost-vsock-common.c:203
>> #6  0x02aa1fe5e0ee in vmstate_save_state_v
>>   (f=f@entry=0x2aa21bdc170, vmsd=0x2aa204ac5f0
>> , opaque=0x2aa21bac9f8,
>> vmdesc=vmdesc@entry=0x3fddc08eb30,
>> version_id=version_id@entry=0) at ../migration/vmstate.c:329
>> #7 0x02aa1fe5ebf8 in vmstate_save_state
>> (f=f@entry=0x2aa21bdc170, vmsd=,
>> opaque=, vmdesc_id=vmdesc_id@entry=0x3fddc08eb30)
>> at ../migration/vmstate.c:317
>> #8 0x02aa1fe75bd0 in vmstate_save (f=f@entry=0x2aa21bdc170,
>> se=se@entry=0x2aa21bdbe90, vmdesc=vmdesc@entry=0x3fddc08eb30) at
>> ../migration/savevm.c:908
>> #9 0x02aa1fe79584 in
>> qemu_savevm_state_complete_precopy_non_iterable
>> (f=f@entry=0x2aa21bdc170, in_postcopy=in_postcopy@entry=false,
>> inactivate_disks=inactivate_disks@entry=true)
>>   at ../migration/savevm.c:1393
>> #10 0x02aa1fe79a96 in qemu_savevm_state_complete_precopy
>> (f=0x2aa21bdc170, iterable_only=iterable_only@entry=false,
>> inactivate_disks=inactivate_disks@entry=true) at
>> ../migration/savevm.c:1459
>> #11 0x02aa1fe6d6ee in migration_completion (s=0x2aa218ef600) at 
>> ../migration/migration.c:3314
>> #12 migration_iteration_run (s=0x2aa218ef600) at 
>> ../migration/migration.c:3761
>> #13 migration_thread (opaque=opaque@entry=0x2aa218ef600) at 
>> ../migration/migration.c:3989
>> #14 0x02aa201f0b8c in qemu_thread_start (args=) at 
>> ../util/qemu-thread-posix.c:505
>> #15 0x03ff8e396248 in start_thread () at /lib64/libc.so.6
>> #16 0x03ff8e41183e in thread_start () at /lib64/libc.so.6
>>
>> Michael, your previous branch did work if I recall correctly.
>
> That one was failing under github CI though (for reasons we didn't
> really address, such as disconnect during stop causing a recursive
> call to stop, but there you are).
 Even the double revert of everything?
>>>
>>> I don't remember at this point.
>>>
 So how do we proceed now?
>>>
>>> I'm hopeful Alex will come up with a fix.
>> I need to replicate the failing test for that. Which test is
>> failing?
>
>
> Pretty much the same as before. guest with vsock, managedsave and
> restore.

If

[PULL 05/30] multifd: Create page_count fields into both MultiFD{Recv, Send}Params

2022-11-15 Thread Juan Quintela

We were recalculating it left and right.  We plan to change that
values on next patches.

Signed-off-by: Juan Quintela 
Reviewed-by: Leonardo Bras 
---
 migration/multifd.h | 4 
 migration/multifd.c | 7 ---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 941563c232..ff3aa2e2e9 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -82,6 +82,8 @@ typedef struct {
 uint32_t packet_len;
 /* guest page size */
 uint32_t page_size;
+/* number of pages in a full packet */
+uint32_t page_count;
 /* multifd flags for sending ram */
 int write_flags;
 
@@ -147,6 +149,8 @@ typedef struct {
 uint32_t packet_len;
 /* guest page size */
 uint32_t page_size;
+/* number of pages in a full packet */
+uint32_t page_count;
 
 /* syncs main thread and channels */
 QemuSemaphore sem_sync;
diff --git a/migration/multifd.c b/migration/multifd.c
index b32fe7edaf..c40d98ad5c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -279,7 +279,6 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
 static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
 {
 MultiFDPacket_t *packet = p->packet;
-uint32_t page_count = MULTIFD_PACKET_SIZE / p->page_size;
 RAMBlock *block;
 int i;
 
@@ -306,10 +305,10 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams 
*p, Error **errp)
  * If we received a packet that is 100 times bigger than expected
  * just stop migration.  It is a magic number.
  */
-if (packet->pages_alloc > page_count) {
+if (packet->pages_alloc > p->page_count) {
 error_setg(errp, "multifd: received packet "
"with size %u and expected a size of %u",
-   packet->pages_alloc, page_count) ;
+   packet->pages_alloc, p->page_count) ;
 return -1;
 }
 
@@ -944,6 +943,7 @@ int multifd_save_setup(Error **errp)
 p->iov = g_new0(struct iovec, page_count + 1);
 p->normal = g_new0(ram_addr_t, page_count);
 p->page_size = qemu_target_page_size();
+p->page_count = page_count;
 
 if (migrate_use_zero_copy_send()) {
 p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
@@ -1191,6 +1191,7 @@ int multifd_load_setup(Error **errp)
 p->name = g_strdup_printf("multifdrecv_%d", i);
 p->iov = g_new0(struct iovec, page_count);
 p->normal = g_new0(ram_addr_t, page_count);
+p->page_count = page_count;
 p->page_size = qemu_target_page_size();
 }
 
-- 
2.38.1

[PULL 18/30] migration: Trivial cleanup save_page_header() on same block check

2022-11-15 Thread Juan Quintela

From: Peter Xu 

The 2nd check on RAM_SAVE_FLAG_CONTINUE is a bit redundant.  Use a boolean
to be clearer.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 9ded381e0a..42b6a543bd 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -689,14 +689,15 @@ static size_t save_page_header(RAMState *rs, QEMUFile *f, 
 RAMBlock *block,
ram_addr_t offset)
 {
 size_t size, len;
+bool same_block = (block == rs->last_sent_block);
 
-if (block == rs->last_sent_block) {
+if (same_block) {
 offset |= RAM_SAVE_FLAG_CONTINUE;
 }
 qemu_put_be64(f, offset);
 size = 8;
 
-if (!(offset & RAM_SAVE_FLAG_CONTINUE)) {
+if (!same_block) {
 len = strlen(block->idstr);
 qemu_put_byte(f, len);
 qemu_put_buffer(f, (uint8_t *)block->idstr, len);
-- 
2.38.1

[PULL 20/30] migration: Yield bitmap_mutex properly when sending/sleeping

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Don't take the bitmap mutex when sending pages, or when being throttled by
migration_rate_limit() (which is a bit tricky to call it here in ram code,
but seems still helpful).

It prepares for the possibility of concurrently sending pages in >1 threads
using the function ram_save_host_page() because all threads may need the
bitmap_mutex to operate on bitmaps, so that either sendmsg() or any kind of
qemu_sem_wait() blocking for one thread will not block the other from
progressing.

Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 46 +++---
 1 file changed, 35 insertions(+), 11 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index ebc5664dcc..6428138194 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2480,9 +2480,14 @@ static void postcopy_preempt_reset_channel(RAMState *rs)
  * a host page in which case the remainder of the hostpage is sent.
  * Only dirty target pages are sent. Note that the host page size may
  * be a huge page for this block.
+ *
  * The saving stops at the boundary of the used_length of the block
  * if the RAMBlock isn't a multiple of the host page size.
  *
+ * The caller must be with ram_state.bitmap_mutex held to call this
+ * function.  Note that this function can temporarily release the lock, but
+ * when the function is returned it'll make sure the lock is still held.
+ *
  * Returns the number of pages written or negative on error
  *
  * @rs: current RAM state
@@ -2490,6 +2495,7 @@ static void postcopy_preempt_reset_channel(RAMState *rs)
  */
 static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
 {
+bool page_dirty, preempt_active = postcopy_preempt_active();
 int tmppages, pages = 0;
 size_t pagesize_bits =
 qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
@@ -2513,22 +2519,40 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss)
 break;
 }
 
+page_dirty = migration_bitmap_clear_dirty(rs, pss->block, pss->page);
+
 /* Check the pages is dirty and if it is send it */
-if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
+if (page_dirty) {
+/*
+ * Properly yield the lock only in postcopy preempt mode
+ * because both migration thread and rp-return thread can
+ * operate on the bitmaps.
+ */
+if (preempt_active) {
+qemu_mutex_unlock(>bitmap_mutex);
+}
 tmppages = ram_save_target_page(rs, pss);
-if (tmppages < 0) {
-return tmppages;
+if (tmppages >= 0) {
+pages += tmppages;
+/*
+ * Allow rate limiting to happen in the middle of huge pages if
+ * something is sent in the current iteration.
+ */
+if (pagesize_bits > 1 && tmppages > 0) {
+migration_rate_limit();
+}
 }
-
-pages += tmppages;
-/*
- * Allow rate limiting to happen in the middle of huge pages if
- * something is sent in the current iteration.
- */
-if (pagesize_bits > 1 && tmppages > 0) {
-migration_rate_limit();
+if (preempt_active) {
+qemu_mutex_lock(>bitmap_mutex);
 }
+} else {
+tmppages = 0;
+}
+
+if (tmppages < 0) {
+return tmppages;
 }
+
 pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
 } while ((pss->page < hostpage_boundary) &&
  offset_in_ramblock(pss->block,
-- 
2.38.1

[PULL 04/30] multifd: Create page_size fields into both MultiFD{Recv, Send}Params

2022-11-15 Thread Juan Quintela

We were calling qemu_target_page_size() left and right.

Signed-off-by: Juan Quintela 
Reviewed-by: Leonardo Bras 
---
 migration/multifd.h  |  4 
 migration/multifd-zlib.c | 14 ++
 migration/multifd-zstd.c | 12 +---
 migration/multifd.c  | 18 --
 4 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 913e4ba274..941563c232 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -80,6 +80,8 @@ typedef struct {
 bool registered_yank;
 /* packet allocated len */
 uint32_t packet_len;
+/* guest page size */
+uint32_t page_size;
 /* multifd flags for sending ram */
 int write_flags;
 
@@ -143,6 +145,8 @@ typedef struct {
 QIOChannel *c;
 /* packet allocated len */
 uint32_t packet_len;
+/* guest page size */
+uint32_t page_size;
 
 /* syncs main thread and channels */
 QemuSemaphore sem_sync;
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 18213a9513..37770248e1 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -116,7 +116,6 @@ static void zlib_send_cleanup(MultiFDSendParams *p, Error 
**errp)
 static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
 {
 struct zlib_data *z = p->data;
-size_t page_size = qemu_target_page_size();
 z_stream *zs = >zs;
 uint32_t out_size = 0;
 int ret;
@@ -135,8 +134,8 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error 
**errp)
  * with compression. zlib does not guarantee that this is safe,
  * therefore copy the page before calling deflate().
  */
-memcpy(z->buf, p->pages->block->host + p->normal[i], page_size);
-zs->avail_in = page_size;
+memcpy(z->buf, p->pages->block->host + p->normal[i], p->page_size);
+zs->avail_in = p->page_size;
 zs->next_in = z->buf;
 
 zs->avail_out = available;
@@ -242,12 +241,11 @@ static void zlib_recv_cleanup(MultiFDRecvParams *p)
 static int zlib_recv_pages(MultiFDRecvParams *p, Error **errp)
 {
 struct zlib_data *z = p->data;
-size_t page_size = qemu_target_page_size();
 z_stream *zs = >zs;
 uint32_t in_size = p->next_packet_size;
 /* we measure the change of total_out */
 uint32_t out_size = zs->total_out;
-uint32_t expected_size = p->normal_num * page_size;
+uint32_t expected_size = p->normal_num * p->page_size;
 uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
 int ret;
 int i;
@@ -274,7 +272,7 @@ static int zlib_recv_pages(MultiFDRecvParams *p, Error 
**errp)
 flush = Z_SYNC_FLUSH;
 }
 
-zs->avail_out = page_size;
+zs->avail_out = p->page_size;
 zs->next_out = p->host + p->normal[i];
 
 /*
@@ -288,8 +286,8 @@ static int zlib_recv_pages(MultiFDRecvParams *p, Error 
**errp)
 do {
 ret = inflate(zs, flush);
 } while (ret == Z_OK && zs->avail_in
- && (zs->total_out - start) < page_size);
-if (ret == Z_OK && (zs->total_out - start) < page_size) {
+ && (zs->total_out - start) < p->page_size);
+if (ret == Z_OK && (zs->total_out - start) < p->page_size) {
 error_setg(errp, "multifd %u: inflate generated too few output",
p->id);
 return -1;
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index d788d309f2..f4a8e1ed1f 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -113,7 +113,6 @@ static void zstd_send_cleanup(MultiFDSendParams *p, Error 
**errp)
 static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
 {
 struct zstd_data *z = p->data;
-size_t page_size = qemu_target_page_size();
 int ret;
 uint32_t i;
 
@@ -128,7 +127,7 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error 
**errp)
 flush = ZSTD_e_flush;
 }
 z->in.src = p->pages->block->host + p->normal[i];
-z->in.size = page_size;
+z->in.size = p->page_size;
 z->in.pos = 0;
 
 /*
@@ -241,8 +240,7 @@ static int zstd_recv_pages(MultiFDRecvParams *p, Error 
**errp)
 {
 uint32_t in_size = p->next_packet_size;
 uint32_t out_size = 0;
-size_t page_size = qemu_target_page_size();
-uint32_t expected_size = p->normal_num * page_size;
+uint32_t expected_size = p->normal_num * p->page_size;
 uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
 struct zstd_data *z = p->data;
 int ret;
@@ -265,7 +263,7 @@ static int zstd_recv_pages(MultiFDRecvParams *p, Error 
**errp)
 
 for (i = 0; i < p->normal_num; i++) {
 z->out.dst = p->host + p->normal[i];
-z->out.size = page_size;
+z->out.size = p->page_size;
 z->out.pos = 0;
 
 /*
@@ -279,8 +277,8 @@ static int zstd_recv_pages(MultiFDRecvParams *p, Error 
**errp)
 do {
 ret

[PULL 21/30] migration: Use atomic ops properly for page accountings

2022-11-15 Thread Juan Quintela

From: Peter Xu 

To prepare for thread-safety on page accountings, at least below counters
need to be accessed only atomically, they are:

ram_counters.transferred
ram_counters.duplicate
ram_counters.normal
ram_counters.postcopy_bytes

There are a lot of other counters but they won't be accessed outside
migration thread, then they're still safe to be accessed without atomic
ops.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.h   | 20 
 migration/migration.c | 10 +-
 migration/multifd.c   |  4 ++--
 migration/ram.c   | 40 
 4 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/migration/ram.h b/migration/ram.h
index 038d52f49f..81cbb0947c 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -32,7 +32,27 @@
 #include "qapi/qapi-types-migration.h"
 #include "exec/cpu-common.h"
 #include "io/channel.h"
+#include "qemu/stats64.h"
 
+/*
+ * These are the migration statistic counters that need to be updated using
+ * atomic ops (can be accessed by more than one thread).  Here since we
+ * cannot modify MigrationStats directly to use Stat64 as it was defined in
+ * the QAPI scheme, we define an internal structure to hold them, and we
+ * propagate the real values when QMP queries happen.
+ *
+ * IOW, the corresponding fields within ram_counters on these specific
+ * fields will be always zero and not being used at all; they're just
+ * placeholders to make it QAPI-compatible.
+ */
+typedef struct {
+Stat64 transferred;
+Stat64 duplicate;
+Stat64 normal;
+Stat64 postcopy_bytes;
+} MigrationAtomicStats;
+
+extern MigrationAtomicStats ram_atomic_counters;
 extern MigrationStats ram_counters;
 extern XBZRLECacheStats xbzrle_counters;
 extern CompressionStats compression_counters;
diff --git a/migration/migration.c b/migration/migration.c
index 9fbed8819a..1f95877fb4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1069,13 +1069,13 @@ static void populate_ram_info(MigrationInfo *info, 
MigrationState *s)
 
 info->has_ram = true;
 info->ram = g_malloc0(sizeof(*info->ram));
-info->ram->transferred = ram_counters.transferred;
+info->ram->transferred = stat64_get(_atomic_counters.transferred);
 info->ram->total = ram_bytes_total();
-info->ram->duplicate = ram_counters.duplicate;
+info->ram->duplicate = stat64_get(_atomic_counters.duplicate);
 /* legacy value.  It is not used anymore */
 info->ram->skipped = 0;
-info->ram->normal = ram_counters.normal;
-info->ram->normal_bytes = ram_counters.normal * page_size;
+info->ram->normal = stat64_get(_atomic_counters.normal);
+info->ram->normal_bytes = info->ram->normal * page_size;
 info->ram->mbps = s->mbps;
 info->ram->dirty_sync_count = ram_counters.dirty_sync_count;
 info->ram->dirty_sync_missed_zero_copy =
@@ -1086,7 +1086,7 @@ static void populate_ram_info(MigrationInfo *info, 
MigrationState *s)
 info->ram->pages_per_second = s->pages_per_second;
 info->ram->precopy_bytes = ram_counters.precopy_bytes;
 info->ram->downtime_bytes = ram_counters.downtime_bytes;
-info->ram->postcopy_bytes = ram_counters.postcopy_bytes;
+info->ram->postcopy_bytes = 
stat64_get(_atomic_counters.postcopy_bytes);
 
 if (migrate_use_xbzrle()) {
 info->has_xbzrle_cache = true;
diff --git a/migration/multifd.c b/migration/multifd.c
index c40d98ad5c..7d3aec9a52 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -432,7 +432,7 @@ static int multifd_send_pages(QEMUFile *f)
 transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len;
 qemu_file_acct_rate_limit(f, transferred);
 ram_counters.multifd_bytes += transferred;
-ram_counters.transferred += transferred;
+stat64_add(_atomic_counters.transferred, transferred);
 qemu_mutex_unlock(>mutex);
 qemu_sem_post(>sem);
 
@@ -624,7 +624,7 @@ int multifd_send_sync_main(QEMUFile *f)
 p->pending_job++;
 qemu_file_acct_rate_limit(f, p->packet_len);
 ram_counters.multifd_bytes += p->packet_len;
-ram_counters.transferred += p->packet_len;
+stat64_add(_atomic_counters.transferred, p->packet_len);
 qemu_mutex_unlock(>mutex);
 qemu_sem_post(>sem);
 
diff --git a/migration/ram.c b/migration/ram.c
index 6428138194..25fd3cf7dc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -453,18 +453,25 @@ uint64_t ram_bytes_remaining(void)
0;
 }
 
+/*
+ * NOTE: not all stats in ram_counters are used in reality.  See comments
+ * for struct MigrationAtomicStats.  The ultimate result of ram migration
+ * counters will be a merged version with both ram_counters and the atomic
+ * fields in ram_atomic_counters.
+ */
 MigrationStats ram_counters;
+MigrationAtomicStats ram_atomic_counters;
 
 void

[PULL 16/30] migration: Add postcopy_preempt_active()

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Add the helper to show that postcopy preempt enabled, meanwhile active.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 96fa521813..52c851eb56 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -190,6 +190,11 @@ out:
 return ret;
 }
 
+static bool postcopy_preempt_active(void)
+{
+return migrate_postcopy_preempt() && migration_in_postcopy();
+}
+
 bool ramblock_is_ignored(RAMBlock *block)
 {
 return !qemu_ram_is_migratable(block) ||
@@ -2461,7 +2466,7 @@ static void postcopy_preempt_choose_channel(RAMState *rs, 
PageSearchStatus *pss)
 /* We need to make sure rs->f always points to the default channel elsewhere */
 static void postcopy_preempt_reset_channel(RAMState *rs)
 {
-if (migrate_postcopy_preempt() && migration_in_postcopy()) {
+if (postcopy_preempt_active()) {
 rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
 rs->f = migrate_get_current()->to_dst_file;
 trace_postcopy_preempt_reset_channel();
@@ -2499,7 +2504,7 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss)
 return 0;
 }
 
-if (migrate_postcopy_preempt() && migration_in_postcopy()) {
+if (postcopy_preempt_active()) {
 postcopy_preempt_choose_channel(rs, pss);
 }
 
-- 
2.38.1

[PULL 26/30] migration: Move last_sent_block into PageSearchStatus

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Since we use PageSearchStatus to represent a channel, it makes perfect
sense to keep last_sent_block (aka, leverage RAM_SAVE_FLAG_CONTINUE) to be
per-channel rather than global because each channel can be sending
different pages on ramblocks.

Hence move it from RAMState into PageSearchStatus.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 71 -
 1 file changed, 41 insertions(+), 30 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index bdb29ac4d9..dbdde5a6a5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -117,6 +117,8 @@ XBZRLECacheStats xbzrle_counters;
 struct PageSearchStatus {
 /* The migration channel used for a specific host page */
 QEMUFile*pss_channel;
+/* Last block from where we have sent data */
+RAMBlock *last_sent_block;
 /* Current block being searched */
 RAMBlock*block;
 /* Current page to search from */
@@ -396,8 +398,6 @@ struct RAMState {
 int uffdio_fd;
 /* Last block that we have visited searching for dirty pages */
 RAMBlock *last_seen_block;
-/* Last block from where we have sent data */
-RAMBlock *last_sent_block;
 /* Last dirty target page we have sent */
 ram_addr_t last_page;
 /* last ram version we have seen */
@@ -712,16 +712,17 @@ exit:
  *
  * Returns the number of bytes written
  *
- * @f: QEMUFile where to send the data
+ * @pss: current PSS channel status
  * @block: block that contains the page we want to send
  * @offset: offset inside the block for the page
  *  in the lower bits, it contains flags
  */
-static size_t save_page_header(RAMState *rs, QEMUFile *f,  RAMBlock *block,
+static size_t save_page_header(PageSearchStatus *pss, RAMBlock *block,
ram_addr_t offset)
 {
 size_t size, len;
-bool same_block = (block == rs->last_sent_block);
+bool same_block = (block == pss->last_sent_block);
+QEMUFile *f = pss->pss_channel;
 
 if (same_block) {
 offset |= RAM_SAVE_FLAG_CONTINUE;
@@ -734,7 +735,7 @@ static size_t save_page_header(RAMState *rs, QEMUFile *f,  
RAMBlock *block,
 qemu_put_byte(f, len);
 qemu_put_buffer(f, (uint8_t *)block->idstr, len);
 size += 1 + len;
-rs->last_sent_block = block;
+pss->last_sent_block = block;
 }
 return size;
 }
@@ -818,17 +819,19 @@ static void xbzrle_cache_zero_page(RAMState *rs, 
ram_addr_t current_addr)
  *  -1 means that xbzrle would be longer than normal
  *
  * @rs: current RAM state
+ * @pss: current PSS channel
  * @current_data: pointer to the address of the page contents
  * @current_addr: addr of the page
  * @block: block that contains the page we want to send
  * @offset: offset inside the block for the page
  */
-static int save_xbzrle_page(RAMState *rs, QEMUFile *file,
+static int save_xbzrle_page(RAMState *rs, PageSearchStatus *pss,
 uint8_t **current_data, ram_addr_t current_addr,
 RAMBlock *block, ram_addr_t offset)
 {
 int encoded_len = 0, bytes_xbzrle;
 uint8_t *prev_cached_page;
+QEMUFile *file = pss->pss_channel;
 
 if (!cache_is_cached(XBZRLE.cache, current_addr,
  ram_counters.dirty_sync_count)) {
@@ -893,7 +896,7 @@ static int save_xbzrle_page(RAMState *rs, QEMUFile *file,
 }
 
 /* Send XBZRLE based compressed page */
-bytes_xbzrle = save_page_header(rs, file, block,
+bytes_xbzrle = save_page_header(pss, block,
 offset | RAM_SAVE_FLAG_XBZRLE);
 qemu_put_byte(file, ENCODING_FLAG_XBZRLE);
 qemu_put_be16(file, encoded_len);
@@ -1324,19 +1327,19 @@ void ram_release_page(const char *rbname, uint64_t 
offset)
  * Returns the size of data written to the file, 0 means the page is not
  * a zero page
  *
- * @rs: current RAM state
- * @file: the file where the data is saved
+ * @pss: current PSS channel
  * @block: block that contains the page we want to send
  * @offset: offset inside the block for the page
  */
-static int save_zero_page_to_file(RAMState *rs, QEMUFile *file,
+static int save_zero_page_to_file(PageSearchStatus *pss,
   RAMBlock *block, ram_addr_t offset)
 {
 uint8_t *p = block->host + offset;
+QEMUFile *file = pss->pss_channel;
 int len = 0;
 
 if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
-len += save_page_header(rs, file, block, offset | RAM_SAVE_FLAG_ZERO);
+len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
 qemu_put_byte(file, 0);
 len += 1;
 ram_release_page(block->idstr, offset);
@@ -1349,14 +1352,14 @@ static int save_zero_page_to_file(RAMState *rs, 
QEMUFile *file,
  *
  * Returns the number of pages written.
  *
- * @rs: current RAM state
+ * @pss: current PSS channel
  *

[PULL 19/30] migration: Remove RAMState.f references in compression code

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Removing referencing to RAMState.f in compress_page_with_multi_thread() and
flush_compressed_data().

Compression code by default isn't compatible with having >1 channels (or it
won't currently know which channel to flush the compressed data), so to
make it simple we always flush on the default to_dst_file port until
someone wants to add >1 ports support, as rs->f right now can really
change (after postcopy preempt is introduced).

There should be no functional change at all after patch applied, since as
long as rs->f referenced in compression code, it must be to_dst_file.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 42b6a543bd..ebc5664dcc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1489,6 +1489,7 @@ static bool save_page_use_compression(RAMState *rs);
 
 static void flush_compressed_data(RAMState *rs)
 {
+MigrationState *ms = migrate_get_current();
 int idx, len, thread_count;
 
 if (!save_page_use_compression(rs)) {
@@ -1507,7 +1508,7 @@ static void flush_compressed_data(RAMState *rs)
 for (idx = 0; idx < thread_count; idx++) {
 qemu_mutex_lock(_param[idx].mutex);
 if (!comp_param[idx].quit) {
-len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+len = qemu_put_qemu_file(ms->to_dst_file, comp_param[idx].file);
 /*
  * it's safe to fetch zero_page without holding comp_done_lock
  * as there is no further request submitted to the thread,
@@ -1526,11 +1527,11 @@ static inline void set_compress_params(CompressParam 
*param, RAMBlock *block,
 param->offset = offset;
 }
 
-static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
-   ram_addr_t offset)
+static int compress_page_with_multi_thread(RAMBlock *block, ram_addr_t offset)
 {
 int idx, thread_count, bytes_xmit = -1, pages = -1;
 bool wait = migrate_compress_wait_thread();
+MigrationState *ms = migrate_get_current();
 
 thread_count = migrate_compress_threads();
 qemu_mutex_lock(_done_lock);
@@ -1538,7 +1539,8 @@ retry:
 for (idx = 0; idx < thread_count; idx++) {
 if (comp_param[idx].done) {
 comp_param[idx].done = false;
-bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+bytes_xmit = qemu_put_qemu_file(ms->to_dst_file,
+comp_param[idx].file);
 qemu_mutex_lock(_param[idx].mutex);
 set_compress_params(_param[idx], block, offset);
 qemu_cond_signal(_param[idx].cond);
@@ -2291,7 +2293,7 @@ static bool save_compress_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset)
 return false;
 }
 
-if (compress_page_with_multi_thread(rs, block, offset) > 0) {
+if (compress_page_with_multi_thread(block, offset) > 0) {
 return true;
 }
 
-- 
2.38.1

[PULL 10/30] migration: Fix possible infinite loop of ram save process

2022-11-15 Thread Juan Quintela

From: Peter Xu 

When starting ram saving procedure (especially at the completion phase),
always set last_seen_block to non-NULL to make sure we can always correctly
detect the case where "we've migrated all the dirty pages".

Then we'll guarantee both last_seen_block and pss.block will be valid
always before the loop starts.

See the comment in the code for some details.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index bb4f08bfed..c0f5d6d287 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2574,14 +2574,22 @@ static int ram_find_and_save_block(RAMState *rs)
 return pages;
 }
 
+/*
+ * Always keep last_seen_block/last_page valid during this procedure,
+ * because find_dirty_block() relies on these values (e.g., we compare
+ * last_seen_block with pss.block to see whether we searched all the
+ * ramblocks) to detect the completion of migration.  Having NULL value
+ * of last_seen_block can conditionally cause below loop to run forever.
+ */
+if (!rs->last_seen_block) {
+rs->last_seen_block = QLIST_FIRST_RCU(_list.blocks);
+rs->last_page = 0;
+}
+
 pss.block = rs->last_seen_block;
 pss.page = rs->last_page;
 pss.complete_round = false;
 
-if (!pss.block) {
-pss.block = QLIST_FIRST_RCU(_list.blocks);
-}
-
 do {
 again = true;
 found = get_queued_page(rs, );
-- 
2.38.1

[PULL 09/30] Unit test code and benchmark code

2022-11-15 Thread Juan Quintela

From: ling xu 

Unit test code is in test-xbzrle.c, and benchmark code is in xbzrle-bench.c
for performance benchmarking.

Signed-off-by: ling xu 
Co-authored-by: Zhou Zhao 
Co-authored-by: Jun Jin 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 tests/bench/xbzrle-bench.c | 465 +
 tests/unit/test-xbzrle.c   |  39 +++-
 tests/bench/meson.build|   4 +
 3 files changed, 503 insertions(+), 5 deletions(-)
 create mode 100644 tests/bench/xbzrle-bench.c

diff --git a/tests/bench/xbzrle-bench.c b/tests/bench/xbzrle-bench.c
new file mode 100644
index 00..d71397e6f4
--- /dev/null
+++ b/tests/bench/xbzrle-bench.c
@@ -0,0 +1,465 @@
+/*
+ * Xor Based Zero Run Length Encoding unit tests.
+ *
+ * Copyright 2013 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *  Orit Wasserman  
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "../migration/xbzrle.h"
+
+#define XBZRLE_PAGE_SIZE 4096
+
+#if defined(CONFIG_AVX512BW_OPT)
+static bool is_cpu_support_avx512bw;
+#include "qemu/cpuid.h"
+static void __attribute__((constructor)) init_cpu_flag(void)
+{
+unsigned max = __get_cpuid_max(0, NULL);
+int a, b, c, d;
+is_cpu_support_avx512bw = false;
+if (max >= 1) {
+__cpuid(1, a, b, c, d);
+ /* We must check that AVX is not just available, but usable.  */
+if ((c & bit_OSXSAVE) && (c & bit_AVX) && max >= 7) {
+int bv;
+__asm("xgetbv" : "=a"(bv), "=d"(d) : "c"(0));
+__cpuid_count(7, 0, a, b, c, d);
+   /* 0xe6:
+*  XCR0[7:5] = 111b (OPMASK state, upper 256-bit of ZMM0-ZMM15
+*and ZMM16-ZMM31 state are enabled by OS)
+*  XCR0[2:1] = 11b (XMM state and YMM state are enabled by OS)
+*/
+if ((bv & 0xe6) == 0xe6 && (b & bit_AVX512BW)) {
+is_cpu_support_avx512bw = true;
+}
+}
+}
+return ;
+}
+#endif
+
+struct ResTime {
+float t_raw;
+float t_512;
+};
+
+static void encode_decode_zero(struct ResTime *res)
+{
+uint8_t *buffer = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *compressed = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *buffer512 = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *compressed512 = g_malloc0(XBZRLE_PAGE_SIZE);
+int i = 0;
+int dlen = 0, dlen512 = 0;
+int diff_len = g_test_rand_int_range(0, XBZRLE_PAGE_SIZE - 1006);
+
+for (i = diff_len; i > 0; i--) {
+buffer[1000 + i] = i;
+buffer512[1000 + i] = i;
+}
+
+buffer[1000 + diff_len + 3] = 103;
+buffer[1000 + diff_len + 5] = 105;
+
+buffer512[1000 + diff_len + 3] = 103;
+buffer512[1000 + diff_len + 5] = 105;
+
+/* encode zero page */
+time_t t_start, t_end, t_start512, t_end512;
+t_start = clock();
+dlen = xbzrle_encode_buffer(buffer, buffer, XBZRLE_PAGE_SIZE, compressed,
+   XBZRLE_PAGE_SIZE);
+t_end = clock();
+float time_val = difftime(t_end, t_start);
+g_assert(dlen == 0);
+
+t_start512 = clock();
+dlen512 = xbzrle_encode_buffer_avx512(buffer512, buffer512, 
XBZRLE_PAGE_SIZE,
+   compressed512, XBZRLE_PAGE_SIZE);
+t_end512 = clock();
+float time_val512 = difftime(t_end512, t_start512);
+g_assert(dlen512 == 0);
+
+res->t_raw = time_val;
+res->t_512 = time_val512;
+
+g_free(buffer);
+g_free(compressed);
+g_free(buffer512);
+g_free(compressed512);
+
+}
+
+static void test_encode_decode_zero_avx512(void)
+{
+int i;
+float time_raw = 0.0, time_512 = 0.0;
+struct ResTime res;
+for (i = 0; i < 1; i++) {
+encode_decode_zero();
+time_raw += res.t_raw;
+time_512 += res.t_512;
+}
+printf("Zero test:\n");
+printf("Raw xbzrle_encode time is %f ms\n", time_raw);
+printf("512 xbzrle_encode time is %f ms\n", time_512);
+}
+
+static void encode_decode_unchanged(struct ResTime *res)
+{
+uint8_t *compressed = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *test = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *compressed512 = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *test512 = g_malloc0(XBZRLE_PAGE_SIZE);
+int i = 0;
+int dlen = 0, dlen512 = 0;
+int diff_len = g_test_rand_int_range(0, XBZRLE_PAGE_SIZE - 1006);
+
+for (i = diff_len; i > 0; i--) {
+test[1000 + i] = i + 4;
+test512[1000 + i] = i + 4;
+}
+
+test[1000 + diff_len + 3] = 107;
+test[1000 + diff_len + 5] = 109;
+
+test512[1000 + diff_len + 3] = 107;
+test512[1000 + diff_len + 5] = 109;
+
+/* test unchanged buffer */
+time_t t_start, t_end, t_start512, t_end512;
+t_start = clock();
+dlen = xbzrle_encode_buffer(test, test, XBZRLE_PAGE_SIZE, compressed,
+XBZRLE_PAGE_SIZE);
+

[PULL 13/30] migration: Use non-atomic ops for clear log bitmap

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Since we already have bitmap_mutex to protect either the dirty bitmap or
the clear log bitmap, we don't need atomic operations to set/clear/test on
the clear log bitmap.  Switching all ops from atomic to non-atomic
versions, meanwhile touch up the comments to show which lock is in charge.

Introduced non-atomic version of bitmap_test_and_clear_atomic(), mostly the
same as the atomic version but simplified a few places, e.g. dropped the
"old_bits" variable, and also the explicit memory barriers.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 include/exec/ram_addr.h | 11 +-
 include/exec/ramblock.h |  3 +++
 include/qemu/bitmap.h   |  1 +
 util/bitmap.c   | 45 +
 4 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 1500680458..f4fb6a2111 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -42,7 +42,8 @@ static inline long clear_bmap_size(uint64_t pages, uint8_t 
shift)
 }
 
 /**
- * clear_bmap_set: set clear bitmap for the page range
+ * clear_bmap_set: set clear bitmap for the page range.  Must be with
+ * bitmap_mutex held.
  *
  * @rb: the ramblock to operate on
  * @start: the start page number
@@ -55,12 +56,12 @@ static inline void clear_bmap_set(RAMBlock *rb, uint64_t 
start,
 {
 uint8_t shift = rb->clear_bmap_shift;
 
-bitmap_set_atomic(rb->clear_bmap, start >> shift,
-  clear_bmap_size(npages, shift));
+bitmap_set(rb->clear_bmap, start >> shift, clear_bmap_size(npages, shift));
 }
 
 /**
- * clear_bmap_test_and_clear: test clear bitmap for the page, clear if set
+ * clear_bmap_test_and_clear: test clear bitmap for the page, clear if set.
+ * Must be with bitmap_mutex held.
  *
  * @rb: the ramblock to operate on
  * @page: the page number to check
@@ -71,7 +72,7 @@ static inline bool clear_bmap_test_and_clear(RAMBlock *rb, 
uint64_t page)
 {
 uint8_t shift = rb->clear_bmap_shift;
 
-return bitmap_test_and_clear_atomic(rb->clear_bmap, page >> shift, 1);
+return bitmap_test_and_clear(rb->clear_bmap, page >> shift, 1);
 }
 
 static inline bool offset_in_ramblock(RAMBlock *b, ram_addr_t offset)
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 6cbedf9e0c..adc03df59c 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -53,6 +53,9 @@ struct RAMBlock {
  * and split clearing of dirty bitmap on the remote node (e.g.,
  * KVM).  The bitmap will be set only when doing global sync.
  *
+ * It is only used during src side of ram migration, and it is
+ * protected by the global ram_state.bitmap_mutex.
+ *
  * NOTE: this bitmap is different comparing to the other bitmaps
  * in that one bit can represent multiple guest pages (which is
  * decided by the `clear_bmap_shift' variable below).  On
diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
index 82a1d2f41f..3ccb00865f 100644
--- a/include/qemu/bitmap.h
+++ b/include/qemu/bitmap.h
@@ -253,6 +253,7 @@ void bitmap_set(unsigned long *map, long i, long len);
 void bitmap_set_atomic(unsigned long *map, long i, long len);
 void bitmap_clear(unsigned long *map, long start, long nr);
 bool bitmap_test_and_clear_atomic(unsigned long *map, long start, long nr);
+bool bitmap_test_and_clear(unsigned long *map, long start, long nr);
 void bitmap_copy_and_clear_atomic(unsigned long *dst, unsigned long *src,
   long nr);
 unsigned long bitmap_find_next_zero_area(unsigned long *map,
diff --git a/util/bitmap.c b/util/bitmap.c
index f81d8057a7..8d12e90a5a 100644
--- a/util/bitmap.c
+++ b/util/bitmap.c
@@ -240,6 +240,51 @@ void bitmap_clear(unsigned long *map, long start, long nr)
 }
 }
 
+bool bitmap_test_and_clear(unsigned long *map, long start, long nr)
+{
+unsigned long *p = map + BIT_WORD(start);
+const long size = start + nr;
+int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+bool dirty = false;
+
+assert(start >= 0 && nr >= 0);
+
+/* First word */
+if (nr - bits_to_clear > 0) {
+if ((*p) & mask_to_clear) {
+dirty = true;
+}
+*p &= ~mask_to_clear;
+nr -= bits_to_clear;
+bits_to_clear = BITS_PER_LONG;
+p++;
+}
+
+/* Full words */
+if (bits_to_clear == BITS_PER_LONG) {
+while (nr >= BITS_PER_LONG) {
+if (*p) {
+dirty = true;
+*p = 0;
+}
+nr -= BITS_PER_LONG;
+p++;
+}
+}
+
+/* Last word */
+if (nr) {
+mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+if ((*p) & mask_to_clear) {
+dirty = true;
+}
+*p &= ~mask_to_clear;
+}
+
+return dirty;
+}
+
 bool

[PULL 06/30] migration: Export ram_transferred_ram()

2022-11-15 Thread Juan Quintela

Signed-off-by: Juan Quintela 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: David Edmondson 
Reviewed-by: Leonardo Bras 
---
 migration/ram.h | 2 ++
 migration/ram.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/migration/ram.h b/migration/ram.h
index c7af65ac74..e844966f69 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -65,6 +65,8 @@ int ram_load_postcopy(QEMUFile *f, int channel);
 
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 
+void ram_transferred_add(uint64_t bytes);
+
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
 bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
 void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
diff --git a/migration/ram.c b/migration/ram.c
index dc1de9ddbc..00a06b2c16 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -422,7 +422,7 @@ uint64_t ram_bytes_remaining(void)
 
 MigrationStats ram_counters;
 
-static void ram_transferred_add(uint64_t bytes)
+void ram_transferred_add(uint64_t bytes)
 {
 if (runstate_is_running()) {
 ram_counters.precopy_bytes += bytes;
-- 
2.38.1

Re: [PATCH v2] m25p80: Improve error when the backend file size does not match the device

2022-11-15 Thread Philippe Mathieu-Daudé


On 15/11/22 16:10, Cédric Le Goater wrote:

Currently, when a block backend is attached to a m25p80 device and the
associated file size does not match the flash model, QEMU complains
with the error message "failed to read the initial flash content".
This is confusing for the user.

Use blk_check_size_and_read_all() instead of blk_pread() to improve
the reported error.

Signed-off-by: Cédric Le Goater 
---
  hw/block/m25p80.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH v2 2/9] block-copy: add missing coroutine_fn annotations

2022-11-15 Thread Emanuele Giuseppe Esposito

To sum up on what was discussed in this serie, I don't really see any
strong objection against these patches, so I will soon send v3 which is
pretty much the same except for patch 1, which will be removed.

I think these patches are useful and will be even more meaningful to the
reviewer when in the next few days I send all the rwlock patches.

What has been discussed so far (using QEMU_IN_COROUTINE, using some sort
of tool to automate everything, etc.) has been noted and as I understand
will be researched by Alberto.

Thank you,
Emanuele

Am 10/11/2022 um 11:52 schrieb Paolo Bonzini:
> On Wed, Nov 9, 2022 at 1:24 PM Emanuele Giuseppe Esposito
>  wrote:
 What I do know is that it's extremely confusing to understand if a
 function that is *not* marked as coroutine_fn is actually being called
 also from coroutines or not.
> 
> Agreed. This is a huge point in favor of pushing coroutine wrappers as
> far up in the call stack as possible, because it means more
> coroutine_fns and fewer mixed functions.
> 
>>> This is a lot better than our "coroutine_fn" sign, which actually do no
>>> check (and can't do). Don't you plan to swap a "coroutine_fn" noop
>>> marker with more meaningful IN_COROUTINE(); (or something like this,
>>> which just do assert(qemu_in_coroutine())) at start of the function? It
>>> would be a lot safer.
>>
>> CCing also Alberto and Paolo
>>
>> So basically I think what we need is something that scans the whole
>> block layer code and puts the right coroutine_fn annotations (or
>> assertions, if you want) in the right places.
> 
> coroutine_fn markers are done by Alberto's static analyzer, which I
> used to add coroutine_fn pretty much everywhere in the code base where
> they are *needed*. My rules are simple:
> 
> * there MUST be no calls from non-coroutine_fn to coroutine_fn, this is 
> obvious
> 
> * there MUST be no blocking in coroutine_fn
> 
> * there SHOULD be no calls from coroutine_fn to generated_co_wrapper;
> use the wrapped *_co_* function directly instead.
> 
> To catch the last one, or possibly the last two, Alberto added
> no_coroutine_fn. In a perfect world non-marked functions would be
> "valid either in coroutine or non-coroutine function": they would call
> neither coroutine_fns nor no_coroutine_fns.
> 
> This is unfortunately easier said than done, but in order to move
> towards that case, I think we can look again at vrc and extend it with
> new commands. Alberto's work covers *local* tests, looking at one
> caller and one callee at a time. With vrc's knowledge of the global
> call graph, we can find *all* paths from a coroutine_fn to a
> generated_co_wrapper, including those that go through unmarked
> functions. Then there are two cases:
> 
> * if the unmarked function is never called from outside a coroutine,
> call the wrapped function and change it to coroutine_fn
> 
> * if the unmarked function can be called from outside a coroutine,
> change it to a coroutine_fn (renaming it) and add a
> generated_co_wrapper. Rinse and repeat.
> 
>> However, it would be nice to assign this to someone and do this
>> automatically, not doing it by hand. I am not sure if Alberto static
>> analyzer is currently capable of doing that.
> 
> I think the first step has to be done by hand, because it entails
> creating new generated_co_wrappers. Checking for regressions can then
> be done automatically though.
> 
> Paolo
>

[PULL 27/30] migration: Send requested page directly in rp-return thread

2022-11-15 Thread Juan Quintela

From: Peter Xu 

With all the facilities ready, send the requested page directly in the
rp-return thread rather than queuing it in the request queue, if and only
if postcopy preempt is enabled.  It can achieve so because it uses separate
channel for sending urgent pages.  The only shared data is bitmap and it's
protected by the bitmap_mutex.

Note that since we're moving the ownership of the urgent channel from the
migration thread to rp thread it also means the rp thread is responsible
for managing the qemufile, e.g. properly close it when pausing migration
happens.  For this, let migration_release_from_dst_file to cover shutdown
of the urgent channel too, renaming it as migration_release_dst_files() to
better show what it does.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/migration.c |  35 +++--
 migration/ram.c   | 112 ++
 2 files changed, 131 insertions(+), 16 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 1f95877fb4..42f36c1e2c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2868,8 +2868,11 @@ static int migrate_handle_rp_resume_ack(MigrationState 
*s, uint32_t value)
 return 0;
 }
 
-/* Release ms->rp_state.from_dst_file in a safe way */
-static void migration_release_from_dst_file(MigrationState *ms)
+/*
+ * Release ms->rp_state.from_dst_file (and postcopy_qemufile_src if
+ * existed) in a safe way.
+ */
+static void migration_release_dst_files(MigrationState *ms)
 {
 QEMUFile *file;
 
@@ -2882,6 +2885,18 @@ static void 
migration_release_from_dst_file(MigrationState *ms)
 ms->rp_state.from_dst_file = NULL;
 }
 
+/*
+ * Do the same to postcopy fast path socket too if there is.  No
+ * locking needed because this qemufile should only be managed by
+ * return path thread.
+ */
+if (ms->postcopy_qemufile_src) {
+migration_ioc_unregister_yank_from_file(ms->postcopy_qemufile_src);
+qemu_file_shutdown(ms->postcopy_qemufile_src);
+qemu_fclose(ms->postcopy_qemufile_src);
+ms->postcopy_qemufile_src = NULL;
+}
+
 qemu_fclose(file);
 }
 
@@ -3026,7 +3041,7 @@ out:
  * Maybe there is something we can do: it looks like a
  * network down issue, and we pause for a recovery.
  */
-migration_release_from_dst_file(ms);
+migration_release_dst_files(ms);
 rp = NULL;
 if (postcopy_pause_return_path_thread(ms)) {
 /*
@@ -3044,7 +3059,7 @@ out:
 }
 
 trace_source_return_path_thread_end();
-migration_release_from_dst_file(ms);
+migration_release_dst_files(ms);
 rcu_unregister_thread();
 return NULL;
 }
@@ -3567,18 +3582,6 @@ static MigThrError postcopy_pause(MigrationState *s)
 qemu_file_shutdown(file);
 qemu_fclose(file);
 
-/*
- * Do the same to postcopy fast path socket too if there is.  No
- * locking needed because no racer as long as we do this before setting
- * status to paused.
- */
-if (s->postcopy_qemufile_src) {
-migration_ioc_unregister_yank_from_file(s->postcopy_qemufile_src);
-qemu_file_shutdown(s->postcopy_qemufile_src);
-qemu_fclose(s->postcopy_qemufile_src);
-s->postcopy_qemufile_src = NULL;
-}
-
 migrate_set_state(>state, s->state,
   MIGRATION_STATUS_POSTCOPY_PAUSED);
 
diff --git a/migration/ram.c b/migration/ram.c
index dbdde5a6a5..5dc221a2fc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -574,6 +574,8 @@ static QemuThread *decompress_threads;
 static QemuMutex decomp_done_lock;
 static QemuCond decomp_done_cond;
 
+static int ram_save_host_page_urgent(PageSearchStatus *pss);
+
 static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock 
*block,
  ram_addr_t offset, uint8_t *source_buf);
 
@@ -588,6 +590,16 @@ static void pss_init(PageSearchStatus *pss, RAMBlock *rb, 
ram_addr_t page)
 pss->complete_round = false;
 }
 
+/*
+ * Check whether two PSSs are actively sending the same page.  Return true
+ * if it is, false otherwise.
+ */
+static bool pss_overlap(PageSearchStatus *pss1, PageSearchStatus *pss2)
+{
+return pss1->host_page_sending && pss2->host_page_sending &&
+(pss1->host_page_start == pss2->host_page_start);
+}
+
 static void *do_data_compress(void *opaque)
 {
 CompressParam *param = opaque;
@@ -2288,6 +2300,57 @@ int ram_save_queue_pages(const char *rbname, ram_addr_t 
start, ram_addr_t len)
 return -1;
 }
 
+/*
+ * When with postcopy preempt, we send back the page directly in the
+ * rp-return thread.
+ */
+if (postcopy_preempt_active()) {
+ram_addr_t page_start = start >> TARGET_PAGE_BITS;
+size_t page_size =

[PULL 30/30] migration: Block migration comment or code is wrong

2022-11-15 Thread Juan Quintela

And it appears that what is wrong is the code. During bulk stage we
need to make sure that some block is dirty, but no games with
max_size at all.

Signed-off-by: Juan Quintela 
Reviewed-by: Stefan Hajnoczi 
---
 migration/block.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/migration/block.c b/migration/block.c
index 3577c815a9..4347da1526 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -880,8 +880,8 @@ static void block_save_pending(QEMUFile *f, void *opaque, 
uint64_t max_size,
 blk_mig_unlock();
 
 /* Report at least one block pending during bulk phase */
-if (pending <= max_size && !block_mig_state.bulk_completed) {
-pending = max_size + BLK_MIG_BLOCK_SIZE;
+if (!pending && !block_mig_state.bulk_completed) {
+pending = BLK_MIG_BLOCK_SIZE;
 }
 
 trace_migration_block_save_pending(pending);
-- 
2.38.1

[PULL 24/30] migration: Add pss_init()

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Helper to init PSS structures.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index fedd61b3da..a2e86623d3 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -570,6 +570,14 @@ static bool do_compress_ram_page(QEMUFile *f, z_stream 
*stream, RAMBlock *block,
 static void postcopy_preempt_restore(RAMState *rs, PageSearchStatus *pss,
  bool postcopy_requested);
 
+/* NOTE: page is the PFN not real ram_addr_t. */
+static void pss_init(PageSearchStatus *pss, RAMBlock *rb, ram_addr_t page)
+{
+pss->block = rb;
+pss->page = page;
+pss->complete_round = false;
+}
+
 static void *do_data_compress(void *opaque)
 {
 CompressParam *param = opaque;
@@ -2678,9 +2686,7 @@ static int ram_find_and_save_block(RAMState *rs)
 rs->last_page = 0;
 }
 
-pss.block = rs->last_seen_block;
-pss.page = rs->last_page;
-pss.complete_round = false;
+pss_init(, rs->last_seen_block, rs->last_page);
 
 do {
 again = true;
-- 
2.38.1

[PULL 22/30] migration: Teach PSS about host page

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Migration code has a lot to do with host pages.  Teaching PSS core about
the idea of host page helps a lot and makes the code clean.  Meanwhile,
this prepares for the future changes that can leverage the new PSS helpers
that this patch introduces to send host page in another thread.

Three more fields are introduced for this:

  (1) host_page_sending: this is set to true when QEMU is sending a host
  page, false otherwise.

  (2) host_page_{start|end}: these point to the start/end of host page
  we're sending, and it's only valid when host_page_sending==true.

For example, when we look up the next dirty page on the ramblock, with
host_page_sending==true, we'll not try to look for anything beyond the
current host page boundary.  This can be slightly efficient than current
code because currently we'll set pss->page to next dirty bit (which can be
over current host page boundary) and reset it to host page boundary if we
found it goes beyond that.

With above, we can easily make migration_bitmap_find_dirty() self contained
by updating pss->page properly.  rs* parameter is removed because it's not
even used in old code.

When sending a host page, we should use the pss helpers like this:

  - pss_host_page_prepare(pss): called before sending host page
  - pss_within_range(pss): whether we're still working on the cur host page?
  - pss_host_page_finish(pss): called after sending a host page

Then we can use ram_save_target_page() to save one small page.

Currently ram_save_host_page() is still the only user. If there'll be
another function to send host page (e.g. in return path thread) in the
future, it should follow the same style.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 95 +++--
 1 file changed, 76 insertions(+), 19 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 25fd3cf7dc..b71edf1f26 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -509,6 +509,11 @@ struct PageSearchStatus {
  * postcopy pages via postcopy preempt channel.
  */
 bool postcopy_target_channel;
+/* Whether we're sending a host page */
+bool  host_page_sending;
+/* The start/end of current host page.  Only valid if 
host_page_sending==true */
+unsigned long host_page_start;
+unsigned long host_page_end;
 };
 typedef struct PageSearchStatus PageSearchStatus;
 
@@ -886,26 +891,38 @@ static int save_xbzrle_page(RAMState *rs, uint8_t 
**current_data,
 }
 
 /**
- * migration_bitmap_find_dirty: find the next dirty page from start
+ * pss_find_next_dirty: find the next dirty page of current ramblock
  *
- * Returns the page offset within memory region of the start of a dirty page
+ * This function updates pss->page to point to the next dirty page index
+ * within the ramblock to migrate, or the end of ramblock when nothing
+ * found.  Note that when pss->host_page_sending==true it means we're
+ * during sending a host page, so we won't look for dirty page that is
+ * outside the host page boundary.
  *
- * @rs: current RAM state
- * @rb: RAMBlock where to search for dirty pages
- * @start: page where we start the search
+ * @pss: the current page search status
  */
-static inline
-unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
-  unsigned long start)
+static void pss_find_next_dirty(PageSearchStatus *pss)
 {
+RAMBlock *rb = pss->block;
 unsigned long size = rb->used_length >> TARGET_PAGE_BITS;
 unsigned long *bitmap = rb->bmap;
 
 if (ramblock_is_ignored(rb)) {
-return size;
+/* Points directly to the end, so we know no dirty page */
+pss->page = size;
+return;
 }
 
-return find_next_bit(bitmap, size, start);
+/*
+ * If during sending a host page, only look for dirty pages within the
+ * current host page being send.
+ */
+if (pss->host_page_sending) {
+assert(pss->host_page_end);
+size = MIN(size, pss->host_page_end);
+}
+
+pss->page = find_next_bit(bitmap, size, pss->page);
 }
 
 static void migration_clear_memory_region_dirty_bitmap(RAMBlock *rb,
@@ -1591,7 +1608,9 @@ static bool find_dirty_block(RAMState *rs, 
PageSearchStatus *pss, bool *again)
 pss->postcopy_requested = false;
 pss->postcopy_target_channel = RAM_CHANNEL_PRECOPY;
 
-pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
+/* Update pss->page for the next dirty bit in ramblock */
+pss_find_next_dirty(pss);
+
 if (pss->complete_round && pss->block == rs->last_seen_block &&
 pss->page >= rs->last_page) {
 /*
@@ -2480,6 +2499,44 @@ static void postcopy_preempt_reset_channel(RAMState *rs)
 }
 }
 
+/* Should be called before sending a host page */
+static void pss_host_page_prepare(PageSearchStatus *pss)
+{
+/* How many guest

[PULL 07/30] migration: Export ram_release_page()

2022-11-15 Thread Juan Quintela

Signed-off-by: Juan Quintela 
Reviewed-by: Leonardo Bras 
---
 migration/ram.h | 1 +
 migration/ram.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/ram.h b/migration/ram.h
index e844966f69..038d52f49f 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -66,6 +66,7 @@ int ram_load_postcopy(QEMUFile *f, int channel);
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 
 void ram_transferred_add(uint64_t bytes);
+void ram_release_page(const char *rbname, uint64_t offset);
 
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
 bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
diff --git a/migration/ram.c b/migration/ram.c
index 00a06b2c16..67e41dd2c0 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1234,7 +1234,7 @@ static void migration_bitmap_sync_precopy(RAMState *rs)
 }
 }
 
-static void ram_release_page(const char *rbname, uint64_t offset)
+void ram_release_page(const char *rbname, uint64_t offset)
 {
 if (!migrate_release_ram() || !migration_in_postcopy()) {
 return;
-- 
2.38.1

[PULL 08/30] Update AVX512 support for xbzrle_encode_buffer

2022-11-15 Thread Juan Quintela

From: ling xu 

This commit updates code of avx512 support for xbzrle_encode_buffer
function to accelerate xbzrle encoding speed. Runtime check of avx512
support and benchmark for this feature are added. Compared with C
version of xbzrle_encode_buffer function, avx512 version can achieve
50%-70% performance improvement on benchmarking. In addition, if dirty
data is randomly located in 4K page, the avx512 version can achieve
almost 140% performance gain.

Signed-off-by: ling xu 
Co-authored-by: Zhou Zhao 
Co-authored-by: Jun Jin 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 meson.build   |  16 +
 migration/xbzrle.h|   4 ++
 migration/ram.c   |  34 +-
 migration/xbzrle.c| 124 ++
 meson_options.txt |   2 +
 scripts/meson-buildoptions.sh |  14 ++--
 6 files changed, 186 insertions(+), 8 deletions(-)

diff --git a/meson.build b/meson.build
index cf3e517e56..d0d28f5c9e 100644
--- a/meson.build
+++ b/meson.build
@@ -2344,6 +2344,22 @@ config_host_data.set('CONFIG_AVX512F_OPT', 
get_option('avx512f') \
 int main(int argc, char *argv[]) { return bar(argv[argc - 1]); }
   '''), error_message: 'AVX512F not available').allowed())
 
+config_host_data.set('CONFIG_AVX512BW_OPT', get_option('avx512bw') \
+  .require(have_cpuid_h, error_message: 'cpuid.h not available, cannot enable 
AVX512BW') \
+  .require(cc.links('''
+#pragma GCC push_options
+#pragma GCC target("avx512bw")
+#include 
+#include 
+static int bar(void *a) {
+
+  __m512i *x = a;
+  __m512i res= _mm512_abs_epi8(*x);
+  return res[1];
+}
+int main(int argc, char *argv[]) { return bar(argv[0]); }
+  '''), error_message: 'AVX512BW not available').allowed())
+
 have_pvrdma = get_option('pvrdma') \
   .require(rdma.found(), error_message: 'PVRDMA requires OpenFabrics 
libraries') \
   .require(cc.compiles(gnu_source_prefix + '''
diff --git a/migration/xbzrle.h b/migration/xbzrle.h
index a0db507b9c..6feb49160a 100644
--- a/migration/xbzrle.h
+++ b/migration/xbzrle.h
@@ -18,4 +18,8 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, 
int slen,
  uint8_t *dst, int dlen);
 
 int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen);
+#if defined(CONFIG_AVX512BW_OPT)
+int xbzrle_encode_buffer_avx512(uint8_t *old_buf, uint8_t *new_buf, int slen,
+uint8_t *dst, int dlen);
+#endif
 #endif
diff --git a/migration/ram.c b/migration/ram.c
index 67e41dd2c0..bb4f08bfed 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -83,6 +83,34 @@
 /* 0x80 is reserved in migration.h start with 0x100 next */
 #define RAM_SAVE_FLAG_COMPRESS_PAGE0x100
 
+int (*xbzrle_encode_buffer_func)(uint8_t *, uint8_t *, int,
+ uint8_t *, int) = xbzrle_encode_buffer;
+#if defined(CONFIG_AVX512BW_OPT)
+#include "qemu/cpuid.h"
+static void __attribute__((constructor)) init_cpu_flag(void)
+{
+unsigned max = __get_cpuid_max(0, NULL);
+int a, b, c, d;
+if (max >= 1) {
+__cpuid(1, a, b, c, d);
+ /* We must check that AVX is not just available, but usable.  */
+if ((c & bit_OSXSAVE) && (c & bit_AVX) && max >= 7) {
+int bv;
+__asm("xgetbv" : "=a"(bv), "=d"(d) : "c"(0));
+__cpuid_count(7, 0, a, b, c, d);
+   /* 0xe6:
+*  XCR0[7:5] = 111b (OPMASK state, upper 256-bit of ZMM0-ZMM15
+*and ZMM16-ZMM31 state are enabled by OS)
+*  XCR0[2:1] = 11b (XMM state and YMM state are enabled by OS)
+*/
+if ((bv & 0xe6) == 0xe6 && (b & bit_AVX512BW)) {
+xbzrle_encode_buffer_func = xbzrle_encode_buffer_avx512;
+}
+}
+}
+}
+#endif
+
 XBZRLECacheStats xbzrle_counters;
 
 /* struct contains XBZRLE cache and a static page
@@ -802,9 +830,9 @@ static int save_xbzrle_page(RAMState *rs, uint8_t 
**current_data,
 memcpy(XBZRLE.current_buf, *current_data, TARGET_PAGE_SIZE);
 
 /* XBZRLE encoding (if there is no overflow) */
-encoded_len = xbzrle_encode_buffer(prev_cached_page, XBZRLE.current_buf,
-   TARGET_PAGE_SIZE, XBZRLE.encoded_buf,
-   TARGET_PAGE_SIZE);
+encoded_len = xbzrle_encode_buffer_func(prev_cached_page, 
XBZRLE.current_buf,
+TARGET_PAGE_SIZE, 
XBZRLE.encoded_buf,
+TARGET_PAGE_SIZE);
 
 /*
  * Update the cache contents, so that it corresponds to the data
diff --git a/migration/xbzrle.c b/migration/xbzrle.c
index 1ba482ded9..05366e86c0 100644
--- a/migration/xbzrle.c
+++ b/migration/xbzrle.c
@@ -174,3 +174,127 @@ int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t 
*dst, int dlen)
 
 return d;
 }
+
+#if defined(CONFIG_AVX512BW_OPT)
+#pragma GCC push_options
+#pragma GCC

[PULL 23/30] migration: Introduce pss_channel

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Introduce pss_channel for PageSearchStatus, define it as "the migration
channel to be used to transfer this host page".

We used to have rs->f, which is a mirror to MigrationState.to_dst_file.

After postcopy preempt initial version, rs->f can be dynamically changed
depending on which channel we want to use.

But that later work still doesn't grant full concurrency of sending pages
in e.g. different threads, because rs->f can either be the PRECOPY channel
or POSTCOPY channel.  This needs to be per-thread too.

PageSearchStatus is actually a good piece of struct which we can leverage
if we want to have multiple threads sending pages.  Sending a single guest
page may not make sense, so we make the granule to be "host page", and in
the PSS structure we allow specify a QEMUFile* to migrate a specific host
page.  Then we open the possibility to specify different channels in
different threads with different PSS structures.

The PSS prefix can be slightly misleading here because e.g. for the
upcoming usage of postcopy channel/thread it's not "searching" (or,
scanning) at all but sending the explicit page that was requested.  However
since PSS existed for some years keep it as-is until someone complains.

This patch mostly (simply) replace rs->f with pss->pss_channel only. No
functional change intended for this patch yet.  But it does prepare to
finally drop rs->f, and make ram_save_guest_page() thread safe.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 70 +++--
 1 file changed, 38 insertions(+), 32 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index b71edf1f26..fedd61b3da 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -481,6 +481,8 @@ void dirty_sync_missed_zero_copy(void)
 
 /* used by the search for pages to send */
 struct PageSearchStatus {
+/* The migration channel used for a specific host page */
+QEMUFile*pss_channel;
 /* Current block being searched */
 RAMBlock*block;
 /* Current page to search from */
@@ -803,9 +805,9 @@ static void xbzrle_cache_zero_page(RAMState *rs, ram_addr_t 
current_addr)
  * @block: block that contains the page we want to send
  * @offset: offset inside the block for the page
  */
-static int save_xbzrle_page(RAMState *rs, uint8_t **current_data,
-ram_addr_t current_addr, RAMBlock *block,
-ram_addr_t offset)
+static int save_xbzrle_page(RAMState *rs, QEMUFile *file,
+uint8_t **current_data, ram_addr_t current_addr,
+RAMBlock *block, ram_addr_t offset)
 {
 int encoded_len = 0, bytes_xbzrle;
 uint8_t *prev_cached_page;
@@ -873,11 +875,11 @@ static int save_xbzrle_page(RAMState *rs, uint8_t 
**current_data,
 }
 
 /* Send XBZRLE based compressed page */
-bytes_xbzrle = save_page_header(rs, rs->f, block,
+bytes_xbzrle = save_page_header(rs, file, block,
 offset | RAM_SAVE_FLAG_XBZRLE);
-qemu_put_byte(rs->f, ENCODING_FLAG_XBZRLE);
-qemu_put_be16(rs->f, encoded_len);
-qemu_put_buffer(rs->f, XBZRLE.encoded_buf, encoded_len);
+qemu_put_byte(file, ENCODING_FLAG_XBZRLE);
+qemu_put_be16(file, encoded_len);
+qemu_put_buffer(file, XBZRLE.encoded_buf, encoded_len);
 bytes_xbzrle += encoded_len + 1 + 2;
 /*
  * Like compressed_size (please see update_compress_thread_counts),
@@ -1333,9 +1335,10 @@ static int save_zero_page_to_file(RAMState *rs, QEMUFile 
*file,
  * @block: block that contains the page we want to send
  * @offset: offset inside the block for the page
  */
-static int save_zero_page(RAMState *rs, RAMBlock *block, ram_addr_t offset)
+static int save_zero_page(RAMState *rs, QEMUFile *file, RAMBlock *block,
+  ram_addr_t offset)
 {
-int len = save_zero_page_to_file(rs, rs->f, block, offset);
+int len = save_zero_page_to_file(rs, file, block, offset);
 
 if (len) {
 stat64_add(_atomic_counters.duplicate, 1);
@@ -1352,15 +1355,15 @@ static int save_zero_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset)
  *
  * Return true if the pages has been saved, otherwise false is returned.
  */
-static bool control_save_page(RAMState *rs, RAMBlock *block, ram_addr_t offset,
-  int *pages)
+static bool control_save_page(PageSearchStatus *pss, RAMBlock *block,
+  ram_addr_t offset, int *pages)
 {
 uint64_t bytes_xmit = 0;
 int ret;
 
 *pages = -1;
-ret = ram_control_save_page(rs->f, block->offset, offset, TARGET_PAGE_SIZE,
-_xmit);
+ret = ram_control_save_page(pss->pss_channel, block->offset, offset,
+TARGET_PAGE_SIZE, _xmit);
 if (ret == RAM_SAVE_CONTROL_NOT_SUPP) {
 return false;
 }

[PULL 14/30] migration: Disable multifd explicitly with compression

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Multifd thread model does not work for compression, explicitly disable it.

Note that previuosly even we can enable both of them, nothing will go
wrong, because the compression code has higher priority so multifd feature
will just be ignored.  Now we'll fail even earlier at config time so the
user should be aware of the consequence better.

Note that there can be a slight chance of breaking existing users, but
let's assume they're not majority and not serious users, or they should
have found that multifd is not working already.

With that, we can safely drop the check in ram_save_target_page() for using
multifd, because when multifd=on then compression=off, then the removed
check on save_page_use_compression() will also always return false too.

Signed-off-by: Peter Xu 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/migration.c |  7 +++
 migration/ram.c   | 11 +--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 0bc3fce4b7..9fbed8819a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1370,6 +1370,13 @@ static bool migrate_caps_check(bool *cap_list,
 }
 }
 
+if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
+if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+error_setg(errp, "Multifd is not compatible with compress");
+return false;
+}
+}
+
 return true;
 }
 
diff --git a/migration/ram.c b/migration/ram.c
index c0f5d6d287..2fcce796d0 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2333,13 +2333,12 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss)
 }
 
 /*
- * Do not use multifd for:
- * 1. Compression as the first page in the new block should be posted out
- *before sending the compressed page
- * 2. In postcopy as one whole host page should be placed
+ * Do not use multifd in postcopy as one whole host page should be
+ * placed.  Meanwhile postcopy requires atomic update of pages, so even
+ * if host page size == guest page size the dest guest during run may
+ * still see partially copied pages which is data corruption.
  */
-if (!save_page_use_compression(rs) && migrate_use_multifd()
-&& !migration_in_postcopy()) {
+if (migrate_use_multifd() && !migration_in_postcopy()) {
 return ram_save_multifd_page(rs, block, offset);
 }
 
-- 
2.38.1

[PULL 25/30] migration: Make PageSearchStatus part of RAMState

2022-11-15 Thread Juan Quintela

From: Peter Xu 

We used to allocate PSS structure on the stack for precopy when sending
pages.  Make it static, so as to describe per-channel ram migration status.

Here we declared RAM_CHANNEL_MAX instances, preparing for postcopy to use
it, even though this patch has not yet to start using the 2nd instance.

This should not have any functional change per se, but it already starts to
export PSS information via the RAMState, so that e.g. one PSS channel can
start to reference the other PSS channel.

Always protect PSS access using the same RAMState.bitmap_mutex.  We already
do so, so no code change needed, just some comment update.  Maybe we should
consider renaming bitmap_mutex some day as it's going to be a more commonly
and big mutex we use for ram states, but just leave it for later.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 112 ++--
 1 file changed, 61 insertions(+), 51 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index a2e86623d3..bdb29ac4d9 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -113,6 +113,46 @@ static void __attribute__((constructor)) 
init_cpu_flag(void)
 
 XBZRLECacheStats xbzrle_counters;
 
+/* used by the search for pages to send */
+struct PageSearchStatus {
+/* The migration channel used for a specific host page */
+QEMUFile*pss_channel;
+/* Current block being searched */
+RAMBlock*block;
+/* Current page to search from */
+unsigned long page;
+/* Set once we wrap around */
+bool complete_round;
+/*
+ * [POSTCOPY-ONLY] Whether current page is explicitly requested by
+ * postcopy.  When set, the request is "urgent" because the dest QEMU
+ * threads are waiting for us.
+ */
+bool postcopy_requested;
+/*
+ * [POSTCOPY-ONLY] The target channel to use to send current page.
+ *
+ * Note: This may _not_ match with the value in postcopy_requested
+ * above. Let's imagine the case where the postcopy request is exactly
+ * the page that we're sending in progress during precopy. In this case
+ * we'll have postcopy_requested set to true but the target channel
+ * will be the precopy channel (so that we don't split brain on that
+ * specific page since the precopy channel already contains partial of
+ * that page data).
+ *
+ * Besides that specific use case, postcopy_target_channel should
+ * always be equal to postcopy_requested, because by default we send
+ * postcopy pages via postcopy preempt channel.
+ */
+bool postcopy_target_channel;
+/* Whether we're sending a host page */
+bool  host_page_sending;
+/* The start/end of current host page.  Invalid if 
host_page_sending==false */
+unsigned long host_page_start;
+unsigned long host_page_end;
+};
+typedef struct PageSearchStatus PageSearchStatus;
+
 /* struct contains XBZRLE cache and a static page
used by the compression */
 static struct {
@@ -347,6 +387,11 @@ typedef struct {
 struct RAMState {
 /* QEMUFile used for this migration */
 QEMUFile *f;
+/*
+ * PageSearchStatus structures for the channels when send pages.
+ * Protected by the bitmap_mutex.
+ */
+PageSearchStatus pss[RAM_CHANNEL_MAX];
 /* UFFD file descriptor, used in 'write-tracking' migration */
 int uffdio_fd;
 /* Last block that we have visited searching for dirty pages */
@@ -390,7 +435,12 @@ struct RAMState {
 uint64_t target_page_count;
 /* number of dirty bits in the bitmap */
 uint64_t migration_dirty_pages;
-/* Protects modification of the bitmap and migration dirty pages */
+/*
+ * Protects:
+ * - dirty/clear bitmap
+ * - migration_dirty_pages
+ * - pss structures
+ */
 QemuMutex bitmap_mutex;
 /* The RAMBlock used in the last src_page_requests */
 RAMBlock *last_req_rb;
@@ -479,46 +529,6 @@ void dirty_sync_missed_zero_copy(void)
 ram_counters.dirty_sync_missed_zero_copy++;
 }
 
-/* used by the search for pages to send */
-struct PageSearchStatus {
-/* The migration channel used for a specific host page */
-QEMUFile*pss_channel;
-/* Current block being searched */
-RAMBlock*block;
-/* Current page to search from */
-unsigned long page;
-/* Set once we wrap around */
-bool complete_round;
-/*
- * [POSTCOPY-ONLY] Whether current page is explicitly requested by
- * postcopy.  When set, the request is "urgent" because the dest QEMU
- * threads are waiting for us.
- */
-bool postcopy_requested;
-/*
- * [POSTCOPY-ONLY] The target channel to use to send current page.
- *
- * Note: This may _not_ match with the value in postcopy_requested
- * above. Let's imagine the case where the postcopy request is exactly
- * the page that

[PULL 28/30] migration: Remove old preempt code around state maintainance

2022-11-15 Thread Juan Quintela

From: Peter Xu 

With the new code to send pages in rp-return thread, there's little help to
keep lots of the old code on maintaining the preempt state in migration
thread, because the new way should always be faster..

Then if we'll always send pages in the rp-return thread anyway, we don't
need those logic to maintain preempt state anymore because now we serialize
things using the mutex directly instead of using those fields.

It's very unfortunate to have those code for a short period, but that's
still one intermediate step that we noticed the next bottleneck on the
migration thread.  Now what we can do best is to drop unnecessary code as
long as the new code is stable to reduce the burden.  It's actually a good
thing because the new "sending page in rp-return thread" model is (IMHO)
even cleaner and with better performance.

Remove the old code that was responsible for maintaining preempt states, at
the meantime also remove x-postcopy-preempt-break-huge parameter because
with concurrent sender threads we don't really need to break-huge anymore.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/migration.h |   7 -
 migration/migration.c |   2 -
 migration/ram.c   | 291 +-
 3 files changed, 3 insertions(+), 297 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index cdad8aceaa..ae4ffd3454 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -340,13 +340,6 @@ struct MigrationState {
 bool send_configuration;
 /* Whether we send section footer during migration */
 bool send_section_footer;
-/*
- * Whether we allow break sending huge pages when postcopy preempt is
- * enabled.  When disabled, we won't interrupt precopy within sending a
- * host huge page, which is the old behavior of vanilla postcopy.
- * NOTE: this parameter is ignored if postcopy preempt is not enabled.
- */
-bool postcopy_preempt_break_huge;
 
 /* Needed by postcopy-pause state */
 QemuSemaphore postcopy_pause_sem;
diff --git a/migration/migration.c b/migration/migration.c
index 42f36c1e2c..22fc863c67 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -4422,8 +4422,6 @@ static Property migration_properties[] = {
 DEFINE_PROP_SIZE("announce-step", MigrationState,
   parameters.announce_step,
   DEFAULT_MIGRATE_ANNOUNCE_STEP),
-DEFINE_PROP_BOOL("x-postcopy-preempt-break-huge", MigrationState,
-  postcopy_preempt_break_huge, true),
 DEFINE_PROP_STRING("tls-creds", MigrationState, parameters.tls_creds),
 DEFINE_PROP_STRING("tls-hostname", MigrationState, 
parameters.tls_hostname),
 DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
diff --git a/migration/ram.c b/migration/ram.c
index 5dc221a2fc..88e61b0aeb 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -125,28 +125,6 @@ struct PageSearchStatus {
 unsigned long page;
 /* Set once we wrap around */
 bool complete_round;
-/*
- * [POSTCOPY-ONLY] Whether current page is explicitly requested by
- * postcopy.  When set, the request is "urgent" because the dest QEMU
- * threads are waiting for us.
- */
-bool postcopy_requested;
-/*
- * [POSTCOPY-ONLY] The target channel to use to send current page.
- *
- * Note: This may _not_ match with the value in postcopy_requested
- * above. Let's imagine the case where the postcopy request is exactly
- * the page that we're sending in progress during precopy. In this case
- * we'll have postcopy_requested set to true but the target channel
- * will be the precopy channel (so that we don't split brain on that
- * specific page since the precopy channel already contains partial of
- * that page data).
- *
- * Besides that specific use case, postcopy_target_channel should
- * always be equal to postcopy_requested, because by default we send
- * postcopy pages via postcopy preempt channel.
- */
-bool postcopy_target_channel;
 /* Whether we're sending a host page */
 bool  host_page_sending;
 /* The start/end of current host page.  Invalid if 
host_page_sending==false */
@@ -371,20 +349,6 @@ struct RAMSrcPageRequest {
 QSIMPLEQ_ENTRY(RAMSrcPageRequest) next_req;
 };
 
-typedef struct {
-/*
- * Cached ramblock/offset values if preempted.  They're only meaningful if
- * preempted==true below.
- */
-RAMBlock *ram_block;
-unsigned long ram_page;
-/*
- * Whether a postcopy preemption just happened.  Will be reset after
- * precopy recovered to background migration.
- */
-bool preempted;
-} PostcopyPreemptState;
-
 /* State of RAM for migration */
 struct RAMState {
 /* QEMUFile used for this migration */
@@ -447,14 +411,6 @@ struct RAMState {

[PULL 29/30] migration: Drop rs->f

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Now with rs->pss we can already cache channels in pss->pss_channels.  That
pss_channel contains more infromation than rs->f because it's per-channel.
So rs->f could be replaced by rss->pss[RAM_CHANNEL_PRECOPY].pss_channel,
while rs->f itself is a bit vague now.

Note that vanilla postcopy still send pages via pss[RAM_CHANNEL_PRECOPY],
that's slightly confusing but it reflects the reality.

Then, after the replacement we can safely drop rs->f.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 88e61b0aeb..29e413b97b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -351,8 +351,6 @@ struct RAMSrcPageRequest {
 
 /* State of RAM for migration */
 struct RAMState {
-/* QEMUFile used for this migration */
-QEMUFile *f;
 /*
  * PageSearchStatus structures for the channels when send pages.
  * Protected by the bitmap_mutex.
@@ -2560,8 +2558,6 @@ static int ram_find_and_save_block(RAMState *rs)
 }
 
 if (found) {
-/* Cache rs->f in pss_channel (TODO: remove rs->f) */
-pss->pss_channel = rs->f;
 pages = ram_save_host_page(rs, pss);
 }
 } while (!pages && again);
@@ -3117,7 +3113,7 @@ static void ram_state_resume_prepare(RAMState *rs, 
QEMUFile *out)
 ram_state_reset(rs);
 
 /* Update RAMState cache of output QEMUFile */
-rs->f = out;
+rs->pss[RAM_CHANNEL_PRECOPY].pss_channel = out;
 
 trace_ram_state_resume_prepare(pages);
 }
@@ -3208,7 +3204,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 return -1;
 }
 }
-(*rsp)->f = f;
+(*rsp)->pss[RAM_CHANNEL_PRECOPY].pss_channel = f;
 
 WITH_RCU_READ_LOCK_GUARD() {
 qemu_put_be64(f, ram_bytes_total_common(true) | 
RAM_SAVE_FLAG_MEM_SIZE);
@@ -3343,7 +3339,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 out:
 if (ret >= 0
 && migration_is_setup_or_active(migrate_get_current()->state)) {
-ret = multifd_send_sync_main(rs->f);
+ret = multifd_send_sync_main(rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
 if (ret < 0) {
 return ret;
 }
@@ -3413,7 +3409,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 return ret;
 }
 
-ret = multifd_send_sync_main(rs->f);
+ret = multifd_send_sync_main(rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
 if (ret < 0) {
 return ret;
 }
-- 
2.38.1

[PULL 00/30] Next patches

2022-11-15 Thread Juan Quintela

The following changes since commit 98f10f0e2613ba1ac2ad3f57a5174014f6dcb03d:

  Merge tag 'pull-target-arm-20221114' of 
https://git.linaro.org/people/pmaydell/qemu-arm into staging (2022-11-14 
13:31:17 -0500)

are available in the Git repository at:

  https://gitlab.com/juan.quintela/qemu.git tags/next-pull-request

for you to fetch changes up to d896a7a40db13fc2d05828c94ddda2747530089c:

  migration: Block migration comment or code is wrong (2022-11-15 10:31:06 
+0100)


Migration PULL request (take 2)

Hi

This time properly signed.

[take 1]
It includes:
- Leonardo fix for zero_copy flush
- Fiona fix for return value of readv/writev
- Peter Xu cleanups
- Peter Xu preempt patches
- Patches ready from zero page (me)
- AVX2 support (ling)
- fix for slow networking and reordering of first packets (manish)

Please, apply.



Fiona Ebner (1):
  migration/channel-block: fix return value for
qio_channel_block_{readv,writev}

Juan Quintela (5):
  multifd: Create page_size fields into both MultiFD{Recv,Send}Params
  multifd: Create page_count fields into both MultiFD{Recv,Send}Params
  migration: Export ram_transferred_ram()
  migration: Export ram_release_page()
  migration: Block migration comment or code is wrong

Leonardo Bras (1):
  migration/multifd/zero-copy: Create helper function for flushing

Peter Xu (20):
  migration: Fix possible infinite loop of ram save process
  migration: Fix race on qemu_file_shutdown()
  migration: Disallow postcopy preempt to be used with compress
  migration: Use non-atomic ops for clear log bitmap
  migration: Disable multifd explicitly with compression
  migration: Take bitmap mutex when completing ram migration
  migration: Add postcopy_preempt_active()
  migration: Cleanup xbzrle zero page cache update logic
  migration: Trivial cleanup save_page_header() on same block check
  migration: Remove RAMState.f references in compression code
  migration: Yield bitmap_mutex properly when sending/sleeping
  migration: Use atomic ops properly for page accountings
  migration: Teach PSS about host page
  migration: Introduce pss_channel
  migration: Add pss_init()
  migration: Make PageSearchStatus part of RAMState
  migration: Move last_sent_block into PageSearchStatus
  migration: Send requested page directly in rp-return thread
  migration: Remove old preempt code around state maintainance
  migration: Drop rs->f

ling xu (2):
  Update AVX512 support for xbzrle_encode_buffer
  Unit test code and benchmark code

manish.mishra (1):
  migration: check magic value for deciding the mapping of channels

 meson.build   |  16 +
 include/exec/ram_addr.h   |  11 +-
 include/exec/ramblock.h   |   3 +
 include/io/channel.h  |  25 ++
 include/qemu/bitmap.h |   1 +
 migration/migration.h |   7 -
 migration/multifd.h   |  10 +-
 migration/postcopy-ram.h  |   2 +-
 migration/ram.h   |  23 +
 migration/xbzrle.h|   4 +
 io/channel-socket.c   |  27 ++
 io/channel.c  |  39 ++
 migration/block.c |   4 +-
 migration/channel-block.c |   6 +-
 migration/migration.c | 109 +++--
 migration/multifd-zlib.c  |  14 +-
 migration/multifd-zstd.c  |  12 +-
 migration/multifd.c   |  69 +--
 migration/postcopy-ram.c  |   5 +-
 migration/qemu-file.c |  27 +-
 migration/ram.c   | 794 +-
 migration/xbzrle.c| 124 ++
 tests/bench/xbzrle-bench.c| 465 
 tests/unit/test-xbzrle.c  |  39 +-
 util/bitmap.c |  45 ++
 meson_options.txt |   2 +
 scripts/meson-buildoptions.sh |  14 +-
 tests/bench/meson.build   |   4 +
 28 files changed, 1379 insertions(+), 522 deletions(-)
 create mode 100644 tests/bench/xbzrle-bench.c

-- 
2.38.1

[PULL 01/30] migration/channel-block: fix return value for qio_channel_block_{readv, writev}

2022-11-15 Thread Juan Quintela

From: Fiona Ebner 

in the error case. The documentation in include/io/channel.h states
that -1 or QIO_CHANNEL_ERR_BLOCK should be returned upon error. Simply
passing along the return value from the bdrv-functions has the
potential to confuse the call sides. Non-blocking mode is not
implemented currently, so -1 it is.

Signed-off-by: Fiona Ebner 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/channel-block.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/migration/channel-block.c b/migration/channel-block.c
index c55c8c93ce..f4ab53acdb 100644
--- a/migration/channel-block.c
+++ b/migration/channel-block.c
@@ -62,7 +62,8 @@ qio_channel_block_readv(QIOChannel *ioc,
 qemu_iovec_init_external(, (struct iovec *)iov, niov);
 ret = bdrv_readv_vmstate(bioc->bs, , bioc->offset);
 if (ret < 0) {
-return ret;
+error_setg_errno(errp, -ret, "bdrv_readv_vmstate failed");
+return -1;
 }
 
 bioc->offset += qiov.size;
@@ -86,7 +87,8 @@ qio_channel_block_writev(QIOChannel *ioc,
 qemu_iovec_init_external(, (struct iovec *)iov, niov);
 ret = bdrv_writev_vmstate(bioc->bs, , bioc->offset);
 if (ret < 0) {
-return ret;
+error_setg_errno(errp, -ret, "bdrv_writev_vmstate failed");
+return -1;
 }
 
 bioc->offset += qiov.size;
-- 
2.38.1

[PULL 15/30] migration: Take bitmap mutex when completing ram migration

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Any call to ram_find_and_save_block() needs to take the bitmap mutex.  We
used to not take it for most of ram_save_complete() because we thought
we're the only one left using the bitmap, but it's not true after the
preempt full patchset applied, since the return path can be taking it too.

Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 2fcce796d0..96fa521813 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3434,6 +3434,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 /* try transferring iterative blocks of memory */
 
 /* flush all remaining blocks regardless of rate limiting */
+qemu_mutex_lock(>bitmap_mutex);
 while (true) {
 int pages;
 
@@ -3447,6 +3448,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 break;
 }
 }
+qemu_mutex_unlock(>bitmap_mutex);
 
 flush_compressed_data(rs);
 ram_control_after_iterate(f, RAM_CONTROL_FINISH);
-- 
2.38.1

[PULL 12/30] migration: Disallow postcopy preempt to be used with compress

2022-11-15 Thread Juan Quintela

From: Peter Xu 

The preempt mode requires the capability to assign channel for each of the
page, while the compression logic will currently assign pages to different
compress thread/local-channel so potentially they're incompatible.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/migration.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 406a9e2f72..0bc3fce4b7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1357,6 +1357,17 @@ static bool migrate_caps_check(bool *cap_list,
 error_setg(errp, "Postcopy preempt requires postcopy-ram");
 return false;
 }
+
+/*
+ * Preempt mode requires urgent pages to be sent in separate
+ * channel, OTOH compression logic will disorder all pages into
+ * different compression channels, which is not compatible with the
+ * preempt assumptions on channel assignments.
+ */
+if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+error_setg(errp, "Postcopy preempt not compatible with compress");
+return false;
+}
 }
 
 return true;
-- 
2.38.1

[PULL 11/30] migration: Fix race on qemu_file_shutdown()

2022-11-15 Thread Juan Quintela

From: Peter Xu 

In qemu_file_shutdown(), there's a possible race if with current order of
operation.  There're two major things to do:

  (1) Do real shutdown() (e.g. shutdown() syscall on socket)
  (2) Update qemufile's last_error

We must do (2) before (1) otherwise there can be a race condition like:

  page receiver other thread
  - 
  qemu_get_buffer()
do shutdown()
returns 0 (buffer all zero)
(meanwhile we didn't check this retcode)
  try to detect IO error
last_error==NULL, IO okay
  install ALL-ZERO page
set last_error
  --> guest crash!

To fix this, we can also check retval of qemu_get_buffer(), but not all
APIs can be properly checked and ultimately we still need to go back to
qemu_file_get_error().  E.g. qemu_get_byte() doesn't return error.

Maybe some day a rework of qemufile API is really needed, but for now keep
using qemu_file_get_error() and fix it by not allowing that race condition
to happen.  Here shutdown() is indeed special because the last_error was
emulated.  For real -EIO errors it'll always be set when e.g. sendmsg()
error triggers so we won't miss those ones, only shutdown() is a bit tricky
here.

Cc: Daniel P. Berrange 
Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/qemu-file.c | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 4f400c2e52..2d5f74ffc2 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -79,6 +79,30 @@ int qemu_file_shutdown(QEMUFile *f)
 int ret = 0;
 
 f->shutdown = true;
+
+/*
+ * We must set qemufile error before the real shutdown(), otherwise
+ * there can be a race window where we thought IO all went though
+ * (because last_error==NULL) but actually IO has already stopped.
+ *
+ * If without correct ordering, the race can happen like this:
+ *
+ *  page receiver other thread
+ *  - 
+ *  qemu_get_buffer()
+ *do shutdown()
+ *returns 0 (buffer all zero)
+ *(we didn't check this retcode)
+ *  try to detect IO error
+ *last_error==NULL, IO okay
+ *  install ALL-ZERO page
+ *set last_error
+ *  --> guest crash!
+ */
+if (!f->last_error) {
+qemu_file_set_error(f, -EIO);
+}
+
 if (!qio_channel_has_feature(f->ioc,
  QIO_CHANNEL_FEATURE_SHUTDOWN)) {
 return -ENOSYS;
@@ -88,9 +112,6 @@ int qemu_file_shutdown(QEMUFile *f)
 ret = -EIO;
 }
 
-if (!f->last_error) {
-qemu_file_set_error(f, -EIO);
-}
 return ret;
 }
 
-- 
2.38.1

[PULL 03/30] migration: check magic value for deciding the mapping of channels

2022-11-15 Thread Juan Quintela

From: "manish.mishra" 

Current logic assumes that channel connections on the destination side are
always established in the same order as the source and the first one will
always be the main channel followed by the multifid or post-copy
preemption channel. This may not be always true, as even if a channel has a
connection established on the source side it can be in the pending state on
the destination side and a newer connection can be established first.
Basically causing out of order mapping of channels on the destination side.
Currently, all channels except post-copy preempt send a magic number, this
patch uses that magic number to decide the type of channel. This logic is
applicable only for precopy(multifd) live migration, as mentioned, the
post-copy preempt channel does not send any magic number. Also, tls live
migrations already does tls handshake before creating other channels, so
this issue is not possible with tls, hence this logic is avoided for tls
live migrations. This patch uses MSG_PEEK to check the magic number of
channels so that current data/control stream management remains
un-effected.

v2: TLS does not support MSG_PEEK, so V1 was broken for tls live
  migrations. For tls live migration, while initializing main channel
  tls handshake is done before we can create other channels, so this
  issue is not possible for tls live migrations. In V2 added a check
  to avoid checking magic number for tls live migration and fallback
  to older method to decide mapping of channels on destination side.

Suggested-by: Daniel P. Berrangé 
Signed-off-by: manish.mishra 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 include/io/channel.h | 25 +++
 migration/multifd.h  |  2 +-
 migration/postcopy-ram.h |  2 +-
 io/channel-socket.c  | 27 
 io/channel.c | 39 +++
 migration/migration.c| 44 +---
 migration/multifd.c  | 12 ---
 migration/postcopy-ram.c |  5 +
 8 files changed, 130 insertions(+), 26 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index c680ee7480..74177aeeea 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -115,6 +115,10 @@ struct QIOChannelClass {
 int **fds,
 size_t *nfds,
 Error **errp);
+ssize_t (*io_read_peek)(QIOChannel *ioc,
+void *buf,
+size_t nbytes,
+Error **errp);
 int (*io_close)(QIOChannel *ioc,
 Error **errp);
 GSource * (*io_create_watch)(QIOChannel *ioc,
@@ -475,6 +479,27 @@ int qio_channel_write_all(QIOChannel *ioc,
   size_t buflen,
   Error **errp);
 
+/**
+ * qio_channel_read_peek_all:
+ * @ioc: the channel object
+ * @buf: the memory region to read in data
+ * @nbytes: the number of bytes to read
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Read given @nbytes data from peek of channel into
+ * memory region @buf.
+ *
+ * The function will be blocked until read size is
+ * equal to requested size.
+ *
+ * Returns: 1 if all bytes were read, 0 if end-of-file
+ *  occurs without data, or -1 on error
+ */
+int qio_channel_read_peek_all(QIOChannel *ioc,
+  void* buf,
+  size_t nbytes,
+  Error **errp);
+
 /**
  * qio_channel_set_blocking:
  * @ioc: the channel object
diff --git a/migration/multifd.h b/migration/multifd.h
index 519f498643..913e4ba274 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -18,7 +18,7 @@ void multifd_save_cleanup(void);
 int multifd_load_setup(Error **errp);
 int multifd_load_cleanup(Error **errp);
 bool multifd_recv_all_channels_created(void);
-bool multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
+void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
 int multifd_send_sync_main(QEMUFile *f);
 int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 6147bf7d1d..25881c4127 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -190,7 +190,7 @@ enum PostcopyChannels {
 RAM_CHANNEL_MAX,
 };
 
-bool postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
+void postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
 int postcopy_preempt_setup(MigrationState *s, Error **errp);
 int postcopy_preempt_wait_channel(MigrationState *s);
 
diff --git a/io/channel-socket.c b/io/channel-socket.c
index b76dca9cc1..b99f5dfda6 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -705,6 +705,32 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 }
 #endif /* WIN32 */
 
+static ssize_t

[PULL 17/30] migration: Cleanup xbzrle zero page cache update logic

2022-11-15 Thread Juan Quintela

From: Peter Xu 

The major change is to replace "!save_page_use_compression()" with
"xbzrle_enabled" to make it clear.

Reasonings:

(1) When compression enabled, "!save_page_use_compression()" is exactly the
same as checking "xbzrle_enabled".

(2) When compression disabled, "!save_page_use_compression()" always return
true.  We used to try calling the xbzrle code, but after this change we
won't, and we shouldn't need to.

Since at it, drop the xbzrle_enabled check in xbzrle_cache_zero_page()
because with this change it's not needed anymore.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 52c851eb56..9ded381e0a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -769,10 +769,6 @@ void mig_throttle_counter_reset(void)
  */
 static void xbzrle_cache_zero_page(RAMState *rs, ram_addr_t current_addr)
 {
-if (!rs->xbzrle_enabled) {
-return;
-}
-
 /* We don't care if this fails to allocate a new cache page
  * as long as it updated an old one */
 cache_insert(XBZRLE.cache, current_addr, XBZRLE.zero_target_page,
@@ -2329,7 +2325,7 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss)
 /* Must let xbzrle know, otherwise a previous (now 0'd) cached
  * page would be stale
  */
-if (!save_page_use_compression(rs)) {
+if (rs->xbzrle_enabled) {
 XBZRLE_cache_lock();
 xbzrle_cache_zero_page(rs, block->offset + offset);
 XBZRLE_cache_unlock();
-- 
2.38.1

[PULL 02/30] migration/multifd/zero-copy: Create helper function for flushing

2022-11-15 Thread Juan Quintela

From: Leonardo Bras 

Move flushing code from multifd_send_sync_main() to a new helper, and call
it in multifd_send_sync_main().

Signed-off-by: Leonardo Bras 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/multifd.c | 30 +++---
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 586ddc9d65..509bbbe3bf 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -566,6 +566,23 @@ void multifd_save_cleanup(void)
 multifd_send_state = NULL;
 }
 
+static int multifd_zero_copy_flush(QIOChannel *c)
+{
+int ret;
+Error *err = NULL;
+
+ret = qio_channel_flush(c, );
+if (ret < 0) {
+error_report_err(err);
+return -1;
+}
+if (ret == 1) {
+dirty_sync_missed_zero_copy();
+}
+
+return ret;
+}
+
 int multifd_send_sync_main(QEMUFile *f)
 {
 int i;
@@ -616,17 +633,8 @@ int multifd_send_sync_main(QEMUFile *f)
 qemu_mutex_unlock(>mutex);
 qemu_sem_post(>sem);
 
-if (flush_zero_copy && p->c) {
-int ret;
-Error *err = NULL;
-
-ret = qio_channel_flush(p->c, );
-if (ret < 0) {
-error_report_err(err);
-return -1;
-} else if (ret == 1) {
-dirty_sync_missed_zero_copy();
-}
+if (flush_zero_copy && p->c && (multifd_zero_copy_flush(p->c) < 0)) {
+return -1;
 }
 }
 for (i = 0; i < migrate_multifd_channels(); i++) {
-- 
2.38.1

Re: [PULL 00/11] Block layer patches

2022-11-15 Thread Kevin Wolf

Am 15.11.2022 um 11:21 hat Hanna Reitz geschrieben:
> On 15.11.22 11:14, Kevin Wolf wrote:
> > Am 15.11.2022 um 00:58 hat John Snow geschrieben:
> > > On Mon, Nov 14, 2022 at 5:56 AM Kevin Wolf  wrote:
> > > > Am 11.11.2022 um 20:20 hat Stefan Hajnoczi geschrieben:
> > > > > > Hanna Reitz (9):
> > > > > >block/mirror: Do not wait for active writes
> > > > > >block/mirror: Drop mirror_wait_for_any_operation()
> > > > > >block/mirror: Fix NULL s->job in active writes
> > > > > >iotests/151: Test that active mirror progresses
> > > > > >iotests/151: Test active requests on mirror start
> > > > > >block: Make bdrv_child_get_parent_aio_context I/O
> > > > > >block-backend: Update ctx immediately after root
> > > > > >block: Start/end drain on correct AioContext
> > > > > >tests/stream-under-throttle: New test
> > > > > Hi Hanna,
> > > > > This test is broken, probably due to the minimum Python version:
> > > > > https://gitlab.com/qemu-project/qemu/-/jobs/3311521303
> > > > This is exactly the problem I saw with running linters in a gating CI,
> > > > but not during 'make check'. And of course, we're hitting it during the
> > > > -rc phase now. :-(
> > > I mean. I'd love to have it run in make check too. The alternative was
> > > never seeing this *anywhere* ...
> > What is the problem with running it in 'make check'? The additional
> > dependencies? If so, can we run it automatically if the dependencies
> > happen to be fulfilled and just skip it otherwise?
> > 
> > If I have to run 'make -C python check-pipenv' manually, I can guarantee
> > you that I'll forget it more often than I'll run it.
> > 
> > > ...but I'm sorry it's taken me so long to figure out how to get this
> > > stuff to work in "make check" and also from manual iotests runs
> > > without adding any kind of setup that you have to manage. It's just
> > > fiddly, sorry :(
> > > 
> > > > But yes, it seems that asyncio.TimeoutError should be used instead of
> > > > asyncio.exceptions.TimeoutError, and Python 3.6 has only the former.
> > > > I'll fix this up and send a v2 if it fixes check-python-pipenv.
> > > Hopefully this goes away when we drop 3.6. I want to, but I recall
> > > there was some question about some platforms that don't support 3.7+
> > > "by default" and how annoying that was or wasn't. We're almost a year
> > > out from 3.6 being EOL, so maybe after this release it's worth a crack
> > > to see how painful it is to move on.
> > If I understand the documentation right, asyncio.TimeoutError is what
> > you should be using either way. That it happens to be a re-export from
> > the internal module asyncio.exceptions seems to be more of an
> > implementation detail, not the official interface.
> 
> Oh, so I understood
> https://docs.python.org/3/library/asyncio-exceptions.html wrong.  I took
> that to mean that as of 3.11, `asyncio.TimeoutError` is a deprecated alias
> for `asyncio.exceptions.TimeoutError`, but it’s actually become an alias for
> the now-built-in `TimeoutError` exception.  I think.

Not just "now-built-in", it has been built in before (starting from
3.3). But AIUI, asyncio used to use its own separate exception class
(asyncio.TimeoutError, in some versions re-exported from the exceptions
submodule) instead of the built-in one, and in 3.11 it now reuses the
built-in one instead of defining a separate custom one.

Kevin

Re: [PATCH v2] m25p80: Improve error when the backend file size does not match the device

2022-11-15 Thread Peter Maydell

On Tue, 15 Nov 2022 at 15:10, Cédric Le Goater  wrote:
>
> Currently, when a block backend is attached to a m25p80 device and the
> associated file size does not match the flash model, QEMU complains
> with the error message "failed to read the initial flash content".
> This is confusing for the user.
>
> Use blk_check_size_and_read_all() instead of blk_pread() to improve
> the reported error.
>
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/block/m25p80.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
> index 02adc87527..68a757abf3 100644
> --- a/hw/block/m25p80.c
> +++ b/hw/block/m25p80.c
> @@ -24,6 +24,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "sysemu/block-backend.h"
> +#include "hw/block/block.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/qdev-properties-system.h"
>  #include "hw/ssi/ssi.h"
> @@ -1614,8 +1615,7 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
> **errp)
>  trace_m25p80_binding(s);
>  s->storage = blk_blockalign(s->blk, s->size);
>
> -if (blk_pread(s->blk, 0, s->size, s->storage, 0) < 0) {
> -error_setg(errp, "failed to read the initial flash content");
> +if (!blk_check_size_and_read_all(s->blk, s->storage, s->size, errp)) 
> {
>  return;
>  }
>  } else {
> --
> 2.38.1

Reviewed-by: Peter Maydell 

thanks
-- PMM

[PATCH v2] m25p80: Improve error when the backend file size does not match the device

2022-11-15 Thread Cédric Le Goater

Currently, when a block backend is attached to a m25p80 device and the
associated file size does not match the flash model, QEMU complains
with the error message "failed to read the initial flash content".
This is confusing for the user.

Use blk_check_size_and_read_all() instead of blk_pread() to improve
the reported error.

Signed-off-by: Cédric Le Goater 
---
 hw/block/m25p80.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
index 02adc87527..68a757abf3 100644
--- a/hw/block/m25p80.c
+++ b/hw/block/m25p80.c
@@ -24,6 +24,7 @@
 #include "qemu/osdep.h"
 #include "qemu/units.h"
 #include "sysemu/block-backend.h"
+#include "hw/block/block.h"
 #include "hw/qdev-properties.h"
 #include "hw/qdev-properties-system.h"
 #include "hw/ssi/ssi.h"
@@ -1614,8 +1615,7 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
**errp)
 trace_m25p80_binding(s);
 s->storage = blk_blockalign(s->blk, s->size);
 
-if (blk_pread(s->blk, 0, s->size, s->storage, 0) < 0) {
-error_setg(errp, "failed to read the initial flash content");
+if (!blk_check_size_and_read_all(s->blk, s->storage, s->size, errp)) {
 return;
 }
 } else {
-- 
2.38.1

Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-15 Thread Christian Borntraeger





Am 15.11.22 um 15:31 schrieb Alex Bennée:


"Michael S. Tsirkin"  writes:


On Mon, Nov 14, 2022 at 06:15:30PM +0100, Christian Borntraeger wrote:



Am 14.11.22 um 18:10 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 05:55:09PM +0100, Christian Borntraeger wrote:



Am 14.11.22 um 17:37 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 05:18:53PM +0100, Christian Borntraeger wrote:

Am 08.11.22 um 10:23 schrieb Alex Bennée:

The previous fix to virtio_device_started revealed a problem in its
use by both the core and the device code. The core code should be able
to handle the device "starting" while the VM isn't running to handle
the restoration of migration state. To solve this dual use introduce a
new helper for use by the vhost-user backends who all use it to feed a
should_start variable.

We can also pick up a change vhost_user_blk_set_status while we are at
it which follows the same pattern.

Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started)
Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
Signed-off-by: Alex Bennée 
Cc: "Michael S. Tsirkin" 


Hmmm, is this
commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
Author: Alex Bennée 
AuthorDate: Mon Nov 7 12:14:07 2022 +
Commit: Michael S. Tsirkin 
CommitDate: Mon Nov 7 14:08:18 2022 -0500

   hw/virtio: introduce virtio_device_should_start

and older version?


This is what got merged:
https://lore.kernel.org/r/20221107121407.1010913-1-alex.bennee%40linaro.org
This patch was sent after I merged the RFC.
I think the only difference is the commit log but I might be missing
something.


This does not seem to fix the regression that I have reported.


This was applied on top of 9f6bcfd99f which IIUC does, right?




QEMU master still fails for me for suspend/resume to disk:

#0  0x03ff8e3980a6 in __pthread_kill_implementation () at /lib64/libc.so.6
#1  0x03ff8e348580 in raise () at /lib64/libc.so.6
#2  0x03ff8e32b5c0 in abort () at /lib64/libc.so.6
#3  0x03ff8e3409da in __assert_fail_base () at /lib64/libc.so.6
#4  0x03ff8e340a4e in  () at /lib64/libc.so.6
#5 0x02aa1ffa8966 in vhost_vsock_common_pre_save
(opaque=) at
../hw/virtio/vhost-vsock-common.c:203
#6  0x02aa1fe5e0ee in vmstate_save_state_v
  (f=f@entry=0x2aa21bdc170, vmsd=0x2aa204ac5f0
, opaque=0x2aa21bac9f8,
vmdesc=vmdesc@entry=0x3fddc08eb30,
version_id=version_id@entry=0) at ../migration/vmstate.c:329
#7 0x02aa1fe5ebf8 in vmstate_save_state
(f=f@entry=0x2aa21bdc170, vmsd=,
opaque=, vmdesc_id=vmdesc_id@entry=0x3fddc08eb30)
at ../migration/vmstate.c:317
#8 0x02aa1fe75bd0 in vmstate_save (f=f@entry=0x2aa21bdc170,
se=se@entry=0x2aa21bdbe90, vmdesc=vmdesc@entry=0x3fddc08eb30) at
../migration/savevm.c:908
#9 0x02aa1fe79584 in
qemu_savevm_state_complete_precopy_non_iterable
(f=f@entry=0x2aa21bdc170, in_postcopy=in_postcopy@entry=false,
inactivate_disks=inactivate_disks@entry=true)
  at ../migration/savevm.c:1393
#10 0x02aa1fe79a96 in qemu_savevm_state_complete_precopy
(f=0x2aa21bdc170, iterable_only=iterable_only@entry=false,
inactivate_disks=inactivate_disks@entry=true) at
../migration/savevm.c:1459
#11 0x02aa1fe6d6ee in migration_completion (s=0x2aa218ef600) at 
../migration/migration.c:3314
#12 migration_iteration_run (s=0x2aa218ef600) at ../migration/migration.c:3761
#13 migration_thread (opaque=opaque@entry=0x2aa218ef600) at 
../migration/migration.c:3989
#14 0x02aa201f0b8c in qemu_thread_start (args=) at 
../util/qemu-thread-posix.c:505
#15 0x03ff8e396248 in start_thread () at /lib64/libc.so.6
#16 0x03ff8e41183e in thread_start () at /lib64/libc.so.6

Michael, your previous branch did work if I recall correctly.


That one was failing under github CI though (for reasons we didn't
really address, such as disconnect during stop causing a recursive
call to stop, but there you are).

Even the double revert of everything?


I don't remember at this point.


So how do we proceed now?


I'm hopeful Alex will come up with a fix.


I need to replicate the failing test for that. Which test is failing?



Pretty much the same as before. guest with vsock, managedsave and restore.

Re: [PATCH] m25p80: Warn the user when the backend file is too small for the device

2022-11-15 Thread Cédric Le Goater


On 11/15/22 15:55, Peter Maydell wrote:

On Tue, 15 Nov 2022 at 14:51, Cédric Le Goater  wrote:


On 11/15/22 15:34, Peter Maydell wrote:

On Tue, 15 Nov 2022 at 14:22, Cédric Le Goater  wrote:


Currently, when a block backend is attached to a m25p80 device and the
associated file size does not match the flash model, QEMU complains
with the error message "failed to read the initial flash content".
This is confusing for the user.


The commit message says we get an unhelpful error if the
file size "does not match"...


Improve the reported error with a new message regarding the file size.

Signed-off-by: Cédric Le Goater 
---
   hw/block/m25p80.c | 8 
   1 file changed, 8 insertions(+)

diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
index 02adc87527..e0515e2a1e 100644
--- a/hw/block/m25p80.c
+++ b/hw/block/m25p80.c
@@ -1606,6 +1606,14 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
**errp)
   if (s->blk) {
   uint64_t perm = BLK_PERM_CONSISTENT_READ |
   (blk_supports_write_perm(s->blk) ? BLK_PERM_WRITE : 
0);
+
+if (blk_getlength(s->blk) < s->size) {


...but the code change is only checking for "too small".

What happens if the user provides a backing file that is too large ?


That's ok because the blk_pread() call following, which loads in RAM
the initial data, won't fail.

It might be better to enforce a strict check on the size to avoid
further confusion ? and change the error message to be clear.


Can we use blk_check_size_and_read_all() here rather than
a manual "check size, and then pread" ? That will take care
of the error message for you and make this device behave
the same way as other flash devices which use block backends.


ok. I wasn't aware of this routine. I will check.

Thanks,
C.

Re: [PATCH 00/30] Migration PULL request

2022-11-15 Thread Stefan Hajnoczi

Please resend as a GPG-signed pull request instead of as a patch series.

Thanks,
Stefan

Re: [PATCH] m25p80: Warn the user when the backend file is too small for the device

2022-11-15 Thread Peter Maydell

On Tue, 15 Nov 2022 at 14:51, Cédric Le Goater  wrote:
>
> On 11/15/22 15:34, Peter Maydell wrote:
> > On Tue, 15 Nov 2022 at 14:22, Cédric Le Goater  wrote:
> >>
> >> Currently, when a block backend is attached to a m25p80 device and the
> >> associated file size does not match the flash model, QEMU complains
> >> with the error message "failed to read the initial flash content".
> >> This is confusing for the user.
> >
> > The commit message says we get an unhelpful error if the
> > file size "does not match"...
> >
> >> Improve the reported error with a new message regarding the file size.
> >>
> >> Signed-off-by: Cédric Le Goater 
> >> ---
> >>   hw/block/m25p80.c | 8 
> >>   1 file changed, 8 insertions(+)
> >>
> >> diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
> >> index 02adc87527..e0515e2a1e 100644
> >> --- a/hw/block/m25p80.c
> >> +++ b/hw/block/m25p80.c
> >> @@ -1606,6 +1606,14 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
> >> **errp)
> >>   if (s->blk) {
> >>   uint64_t perm = BLK_PERM_CONSISTENT_READ |
> >>   (blk_supports_write_perm(s->blk) ? 
> >> BLK_PERM_WRITE : 0);
> >> +
> >> +if (blk_getlength(s->blk) < s->size) {
> >
> > ...but the code change is only checking for "too small".
> >
> > What happens if the user provides a backing file that is too large ?
>
> That's ok because the blk_pread() call following, which loads in RAM
> the initial data, won't fail.
>
> It might be better to enforce a strict check on the size to avoid
> further confusion ? and change the error message to be clear.

Can we use blk_check_size_and_read_all() here rather than
a manual "check size, and then pread" ? That will take care
of the error message for you and make this device behave
the same way as other flash devices which use block backends.

thanks
-- PMM

Re: [PATCH] m25p80: Warn the user when the backend file is too small for the device

2022-11-15 Thread Cédric Le Goater


On 11/15/22 15:34, Peter Maydell wrote:

On Tue, 15 Nov 2022 at 14:22, Cédric Le Goater  wrote:


Currently, when a block backend is attached to a m25p80 device and the
associated file size does not match the flash model, QEMU complains
with the error message "failed to read the initial flash content".
This is confusing for the user.


The commit message says we get an unhelpful error if the
file size "does not match"...


Improve the reported error with a new message regarding the file size.

Signed-off-by: Cédric Le Goater 
---
  hw/block/m25p80.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
index 02adc87527..e0515e2a1e 100644
--- a/hw/block/m25p80.c
+++ b/hw/block/m25p80.c
@@ -1606,6 +1606,14 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
**errp)
  if (s->blk) {
  uint64_t perm = BLK_PERM_CONSISTENT_READ |
  (blk_supports_write_perm(s->blk) ? BLK_PERM_WRITE : 
0);
+
+if (blk_getlength(s->blk) < s->size) {


...but the code change is only checking for "too small".

What happens if the user provides a backing file that is too large ?


That's ok because the blk_pread() call following, which loads in RAM
the initial data, won't fail.

It might be better to enforce a strict check on the size to avoid
further confusion ? and change the error message to be clear.
 



+error_setg(errp,
+   "backend file is too small for flash device %s (%d MB)",
+   object_class_get_name(OBJECT_CLASS(mc)), s->size >> 20);


This potentially reports to the user a size which isn't the
right one for them to use to set the size of the backing file,
if that required size isn't an exact number of MB.


True. We have a few devices which size is below 1MB. Using a KB unit
should be fine.

Thanks,

C.




+return;
+}
+
  ret = blk_set_perm(s->blk, perm, BLK_PERM_ALL, errp);
  if (ret < 0) {
  return;
--
2.38.1


thanks
-- PMM

Re: [PATCH] m25p80: Warn the user when the backend file is too small for the device

2022-11-15 Thread Peter Maydell

On Tue, 15 Nov 2022 at 14:22, Cédric Le Goater  wrote:
>
> Currently, when a block backend is attached to a m25p80 device and the
> associated file size does not match the flash model, QEMU complains
> with the error message "failed to read the initial flash content".
> This is confusing for the user.

The commit message says we get an unhelpful error if the
file size "does not match"...

> Improve the reported error with a new message regarding the file size.
>
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/block/m25p80.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
> index 02adc87527..e0515e2a1e 100644
> --- a/hw/block/m25p80.c
> +++ b/hw/block/m25p80.c
> @@ -1606,6 +1606,14 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
> **errp)
>  if (s->blk) {
>  uint64_t perm = BLK_PERM_CONSISTENT_READ |
>  (blk_supports_write_perm(s->blk) ? BLK_PERM_WRITE : 
> 0);
> +
> +if (blk_getlength(s->blk) < s->size) {

...but the code change is only checking for "too small".

What happens if the user provides a backing file that is too large ?

> +error_setg(errp,
> +   "backend file is too small for flash device %s (%d 
> MB)",
> +   object_class_get_name(OBJECT_CLASS(mc)), s->size >> 
> 20);

This potentially reports to the user a size which isn't the
right one for them to use to set the size of the backing file,
if that required size isn't an exact number of MB.

> +return;
> +}
> +
>  ret = blk_set_perm(s->blk, perm, BLK_PERM_ALL, errp);
>  if (ret < 0) {
>  return;
> --
> 2.38.1

thanks
-- PMM

Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-15 Thread Alex Bennée



"Michael S. Tsirkin"  writes:

> On Mon, Nov 14, 2022 at 06:15:30PM +0100, Christian Borntraeger wrote:
>> 
>> 
>> Am 14.11.22 um 18:10 schrieb Michael S. Tsirkin:
>> > On Mon, Nov 14, 2022 at 05:55:09PM +0100, Christian Borntraeger wrote:
>> > > 
>> > > 
>> > > Am 14.11.22 um 17:37 schrieb Michael S. Tsirkin:
>> > > > On Mon, Nov 14, 2022 at 05:18:53PM +0100, Christian Borntraeger wrote:
>> > > > > Am 08.11.22 um 10:23 schrieb Alex Bennée:
>> > > > > > The previous fix to virtio_device_started revealed a problem in its
>> > > > > > use by both the core and the device code. The core code should be 
>> > > > > > able
>> > > > > > to handle the device "starting" while the VM isn't running to 
>> > > > > > handle
>> > > > > > the restoration of migration state. To solve this dual use 
>> > > > > > introduce a
>> > > > > > new helper for use by the vhost-user backends who all use it to 
>> > > > > > feed a
>> > > > > > should_start variable.
>> > > > > > 
>> > > > > > We can also pick up a change vhost_user_blk_set_status while we 
>> > > > > > are at
>> > > > > > it which follows the same pattern.
>> > > > > > 
>> > > > > > Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to 
>> > > > > > virtio_device_started)
>> > > > > > Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio 
>> > > > > > device)
>> > > > > > Signed-off-by: Alex Bennée 
>> > > > > > Cc: "Michael S. Tsirkin" 
>> > > > > 
>> > > > > Hmmm, is this
>> > > > > commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
>> > > > > Author: Alex Bennée 
>> > > > > AuthorDate: Mon Nov 7 12:14:07 2022 +
>> > > > > Commit: Michael S. Tsirkin 
>> > > > > CommitDate: Mon Nov 7 14:08:18 2022 -0500
>> > > > > 
>> > > > >   hw/virtio: introduce virtio_device_should_start
>> > > > > 
>> > > > > and older version?
>> > > > 
>> > > > This is what got merged:
>> > > > https://lore.kernel.org/r/20221107121407.1010913-1-alex.bennee%40linaro.org
>> > > > This patch was sent after I merged the RFC.
>> > > > I think the only difference is the commit log but I might be missing
>> > > > something.
>> > > > 
>> > > > > This does not seem to fix the regression that I have reported.
>> > > > 
>> > > > This was applied on top of 9f6bcfd99f which IIUC does, right?
>> > > > 
>> > > > 
>> > > 
>> > > QEMU master still fails for me for suspend/resume to disk:
>> > > 
>> > > #0  0x03ff8e3980a6 in __pthread_kill_implementation () at 
>> > > /lib64/libc.so.6
>> > > #1  0x03ff8e348580 in raise () at /lib64/libc.so.6
>> > > #2  0x03ff8e32b5c0 in abort () at /lib64/libc.so.6
>> > > #3  0x03ff8e3409da in __assert_fail_base () at /lib64/libc.so.6
>> > > #4  0x03ff8e340a4e in  () at /lib64/libc.so.6
>> > > #5 0x02aa1ffa8966 in vhost_vsock_common_pre_save
>> > > (opaque=) at
>> > > ../hw/virtio/vhost-vsock-common.c:203
>> > > #6  0x02aa1fe5e0ee in vmstate_save_state_v
>> > >  (f=f@entry=0x2aa21bdc170, vmsd=0x2aa204ac5f0
>> > > , opaque=0x2aa21bac9f8,
>> > > vmdesc=vmdesc@entry=0x3fddc08eb30,
>> > > version_id=version_id@entry=0) at ../migration/vmstate.c:329
>> > > #7 0x02aa1fe5ebf8 in vmstate_save_state
>> > > (f=f@entry=0x2aa21bdc170, vmsd=,
>> > > opaque=, vmdesc_id=vmdesc_id@entry=0x3fddc08eb30)
>> > > at ../migration/vmstate.c:317
>> > > #8 0x02aa1fe75bd0 in vmstate_save (f=f@entry=0x2aa21bdc170,
>> > > se=se@entry=0x2aa21bdbe90, vmdesc=vmdesc@entry=0x3fddc08eb30) at
>> > > ../migration/savevm.c:908
>> > > #9 0x02aa1fe79584 in
>> > > qemu_savevm_state_complete_precopy_non_iterable
>> > > (f=f@entry=0x2aa21bdc170, in_postcopy=in_postcopy@entry=false,
>> > > inactivate_disks=inactivate_disks@entry=true)
>> > >  at ../migration/savevm.c:1393
>> > > #10 0x02aa1fe79a96 in qemu_savevm_state_complete_precopy
>> > > (f=0x2aa21bdc170, iterable_only=iterable_only@entry=false,
>> > > inactivate_disks=inactivate_disks@entry=true) at
>> > > ../migration/savevm.c:1459
>> > > #11 0x02aa1fe6d6ee in migration_completion (s=0x2aa218ef600) at 
>> > > ../migration/migration.c:3314
>> > > #12 migration_iteration_run (s=0x2aa218ef600) at 
>> > > ../migration/migration.c:3761
>> > > #13 migration_thread (opaque=opaque@entry=0x2aa218ef600) at 
>> > > ../migration/migration.c:3989
>> > > #14 0x02aa201f0b8c in qemu_thread_start (args=) at 
>> > > ../util/qemu-thread-posix.c:505
>> > > #15 0x03ff8e396248 in start_thread () at /lib64/libc.so.6
>> > > #16 0x03ff8e41183e in thread_start () at /lib64/libc.so.6
>> > > 
>> > > Michael, your previous branch did work if I recall correctly.
>> > 
>> > That one was failing under github CI though (for reasons we didn't
>> > really address, such as disconnect during stop causing a recursive
>> > call to stop, but there you are).
>> Even the double revert of everything?
>
> I don't remember at this point.
>
>> So how do we proceed now?
>
> I'm hopeful Alex will come up with a fix.

I need to replicate the failing test for that. Which test is failing?

-- 
Alex Bennée

[PATCH] m25p80: Warn the user when the backend file is too small for the device

2022-11-15 Thread Cédric Le Goater

Currently, when a block backend is attached to a m25p80 device and the
associated file size does not match the flash model, QEMU complains
with the error message "failed to read the initial flash content".
This is confusing for the user.

Improve the reported error with a new message regarding the file size.

Signed-off-by: Cédric Le Goater 
---
 hw/block/m25p80.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
index 02adc87527..e0515e2a1e 100644
--- a/hw/block/m25p80.c
+++ b/hw/block/m25p80.c
@@ -1606,6 +1606,14 @@ static void m25p80_realize(SSIPeripheral *ss, Error 
**errp)
 if (s->blk) {
 uint64_t perm = BLK_PERM_CONSISTENT_READ |
 (blk_supports_write_perm(s->blk) ? BLK_PERM_WRITE : 0);
+
+if (blk_getlength(s->blk) < s->size) {
+error_setg(errp,
+   "backend file is too small for flash device %s (%d MB)",
+   object_class_get_name(OBJECT_CLASS(mc)), s->size >> 20);
+return;
+}
+
 ret = blk_set_perm(s->blk, perm, BLK_PERM_ALL, errp);
 if (ret < 0) {
 return;
-- 
2.38.1

Re: [PULL v2 00/11] Block layer patches

2022-11-15 Thread Stefan Hajnoczi

Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/7.2 for any 
user-visible changes.


signature.asc
Description: PGP signature

Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-15 Thread Christian Borntraeger





Am 15.11.22 um 12:25 schrieb Michael S. Tsirkin:

On Tue, Nov 15, 2022 at 09:18:27AM +0100, Christian Borntraeger wrote:


Am 14.11.22 um 18:20 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 06:15:30PM +0100, Christian Borntraeger wrote:



Am 14.11.22 um 18:10 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 05:55:09PM +0100, Christian Borntraeger wrote:



Am 14.11.22 um 17:37 schrieb Michael S. Tsirkin:

On Mon, Nov 14, 2022 at 05:18:53PM +0100, Christian Borntraeger wrote:

Am 08.11.22 um 10:23 schrieb Alex Bennée:

The previous fix to virtio_device_started revealed a problem in its
use by both the core and the device code. The core code should be able
to handle the device "starting" while the VM isn't running to handle
the restoration of migration state. To solve this dual use introduce a
new helper for use by the vhost-user backends who all use it to feed a
should_start variable.

We can also pick up a change vhost_user_blk_set_status while we are at
it which follows the same pattern.

Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started)
Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
Signed-off-by: Alex Bennée 
Cc: "Michael S. Tsirkin" 


Hmmm, is this
commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
Author: Alex Bennée 
AuthorDate: Mon Nov 7 12:14:07 2022 +
Commit: Michael S. Tsirkin 
CommitDate: Mon Nov 7 14:08:18 2022 -0500

hw/virtio: introduce virtio_device_should_start

and older version?


This is what got merged:
https://lore.kernel.org/r/20221107121407.1010913-1-alex.bennee%40linaro.org
This patch was sent after I merged the RFC.
I think the only difference is the commit log but I might be missing
something.


This does not seem to fix the regression that I have reported.


This was applied on top of 9f6bcfd99f which IIUC does, right?




QEMU master still fails for me for suspend/resume to disk:

#0  0x03ff8e3980a6 in __pthread_kill_implementation () at /lib64/libc.so.6
#1  0x03ff8e348580 in raise () at /lib64/libc.so.6
#2  0x03ff8e32b5c0 in abort () at /lib64/libc.so.6
#3  0x03ff8e3409da in __assert_fail_base () at /lib64/libc.so.6
#4  0x03ff8e340a4e in  () at /lib64/libc.so.6
#5  0x02aa1ffa8966 in vhost_vsock_common_pre_save (opaque=) 
at ../hw/virtio/vhost-vsock-common.c:203
#6  0x02aa1fe5e0ee in vmstate_save_state_v
   (f=f@entry=0x2aa21bdc170, vmsd=0x2aa204ac5f0 
, opaque=0x2aa21bac9f8, 
vmdesc=vmdesc@entry=0x3fddc08eb30, version_id=version_id@entry=0) at 
../migration/vmstate.c:329
#7  0x02aa1fe5ebf8 in vmstate_save_state (f=f@entry=0x2aa21bdc170, vmsd=, opaque=, vmdesc_id=vmdesc_id@entry=0x3fddc08eb30) at 
../migration/vmstate.c:317
#8  0x02aa1fe75bd0 in vmstate_save (f=f@entry=0x2aa21bdc170, 
se=se@entry=0x2aa21bdbe90, vmdesc=vmdesc@entry=0x3fddc08eb30) at 
../migration/savevm.c:908
#9  0x02aa1fe79584 in qemu_savevm_state_complete_precopy_non_iterable 
(f=f@entry=0x2aa21bdc170, in_postcopy=in_postcopy@entry=false, 
inactivate_disks=inactivate_disks@entry=true)
   at ../migration/savevm.c:1393
#10 0x02aa1fe79a96 in qemu_savevm_state_complete_precopy (f=0x2aa21bdc170, 
iterable_only=iterable_only@entry=false, 
inactivate_disks=inactivate_disks@entry=true) at ../migration/savevm.c:1459
#11 0x02aa1fe6d6ee in migration_completion (s=0x2aa218ef600) at 
../migration/migration.c:3314
#12 migration_iteration_run (s=0x2aa218ef600) at ../migration/migration.c:3761
#13 migration_thread (opaque=opaque@entry=0x2aa218ef600) at 
../migration/migration.c:3989
#14 0x02aa201f0b8c in qemu_thread_start (args=) at 
../util/qemu-thread-posix.c:505
#15 0x03ff8e396248 in start_thread () at /lib64/libc.so.6
#16 0x03ff8e41183e in thread_start () at /lib64/libc.so.6

Michael, your previous branch did work if I recall correctly.


That one was failing under github CI though (for reasons we didn't
really address, such as disconnect during stop causing a recursive
call to stop, but there you are).

Even the double revert of everything?


I don't remember at this point.


So how do we proceed now?


I'm hopeful Alex will come up with a fix.



The initial fix changed to qemu/master does still work for me

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index a973811cbfc6..fb3072838119 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -411,14 +411,14 @@ static inline bool virtio_device_started(VirtIODevice 
*vdev, uint8_t status)
   */
  static inline bool virtio_device_should_start(VirtIODevice *vdev, uint8_t 
status)
  {
-if (vdev->use_started) {
-return vdev->started;
-}
-
  if (!vdev->vm_running) {
  return false;
  }
+if (vdev->use_started) {
+return vdev->started;
+}
+
  return status & VIRTIO_CONFIG_S_DRIVER_OK;
  }


Triggers failure on gitlab unfortunately:

https://gitlab.com/mstredhat/qemu/-/jobs/3323768122


So maybe we should go forward and revert the

RE: [PATCH v1] block/rbd: Add support for layered encryption

2022-11-15 Thread Or Ozeri

I tried casting to non-const and it seems to work. Changed in v3 now.
I did not know that a const modifier could simply be cast out :)

> -Original Message-
> From: Ilya Dryomov 
> Sent: 15 November 2022 14:00
> To: Or Ozeri 
> Cc: qemu-de...@nongnu.org; qemu-block@nongnu.org; Danny Harnik
> 
> Subject: [EXTERNAL] Re: [PATCH v1] block/rbd: Add support for layered
> encryption
> 
> On Sun, Nov 13, 2022 at 11:16 AM Or Ozeri  wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Ilya Dryomov 
> > > Sent: 11 November 2022 15:01
> > > To: Or Ozeri 
> > > Cc: qemu-de...@nongnu.org; qemu-block@nongnu.org; Danny Harnik
> > > 
> > > Subject: [EXTERNAL] Re: [PATCH v1] block/rbd: Add support for
> > > layered encryption
> > >
> > > I don't understand the need for this char* array.  Is there a
> > > problem with putting the blob directly into
> > > luks_all_opts->passphrase just like the size is put into luks_all_opts-
> >passphrase_size?
> > >
> >
> > luks_all_opts->passphrase has a const modifier.
> 
> Hi Or,
> 
> That's really not a reason to make a dynamic memory allocation.  You can just
> cast that const away but I suspect that the underlying issue is that a const 
> is
> missing somewhere else.  At the end of the day, QEMU allocates a buffer for
> the passphrase when it's fetched via the secret API -- that pointer should
> assign to const char* just fine.
> 
> Thanks,
> 
> Ilya

[PATCH 18/30] migration: Trivial cleanup save_page_header() on same block check

2022-11-15 Thread Juan Quintela

From: Peter Xu 

The 2nd check on RAM_SAVE_FLAG_CONTINUE is a bit redundant.  Use a boolean
to be clearer.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 9ded381e0a..42b6a543bd 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -689,14 +689,15 @@ static size_t save_page_header(RAMState *rs, QEMUFile *f, 
 RAMBlock *block,
ram_addr_t offset)
 {
 size_t size, len;
+bool same_block = (block == rs->last_sent_block);
 
-if (block == rs->last_sent_block) {
+if (same_block) {
 offset |= RAM_SAVE_FLAG_CONTINUE;
 }
 qemu_put_be64(f, offset);
 size = 8;
 
-if (!(offset & RAM_SAVE_FLAG_CONTINUE)) {
+if (!same_block) {
 len = strlen(block->idstr);
 qemu_put_byte(f, len);
 qemu_put_buffer(f, (uint8_t *)block->idstr, len);
-- 
2.38.1

[PATCH v3] block/rbd: Add support for layered encryption

2022-11-15 Thread Or Ozeri

Starting from ceph Reef, RBD has built-in support for layered encryption,
where each ancestor image (in a cloned image setting) can be possibly
encrypted using a unique passphrase.

A new function, rbd_encryption_load2, was added to librbd API.
This new function supports an array of passphrases (via "spec" structs).

This commit extends the qemu rbd driver API to use this new librbd API,
in order to support this new layered encryption feature.

Signed-off-by: Or Ozeri 
---
v3: further nit fixes suggested by @idryomov
v2: nit fixes suggested by @idryomov
---
 block/rbd.c  | 119 ++-
 qapi/block-core.json |  35 +++--
 2 files changed, 150 insertions(+), 4 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index f826410f40..ce017c29b5 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -71,6 +71,16 @@ static const char rbd_luks2_header_verification[
 'L', 'U', 'K', 'S', 0xBA, 0xBE, 0, 2
 };
 
+static const char rbd_layered_luks_header_verification[
+RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
+'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 1
+};
+
+static const char rbd_layered_luks2_header_verification[
+RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
+'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 2
+};
+
 typedef enum {
 RBD_AIO_READ,
 RBD_AIO_WRITE,
@@ -470,6 +480,9 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
 size_t passphrase_len;
 rbd_encryption_luks1_format_options_t luks_opts;
 rbd_encryption_luks2_format_options_t luks2_opts;
+#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
+rbd_encryption_luks_format_options_t luks_any_opts;
+#endif
 rbd_encryption_format_t format;
 rbd_encryption_options_t opts;
 size_t opts_size;
@@ -505,6 +518,23 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
 luks2_opts.passphrase_size = passphrase_len;
 break;
 }
+#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
+case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS_ANY: {
+memset(_any_opts, 0, sizeof(luks_any_opts));
+format = RBD_ENCRYPTION_FORMAT_LUKS;
+opts = _any_opts;
+opts_size = sizeof(luks_any_opts);
+r = qemu_rbd_convert_luks_options(
+
qapi_RbdEncryptionOptionsLUKSAny_base(>u.luks_any),
+, _len, errp);
+if (r < 0) {
+return r;
+}
+luks_any_opts.passphrase = passphrase;
+luks_any_opts.passphrase_size = passphrase_len;
+break;
+}
+#endif
 default: {
 r = -ENOTSUP;
 error_setg_errno(
@@ -522,6 +552,74 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
 
 return 0;
 }
+
+#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
+static int qemu_rbd_encryption_load2(rbd_image_t image,
+ RbdEncryptionOptions *encrypt,
+ Error **errp)
+{
+int r = 0;
+int encrypt_count = 1;
+int i;
+RbdEncryptionOptions *curr_encrypt;
+rbd_encryption_spec_t *specs;
+rbd_encryption_luks_format_options_t* luks_any_opts;
+
+/* count encryption options */
+for (curr_encrypt = encrypt; curr_encrypt->has_parent;
+ curr_encrypt = curr_encrypt->parent) {
+++encrypt_count;
+}
+
+specs = g_new0(rbd_encryption_spec_t, encrypt_count);
+
+curr_encrypt = encrypt;
+for (i = 0; i < encrypt_count; ++i) {
+if (curr_encrypt->format != RBD_IMAGE_ENCRYPTION_FORMAT_LUKS_ANY) {
+r = -ENOTSUP;
+error_setg_errno(
+errp, -r, "unknown image encryption format: %u",
+curr_encrypt->format);
+goto exit;
+}
+
+specs[i].format = RBD_ENCRYPTION_FORMAT_LUKS;
+specs[i].opts_size = sizeof(rbd_encryption_luks_format_options_t);
+
+luks_any_opts = g_new0(rbd_encryption_luks_format_options_t, 1);
+specs[i].opts = luks_any_opts;
+
+r = qemu_rbd_convert_luks_options(
+qapi_RbdEncryptionOptionsLUKSAny_base(
+_encrypt->u.luks_any),
+(char**)_any_opts->passphrase,
+_any_opts->passphrase_size,
+errp);
+if (r < 0) {
+goto exit;
+}
+
+curr_encrypt = curr_encrypt->parent;
+}
+
+r = rbd_encryption_load2(image, specs, encrypt_count);
+if (r < 0) {
+error_setg_errno(errp, -r, "layered encryption load fail");
+goto exit;
+}
+
+exit:
+for (i = 0; i < encrypt_count; ++i) {
+luks_any_opts = specs[i].opts;
+if (luks_any_opts) {
+g_free((char*)luks_any_opts->passphrase);
+g_free(luks_any_opts);
+}
+}
+g_free(specs);
+return r;
+}
+#endif
 #endif
 
 /* FIXME Deprecate and remove keypairs or make it available in QMP. */
@@ -993,7 +1091,16 @@ static int qemu_rbd_open(BlockDriverState *bs,

[PATCH 23/30] migration: Introduce pss_channel

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Introduce pss_channel for PageSearchStatus, define it as "the migration
channel to be used to transfer this host page".

We used to have rs->f, which is a mirror to MigrationState.to_dst_file.

After postcopy preempt initial version, rs->f can be dynamically changed
depending on which channel we want to use.

But that later work still doesn't grant full concurrency of sending pages
in e.g. different threads, because rs->f can either be the PRECOPY channel
or POSTCOPY channel.  This needs to be per-thread too.

PageSearchStatus is actually a good piece of struct which we can leverage
if we want to have multiple threads sending pages.  Sending a single guest
page may not make sense, so we make the granule to be "host page", and in
the PSS structure we allow specify a QEMUFile* to migrate a specific host
page.  Then we open the possibility to specify different channels in
different threads with different PSS structures.

The PSS prefix can be slightly misleading here because e.g. for the
upcoming usage of postcopy channel/thread it's not "searching" (or,
scanning) at all but sending the explicit page that was requested.  However
since PSS existed for some years keep it as-is until someone complains.

This patch mostly (simply) replace rs->f with pss->pss_channel only. No
functional change intended for this patch yet.  But it does prepare to
finally drop rs->f, and make ram_save_guest_page() thread safe.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 70 +++--
 1 file changed, 38 insertions(+), 32 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index b71edf1f26..fedd61b3da 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -481,6 +481,8 @@ void dirty_sync_missed_zero_copy(void)
 
 /* used by the search for pages to send */
 struct PageSearchStatus {
+/* The migration channel used for a specific host page */
+QEMUFile*pss_channel;
 /* Current block being searched */
 RAMBlock*block;
 /* Current page to search from */
@@ -803,9 +805,9 @@ static void xbzrle_cache_zero_page(RAMState *rs, ram_addr_t 
current_addr)
  * @block: block that contains the page we want to send
  * @offset: offset inside the block for the page
  */
-static int save_xbzrle_page(RAMState *rs, uint8_t **current_data,
-ram_addr_t current_addr, RAMBlock *block,
-ram_addr_t offset)
+static int save_xbzrle_page(RAMState *rs, QEMUFile *file,
+uint8_t **current_data, ram_addr_t current_addr,
+RAMBlock *block, ram_addr_t offset)
 {
 int encoded_len = 0, bytes_xbzrle;
 uint8_t *prev_cached_page;
@@ -873,11 +875,11 @@ static int save_xbzrle_page(RAMState *rs, uint8_t 
**current_data,
 }
 
 /* Send XBZRLE based compressed page */
-bytes_xbzrle = save_page_header(rs, rs->f, block,
+bytes_xbzrle = save_page_header(rs, file, block,
 offset | RAM_SAVE_FLAG_XBZRLE);
-qemu_put_byte(rs->f, ENCODING_FLAG_XBZRLE);
-qemu_put_be16(rs->f, encoded_len);
-qemu_put_buffer(rs->f, XBZRLE.encoded_buf, encoded_len);
+qemu_put_byte(file, ENCODING_FLAG_XBZRLE);
+qemu_put_be16(file, encoded_len);
+qemu_put_buffer(file, XBZRLE.encoded_buf, encoded_len);
 bytes_xbzrle += encoded_len + 1 + 2;
 /*
  * Like compressed_size (please see update_compress_thread_counts),
@@ -1333,9 +1335,10 @@ static int save_zero_page_to_file(RAMState *rs, QEMUFile 
*file,
  * @block: block that contains the page we want to send
  * @offset: offset inside the block for the page
  */
-static int save_zero_page(RAMState *rs, RAMBlock *block, ram_addr_t offset)
+static int save_zero_page(RAMState *rs, QEMUFile *file, RAMBlock *block,
+  ram_addr_t offset)
 {
-int len = save_zero_page_to_file(rs, rs->f, block, offset);
+int len = save_zero_page_to_file(rs, file, block, offset);
 
 if (len) {
 stat64_add(_atomic_counters.duplicate, 1);
@@ -1352,15 +1355,15 @@ static int save_zero_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset)
  *
  * Return true if the pages has been saved, otherwise false is returned.
  */
-static bool control_save_page(RAMState *rs, RAMBlock *block, ram_addr_t offset,
-  int *pages)
+static bool control_save_page(PageSearchStatus *pss, RAMBlock *block,
+  ram_addr_t offset, int *pages)
 {
 uint64_t bytes_xmit = 0;
 int ret;
 
 *pages = -1;
-ret = ram_control_save_page(rs->f, block->offset, offset, TARGET_PAGE_SIZE,
-_xmit);
+ret = ram_control_save_page(pss->pss_channel, block->offset, offset,
+TARGET_PAGE_SIZE, _xmit);
 if (ret == RAM_SAVE_CONTROL_NOT_SUPP) {
 return false;
 }

[PATCH 29/30] migration: Drop rs->f

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Now with rs->pss we can already cache channels in pss->pss_channels.  That
pss_channel contains more infromation than rs->f because it's per-channel.
So rs->f could be replaced by rss->pss[RAM_CHANNEL_PRECOPY].pss_channel,
while rs->f itself is a bit vague now.

Note that vanilla postcopy still send pages via pss[RAM_CHANNEL_PRECOPY],
that's slightly confusing but it reflects the reality.

Then, after the replacement we can safely drop rs->f.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 88e61b0aeb..29e413b97b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -351,8 +351,6 @@ struct RAMSrcPageRequest {
 
 /* State of RAM for migration */
 struct RAMState {
-/* QEMUFile used for this migration */
-QEMUFile *f;
 /*
  * PageSearchStatus structures for the channels when send pages.
  * Protected by the bitmap_mutex.
@@ -2560,8 +2558,6 @@ static int ram_find_and_save_block(RAMState *rs)
 }
 
 if (found) {
-/* Cache rs->f in pss_channel (TODO: remove rs->f) */
-pss->pss_channel = rs->f;
 pages = ram_save_host_page(rs, pss);
 }
 } while (!pages && again);
@@ -3117,7 +3113,7 @@ static void ram_state_resume_prepare(RAMState *rs, 
QEMUFile *out)
 ram_state_reset(rs);
 
 /* Update RAMState cache of output QEMUFile */
-rs->f = out;
+rs->pss[RAM_CHANNEL_PRECOPY].pss_channel = out;
 
 trace_ram_state_resume_prepare(pages);
 }
@@ -3208,7 +3204,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 return -1;
 }
 }
-(*rsp)->f = f;
+(*rsp)->pss[RAM_CHANNEL_PRECOPY].pss_channel = f;
 
 WITH_RCU_READ_LOCK_GUARD() {
 qemu_put_be64(f, ram_bytes_total_common(true) | 
RAM_SAVE_FLAG_MEM_SIZE);
@@ -3343,7 +3339,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 out:
 if (ret >= 0
 && migration_is_setup_or_active(migrate_get_current()->state)) {
-ret = multifd_send_sync_main(rs->f);
+ret = multifd_send_sync_main(rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
 if (ret < 0) {
 return ret;
 }
@@ -3413,7 +3409,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 return ret;
 }
 
-ret = multifd_send_sync_main(rs->f);
+ret = multifd_send_sync_main(rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
 if (ret < 0) {
 return ret;
 }
-- 
2.38.1

[PATCH 27/30] migration: Send requested page directly in rp-return thread

2022-11-15 Thread Juan Quintela

From: Peter Xu 

With all the facilities ready, send the requested page directly in the
rp-return thread rather than queuing it in the request queue, if and only
if postcopy preempt is enabled.  It can achieve so because it uses separate
channel for sending urgent pages.  The only shared data is bitmap and it's
protected by the bitmap_mutex.

Note that since we're moving the ownership of the urgent channel from the
migration thread to rp thread it also means the rp thread is responsible
for managing the qemufile, e.g. properly close it when pausing migration
happens.  For this, let migration_release_from_dst_file to cover shutdown
of the urgent channel too, renaming it as migration_release_dst_files() to
better show what it does.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/migration.c |  35 +++--
 migration/ram.c   | 112 ++
 2 files changed, 131 insertions(+), 16 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 1f95877fb4..42f36c1e2c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2868,8 +2868,11 @@ static int migrate_handle_rp_resume_ack(MigrationState 
*s, uint32_t value)
 return 0;
 }
 
-/* Release ms->rp_state.from_dst_file in a safe way */
-static void migration_release_from_dst_file(MigrationState *ms)
+/*
+ * Release ms->rp_state.from_dst_file (and postcopy_qemufile_src if
+ * existed) in a safe way.
+ */
+static void migration_release_dst_files(MigrationState *ms)
 {
 QEMUFile *file;
 
@@ -2882,6 +2885,18 @@ static void 
migration_release_from_dst_file(MigrationState *ms)
 ms->rp_state.from_dst_file = NULL;
 }
 
+/*
+ * Do the same to postcopy fast path socket too if there is.  No
+ * locking needed because this qemufile should only be managed by
+ * return path thread.
+ */
+if (ms->postcopy_qemufile_src) {
+migration_ioc_unregister_yank_from_file(ms->postcopy_qemufile_src);
+qemu_file_shutdown(ms->postcopy_qemufile_src);
+qemu_fclose(ms->postcopy_qemufile_src);
+ms->postcopy_qemufile_src = NULL;
+}
+
 qemu_fclose(file);
 }
 
@@ -3026,7 +3041,7 @@ out:
  * Maybe there is something we can do: it looks like a
  * network down issue, and we pause for a recovery.
  */
-migration_release_from_dst_file(ms);
+migration_release_dst_files(ms);
 rp = NULL;
 if (postcopy_pause_return_path_thread(ms)) {
 /*
@@ -3044,7 +3059,7 @@ out:
 }
 
 trace_source_return_path_thread_end();
-migration_release_from_dst_file(ms);
+migration_release_dst_files(ms);
 rcu_unregister_thread();
 return NULL;
 }
@@ -3567,18 +3582,6 @@ static MigThrError postcopy_pause(MigrationState *s)
 qemu_file_shutdown(file);
 qemu_fclose(file);
 
-/*
- * Do the same to postcopy fast path socket too if there is.  No
- * locking needed because no racer as long as we do this before setting
- * status to paused.
- */
-if (s->postcopy_qemufile_src) {
-migration_ioc_unregister_yank_from_file(s->postcopy_qemufile_src);
-qemu_file_shutdown(s->postcopy_qemufile_src);
-qemu_fclose(s->postcopy_qemufile_src);
-s->postcopy_qemufile_src = NULL;
-}
-
 migrate_set_state(>state, s->state,
   MIGRATION_STATUS_POSTCOPY_PAUSED);
 
diff --git a/migration/ram.c b/migration/ram.c
index dbdde5a6a5..5dc221a2fc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -574,6 +574,8 @@ static QemuThread *decompress_threads;
 static QemuMutex decomp_done_lock;
 static QemuCond decomp_done_cond;
 
+static int ram_save_host_page_urgent(PageSearchStatus *pss);
+
 static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock 
*block,
  ram_addr_t offset, uint8_t *source_buf);
 
@@ -588,6 +590,16 @@ static void pss_init(PageSearchStatus *pss, RAMBlock *rb, 
ram_addr_t page)
 pss->complete_round = false;
 }
 
+/*
+ * Check whether two PSSs are actively sending the same page.  Return true
+ * if it is, false otherwise.
+ */
+static bool pss_overlap(PageSearchStatus *pss1, PageSearchStatus *pss2)
+{
+return pss1->host_page_sending && pss2->host_page_sending &&
+(pss1->host_page_start == pss2->host_page_start);
+}
+
 static void *do_data_compress(void *opaque)
 {
 CompressParam *param = opaque;
@@ -2288,6 +2300,57 @@ int ram_save_queue_pages(const char *rbname, ram_addr_t 
start, ram_addr_t len)
 return -1;
 }
 
+/*
+ * When with postcopy preempt, we send back the page directly in the
+ * rp-return thread.
+ */
+if (postcopy_preempt_active()) {
+ram_addr_t page_start = start >> TARGET_PAGE_BITS;
+size_t page_size =

[PATCH 14/30] migration: Disable multifd explicitly with compression

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Multifd thread model does not work for compression, explicitly disable it.

Note that previuosly even we can enable both of them, nothing will go
wrong, because the compression code has higher priority so multifd feature
will just be ignored.  Now we'll fail even earlier at config time so the
user should be aware of the consequence better.

Note that there can be a slight chance of breaking existing users, but
let's assume they're not majority and not serious users, or they should
have found that multifd is not working already.

With that, we can safely drop the check in ram_save_target_page() for using
multifd, because when multifd=on then compression=off, then the removed
check on save_page_use_compression() will also always return false too.

Signed-off-by: Peter Xu 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/migration.c |  7 +++
 migration/ram.c   | 11 +--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 0bc3fce4b7..9fbed8819a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1370,6 +1370,13 @@ static bool migrate_caps_check(bool *cap_list,
 }
 }
 
+if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
+if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+error_setg(errp, "Multifd is not compatible with compress");
+return false;
+}
+}
+
 return true;
 }
 
diff --git a/migration/ram.c b/migration/ram.c
index c0f5d6d287..2fcce796d0 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2333,13 +2333,12 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss)
 }
 
 /*
- * Do not use multifd for:
- * 1. Compression as the first page in the new block should be posted out
- *before sending the compressed page
- * 2. In postcopy as one whole host page should be placed
+ * Do not use multifd in postcopy as one whole host page should be
+ * placed.  Meanwhile postcopy requires atomic update of pages, so even
+ * if host page size == guest page size the dest guest during run may
+ * still see partially copied pages which is data corruption.
  */
-if (!save_page_use_compression(rs) && migrate_use_multifd()
-&& !migration_in_postcopy()) {
+if (migrate_use_multifd() && !migration_in_postcopy()) {
 return ram_save_multifd_page(rs, block, offset);
 }
 
-- 
2.38.1

[PATCH 28/30] migration: Remove old preempt code around state maintainance

2022-11-15 Thread Juan Quintela

From: Peter Xu 

With the new code to send pages in rp-return thread, there's little help to
keep lots of the old code on maintaining the preempt state in migration
thread, because the new way should always be faster..

Then if we'll always send pages in the rp-return thread anyway, we don't
need those logic to maintain preempt state anymore because now we serialize
things using the mutex directly instead of using those fields.

It's very unfortunate to have those code for a short period, but that's
still one intermediate step that we noticed the next bottleneck on the
migration thread.  Now what we can do best is to drop unnecessary code as
long as the new code is stable to reduce the burden.  It's actually a good
thing because the new "sending page in rp-return thread" model is (IMHO)
even cleaner and with better performance.

Remove the old code that was responsible for maintaining preempt states, at
the meantime also remove x-postcopy-preempt-break-huge parameter because
with concurrent sender threads we don't really need to break-huge anymore.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/migration.h |   7 -
 migration/migration.c |   2 -
 migration/ram.c   | 291 +-
 3 files changed, 3 insertions(+), 297 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index cdad8aceaa..ae4ffd3454 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -340,13 +340,6 @@ struct MigrationState {
 bool send_configuration;
 /* Whether we send section footer during migration */
 bool send_section_footer;
-/*
- * Whether we allow break sending huge pages when postcopy preempt is
- * enabled.  When disabled, we won't interrupt precopy within sending a
- * host huge page, which is the old behavior of vanilla postcopy.
- * NOTE: this parameter is ignored if postcopy preempt is not enabled.
- */
-bool postcopy_preempt_break_huge;
 
 /* Needed by postcopy-pause state */
 QemuSemaphore postcopy_pause_sem;
diff --git a/migration/migration.c b/migration/migration.c
index 42f36c1e2c..22fc863c67 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -4422,8 +4422,6 @@ static Property migration_properties[] = {
 DEFINE_PROP_SIZE("announce-step", MigrationState,
   parameters.announce_step,
   DEFAULT_MIGRATE_ANNOUNCE_STEP),
-DEFINE_PROP_BOOL("x-postcopy-preempt-break-huge", MigrationState,
-  postcopy_preempt_break_huge, true),
 DEFINE_PROP_STRING("tls-creds", MigrationState, parameters.tls_creds),
 DEFINE_PROP_STRING("tls-hostname", MigrationState, 
parameters.tls_hostname),
 DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
diff --git a/migration/ram.c b/migration/ram.c
index 5dc221a2fc..88e61b0aeb 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -125,28 +125,6 @@ struct PageSearchStatus {
 unsigned long page;
 /* Set once we wrap around */
 bool complete_round;
-/*
- * [POSTCOPY-ONLY] Whether current page is explicitly requested by
- * postcopy.  When set, the request is "urgent" because the dest QEMU
- * threads are waiting for us.
- */
-bool postcopy_requested;
-/*
- * [POSTCOPY-ONLY] The target channel to use to send current page.
- *
- * Note: This may _not_ match with the value in postcopy_requested
- * above. Let's imagine the case where the postcopy request is exactly
- * the page that we're sending in progress during precopy. In this case
- * we'll have postcopy_requested set to true but the target channel
- * will be the precopy channel (so that we don't split brain on that
- * specific page since the precopy channel already contains partial of
- * that page data).
- *
- * Besides that specific use case, postcopy_target_channel should
- * always be equal to postcopy_requested, because by default we send
- * postcopy pages via postcopy preempt channel.
- */
-bool postcopy_target_channel;
 /* Whether we're sending a host page */
 bool  host_page_sending;
 /* The start/end of current host page.  Invalid if 
host_page_sending==false */
@@ -371,20 +349,6 @@ struct RAMSrcPageRequest {
 QSIMPLEQ_ENTRY(RAMSrcPageRequest) next_req;
 };
 
-typedef struct {
-/*
- * Cached ramblock/offset values if preempted.  They're only meaningful if
- * preempted==true below.
- */
-RAMBlock *ram_block;
-unsigned long ram_page;
-/*
- * Whether a postcopy preemption just happened.  Will be reset after
- * precopy recovered to background migration.
- */
-bool preempted;
-} PostcopyPreemptState;
-
 /* State of RAM for migration */
 struct RAMState {
 /* QEMUFile used for this migration */
@@ -447,14 +411,6 @@ struct RAMState {

[PATCH 19/30] migration: Remove RAMState.f references in compression code

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Removing referencing to RAMState.f in compress_page_with_multi_thread() and
flush_compressed_data().

Compression code by default isn't compatible with having >1 channels (or it
won't currently know which channel to flush the compressed data), so to
make it simple we always flush on the default to_dst_file port until
someone wants to add >1 ports support, as rs->f right now can really
change (after postcopy preempt is introduced).

There should be no functional change at all after patch applied, since as
long as rs->f referenced in compression code, it must be to_dst_file.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 42b6a543bd..ebc5664dcc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1489,6 +1489,7 @@ static bool save_page_use_compression(RAMState *rs);
 
 static void flush_compressed_data(RAMState *rs)
 {
+MigrationState *ms = migrate_get_current();
 int idx, len, thread_count;
 
 if (!save_page_use_compression(rs)) {
@@ -1507,7 +1508,7 @@ static void flush_compressed_data(RAMState *rs)
 for (idx = 0; idx < thread_count; idx++) {
 qemu_mutex_lock(_param[idx].mutex);
 if (!comp_param[idx].quit) {
-len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+len = qemu_put_qemu_file(ms->to_dst_file, comp_param[idx].file);
 /*
  * it's safe to fetch zero_page without holding comp_done_lock
  * as there is no further request submitted to the thread,
@@ -1526,11 +1527,11 @@ static inline void set_compress_params(CompressParam 
*param, RAMBlock *block,
 param->offset = offset;
 }
 
-static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
-   ram_addr_t offset)
+static int compress_page_with_multi_thread(RAMBlock *block, ram_addr_t offset)
 {
 int idx, thread_count, bytes_xmit = -1, pages = -1;
 bool wait = migrate_compress_wait_thread();
+MigrationState *ms = migrate_get_current();
 
 thread_count = migrate_compress_threads();
 qemu_mutex_lock(_done_lock);
@@ -1538,7 +1539,8 @@ retry:
 for (idx = 0; idx < thread_count; idx++) {
 if (comp_param[idx].done) {
 comp_param[idx].done = false;
-bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+bytes_xmit = qemu_put_qemu_file(ms->to_dst_file,
+comp_param[idx].file);
 qemu_mutex_lock(_param[idx].mutex);
 set_compress_params(_param[idx], block, offset);
 qemu_cond_signal(_param[idx].cond);
@@ -2291,7 +2293,7 @@ static bool save_compress_page(RAMState *rs, RAMBlock 
*block, ram_addr_t offset)
 return false;
 }
 
-if (compress_page_with_multi_thread(rs, block, offset) > 0) {
+if (compress_page_with_multi_thread(block, offset) > 0) {
 return true;
 }
 
-- 
2.38.1

[PATCH 09/30] Unit test code and benchmark code

2022-11-15 Thread Juan Quintela

From: ling xu 

Unit test code is in test-xbzrle.c, and benchmark code is in xbzrle-bench.c
for performance benchmarking.

Signed-off-by: ling xu 
Co-authored-by: Zhou Zhao 
Co-authored-by: Jun Jin 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 tests/bench/xbzrle-bench.c | 465 +
 tests/unit/test-xbzrle.c   |  39 +++-
 tests/bench/meson.build|   4 +
 3 files changed, 503 insertions(+), 5 deletions(-)
 create mode 100644 tests/bench/xbzrle-bench.c

diff --git a/tests/bench/xbzrle-bench.c b/tests/bench/xbzrle-bench.c
new file mode 100644
index 00..d71397e6f4
--- /dev/null
+++ b/tests/bench/xbzrle-bench.c
@@ -0,0 +1,465 @@
+/*
+ * Xor Based Zero Run Length Encoding unit tests.
+ *
+ * Copyright 2013 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *  Orit Wasserman  
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "../migration/xbzrle.h"
+
+#define XBZRLE_PAGE_SIZE 4096
+
+#if defined(CONFIG_AVX512BW_OPT)
+static bool is_cpu_support_avx512bw;
+#include "qemu/cpuid.h"
+static void __attribute__((constructor)) init_cpu_flag(void)
+{
+unsigned max = __get_cpuid_max(0, NULL);
+int a, b, c, d;
+is_cpu_support_avx512bw = false;
+if (max >= 1) {
+__cpuid(1, a, b, c, d);
+ /* We must check that AVX is not just available, but usable.  */
+if ((c & bit_OSXSAVE) && (c & bit_AVX) && max >= 7) {
+int bv;
+__asm("xgetbv" : "=a"(bv), "=d"(d) : "c"(0));
+__cpuid_count(7, 0, a, b, c, d);
+   /* 0xe6:
+*  XCR0[7:5] = 111b (OPMASK state, upper 256-bit of ZMM0-ZMM15
+*and ZMM16-ZMM31 state are enabled by OS)
+*  XCR0[2:1] = 11b (XMM state and YMM state are enabled by OS)
+*/
+if ((bv & 0xe6) == 0xe6 && (b & bit_AVX512BW)) {
+is_cpu_support_avx512bw = true;
+}
+}
+}
+return ;
+}
+#endif
+
+struct ResTime {
+float t_raw;
+float t_512;
+};
+
+static void encode_decode_zero(struct ResTime *res)
+{
+uint8_t *buffer = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *compressed = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *buffer512 = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *compressed512 = g_malloc0(XBZRLE_PAGE_SIZE);
+int i = 0;
+int dlen = 0, dlen512 = 0;
+int diff_len = g_test_rand_int_range(0, XBZRLE_PAGE_SIZE - 1006);
+
+for (i = diff_len; i > 0; i--) {
+buffer[1000 + i] = i;
+buffer512[1000 + i] = i;
+}
+
+buffer[1000 + diff_len + 3] = 103;
+buffer[1000 + diff_len + 5] = 105;
+
+buffer512[1000 + diff_len + 3] = 103;
+buffer512[1000 + diff_len + 5] = 105;
+
+/* encode zero page */
+time_t t_start, t_end, t_start512, t_end512;
+t_start = clock();
+dlen = xbzrle_encode_buffer(buffer, buffer, XBZRLE_PAGE_SIZE, compressed,
+   XBZRLE_PAGE_SIZE);
+t_end = clock();
+float time_val = difftime(t_end, t_start);
+g_assert(dlen == 0);
+
+t_start512 = clock();
+dlen512 = xbzrle_encode_buffer_avx512(buffer512, buffer512, 
XBZRLE_PAGE_SIZE,
+   compressed512, XBZRLE_PAGE_SIZE);
+t_end512 = clock();
+float time_val512 = difftime(t_end512, t_start512);
+g_assert(dlen512 == 0);
+
+res->t_raw = time_val;
+res->t_512 = time_val512;
+
+g_free(buffer);
+g_free(compressed);
+g_free(buffer512);
+g_free(compressed512);
+
+}
+
+static void test_encode_decode_zero_avx512(void)
+{
+int i;
+float time_raw = 0.0, time_512 = 0.0;
+struct ResTime res;
+for (i = 0; i < 1; i++) {
+encode_decode_zero();
+time_raw += res.t_raw;
+time_512 += res.t_512;
+}
+printf("Zero test:\n");
+printf("Raw xbzrle_encode time is %f ms\n", time_raw);
+printf("512 xbzrle_encode time is %f ms\n", time_512);
+}
+
+static void encode_decode_unchanged(struct ResTime *res)
+{
+uint8_t *compressed = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *test = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *compressed512 = g_malloc0(XBZRLE_PAGE_SIZE);
+uint8_t *test512 = g_malloc0(XBZRLE_PAGE_SIZE);
+int i = 0;
+int dlen = 0, dlen512 = 0;
+int diff_len = g_test_rand_int_range(0, XBZRLE_PAGE_SIZE - 1006);
+
+for (i = diff_len; i > 0; i--) {
+test[1000 + i] = i + 4;
+test512[1000 + i] = i + 4;
+}
+
+test[1000 + diff_len + 3] = 107;
+test[1000 + diff_len + 5] = 109;
+
+test512[1000 + diff_len + 3] = 107;
+test512[1000 + diff_len + 5] = 109;
+
+/* test unchanged buffer */
+time_t t_start, t_end, t_start512, t_end512;
+t_start = clock();
+dlen = xbzrle_encode_buffer(test, test, XBZRLE_PAGE_SIZE, compressed,
+XBZRLE_PAGE_SIZE);
+

[PATCH 24/30] migration: Add pss_init()

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Helper to init PSS structures.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index fedd61b3da..a2e86623d3 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -570,6 +570,14 @@ static bool do_compress_ram_page(QEMUFile *f, z_stream 
*stream, RAMBlock *block,
 static void postcopy_preempt_restore(RAMState *rs, PageSearchStatus *pss,
  bool postcopy_requested);
 
+/* NOTE: page is the PFN not real ram_addr_t. */
+static void pss_init(PageSearchStatus *pss, RAMBlock *rb, ram_addr_t page)
+{
+pss->block = rb;
+pss->page = page;
+pss->complete_round = false;
+}
+
 static void *do_data_compress(void *opaque)
 {
 CompressParam *param = opaque;
@@ -2678,9 +2686,7 @@ static int ram_find_and_save_block(RAMState *rs)
 rs->last_page = 0;
 }
 
-pss.block = rs->last_seen_block;
-pss.page = rs->last_page;
-pss.complete_round = false;
+pss_init(, rs->last_seen_block, rs->last_page);
 
 do {
 again = true;
-- 
2.38.1

[PATCH 11/30] migration: Fix race on qemu_file_shutdown()

2022-11-15 Thread Juan Quintela

From: Peter Xu 

In qemu_file_shutdown(), there's a possible race if with current order of
operation.  There're two major things to do:

  (1) Do real shutdown() (e.g. shutdown() syscall on socket)
  (2) Update qemufile's last_error

We must do (2) before (1) otherwise there can be a race condition like:

  page receiver other thread
  - 
  qemu_get_buffer()
do shutdown()
returns 0 (buffer all zero)
(meanwhile we didn't check this retcode)
  try to detect IO error
last_error==NULL, IO okay
  install ALL-ZERO page
set last_error
  --> guest crash!

To fix this, we can also check retval of qemu_get_buffer(), but not all
APIs can be properly checked and ultimately we still need to go back to
qemu_file_get_error().  E.g. qemu_get_byte() doesn't return error.

Maybe some day a rework of qemufile API is really needed, but for now keep
using qemu_file_get_error() and fix it by not allowing that race condition
to happen.  Here shutdown() is indeed special because the last_error was
emulated.  For real -EIO errors it'll always be set when e.g. sendmsg()
error triggers so we won't miss those ones, only shutdown() is a bit tricky
here.

Cc: Daniel P. Berrange 
Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/qemu-file.c | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 4f400c2e52..2d5f74ffc2 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -79,6 +79,30 @@ int qemu_file_shutdown(QEMUFile *f)
 int ret = 0;
 
 f->shutdown = true;
+
+/*
+ * We must set qemufile error before the real shutdown(), otherwise
+ * there can be a race window where we thought IO all went though
+ * (because last_error==NULL) but actually IO has already stopped.
+ *
+ * If without correct ordering, the race can happen like this:
+ *
+ *  page receiver other thread
+ *  - 
+ *  qemu_get_buffer()
+ *do shutdown()
+ *returns 0 (buffer all zero)
+ *(we didn't check this retcode)
+ *  try to detect IO error
+ *last_error==NULL, IO okay
+ *  install ALL-ZERO page
+ *set last_error
+ *  --> guest crash!
+ */
+if (!f->last_error) {
+qemu_file_set_error(f, -EIO);
+}
+
 if (!qio_channel_has_feature(f->ioc,
  QIO_CHANNEL_FEATURE_SHUTDOWN)) {
 return -ENOSYS;
@@ -88,9 +112,6 @@ int qemu_file_shutdown(QEMUFile *f)
 ret = -EIO;
 }
 
-if (!f->last_error) {
-qemu_file_set_error(f, -EIO);
-}
 return ret;
 }
 
-- 
2.38.1

[PATCH 30/30] migration: Block migration comment or code is wrong

2022-11-15 Thread Juan Quintela

And it appears that what is wrong is the code. During bulk stage we
need to make sure that some block is dirty, but no games with
max_size at all.

Signed-off-by: Juan Quintela 
Reviewed-by: Stefan Hajnoczi 
---
 migration/block.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/migration/block.c b/migration/block.c
index 3577c815a9..4347da1526 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -880,8 +880,8 @@ static void block_save_pending(QEMUFile *f, void *opaque, 
uint64_t max_size,
 blk_mig_unlock();
 
 /* Report at least one block pending during bulk phase */
-if (pending <= max_size && !block_mig_state.bulk_completed) {
-pending = max_size + BLK_MIG_BLOCK_SIZE;
+if (!pending && !block_mig_state.bulk_completed) {
+pending = BLK_MIG_BLOCK_SIZE;
 }
 
 trace_migration_block_save_pending(pending);
-- 
2.38.1

[PATCH 22/30] migration: Teach PSS about host page

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Migration code has a lot to do with host pages.  Teaching PSS core about
the idea of host page helps a lot and makes the code clean.  Meanwhile,
this prepares for the future changes that can leverage the new PSS helpers
that this patch introduces to send host page in another thread.

Three more fields are introduced for this:

  (1) host_page_sending: this is set to true when QEMU is sending a host
  page, false otherwise.

  (2) host_page_{start|end}: these point to the start/end of host page
  we're sending, and it's only valid when host_page_sending==true.

For example, when we look up the next dirty page on the ramblock, with
host_page_sending==true, we'll not try to look for anything beyond the
current host page boundary.  This can be slightly efficient than current
code because currently we'll set pss->page to next dirty bit (which can be
over current host page boundary) and reset it to host page boundary if we
found it goes beyond that.

With above, we can easily make migration_bitmap_find_dirty() self contained
by updating pss->page properly.  rs* parameter is removed because it's not
even used in old code.

When sending a host page, we should use the pss helpers like this:

  - pss_host_page_prepare(pss): called before sending host page
  - pss_within_range(pss): whether we're still working on the cur host page?
  - pss_host_page_finish(pss): called after sending a host page

Then we can use ram_save_target_page() to save one small page.

Currently ram_save_host_page() is still the only user. If there'll be
another function to send host page (e.g. in return path thread) in the
future, it should follow the same style.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 95 +++--
 1 file changed, 76 insertions(+), 19 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 25fd3cf7dc..b71edf1f26 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -509,6 +509,11 @@ struct PageSearchStatus {
  * postcopy pages via postcopy preempt channel.
  */
 bool postcopy_target_channel;
+/* Whether we're sending a host page */
+bool  host_page_sending;
+/* The start/end of current host page.  Only valid if 
host_page_sending==true */
+unsigned long host_page_start;
+unsigned long host_page_end;
 };
 typedef struct PageSearchStatus PageSearchStatus;
 
@@ -886,26 +891,38 @@ static int save_xbzrle_page(RAMState *rs, uint8_t 
**current_data,
 }
 
 /**
- * migration_bitmap_find_dirty: find the next dirty page from start
+ * pss_find_next_dirty: find the next dirty page of current ramblock
  *
- * Returns the page offset within memory region of the start of a dirty page
+ * This function updates pss->page to point to the next dirty page index
+ * within the ramblock to migrate, or the end of ramblock when nothing
+ * found.  Note that when pss->host_page_sending==true it means we're
+ * during sending a host page, so we won't look for dirty page that is
+ * outside the host page boundary.
  *
- * @rs: current RAM state
- * @rb: RAMBlock where to search for dirty pages
- * @start: page where we start the search
+ * @pss: the current page search status
  */
-static inline
-unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
-  unsigned long start)
+static void pss_find_next_dirty(PageSearchStatus *pss)
 {
+RAMBlock *rb = pss->block;
 unsigned long size = rb->used_length >> TARGET_PAGE_BITS;
 unsigned long *bitmap = rb->bmap;
 
 if (ramblock_is_ignored(rb)) {
-return size;
+/* Points directly to the end, so we know no dirty page */
+pss->page = size;
+return;
 }
 
-return find_next_bit(bitmap, size, start);
+/*
+ * If during sending a host page, only look for dirty pages within the
+ * current host page being send.
+ */
+if (pss->host_page_sending) {
+assert(pss->host_page_end);
+size = MIN(size, pss->host_page_end);
+}
+
+pss->page = find_next_bit(bitmap, size, pss->page);
 }
 
 static void migration_clear_memory_region_dirty_bitmap(RAMBlock *rb,
@@ -1591,7 +1608,9 @@ static bool find_dirty_block(RAMState *rs, 
PageSearchStatus *pss, bool *again)
 pss->postcopy_requested = false;
 pss->postcopy_target_channel = RAM_CHANNEL_PRECOPY;
 
-pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
+/* Update pss->page for the next dirty bit in ramblock */
+pss_find_next_dirty(pss);
+
 if (pss->complete_round && pss->block == rs->last_seen_block &&
 pss->page >= rs->last_page) {
 /*
@@ -2480,6 +2499,44 @@ static void postcopy_preempt_reset_channel(RAMState *rs)
 }
 }
 
+/* Should be called before sending a host page */
+static void pss_host_page_prepare(PageSearchStatus *pss)
+{
+/* How many guest

[PATCH 05/30] multifd: Create page_count fields into both MultiFD{Recv, Send}Params

2022-11-15 Thread Juan Quintela

We were recalculating it left and right.  We plan to change that
values on next patches.

Signed-off-by: Juan Quintela 
Reviewed-by: Leonardo Bras 
---
 migration/multifd.h | 4 
 migration/multifd.c | 7 ---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 941563c232..ff3aa2e2e9 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -82,6 +82,8 @@ typedef struct {
 uint32_t packet_len;
 /* guest page size */
 uint32_t page_size;
+/* number of pages in a full packet */
+uint32_t page_count;
 /* multifd flags for sending ram */
 int write_flags;
 
@@ -147,6 +149,8 @@ typedef struct {
 uint32_t packet_len;
 /* guest page size */
 uint32_t page_size;
+/* number of pages in a full packet */
+uint32_t page_count;
 
 /* syncs main thread and channels */
 QemuSemaphore sem_sync;
diff --git a/migration/multifd.c b/migration/multifd.c
index b32fe7edaf..c40d98ad5c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -279,7 +279,6 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
 static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
 {
 MultiFDPacket_t *packet = p->packet;
-uint32_t page_count = MULTIFD_PACKET_SIZE / p->page_size;
 RAMBlock *block;
 int i;
 
@@ -306,10 +305,10 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams 
*p, Error **errp)
  * If we received a packet that is 100 times bigger than expected
  * just stop migration.  It is a magic number.
  */
-if (packet->pages_alloc > page_count) {
+if (packet->pages_alloc > p->page_count) {
 error_setg(errp, "multifd: received packet "
"with size %u and expected a size of %u",
-   packet->pages_alloc, page_count) ;
+   packet->pages_alloc, p->page_count) ;
 return -1;
 }
 
@@ -944,6 +943,7 @@ int multifd_save_setup(Error **errp)
 p->iov = g_new0(struct iovec, page_count + 1);
 p->normal = g_new0(ram_addr_t, page_count);
 p->page_size = qemu_target_page_size();
+p->page_count = page_count;
 
 if (migrate_use_zero_copy_send()) {
 p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
@@ -1191,6 +1191,7 @@ int multifd_load_setup(Error **errp)
 p->name = g_strdup_printf("multifdrecv_%d", i);
 p->iov = g_new0(struct iovec, page_count);
 p->normal = g_new0(ram_addr_t, page_count);
+p->page_count = page_count;
 p->page_size = qemu_target_page_size();
 }
 
-- 
2.38.1

[PATCH 15/30] migration: Take bitmap mutex when completing ram migration

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Any call to ram_find_and_save_block() needs to take the bitmap mutex.  We
used to not take it for most of ram_save_complete() because we thought
we're the only one left using the bitmap, but it's not true after the
preempt full patchset applied, since the return path can be taking it too.

Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 2fcce796d0..96fa521813 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3434,6 +3434,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 /* try transferring iterative blocks of memory */
 
 /* flush all remaining blocks regardless of rate limiting */
+qemu_mutex_lock(>bitmap_mutex);
 while (true) {
 int pages;
 
@@ -3447,6 +3448,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 break;
 }
 }
+qemu_mutex_unlock(>bitmap_mutex);
 
 flush_compressed_data(rs);
 ram_control_after_iterate(f, RAM_CONTROL_FINISH);
-- 
2.38.1

[PATCH 01/30] migration/channel-block: fix return value for qio_channel_block_{readv, writev}

2022-11-15 Thread Juan Quintela

From: Fiona Ebner 

in the error case. The documentation in include/io/channel.h states
that -1 or QIO_CHANNEL_ERR_BLOCK should be returned upon error. Simply
passing along the return value from the bdrv-functions has the
potential to confuse the call sides. Non-blocking mode is not
implemented currently, so -1 it is.

Signed-off-by: Fiona Ebner 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/channel-block.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/migration/channel-block.c b/migration/channel-block.c
index c55c8c93ce..f4ab53acdb 100644
--- a/migration/channel-block.c
+++ b/migration/channel-block.c
@@ -62,7 +62,8 @@ qio_channel_block_readv(QIOChannel *ioc,
 qemu_iovec_init_external(, (struct iovec *)iov, niov);
 ret = bdrv_readv_vmstate(bioc->bs, , bioc->offset);
 if (ret < 0) {
-return ret;
+error_setg_errno(errp, -ret, "bdrv_readv_vmstate failed");
+return -1;
 }
 
 bioc->offset += qiov.size;
@@ -86,7 +87,8 @@ qio_channel_block_writev(QIOChannel *ioc,
 qemu_iovec_init_external(, (struct iovec *)iov, niov);
 ret = bdrv_writev_vmstate(bioc->bs, , bioc->offset);
 if (ret < 0) {
-return ret;
+error_setg_errno(errp, -ret, "bdrv_writev_vmstate failed");
+return -1;
 }
 
 bioc->offset += qiov.size;
-- 
2.38.1

[PATCH 20/30] migration: Yield bitmap_mutex properly when sending/sleeping

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Don't take the bitmap mutex when sending pages, or when being throttled by
migration_rate_limit() (which is a bit tricky to call it here in ram code,
but seems still helpful).

It prepares for the possibility of concurrently sending pages in >1 threads
using the function ram_save_host_page() because all threads may need the
bitmap_mutex to operate on bitmaps, so that either sendmsg() or any kind of
qemu_sem_wait() blocking for one thread will not block the other from
progressing.

Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 46 +++---
 1 file changed, 35 insertions(+), 11 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index ebc5664dcc..6428138194 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2480,9 +2480,14 @@ static void postcopy_preempt_reset_channel(RAMState *rs)
  * a host page in which case the remainder of the hostpage is sent.
  * Only dirty target pages are sent. Note that the host page size may
  * be a huge page for this block.
+ *
  * The saving stops at the boundary of the used_length of the block
  * if the RAMBlock isn't a multiple of the host page size.
  *
+ * The caller must be with ram_state.bitmap_mutex held to call this
+ * function.  Note that this function can temporarily release the lock, but
+ * when the function is returned it'll make sure the lock is still held.
+ *
  * Returns the number of pages written or negative on error
  *
  * @rs: current RAM state
@@ -2490,6 +2495,7 @@ static void postcopy_preempt_reset_channel(RAMState *rs)
  */
 static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
 {
+bool page_dirty, preempt_active = postcopy_preempt_active();
 int tmppages, pages = 0;
 size_t pagesize_bits =
 qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
@@ -2513,22 +2519,40 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss)
 break;
 }
 
+page_dirty = migration_bitmap_clear_dirty(rs, pss->block, pss->page);
+
 /* Check the pages is dirty and if it is send it */
-if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
+if (page_dirty) {
+/*
+ * Properly yield the lock only in postcopy preempt mode
+ * because both migration thread and rp-return thread can
+ * operate on the bitmaps.
+ */
+if (preempt_active) {
+qemu_mutex_unlock(>bitmap_mutex);
+}
 tmppages = ram_save_target_page(rs, pss);
-if (tmppages < 0) {
-return tmppages;
+if (tmppages >= 0) {
+pages += tmppages;
+/*
+ * Allow rate limiting to happen in the middle of huge pages if
+ * something is sent in the current iteration.
+ */
+if (pagesize_bits > 1 && tmppages > 0) {
+migration_rate_limit();
+}
 }
-
-pages += tmppages;
-/*
- * Allow rate limiting to happen in the middle of huge pages if
- * something is sent in the current iteration.
- */
-if (pagesize_bits > 1 && tmppages > 0) {
-migration_rate_limit();
+if (preempt_active) {
+qemu_mutex_lock(>bitmap_mutex);
 }
+} else {
+tmppages = 0;
+}
+
+if (tmppages < 0) {
+return tmppages;
 }
+
 pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
 } while ((pss->page < hostpage_boundary) &&
  offset_in_ramblock(pss->block,
-- 
2.38.1

[PATCH 26/30] migration: Move last_sent_block into PageSearchStatus

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Since we use PageSearchStatus to represent a channel, it makes perfect
sense to keep last_sent_block (aka, leverage RAM_SAVE_FLAG_CONTINUE) to be
per-channel rather than global because each channel can be sending
different pages on ramblocks.

Hence move it from RAMState into PageSearchStatus.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 71 -
 1 file changed, 41 insertions(+), 30 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index bdb29ac4d9..dbdde5a6a5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -117,6 +117,8 @@ XBZRLECacheStats xbzrle_counters;
 struct PageSearchStatus {
 /* The migration channel used for a specific host page */
 QEMUFile*pss_channel;
+/* Last block from where we have sent data */
+RAMBlock *last_sent_block;
 /* Current block being searched */
 RAMBlock*block;
 /* Current page to search from */
@@ -396,8 +398,6 @@ struct RAMState {
 int uffdio_fd;
 /* Last block that we have visited searching for dirty pages */
 RAMBlock *last_seen_block;
-/* Last block from where we have sent data */
-RAMBlock *last_sent_block;
 /* Last dirty target page we have sent */
 ram_addr_t last_page;
 /* last ram version we have seen */
@@ -712,16 +712,17 @@ exit:
  *
  * Returns the number of bytes written
  *
- * @f: QEMUFile where to send the data
+ * @pss: current PSS channel status
  * @block: block that contains the page we want to send
  * @offset: offset inside the block for the page
  *  in the lower bits, it contains flags
  */
-static size_t save_page_header(RAMState *rs, QEMUFile *f,  RAMBlock *block,
+static size_t save_page_header(PageSearchStatus *pss, RAMBlock *block,
ram_addr_t offset)
 {
 size_t size, len;
-bool same_block = (block == rs->last_sent_block);
+bool same_block = (block == pss->last_sent_block);
+QEMUFile *f = pss->pss_channel;
 
 if (same_block) {
 offset |= RAM_SAVE_FLAG_CONTINUE;
@@ -734,7 +735,7 @@ static size_t save_page_header(RAMState *rs, QEMUFile *f,  
RAMBlock *block,
 qemu_put_byte(f, len);
 qemu_put_buffer(f, (uint8_t *)block->idstr, len);
 size += 1 + len;
-rs->last_sent_block = block;
+pss->last_sent_block = block;
 }
 return size;
 }
@@ -818,17 +819,19 @@ static void xbzrle_cache_zero_page(RAMState *rs, 
ram_addr_t current_addr)
  *  -1 means that xbzrle would be longer than normal
  *
  * @rs: current RAM state
+ * @pss: current PSS channel
  * @current_data: pointer to the address of the page contents
  * @current_addr: addr of the page
  * @block: block that contains the page we want to send
  * @offset: offset inside the block for the page
  */
-static int save_xbzrle_page(RAMState *rs, QEMUFile *file,
+static int save_xbzrle_page(RAMState *rs, PageSearchStatus *pss,
 uint8_t **current_data, ram_addr_t current_addr,
 RAMBlock *block, ram_addr_t offset)
 {
 int encoded_len = 0, bytes_xbzrle;
 uint8_t *prev_cached_page;
+QEMUFile *file = pss->pss_channel;
 
 if (!cache_is_cached(XBZRLE.cache, current_addr,
  ram_counters.dirty_sync_count)) {
@@ -893,7 +896,7 @@ static int save_xbzrle_page(RAMState *rs, QEMUFile *file,
 }
 
 /* Send XBZRLE based compressed page */
-bytes_xbzrle = save_page_header(rs, file, block,
+bytes_xbzrle = save_page_header(pss, block,
 offset | RAM_SAVE_FLAG_XBZRLE);
 qemu_put_byte(file, ENCODING_FLAG_XBZRLE);
 qemu_put_be16(file, encoded_len);
@@ -1324,19 +1327,19 @@ void ram_release_page(const char *rbname, uint64_t 
offset)
  * Returns the size of data written to the file, 0 means the page is not
  * a zero page
  *
- * @rs: current RAM state
- * @file: the file where the data is saved
+ * @pss: current PSS channel
  * @block: block that contains the page we want to send
  * @offset: offset inside the block for the page
  */
-static int save_zero_page_to_file(RAMState *rs, QEMUFile *file,
+static int save_zero_page_to_file(PageSearchStatus *pss,
   RAMBlock *block, ram_addr_t offset)
 {
 uint8_t *p = block->host + offset;
+QEMUFile *file = pss->pss_channel;
 int len = 0;
 
 if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
-len += save_page_header(rs, file, block, offset | RAM_SAVE_FLAG_ZERO);
+len += save_page_header(pss, block, offset | RAM_SAVE_FLAG_ZERO);
 qemu_put_byte(file, 0);
 len += 1;
 ram_release_page(block->idstr, offset);
@@ -1349,14 +1352,14 @@ static int save_zero_page_to_file(RAMState *rs, 
QEMUFile *file,
  *
  * Returns the number of pages written.
  *
- * @rs: current RAM state
+ * @pss: current PSS channel
  *

[PATCH 10/30] migration: Fix possible infinite loop of ram save process

2022-11-15 Thread Juan Quintela

From: Peter Xu 

When starting ram saving procedure (especially at the completion phase),
always set last_seen_block to non-NULL to make sure we can always correctly
detect the case where "we've migrated all the dirty pages".

Then we'll guarantee both last_seen_block and pss.block will be valid
always before the loop starts.

See the comment in the code for some details.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index bb4f08bfed..c0f5d6d287 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2574,14 +2574,22 @@ static int ram_find_and_save_block(RAMState *rs)
 return pages;
 }
 
+/*
+ * Always keep last_seen_block/last_page valid during this procedure,
+ * because find_dirty_block() relies on these values (e.g., we compare
+ * last_seen_block with pss.block to see whether we searched all the
+ * ramblocks) to detect the completion of migration.  Having NULL value
+ * of last_seen_block can conditionally cause below loop to run forever.
+ */
+if (!rs->last_seen_block) {
+rs->last_seen_block = QLIST_FIRST_RCU(_list.blocks);
+rs->last_page = 0;
+}
+
 pss.block = rs->last_seen_block;
 pss.page = rs->last_page;
 pss.complete_round = false;
 
-if (!pss.block) {
-pss.block = QLIST_FIRST_RCU(_list.blocks);
-}
-
 do {
 again = true;
 found = get_queued_page(rs, );
-- 
2.38.1

[PATCH 21/30] migration: Use atomic ops properly for page accountings

2022-11-15 Thread Juan Quintela

From: Peter Xu 

To prepare for thread-safety on page accountings, at least below counters
need to be accessed only atomically, they are:

ram_counters.transferred
ram_counters.duplicate
ram_counters.normal
ram_counters.postcopy_bytes

There are a lot of other counters but they won't be accessed outside
migration thread, then they're still safe to be accessed without atomic
ops.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.h   | 20 
 migration/migration.c | 10 +-
 migration/multifd.c   |  4 ++--
 migration/ram.c   | 40 
 4 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/migration/ram.h b/migration/ram.h
index 038d52f49f..81cbb0947c 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -32,7 +32,27 @@
 #include "qapi/qapi-types-migration.h"
 #include "exec/cpu-common.h"
 #include "io/channel.h"
+#include "qemu/stats64.h"
 
+/*
+ * These are the migration statistic counters that need to be updated using
+ * atomic ops (can be accessed by more than one thread).  Here since we
+ * cannot modify MigrationStats directly to use Stat64 as it was defined in
+ * the QAPI scheme, we define an internal structure to hold them, and we
+ * propagate the real values when QMP queries happen.
+ *
+ * IOW, the corresponding fields within ram_counters on these specific
+ * fields will be always zero and not being used at all; they're just
+ * placeholders to make it QAPI-compatible.
+ */
+typedef struct {
+Stat64 transferred;
+Stat64 duplicate;
+Stat64 normal;
+Stat64 postcopy_bytes;
+} MigrationAtomicStats;
+
+extern MigrationAtomicStats ram_atomic_counters;
 extern MigrationStats ram_counters;
 extern XBZRLECacheStats xbzrle_counters;
 extern CompressionStats compression_counters;
diff --git a/migration/migration.c b/migration/migration.c
index 9fbed8819a..1f95877fb4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1069,13 +1069,13 @@ static void populate_ram_info(MigrationInfo *info, 
MigrationState *s)
 
 info->has_ram = true;
 info->ram = g_malloc0(sizeof(*info->ram));
-info->ram->transferred = ram_counters.transferred;
+info->ram->transferred = stat64_get(_atomic_counters.transferred);
 info->ram->total = ram_bytes_total();
-info->ram->duplicate = ram_counters.duplicate;
+info->ram->duplicate = stat64_get(_atomic_counters.duplicate);
 /* legacy value.  It is not used anymore */
 info->ram->skipped = 0;
-info->ram->normal = ram_counters.normal;
-info->ram->normal_bytes = ram_counters.normal * page_size;
+info->ram->normal = stat64_get(_atomic_counters.normal);
+info->ram->normal_bytes = info->ram->normal * page_size;
 info->ram->mbps = s->mbps;
 info->ram->dirty_sync_count = ram_counters.dirty_sync_count;
 info->ram->dirty_sync_missed_zero_copy =
@@ -1086,7 +1086,7 @@ static void populate_ram_info(MigrationInfo *info, 
MigrationState *s)
 info->ram->pages_per_second = s->pages_per_second;
 info->ram->precopy_bytes = ram_counters.precopy_bytes;
 info->ram->downtime_bytes = ram_counters.downtime_bytes;
-info->ram->postcopy_bytes = ram_counters.postcopy_bytes;
+info->ram->postcopy_bytes = 
stat64_get(_atomic_counters.postcopy_bytes);
 
 if (migrate_use_xbzrle()) {
 info->has_xbzrle_cache = true;
diff --git a/migration/multifd.c b/migration/multifd.c
index c40d98ad5c..7d3aec9a52 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -432,7 +432,7 @@ static int multifd_send_pages(QEMUFile *f)
 transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len;
 qemu_file_acct_rate_limit(f, transferred);
 ram_counters.multifd_bytes += transferred;
-ram_counters.transferred += transferred;
+stat64_add(_atomic_counters.transferred, transferred);
 qemu_mutex_unlock(>mutex);
 qemu_sem_post(>sem);
 
@@ -624,7 +624,7 @@ int multifd_send_sync_main(QEMUFile *f)
 p->pending_job++;
 qemu_file_acct_rate_limit(f, p->packet_len);
 ram_counters.multifd_bytes += p->packet_len;
-ram_counters.transferred += p->packet_len;
+stat64_add(_atomic_counters.transferred, p->packet_len);
 qemu_mutex_unlock(>mutex);
 qemu_sem_post(>sem);
 
diff --git a/migration/ram.c b/migration/ram.c
index 6428138194..25fd3cf7dc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -453,18 +453,25 @@ uint64_t ram_bytes_remaining(void)
0;
 }
 
+/*
+ * NOTE: not all stats in ram_counters are used in reality.  See comments
+ * for struct MigrationAtomicStats.  The ultimate result of ram migration
+ * counters will be a merged version with both ram_counters and the atomic
+ * fields in ram_atomic_counters.
+ */
 MigrationStats ram_counters;
+MigrationAtomicStats ram_atomic_counters;
 
 void

[PATCH 02/30] migration/multifd/zero-copy: Create helper function for flushing

2022-11-15 Thread Juan Quintela

From: Leonardo Bras 

Move flushing code from multifd_send_sync_main() to a new helper, and call
it in multifd_send_sync_main().

Signed-off-by: Leonardo Bras 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/multifd.c | 30 +++---
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 586ddc9d65..509bbbe3bf 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -566,6 +566,23 @@ void multifd_save_cleanup(void)
 multifd_send_state = NULL;
 }
 
+static int multifd_zero_copy_flush(QIOChannel *c)
+{
+int ret;
+Error *err = NULL;
+
+ret = qio_channel_flush(c, );
+if (ret < 0) {
+error_report_err(err);
+return -1;
+}
+if (ret == 1) {
+dirty_sync_missed_zero_copy();
+}
+
+return ret;
+}
+
 int multifd_send_sync_main(QEMUFile *f)
 {
 int i;
@@ -616,17 +633,8 @@ int multifd_send_sync_main(QEMUFile *f)
 qemu_mutex_unlock(>mutex);
 qemu_sem_post(>sem);
 
-if (flush_zero_copy && p->c) {
-int ret;
-Error *err = NULL;
-
-ret = qio_channel_flush(p->c, );
-if (ret < 0) {
-error_report_err(err);
-return -1;
-} else if (ret == 1) {
-dirty_sync_missed_zero_copy();
-}
+if (flush_zero_copy && p->c && (multifd_zero_copy_flush(p->c) < 0)) {
+return -1;
 }
 }
 for (i = 0; i < migrate_multifd_channels(); i++) {
-- 
2.38.1

[PATCH 25/30] migration: Make PageSearchStatus part of RAMState

2022-11-15 Thread Juan Quintela

From: Peter Xu 

We used to allocate PSS structure on the stack for precopy when sending
pages.  Make it static, so as to describe per-channel ram migration status.

Here we declared RAM_CHANNEL_MAX instances, preparing for postcopy to use
it, even though this patch has not yet to start using the 2nd instance.

This should not have any functional change per se, but it already starts to
export PSS information via the RAMState, so that e.g. one PSS channel can
start to reference the other PSS channel.

Always protect PSS access using the same RAMState.bitmap_mutex.  We already
do so, so no code change needed, just some comment update.  Maybe we should
consider renaming bitmap_mutex some day as it's going to be a more commonly
and big mutex we use for ram states, but just leave it for later.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 112 ++--
 1 file changed, 61 insertions(+), 51 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index a2e86623d3..bdb29ac4d9 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -113,6 +113,46 @@ static void __attribute__((constructor)) 
init_cpu_flag(void)
 
 XBZRLECacheStats xbzrle_counters;
 
+/* used by the search for pages to send */
+struct PageSearchStatus {
+/* The migration channel used for a specific host page */
+QEMUFile*pss_channel;
+/* Current block being searched */
+RAMBlock*block;
+/* Current page to search from */
+unsigned long page;
+/* Set once we wrap around */
+bool complete_round;
+/*
+ * [POSTCOPY-ONLY] Whether current page is explicitly requested by
+ * postcopy.  When set, the request is "urgent" because the dest QEMU
+ * threads are waiting for us.
+ */
+bool postcopy_requested;
+/*
+ * [POSTCOPY-ONLY] The target channel to use to send current page.
+ *
+ * Note: This may _not_ match with the value in postcopy_requested
+ * above. Let's imagine the case where the postcopy request is exactly
+ * the page that we're sending in progress during precopy. In this case
+ * we'll have postcopy_requested set to true but the target channel
+ * will be the precopy channel (so that we don't split brain on that
+ * specific page since the precopy channel already contains partial of
+ * that page data).
+ *
+ * Besides that specific use case, postcopy_target_channel should
+ * always be equal to postcopy_requested, because by default we send
+ * postcopy pages via postcopy preempt channel.
+ */
+bool postcopy_target_channel;
+/* Whether we're sending a host page */
+bool  host_page_sending;
+/* The start/end of current host page.  Invalid if 
host_page_sending==false */
+unsigned long host_page_start;
+unsigned long host_page_end;
+};
+typedef struct PageSearchStatus PageSearchStatus;
+
 /* struct contains XBZRLE cache and a static page
used by the compression */
 static struct {
@@ -347,6 +387,11 @@ typedef struct {
 struct RAMState {
 /* QEMUFile used for this migration */
 QEMUFile *f;
+/*
+ * PageSearchStatus structures for the channels when send pages.
+ * Protected by the bitmap_mutex.
+ */
+PageSearchStatus pss[RAM_CHANNEL_MAX];
 /* UFFD file descriptor, used in 'write-tracking' migration */
 int uffdio_fd;
 /* Last block that we have visited searching for dirty pages */
@@ -390,7 +435,12 @@ struct RAMState {
 uint64_t target_page_count;
 /* number of dirty bits in the bitmap */
 uint64_t migration_dirty_pages;
-/* Protects modification of the bitmap and migration dirty pages */
+/*
+ * Protects:
+ * - dirty/clear bitmap
+ * - migration_dirty_pages
+ * - pss structures
+ */
 QemuMutex bitmap_mutex;
 /* The RAMBlock used in the last src_page_requests */
 RAMBlock *last_req_rb;
@@ -479,46 +529,6 @@ void dirty_sync_missed_zero_copy(void)
 ram_counters.dirty_sync_missed_zero_copy++;
 }
 
-/* used by the search for pages to send */
-struct PageSearchStatus {
-/* The migration channel used for a specific host page */
-QEMUFile*pss_channel;
-/* Current block being searched */
-RAMBlock*block;
-/* Current page to search from */
-unsigned long page;
-/* Set once we wrap around */
-bool complete_round;
-/*
- * [POSTCOPY-ONLY] Whether current page is explicitly requested by
- * postcopy.  When set, the request is "urgent" because the dest QEMU
- * threads are waiting for us.
- */
-bool postcopy_requested;
-/*
- * [POSTCOPY-ONLY] The target channel to use to send current page.
- *
- * Note: This may _not_ match with the value in postcopy_requested
- * above. Let's imagine the case where the postcopy request is exactly
- * the page that

[PATCH 16/30] migration: Add postcopy_preempt_active()

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Add the helper to show that postcopy preempt enabled, meanwhile active.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 96fa521813..52c851eb56 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -190,6 +190,11 @@ out:
 return ret;
 }
 
+static bool postcopy_preempt_active(void)
+{
+return migrate_postcopy_preempt() && migration_in_postcopy();
+}
+
 bool ramblock_is_ignored(RAMBlock *block)
 {
 return !qemu_ram_is_migratable(block) ||
@@ -2461,7 +2466,7 @@ static void postcopy_preempt_choose_channel(RAMState *rs, 
PageSearchStatus *pss)
 /* We need to make sure rs->f always points to the default channel elsewhere */
 static void postcopy_preempt_reset_channel(RAMState *rs)
 {
-if (migrate_postcopy_preempt() && migration_in_postcopy()) {
+if (postcopy_preempt_active()) {
 rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
 rs->f = migrate_get_current()->to_dst_file;
 trace_postcopy_preempt_reset_channel();
@@ -2499,7 +2504,7 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss)
 return 0;
 }
 
-if (migrate_postcopy_preempt() && migration_in_postcopy()) {
+if (postcopy_preempt_active()) {
 postcopy_preempt_choose_channel(rs, pss);
 }
 
-- 
2.38.1

[PATCH 06/30] migration: Export ram_transferred_ram()

2022-11-15 Thread Juan Quintela

Signed-off-by: Juan Quintela 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: David Edmondson 
Reviewed-by: Leonardo Bras 
---
 migration/ram.h | 2 ++
 migration/ram.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/migration/ram.h b/migration/ram.h
index c7af65ac74..e844966f69 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -65,6 +65,8 @@ int ram_load_postcopy(QEMUFile *f, int channel);
 
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 
+void ram_transferred_add(uint64_t bytes);
+
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
 bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
 void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
diff --git a/migration/ram.c b/migration/ram.c
index dc1de9ddbc..00a06b2c16 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -422,7 +422,7 @@ uint64_t ram_bytes_remaining(void)
 
 MigrationStats ram_counters;
 
-static void ram_transferred_add(uint64_t bytes)
+void ram_transferred_add(uint64_t bytes)
 {
 if (runstate_is_running()) {
 ram_counters.precopy_bytes += bytes;
-- 
2.38.1

[PATCH 12/30] migration: Disallow postcopy preempt to be used with compress

2022-11-15 Thread Juan Quintela

From: Peter Xu 

The preempt mode requires the capability to assign channel for each of the
page, while the compression logic will currently assign pages to different
compress thread/local-channel so potentially they're incompatible.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/migration.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 406a9e2f72..0bc3fce4b7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1357,6 +1357,17 @@ static bool migrate_caps_check(bool *cap_list,
 error_setg(errp, "Postcopy preempt requires postcopy-ram");
 return false;
 }
+
+/*
+ * Preempt mode requires urgent pages to be sent in separate
+ * channel, OTOH compression logic will disorder all pages into
+ * different compression channels, which is not compatible with the
+ * preempt assumptions on channel assignments.
+ */
+if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+error_setg(errp, "Postcopy preempt not compatible with compress");
+return false;
+}
 }
 
 return true;
-- 
2.38.1

[PATCH 13/30] migration: Use non-atomic ops for clear log bitmap

2022-11-15 Thread Juan Quintela

From: Peter Xu 

Since we already have bitmap_mutex to protect either the dirty bitmap or
the clear log bitmap, we don't need atomic operations to set/clear/test on
the clear log bitmap.  Switching all ops from atomic to non-atomic
versions, meanwhile touch up the comments to show which lock is in charge.

Introduced non-atomic version of bitmap_test_and_clear_atomic(), mostly the
same as the atomic version but simplified a few places, e.g. dropped the
"old_bits" variable, and also the explicit memory barriers.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 include/exec/ram_addr.h | 11 +-
 include/exec/ramblock.h |  3 +++
 include/qemu/bitmap.h   |  1 +
 util/bitmap.c   | 45 +
 4 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 1500680458..f4fb6a2111 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -42,7 +42,8 @@ static inline long clear_bmap_size(uint64_t pages, uint8_t 
shift)
 }
 
 /**
- * clear_bmap_set: set clear bitmap for the page range
+ * clear_bmap_set: set clear bitmap for the page range.  Must be with
+ * bitmap_mutex held.
  *
  * @rb: the ramblock to operate on
  * @start: the start page number
@@ -55,12 +56,12 @@ static inline void clear_bmap_set(RAMBlock *rb, uint64_t 
start,
 {
 uint8_t shift = rb->clear_bmap_shift;
 
-bitmap_set_atomic(rb->clear_bmap, start >> shift,
-  clear_bmap_size(npages, shift));
+bitmap_set(rb->clear_bmap, start >> shift, clear_bmap_size(npages, shift));
 }
 
 /**
- * clear_bmap_test_and_clear: test clear bitmap for the page, clear if set
+ * clear_bmap_test_and_clear: test clear bitmap for the page, clear if set.
+ * Must be with bitmap_mutex held.
  *
  * @rb: the ramblock to operate on
  * @page: the page number to check
@@ -71,7 +72,7 @@ static inline bool clear_bmap_test_and_clear(RAMBlock *rb, 
uint64_t page)
 {
 uint8_t shift = rb->clear_bmap_shift;
 
-return bitmap_test_and_clear_atomic(rb->clear_bmap, page >> shift, 1);
+return bitmap_test_and_clear(rb->clear_bmap, page >> shift, 1);
 }
 
 static inline bool offset_in_ramblock(RAMBlock *b, ram_addr_t offset)
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 6cbedf9e0c..adc03df59c 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -53,6 +53,9 @@ struct RAMBlock {
  * and split clearing of dirty bitmap on the remote node (e.g.,
  * KVM).  The bitmap will be set only when doing global sync.
  *
+ * It is only used during src side of ram migration, and it is
+ * protected by the global ram_state.bitmap_mutex.
+ *
  * NOTE: this bitmap is different comparing to the other bitmaps
  * in that one bit can represent multiple guest pages (which is
  * decided by the `clear_bmap_shift' variable below).  On
diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
index 82a1d2f41f..3ccb00865f 100644
--- a/include/qemu/bitmap.h
+++ b/include/qemu/bitmap.h
@@ -253,6 +253,7 @@ void bitmap_set(unsigned long *map, long i, long len);
 void bitmap_set_atomic(unsigned long *map, long i, long len);
 void bitmap_clear(unsigned long *map, long start, long nr);
 bool bitmap_test_and_clear_atomic(unsigned long *map, long start, long nr);
+bool bitmap_test_and_clear(unsigned long *map, long start, long nr);
 void bitmap_copy_and_clear_atomic(unsigned long *dst, unsigned long *src,
   long nr);
 unsigned long bitmap_find_next_zero_area(unsigned long *map,
diff --git a/util/bitmap.c b/util/bitmap.c
index f81d8057a7..8d12e90a5a 100644
--- a/util/bitmap.c
+++ b/util/bitmap.c
@@ -240,6 +240,51 @@ void bitmap_clear(unsigned long *map, long start, long nr)
 }
 }
 
+bool bitmap_test_and_clear(unsigned long *map, long start, long nr)
+{
+unsigned long *p = map + BIT_WORD(start);
+const long size = start + nr;
+int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+bool dirty = false;
+
+assert(start >= 0 && nr >= 0);
+
+/* First word */
+if (nr - bits_to_clear > 0) {
+if ((*p) & mask_to_clear) {
+dirty = true;
+}
+*p &= ~mask_to_clear;
+nr -= bits_to_clear;
+bits_to_clear = BITS_PER_LONG;
+p++;
+}
+
+/* Full words */
+if (bits_to_clear == BITS_PER_LONG) {
+while (nr >= BITS_PER_LONG) {
+if (*p) {
+dirty = true;
+*p = 0;
+}
+nr -= BITS_PER_LONG;
+p++;
+}
+}
+
+/* Last word */
+if (nr) {
+mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+if ((*p) & mask_to_clear) {
+dirty = true;
+}
+*p &= ~mask_to_clear;
+}
+
+return dirty;
+}
+
 bool

[PATCH 07/30] migration: Export ram_release_page()

2022-11-15 Thread Juan Quintela

Signed-off-by: Juan Quintela 
Reviewed-by: Leonardo Bras 
---
 migration/ram.h | 1 +
 migration/ram.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/ram.h b/migration/ram.h
index e844966f69..038d52f49f 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -66,6 +66,7 @@ int ram_load_postcopy(QEMUFile *f, int channel);
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 
 void ram_transferred_add(uint64_t bytes);
+void ram_release_page(const char *rbname, uint64_t offset);
 
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
 bool ramblock_recv_bitmap_test_byte_offset(RAMBlock *rb, uint64_t byte_offset);
diff --git a/migration/ram.c b/migration/ram.c
index 00a06b2c16..67e41dd2c0 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1234,7 +1234,7 @@ static void migration_bitmap_sync_precopy(RAMState *rs)
 }
 }
 
-static void ram_release_page(const char *rbname, uint64_t offset)
+void ram_release_page(const char *rbname, uint64_t offset)
 {
 if (!migrate_release_ram() || !migration_in_postcopy()) {
 return;
-- 
2.38.1

[PATCH 17/30] migration: Cleanup xbzrle zero page cache update logic

2022-11-15 Thread Juan Quintela

From: Peter Xu 

The major change is to replace "!save_page_use_compression()" with
"xbzrle_enabled" to make it clear.

Reasonings:

(1) When compression enabled, "!save_page_use_compression()" is exactly the
same as checking "xbzrle_enabled".

(2) When compression disabled, "!save_page_use_compression()" always return
true.  We used to try calling the xbzrle code, but after this change we
won't, and we shouldn't need to.

Since at it, drop the xbzrle_enabled check in xbzrle_cache_zero_page()
because with this change it's not needed anymore.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 52c851eb56..9ded381e0a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -769,10 +769,6 @@ void mig_throttle_counter_reset(void)
  */
 static void xbzrle_cache_zero_page(RAMState *rs, ram_addr_t current_addr)
 {
-if (!rs->xbzrle_enabled) {
-return;
-}
-
 /* We don't care if this fails to allocate a new cache page
  * as long as it updated an old one */
 cache_insert(XBZRLE.cache, current_addr, XBZRLE.zero_target_page,
@@ -2329,7 +2325,7 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss)
 /* Must let xbzrle know, otherwise a previous (now 0'd) cached
  * page would be stale
  */
-if (!save_page_use_compression(rs)) {
+if (rs->xbzrle_enabled) {
 XBZRLE_cache_lock();
 xbzrle_cache_zero_page(rs, block->offset + offset);
 XBZRLE_cache_unlock();
-- 
2.38.1

[PATCH 08/30] Update AVX512 support for xbzrle_encode_buffer

2022-11-15 Thread Juan Quintela

From: ling xu 

This commit updates code of avx512 support for xbzrle_encode_buffer
function to accelerate xbzrle encoding speed. Runtime check of avx512
support and benchmark for this feature are added. Compared with C
version of xbzrle_encode_buffer function, avx512 version can achieve
50%-70% performance improvement on benchmarking. In addition, if dirty
data is randomly located in 4K page, the avx512 version can achieve
almost 140% performance gain.

Signed-off-by: ling xu 
Co-authored-by: Zhou Zhao 
Co-authored-by: Jun Jin 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 meson.build   |  16 +
 migration/xbzrle.h|   4 ++
 migration/ram.c   |  34 +-
 migration/xbzrle.c| 124 ++
 meson_options.txt |   2 +
 scripts/meson-buildoptions.sh |  14 ++--
 6 files changed, 186 insertions(+), 8 deletions(-)

diff --git a/meson.build b/meson.build
index cf3e517e56..d0d28f5c9e 100644
--- a/meson.build
+++ b/meson.build
@@ -2344,6 +2344,22 @@ config_host_data.set('CONFIG_AVX512F_OPT', 
get_option('avx512f') \
 int main(int argc, char *argv[]) { return bar(argv[argc - 1]); }
   '''), error_message: 'AVX512F not available').allowed())
 
+config_host_data.set('CONFIG_AVX512BW_OPT', get_option('avx512bw') \
+  .require(have_cpuid_h, error_message: 'cpuid.h not available, cannot enable 
AVX512BW') \
+  .require(cc.links('''
+#pragma GCC push_options
+#pragma GCC target("avx512bw")
+#include 
+#include 
+static int bar(void *a) {
+
+  __m512i *x = a;
+  __m512i res= _mm512_abs_epi8(*x);
+  return res[1];
+}
+int main(int argc, char *argv[]) { return bar(argv[0]); }
+  '''), error_message: 'AVX512BW not available').allowed())
+
 have_pvrdma = get_option('pvrdma') \
   .require(rdma.found(), error_message: 'PVRDMA requires OpenFabrics 
libraries') \
   .require(cc.compiles(gnu_source_prefix + '''
diff --git a/migration/xbzrle.h b/migration/xbzrle.h
index a0db507b9c..6feb49160a 100644
--- a/migration/xbzrle.h
+++ b/migration/xbzrle.h
@@ -18,4 +18,8 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, 
int slen,
  uint8_t *dst, int dlen);
 
 int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen);
+#if defined(CONFIG_AVX512BW_OPT)
+int xbzrle_encode_buffer_avx512(uint8_t *old_buf, uint8_t *new_buf, int slen,
+uint8_t *dst, int dlen);
+#endif
 #endif
diff --git a/migration/ram.c b/migration/ram.c
index 67e41dd2c0..bb4f08bfed 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -83,6 +83,34 @@
 /* 0x80 is reserved in migration.h start with 0x100 next */
 #define RAM_SAVE_FLAG_COMPRESS_PAGE0x100
 
+int (*xbzrle_encode_buffer_func)(uint8_t *, uint8_t *, int,
+ uint8_t *, int) = xbzrle_encode_buffer;
+#if defined(CONFIG_AVX512BW_OPT)
+#include "qemu/cpuid.h"
+static void __attribute__((constructor)) init_cpu_flag(void)
+{
+unsigned max = __get_cpuid_max(0, NULL);
+int a, b, c, d;
+if (max >= 1) {
+__cpuid(1, a, b, c, d);
+ /* We must check that AVX is not just available, but usable.  */
+if ((c & bit_OSXSAVE) && (c & bit_AVX) && max >= 7) {
+int bv;
+__asm("xgetbv" : "=a"(bv), "=d"(d) : "c"(0));
+__cpuid_count(7, 0, a, b, c, d);
+   /* 0xe6:
+*  XCR0[7:5] = 111b (OPMASK state, upper 256-bit of ZMM0-ZMM15
+*and ZMM16-ZMM31 state are enabled by OS)
+*  XCR0[2:1] = 11b (XMM state and YMM state are enabled by OS)
+*/
+if ((bv & 0xe6) == 0xe6 && (b & bit_AVX512BW)) {
+xbzrle_encode_buffer_func = xbzrle_encode_buffer_avx512;
+}
+}
+}
+}
+#endif
+
 XBZRLECacheStats xbzrle_counters;
 
 /* struct contains XBZRLE cache and a static page
@@ -802,9 +830,9 @@ static int save_xbzrle_page(RAMState *rs, uint8_t 
**current_data,
 memcpy(XBZRLE.current_buf, *current_data, TARGET_PAGE_SIZE);
 
 /* XBZRLE encoding (if there is no overflow) */
-encoded_len = xbzrle_encode_buffer(prev_cached_page, XBZRLE.current_buf,
-   TARGET_PAGE_SIZE, XBZRLE.encoded_buf,
-   TARGET_PAGE_SIZE);
+encoded_len = xbzrle_encode_buffer_func(prev_cached_page, 
XBZRLE.current_buf,
+TARGET_PAGE_SIZE, 
XBZRLE.encoded_buf,
+TARGET_PAGE_SIZE);
 
 /*
  * Update the cache contents, so that it corresponds to the data
diff --git a/migration/xbzrle.c b/migration/xbzrle.c
index 1ba482ded9..05366e86c0 100644
--- a/migration/xbzrle.c
+++ b/migration/xbzrle.c
@@ -174,3 +174,127 @@ int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t 
*dst, int dlen)
 
 return d;
 }
+
+#if defined(CONFIG_AVX512BW_OPT)
+#pragma GCC push_options
+#pragma GCC

[PATCH 04/30] multifd: Create page_size fields into both MultiFD{Recv, Send}Params

2022-11-15 Thread Juan Quintela

We were calling qemu_target_page_size() left and right.

Signed-off-by: Juan Quintela 
Reviewed-by: Leonardo Bras 
---
 migration/multifd.h  |  4 
 migration/multifd-zlib.c | 14 ++
 migration/multifd-zstd.c | 12 +---
 migration/multifd.c  | 18 --
 4 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 913e4ba274..941563c232 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -80,6 +80,8 @@ typedef struct {
 bool registered_yank;
 /* packet allocated len */
 uint32_t packet_len;
+/* guest page size */
+uint32_t page_size;
 /* multifd flags for sending ram */
 int write_flags;
 
@@ -143,6 +145,8 @@ typedef struct {
 QIOChannel *c;
 /* packet allocated len */
 uint32_t packet_len;
+/* guest page size */
+uint32_t page_size;
 
 /* syncs main thread and channels */
 QemuSemaphore sem_sync;
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 18213a9513..37770248e1 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -116,7 +116,6 @@ static void zlib_send_cleanup(MultiFDSendParams *p, Error 
**errp)
 static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
 {
 struct zlib_data *z = p->data;
-size_t page_size = qemu_target_page_size();
 z_stream *zs = >zs;
 uint32_t out_size = 0;
 int ret;
@@ -135,8 +134,8 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error 
**errp)
  * with compression. zlib does not guarantee that this is safe,
  * therefore copy the page before calling deflate().
  */
-memcpy(z->buf, p->pages->block->host + p->normal[i], page_size);
-zs->avail_in = page_size;
+memcpy(z->buf, p->pages->block->host + p->normal[i], p->page_size);
+zs->avail_in = p->page_size;
 zs->next_in = z->buf;
 
 zs->avail_out = available;
@@ -242,12 +241,11 @@ static void zlib_recv_cleanup(MultiFDRecvParams *p)
 static int zlib_recv_pages(MultiFDRecvParams *p, Error **errp)
 {
 struct zlib_data *z = p->data;
-size_t page_size = qemu_target_page_size();
 z_stream *zs = >zs;
 uint32_t in_size = p->next_packet_size;
 /* we measure the change of total_out */
 uint32_t out_size = zs->total_out;
-uint32_t expected_size = p->normal_num * page_size;
+uint32_t expected_size = p->normal_num * p->page_size;
 uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
 int ret;
 int i;
@@ -274,7 +272,7 @@ static int zlib_recv_pages(MultiFDRecvParams *p, Error 
**errp)
 flush = Z_SYNC_FLUSH;
 }
 
-zs->avail_out = page_size;
+zs->avail_out = p->page_size;
 zs->next_out = p->host + p->normal[i];
 
 /*
@@ -288,8 +286,8 @@ static int zlib_recv_pages(MultiFDRecvParams *p, Error 
**errp)
 do {
 ret = inflate(zs, flush);
 } while (ret == Z_OK && zs->avail_in
- && (zs->total_out - start) < page_size);
-if (ret == Z_OK && (zs->total_out - start) < page_size) {
+ && (zs->total_out - start) < p->page_size);
+if (ret == Z_OK && (zs->total_out - start) < p->page_size) {
 error_setg(errp, "multifd %u: inflate generated too few output",
p->id);
 return -1;
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index d788d309f2..f4a8e1ed1f 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -113,7 +113,6 @@ static void zstd_send_cleanup(MultiFDSendParams *p, Error 
**errp)
 static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
 {
 struct zstd_data *z = p->data;
-size_t page_size = qemu_target_page_size();
 int ret;
 uint32_t i;
 
@@ -128,7 +127,7 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error 
**errp)
 flush = ZSTD_e_flush;
 }
 z->in.src = p->pages->block->host + p->normal[i];
-z->in.size = page_size;
+z->in.size = p->page_size;
 z->in.pos = 0;
 
 /*
@@ -241,8 +240,7 @@ static int zstd_recv_pages(MultiFDRecvParams *p, Error 
**errp)
 {
 uint32_t in_size = p->next_packet_size;
 uint32_t out_size = 0;
-size_t page_size = qemu_target_page_size();
-uint32_t expected_size = p->normal_num * page_size;
+uint32_t expected_size = p->normal_num * p->page_size;
 uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
 struct zstd_data *z = p->data;
 int ret;
@@ -265,7 +263,7 @@ static int zstd_recv_pages(MultiFDRecvParams *p, Error 
**errp)
 
 for (i = 0; i < p->normal_num; i++) {
 z->out.dst = p->host + p->normal[i];
-z->out.size = page_size;
+z->out.size = p->page_size;
 z->out.pos = 0;
 
 /*
@@ -279,8 +277,8 @@ static int zstd_recv_pages(MultiFDRecvParams *p, Error 
**errp)
 do {
 ret

[PATCH 03/30] migration: check magic value for deciding the mapping of channels

2022-11-15 Thread Juan Quintela

From: "manish.mishra" 

Current logic assumes that channel connections on the destination side are
always established in the same order as the source and the first one will
always be the main channel followed by the multifid or post-copy
preemption channel. This may not be always true, as even if a channel has a
connection established on the source side it can be in the pending state on
the destination side and a newer connection can be established first.
Basically causing out of order mapping of channels on the destination side.
Currently, all channels except post-copy preempt send a magic number, this
patch uses that magic number to decide the type of channel. This logic is
applicable only for precopy(multifd) live migration, as mentioned, the
post-copy preempt channel does not send any magic number. Also, tls live
migrations already does tls handshake before creating other channels, so
this issue is not possible with tls, hence this logic is avoided for tls
live migrations. This patch uses MSG_PEEK to check the magic number of
channels so that current data/control stream management remains
un-effected.

v2: TLS does not support MSG_PEEK, so V1 was broken for tls live
  migrations. For tls live migration, while initializing main channel
  tls handshake is done before we can create other channels, so this
  issue is not possible for tls live migrations. In V2 added a check
  to avoid checking magic number for tls live migration and fallback
  to older method to decide mapping of channels on destination side.

Suggested-by: Daniel P. Berrangé 
Signed-off-by: manish.mishra 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 include/io/channel.h | 25 +++
 migration/multifd.h  |  2 +-
 migration/postcopy-ram.h |  2 +-
 io/channel-socket.c  | 27 
 io/channel.c | 39 +++
 migration/migration.c| 44 +---
 migration/multifd.c  | 12 ---
 migration/postcopy-ram.c |  5 +
 8 files changed, 130 insertions(+), 26 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index c680ee7480..74177aeeea 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -115,6 +115,10 @@ struct QIOChannelClass {
 int **fds,
 size_t *nfds,
 Error **errp);
+ssize_t (*io_read_peek)(QIOChannel *ioc,
+void *buf,
+size_t nbytes,
+Error **errp);
 int (*io_close)(QIOChannel *ioc,
 Error **errp);
 GSource * (*io_create_watch)(QIOChannel *ioc,
@@ -475,6 +479,27 @@ int qio_channel_write_all(QIOChannel *ioc,
   size_t buflen,
   Error **errp);
 
+/**
+ * qio_channel_read_peek_all:
+ * @ioc: the channel object
+ * @buf: the memory region to read in data
+ * @nbytes: the number of bytes to read
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Read given @nbytes data from peek of channel into
+ * memory region @buf.
+ *
+ * The function will be blocked until read size is
+ * equal to requested size.
+ *
+ * Returns: 1 if all bytes were read, 0 if end-of-file
+ *  occurs without data, or -1 on error
+ */
+int qio_channel_read_peek_all(QIOChannel *ioc,
+  void* buf,
+  size_t nbytes,
+  Error **errp);
+
 /**
  * qio_channel_set_blocking:
  * @ioc: the channel object
diff --git a/migration/multifd.h b/migration/multifd.h
index 519f498643..913e4ba274 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -18,7 +18,7 @@ void multifd_save_cleanup(void);
 int multifd_load_setup(Error **errp);
 int multifd_load_cleanup(Error **errp);
 bool multifd_recv_all_channels_created(void);
-bool multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
+void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
 int multifd_send_sync_main(QEMUFile *f);
 int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index 6147bf7d1d..25881c4127 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -190,7 +190,7 @@ enum PostcopyChannels {
 RAM_CHANNEL_MAX,
 };
 
-bool postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
+void postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
 int postcopy_preempt_setup(MigrationState *s, Error **errp);
 int postcopy_preempt_wait_channel(MigrationState *s);
 
diff --git a/io/channel-socket.c b/io/channel-socket.c
index b76dca9cc1..b99f5dfda6 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -705,6 +705,32 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 }
 #endif /* WIN32 */
 
+static ssize_t

[PATCH 00/30] Migration PULL request

2022-11-15 Thread Juan Quintela

Hi

It includes:
- Leonardo fix for zero_copy flush
- Fiona fix for return value of readv/writev
- Peter Xu cleanups
- Peter Xu preempt patches
- Patches ready from zero page (me)
- AVX2 support (ling)
- fix for slow networking and reordering of first packets (manish)

Please, apply.

Fiona Ebner (1):
  migration/channel-block: fix return value for
qio_channel_block_{readv,writev}

Juan Quintela (5):
  multifd: Create page_size fields into both MultiFD{Recv,Send}Params
  multifd: Create page_count fields into both MultiFD{Recv,Send}Params
  migration: Export ram_transferred_ram()
  migration: Export ram_release_page()
  migration: Block migration comment or code is wrong

Leonardo Bras (1):
  migration/multifd/zero-copy: Create helper function for flushing

Peter Xu (20):
  migration: Fix possible infinite loop of ram save process
  migration: Fix race on qemu_file_shutdown()
  migration: Disallow postcopy preempt to be used with compress
  migration: Use non-atomic ops for clear log bitmap
  migration: Disable multifd explicitly with compression
  migration: Take bitmap mutex when completing ram migration
  migration: Add postcopy_preempt_active()
  migration: Cleanup xbzrle zero page cache update logic
  migration: Trivial cleanup save_page_header() on same block check
  migration: Remove RAMState.f references in compression code
  migration: Yield bitmap_mutex properly when sending/sleeping
  migration: Use atomic ops properly for page accountings
  migration: Teach PSS about host page
  migration: Introduce pss_channel
  migration: Add pss_init()
  migration: Make PageSearchStatus part of RAMState
  migration: Move last_sent_block into PageSearchStatus
  migration: Send requested page directly in rp-return thread
  migration: Remove old preempt code around state maintainance
  migration: Drop rs->f

ling xu (2):
  Update AVX512 support for xbzrle_encode_buffer
  Unit test code and benchmark code

manish.mishra (1):
  migration: check magic value for deciding the mapping of channels

 meson.build   |  16 +
 include/exec/ram_addr.h   |  11 +-
 include/exec/ramblock.h   |   3 +
 include/io/channel.h  |  25 ++
 include/qemu/bitmap.h |   1 +
 migration/migration.h |   7 -
 migration/multifd.h   |  10 +-
 migration/postcopy-ram.h  |   2 +-
 migration/ram.h   |  23 +
 migration/xbzrle.h|   4 +
 io/channel-socket.c   |  27 ++
 io/channel.c  |  39 ++
 migration/block.c |   4 +-
 migration/channel-block.c |   6 +-
 migration/migration.c | 109 +++--
 migration/multifd-zlib.c  |  14 +-
 migration/multifd-zstd.c  |  12 +-
 migration/multifd.c   |  69 +--
 migration/postcopy-ram.c  |   5 +-
 migration/qemu-file.c |  27 +-
 migration/ram.c   | 794 +-
 migration/xbzrle.c| 124 ++
 tests/bench/xbzrle-bench.c| 465 
 tests/unit/test-xbzrle.c  |  39 +-
 util/bitmap.c |  45 ++
 meson_options.txt |   2 +
 scripts/meson-buildoptions.sh |  14 +-
 tests/bench/meson.build   |   4 +
 28 files changed, 1379 insertions(+), 522 deletions(-)
 create mode 100644 tests/bench/xbzrle-bench.c

-- 
2.38.1

Re: [PATCH v2] block/rbd: Add support for layered encryption

2022-11-15 Thread Ilya Dryomov

On Sun, Nov 13, 2022 at 11:17 AM Or Ozeri  wrote:
>
> Starting from ceph Reef, RBD has built-in support for layered encryption,
> where each ancestor image (in a cloned image setting) can be possibly
> encrypted using a unique passphrase.
>
> A new function, rbd_encryption_load2, was added to librbd API.
> This new function supports an array of passphrases (via "spec" structs).
>
> This commit extends the qemu rbd driver API to use this new librbd API,
> in order to support this new layered encryption feature.
>
> Signed-off-by: Or Ozeri 
> ---
> v2: nit fixes suggested by @idryomov
> ---
>  block/rbd.c  | 122 ++-
>  qapi/block-core.json |  33 ++--
>  2 files changed, 151 insertions(+), 4 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index f826410f40..bde0326bfd 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -71,6 +71,16 @@ static const char rbd_luks2_header_verification[
>  'L', 'U', 'K', 'S', 0xBA, 0xBE, 0, 2
>  };
>
> +static const char rbd_layered_luks_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 1
> +};
> +
> +static const char rbd_layered_luks2_header_verification[
> +RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
> +'R', 'B', 'D', 'L', 0xBA, 0xBE, 0, 2
> +};
> +
>  typedef enum {
>  RBD_AIO_READ,
>  RBD_AIO_WRITE,
> @@ -470,6 +480,9 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>  size_t passphrase_len;
>  rbd_encryption_luks1_format_options_t luks_opts;
>  rbd_encryption_luks2_format_options_t luks2_opts;
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
> +rbd_encryption_luks_format_options_t luks_any_opts;
> +#endif
>  rbd_encryption_format_t format;
>  rbd_encryption_options_t opts;
>  size_t opts_size;
> @@ -505,6 +518,23 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>  luks2_opts.passphrase_size = passphrase_len;
>  break;
>  }
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
> +case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS_ANY: {
> +memset(_any_opts, 0, sizeof(luks_any_opts));
> +format = RBD_ENCRYPTION_FORMAT_LUKS;
> +opts = _any_opts;
> +opts_size = sizeof(luks_any_opts);
> +r = qemu_rbd_convert_luks_options(
> +
> qapi_RbdEncryptionOptionsLUKSAny_base(>u.luks_any),
> +, _len, errp);
> +if (r < 0) {
> +return r;
> +}
> +luks_any_opts.passphrase = passphrase;
> +luks_any_opts.passphrase_size = passphrase_len;
> +break;
> +}
> +#endif
>  default: {
>  r = -ENOTSUP;
>  error_setg_errno(
> @@ -522,6 +552,76 @@ static int qemu_rbd_encryption_load(rbd_image_t image,
>
>  return 0;
>  }
> +
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION_LOAD2
> +static int qemu_rbd_encryption_load2(rbd_image_t image,
> + RbdEncryptionOptions *encrypt,
> + Error **errp)
> +{
> +int r = 0;
> +int encrypt_count = 1;
> +int i;
> +RbdEncryptionOptions *curr_encrypt;
> +rbd_encryption_spec_t *specs;
> +rbd_encryption_luks_format_options_t* luks_any_opts;
> +char **passphrases;
> +
> +/* count encryption options */
> +for (curr_encrypt = encrypt; curr_encrypt->has_parent;
> + curr_encrypt = curr_encrypt->parent) {
> +++encrypt_count;
> +}
> +
> +specs = g_new0(rbd_encryption_spec_t, encrypt_count);
> +passphrases = g_new0(char*, encrypt_count);
> +
> +curr_encrypt = encrypt;
> +for (i = 0; i < encrypt_count; ++i) {
> +if (curr_encrypt->format != RBD_IMAGE_ENCRYPTION_FORMAT_LUKS_ANY) {
> +r = -ENOTSUP;
> +error_setg_errno(
> +errp, -r, "unknown image encryption format: %u",
> +curr_encrypt->format);
> +goto exit;
> +}
> +
> +specs[i].format = RBD_ENCRYPTION_FORMAT_LUKS;
> +specs[i].opts_size = sizeof(rbd_encryption_luks_format_options_t);
> +
> +luks_any_opts = g_new0(rbd_encryption_luks_format_options_t, 1);
> +specs[i].opts = luks_any_opts;
> +
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionOptionsLUKSAny_base(
> +_encrypt->u.luks_any),
> +[i], _any_opts->passphrase_size,
> +errp);
> +if (r < 0) {
> +goto exit;
> +}
> +
> +luks_any_opts->passphrase = passphrases[i];

I think qemu_rbd_convert_luks_options() is where the const is missing
(see my earlier reply).  If you make passphrase parameter const char**
there, passphrases array can just go away.

> +
> +curr_encrypt = curr_encrypt->parent;
> +}
> +
> +r = rbd_encryption_load2(image, specs, encrypt_count);
> +

Re: [PATCH v1] block/rbd: Add support for layered encryption

2022-11-15 Thread Ilya Dryomov

On Sun, Nov 13, 2022 at 11:16 AM Or Ozeri  wrote:
>
>
>
> > -Original Message-
> > From: Ilya Dryomov 
> > Sent: 11 November 2022 15:01
> > To: Or Ozeri 
> > Cc: qemu-de...@nongnu.org; qemu-block@nongnu.org; Danny Harnik
> > 
> > Subject: [EXTERNAL] Re: [PATCH v1] block/rbd: Add support for layered
> > encryption
> >
> > I don't understand the need for this char* array.  Is there a problem with
> > putting the blob directly into luks_all_opts->passphrase just like the size 
> > is
> > put into luks_all_opts->passphrase_size?
> >
>
> luks_all_opts->passphrase has a const modifier.

Hi Or,

That's really not a reason to make a dynamic memory allocation.  You
can just cast that const away but I suspect that the underlying issue
is that a const is missing somewhere else.  At the end of the day, QEMU
allocates a buffer for the passphrase when it's fetched via the secret
API -- that pointer should assign to const char* just fine.

Thanks,

Ilya

Re: [PATCH v2 3/3] nvme: Add physical writes/reads from OCP log

2022-11-15 Thread Klaus Jensen

On Nov 14 14:50, Joel Granados wrote:
> In order to evaluate write amplification factor (WAF) within the storage
> stack it is important to know the number of bytes written to the
> controller. The existing SMART log value of Data Units Written is too
> coarse (given in units of 500 Kb) and so we add the SMART health
> information extended from the OCP specification (given in units of bytes).
> 
> To accomodate different vendor specific specifications like OCP, we add a
> multiplexing function (nvme_vendor_specific_log) which will route to the
> different log functions based on arguments and log ids. We only return the
> OCP extended smart log when the command is 0xC0 and ocp has been turned on
> in the args.
> 
> Though we add the whole nvme smart log extended structure, we only populate
> the physical_media_units_{read,written}, log_page_version and
> log_page_uuid.
> 
> Signed-off-by: Joel Granados 
> 
> squash with main
> 
> Signed-off-by: Joel Granados 

Looks like you slightly messed up the squash ;)

Also, squash the previous patch (adding the ocp parameter) into this.
Please add a note in the documentation (docs/system/devices/nvme.rst)
about this parameter.

> ---
>  hw/nvme/ctrl.c   | 56 
>  include/block/nvme.h | 36 
>  2 files changed, 92 insertions(+)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 220683201a..5e6a8150a2 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -4455,6 +4455,42 @@ static void nvme_set_blk_stats(NvmeNamespace *ns, 
> struct nvme_stats *stats)
>  stats->write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
>  }
>  
> +static uint16_t nvme_ocp_extended_smart_info(NvmeCtrl *n, uint8_t rae,
> + uint32_t buf_len, uint64_t off,
> + NvmeRequest *req)
> +{
> +NvmeNamespace *ns = NULL;
> +NvmeSmartLogExtended smart_ext = { 0 };
> +struct nvme_stats stats = { 0 };
> +uint32_t trans_len;
> +
> +if (off >= sizeof(smart_ext)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +// Accumulate all stats from all namespaces

Use /* lower-case and no period */ for one sentence, one line comments.

I think scripts/checkpatch.pl picks this up.

> +for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) {
> +ns = nvme_ns(n, i);
> +if (ns)
> +{

Paranthesis go on the same line as the `if`.

> +nvme_set_blk_stats(ns, );
> +}
> +}
> +
> +smart_ext.physical_media_units_written[0] = 
> cpu_to_le32(stats.units_written);
> +smart_ext.physical_media_units_read[0] = cpu_to_le32(stats.units_read);
> +smart_ext.log_page_version = 0x0003;
> +smart_ext.log_page_uuid[0] = 0xA4F2BFEA2810AFC5;
> +smart_ext.log_page_uuid[1] = 0xAFD514C97C6F4F9C;
> +
> +if (!rae) {
> +nvme_clear_events(n, NVME_AER_TYPE_SMART);
> +}
> +
> +trans_len = MIN(sizeof(smart_ext) - off, buf_len);
> +return nvme_c2h(n, (uint8_t *) _ext + off, trans_len, req);
> +}
> +
>  static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
>  uint64_t off, NvmeRequest *req)
>  {
> @@ -4642,6 +4678,24 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint8_t 
> csi, uint32_t buf_len,
>  return nvme_c2h(n, ((uint8_t *)) + off, trans_len, req);
>  }
>  
> +static uint16_t nvme_vendor_specific_log(uint8_t lid, NvmeCtrl *n, uint8_t 
> rae,
> + uint32_t buf_len, uint64_t off,
> + NvmeRequest *req)

`NvmeCtrl *n` must be first parameter.

> +{
> +NvmeSubsystem *subsys = n->subsys;
> +switch (lid) {
> +case NVME_LOG_VENDOR_START:

In this particular case, I think it is more clear if you simply use the
hex value directly. The "meaning" of the log page id depends on if or
not this is an controller implementing the OCP spec.

> +if (subsys->params.ocp) {
> +return nvme_ocp_extended_smart_info(n, rae, buf_len, off, 
> req);
> +}
> +break;
> +/* Add a case for each additional vendor specific log id */

Lower-case the comment.

> +}
> +
> +trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
>  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
>  {
>  NvmeCmd *cmd = >cmd;
> @@ -4683,6 +4737,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest 
> *req)
>  return nvme_error_info(n, rae, len, off, req);
>  case NVME_LOG_SMART_INFO:
>  return nvme_smart_info(n, rae, len, off, req);
> +case NVME_LOG_VENDOR_START...NVME_LOG_VENDOR_END:
> +return nvme_vendor_specific_log(lid, n, rae, len, off, req);
>  case NVME_LOG_FW_SLOT_INFO:
>  return nvme_fw_log_info(n, len, off, req);
>  case NVME_LOG_CHANGED_NSLIST:
> diff --git

Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-15 Thread Michael S. Tsirkin

On Tue, Nov 15, 2022 at 09:18:27AM +0100, Christian Borntraeger wrote:
> 
> Am 14.11.22 um 18:20 schrieb Michael S. Tsirkin:
> > On Mon, Nov 14, 2022 at 06:15:30PM +0100, Christian Borntraeger wrote:
> > > 
> > > 
> > > Am 14.11.22 um 18:10 schrieb Michael S. Tsirkin:
> > > > On Mon, Nov 14, 2022 at 05:55:09PM +0100, Christian Borntraeger wrote:
> > > > > 
> > > > > 
> > > > > Am 14.11.22 um 17:37 schrieb Michael S. Tsirkin:
> > > > > > On Mon, Nov 14, 2022 at 05:18:53PM +0100, Christian Borntraeger 
> > > > > > wrote:
> > > > > > > Am 08.11.22 um 10:23 schrieb Alex Bennée:
> > > > > > > > The previous fix to virtio_device_started revealed a problem in 
> > > > > > > > its
> > > > > > > > use by both the core and the device code. The core code should 
> > > > > > > > be able
> > > > > > > > to handle the device "starting" while the VM isn't running to 
> > > > > > > > handle
> > > > > > > > the restoration of migration state. To solve this dual use 
> > > > > > > > introduce a
> > > > > > > > new helper for use by the vhost-user backends who all use it to 
> > > > > > > > feed a
> > > > > > > > should_start variable.
> > > > > > > > 
> > > > > > > > We can also pick up a change vhost_user_blk_set_status while we 
> > > > > > > > are at
> > > > > > > > it which follows the same pattern.
> > > > > > > > 
> > > > > > > > Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to 
> > > > > > > > virtio_device_started)
> > > > > > > > Fixes: 27ba7b027f (hw/virtio: add boilerplate for 
> > > > > > > > vhost-user-gpio device)
> > > > > > > > Signed-off-by: Alex Bennée 
> > > > > > > > Cc: "Michael S. Tsirkin" 
> > > > > > > 
> > > > > > > Hmmm, is this
> > > > > > > commit 259d69c00b67c02a67f3bdbeeea71c2c0af76c35
> > > > > > > Author: Alex Bennée 
> > > > > > > AuthorDate: Mon Nov 7 12:14:07 2022 +
> > > > > > > Commit: Michael S. Tsirkin 
> > > > > > > CommitDate: Mon Nov 7 14:08:18 2022 -0500
> > > > > > > 
> > > > > > >hw/virtio: introduce virtio_device_should_start
> > > > > > > 
> > > > > > > and older version?
> > > > > > 
> > > > > > This is what got merged:
> > > > > > https://lore.kernel.org/r/20221107121407.1010913-1-alex.bennee%40linaro.org
> > > > > > This patch was sent after I merged the RFC.
> > > > > > I think the only difference is the commit log but I might be missing
> > > > > > something.
> > > > > > 
> > > > > > > This does not seem to fix the regression that I have reported.
> > > > > > 
> > > > > > This was applied on top of 9f6bcfd99f which IIUC does, right?
> > > > > > 
> > > > > > 
> > > > > 
> > > > > QEMU master still fails for me for suspend/resume to disk:
> > > > > 
> > > > > #0  0x03ff8e3980a6 in __pthread_kill_implementation () at 
> > > > > /lib64/libc.so.6
> > > > > #1  0x03ff8e348580 in raise () at /lib64/libc.so.6
> > > > > #2  0x03ff8e32b5c0 in abort () at /lib64/libc.so.6
> > > > > #3  0x03ff8e3409da in __assert_fail_base () at /lib64/libc.so.6
> > > > > #4  0x03ff8e340a4e in  () at /lib64/libc.so.6
> > > > > #5  0x02aa1ffa8966 in vhost_vsock_common_pre_save 
> > > > > (opaque=) at ../hw/virtio/vhost-vsock-common.c:203
> > > > > #6  0x02aa1fe5e0ee in vmstate_save_state_v
> > > > >   (f=f@entry=0x2aa21bdc170, vmsd=0x2aa204ac5f0 
> > > > > , opaque=0x2aa21bac9f8, 
> > > > > vmdesc=vmdesc@entry=0x3fddc08eb30, version_id=version_id@entry=0) at 
> > > > > ../migration/vmstate.c:329
> > > > > #7  0x02aa1fe5ebf8 in vmstate_save_state 
> > > > > (f=f@entry=0x2aa21bdc170, vmsd=, opaque= > > > > out>, vmdesc_id=vmdesc_id@entry=0x3fddc08eb30) at 
> > > > > ../migration/vmstate.c:317
> > > > > #8  0x02aa1fe75bd0 in vmstate_save (f=f@entry=0x2aa21bdc170, 
> > > > > se=se@entry=0x2aa21bdbe90, vmdesc=vmdesc@entry=0x3fddc08eb30) at 
> > > > > ../migration/savevm.c:908
> > > > > #9  0x02aa1fe79584 in 
> > > > > qemu_savevm_state_complete_precopy_non_iterable 
> > > > > (f=f@entry=0x2aa21bdc170, in_postcopy=in_postcopy@entry=false, 
> > > > > inactivate_disks=inactivate_disks@entry=true)
> > > > >   at ../migration/savevm.c:1393
> > > > > #10 0x02aa1fe79a96 in qemu_savevm_state_complete_precopy 
> > > > > (f=0x2aa21bdc170, iterable_only=iterable_only@entry=false, 
> > > > > inactivate_disks=inactivate_disks@entry=true) at 
> > > > > ../migration/savevm.c:1459
> > > > > #11 0x02aa1fe6d6ee in migration_completion (s=0x2aa218ef600) at 
> > > > > ../migration/migration.c:3314
> > > > > #12 migration_iteration_run (s=0x2aa218ef600) at 
> > > > > ../migration/migration.c:3761
> > > > > #13 migration_thread (opaque=opaque@entry=0x2aa218ef600) at 
> > > > > ../migration/migration.c:3989
> > > > > #14 0x02aa201f0b8c in qemu_thread_start (args=) at 
> > > > > ../util/qemu-thread-posix.c:505
> > > > > #15 0x03ff8e396248 in start_thread () at /lib64/libc.so.6
> > > > > #16 0x03ff8e41183e in thread_start () at /lib64/libc.so.6
> > > > > 
> > > > > Michael, your previous branch did work if I recall correctly.
> > > > 
> >

Re: [PATCH v2 2/3] nvme: Add ocp to the subsys

2022-11-15 Thread Klaus Jensen

On Nov 14 14:50, Joel Granados wrote:
> The Open Compute Project defines a Datacenter NVMe SSD Spec that sits on
> top of the NVMe spec. Additional commands and NVMe behaviors specific for
> the Datacenter. This is a preparation patch that introduces an argument to
> activate OCP in nvme.
> 
> Signed-off-by: Joel Granados 
> ---
>  hw/nvme/nvme.h   | 1 +
>  hw/nvme/subsys.c | 4 ++--
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> index 79f5c281c2..aa99c0c57c 100644
> --- a/hw/nvme/nvme.h
> +++ b/hw/nvme/nvme.h
> @@ -56,6 +56,7 @@ typedef struct NvmeSubsystem {
>  
>  struct {
>  char *nqn;
> +bool ocp;
>  } params;
>  } NvmeSubsystem;
>  
> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
> index 9d2643678b..ecca28449c 100644
> --- a/hw/nvme/subsys.c
> +++ b/hw/nvme/subsys.c
> @@ -129,8 +129,8 @@ static void nvme_subsys_realize(DeviceState *dev, Error 
> **errp)
>  
>  static Property nvme_subsystem_props[] = {
>  DEFINE_PROP_STRING("nqn", NvmeSubsystem, params.nqn),
> -DEFINE_PROP_END_OF_LIST(),
> -};
> +DEFINE_PROP_BOOL("ocp", NvmeSubsystem, params.ocp, false),

It is the controller that implements the OCP specification, not the
namespace or the subsystem. The parameter should be on the controller
device.

We discussed that the Get Log Page was subsystem scoped and not
namespace scoped, but that is unrelated to this.

> +DEFINE_PROP_END_OF_LIST(), };
>  
>  static void nvme_subsys_class_init(ObjectClass *oc, void *data)
>  {
> -- 
> 2.30.2
> 
> 


signature.asc
Description: PGP signature

Re: [PATCH v2 1/3] nvme: Move adjustment of data_units{read,written}

2022-11-15 Thread Klaus Jensen

On Nov 14 14:50, Joel Granados wrote:
> In order to return the units_{read/written} required by the SMART log we
> need to shift the number of bytes value by BDRV_SECTORS_BITS and multiply
> by 1000. This is a prep patch that moves this adjustment to where the SMART
> log is calculated in order to use the stats struct for calculating OCP
> extended smart log values.
> 
> Signed-off-by: Joel Granados 
> ---
>  hw/nvme/ctrl.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 87aeba0564..220683201a 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -4449,8 +4449,8 @@ static void nvme_set_blk_stats(NvmeNamespace *ns, 
> struct nvme_stats *stats)
>  {
>  BlockAcctStats *s = blk_get_stats(ns->blkconf.blk);
>  
> -stats->units_read += s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> -stats->units_written += s->nr_bytes[BLOCK_ACCT_WRITE] >> 
> BDRV_SECTOR_BITS;
> +stats->units_read += s->nr_bytes[BLOCK_ACCT_READ];
> +stats->units_written += s->nr_bytes[BLOCK_ACCT_WRITE];
>  stats->read_commands += s->nr_ops[BLOCK_ACCT_READ];
>  stats->write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
>  }
> @@ -4490,10 +4490,12 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t 
> rae, uint32_t buf_len,
>  trans_len = MIN(sizeof(smart) - off, buf_len);
>  smart.critical_warning = n->smart_critical_warning;
>  
> -smart.data_units_read[0] = cpu_to_le64(DIV_ROUND_UP(stats.units_read,
> -1000));
> -smart.data_units_written[0] = 
> cpu_to_le64(DIV_ROUND_UP(stats.units_written,
> -   1000));
> +smart.data_units_read[0] = cpu_to_le64(DIV_ROUND_UP(
> +   stats.units_read >> 
> BDRV_SECTOR_BITS,
> +   1000));
> +smart.data_units_written[0] = cpu_to_le64(DIV_ROUND_UP(
> +  stats.units_written >> 
> BDRV_SECTOR_BITS,
> +  1000));
>  smart.host_read_commands[0] = cpu_to_le64(stats.read_commands);
>  smart.host_write_commands[0] = cpu_to_le64(stats.write_commands);
>  

Reviewed-by: Klaus Jensen 


signature.asc
Description: PGP signature

Re: [PULL 00/11] Block layer patches

2022-11-15 Thread Hanna Reitz


On 15.11.22 11:14, Kevin Wolf wrote:

Am 15.11.2022 um 00:58 hat John Snow geschrieben:

On Mon, Nov 14, 2022 at 5:56 AM Kevin Wolf  wrote:

Am 11.11.2022 um 20:20 hat Stefan Hajnoczi geschrieben:

Hanna Reitz (9):
   block/mirror: Do not wait for active writes
   block/mirror: Drop mirror_wait_for_any_operation()
   block/mirror: Fix NULL s->job in active writes
   iotests/151: Test that active mirror progresses
   iotests/151: Test active requests on mirror start
   block: Make bdrv_child_get_parent_aio_context I/O
   block-backend: Update ctx immediately after root
   block: Start/end drain on correct AioContext
   tests/stream-under-throttle: New test

Hi Hanna,
This test is broken, probably due to the minimum Python version:
https://gitlab.com/qemu-project/qemu/-/jobs/3311521303

This is exactly the problem I saw with running linters in a gating CI,
but not during 'make check'. And of course, we're hitting it during the
-rc phase now. :-(

I mean. I'd love to have it run in make check too. The alternative was
never seeing this *anywhere* ...

What is the problem with running it in 'make check'? The additional
dependencies? If so, can we run it automatically if the dependencies
happen to be fulfilled and just skip it otherwise?

If I have to run 'make -C python check-pipenv' manually, I can guarantee
you that I'll forget it more often than I'll run it.


...but I'm sorry it's taken me so long to figure out how to get this
stuff to work in "make check" and also from manual iotests runs
without adding any kind of setup that you have to manage. It's just
fiddly, sorry :(


But yes, it seems that asyncio.TimeoutError should be used instead of
asyncio.exceptions.TimeoutError, and Python 3.6 has only the former.
I'll fix this up and send a v2 if it fixes check-python-pipenv.

Hopefully this goes away when we drop 3.6. I want to, but I recall
there was some question about some platforms that don't support 3.7+
"by default" and how annoying that was or wasn't. We're almost a year
out from 3.6 being EOL, so maybe after this release it's worth a crack
to see how painful it is to move on.

If I understand the documentation right, asyncio.TimeoutError is what
you should be using either way. That it happens to be a re-export from
the internal module asyncio.exceptions seems to be more of an
implementation detail, not the official interface.


Oh, so I understood 
https://docs.python.org/3/library/asyncio-exceptions.html wrong.  I took 
that to mean that as of 3.11, `asyncio.TimeoutError` is a deprecated 
alias for `asyncio.exceptions.TimeoutError`, but it’s actually become an 
alias for the now-built-in `TimeoutError` exception.  I think.


Hanna

1 2 >

1 - 100 of 103 matches

Mail list logo