Re: [PULL 00/30] Next patches

2023-06-21 Thread Richard Henderson

On 6/22/23 04:12, Juan Quintela wrote:

The following changes since commit 67fe6ae41da64368bc4936b196fee2bf61f8c720:

   Merge tag 'pull-tricore-20230621-1' ofhttps://github.com/bkoppelmann/qemu  
into staging (2023-06-21 20:08:48 +0200)

are available in the Git repository at:

   https://gitlab.com/juan.quintela/qemu.git  tags/next-pull-request

for you to fetch changes up to c53dc569d0a0fb76eaa83f353253a897914948f9:

   migration/rdma: Split qemu_fopen_rdma() into input/output functions 
(2023-06-22 02:45:30 +0200)


Migration Pull request (20230621)

In this pull request:

- fix for multifd thread creation (fabiano)
- dirtylimity (hyman)
   * migration-test will go on next PULL request, as it has failures.
- Improve error description (tejus)
- improve -incoming and set parameters before calling incoming (wei)
- migration atomic counters reviewed patches (quintela)
- migration-test refacttoring reviewed (quintela)

Please apply.


You really need to test at least one 32-bit host regularly.
It should be trivial for you to do an i686 build somewhere.

https://gitlab.com/qemu-project/qemu/-/jobs/4518975360#L4817
https://gitlab.com/qemu-project/qemu/-/jobs/4518975263#L3486
https://gitlab.com/qemu-project/qemu/-/jobs/4518975261#L3145
https://gitlab.com/qemu-project/qemu/-/jobs/4518975298#L3372
https://gitlab.com/qemu-project/qemu/-/jobs/4518975301#L3221

../softmmu/dirtylimit.c:558:58: error: format specifies type 'long' but the argument has 
type 'int64_t' (aka 'long long') [-Werror,-Wformat]

error_setg(, "invalid dirty page limit %ld", dirty_rate);
   ~~~   ^~
   %lld


r~



Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-21 Thread Juan Quintela
Thomas Huth  wrote:
> On 12/06/2023 21.33, Juan Quintela wrote:
>> Only "defer" is recommended.  After setting all migation parameters,
>> start incoming migration with "migrate-incoming uri" command.
>> Signed-off-by: Juan Quintela 
>> ---
>>   docs/about/deprecated.rst | 7 +++
>>   softmmu/vl.c  | 2 ++
>>   2 files changed, 9 insertions(+)
>> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
>> index 47e98dc95e..518672722d 100644
>> --- a/docs/about/deprecated.rst
>> +++ b/docs/about/deprecated.rst
>> @@ -447,3 +447,10 @@ The new way to modify migration is using migration 
>> parameters.
>>   ``blk`` functionality can be acchieved using
>>   ``migrate_set_parameter block-incremental true``.
>>   +``-incoming uri`` (since 8.1)
>> +'
>> +
>> +Everything except ``-incoming defer`` are deprecated.  This allows to
>> +setup parameters before launching the proper migration with
>> +``migrate-incoming uri``.
>> +
>> diff --git a/softmmu/vl.c b/softmmu/vl.c
>> index b0b96f67fa..7fe865ab59 100644
>> --- a/softmmu/vl.c
>> +++ b/softmmu/vl.c
>> @@ -2651,6 +2651,8 @@ void qmp_x_exit_preconfig(Error **errp)
>>   if (incoming) {
>>   Error *local_err = NULL;
>>   if (strcmp(incoming, "defer") != 0) {
>> +warn_report("-incoming %s is deprecated, use -incoming defer 
>> and "
>> +" set the uri with migrate-incoming.", incoming);
>>   qmp_migrate_incoming(incoming, _err);
>>   if (local_err) {
>>   error_reportf_err(local_err, "-incoming %s: ", incoming);
>
> Could we maybe keep at least the smallest set of necessary parameters
> around? I'm often doing a "-incoming tcp:0:1234" for doing quick
> sanity checks with migration, not caring about other migration
> parameters, so if that could continue to work, that would be very
> appreciated.

I will try to explain myself here.

I think that everything except tcp works.
But when we have tcp, we have two cases where this is a trap:
- multifd channels:
  * if we default to a big number, we underuse resources in a normal
case
  * if we default to a small number, we have the problem that if the
user set "later" multifd-channels to a bigger number, things can
break.
- postcopy+preempt:
  this case is also problematic, but easily fixable.  Put a default
  of 2 instead of 1.

The only other solution that I can think of is just fail if we set
multifd without incoming defer.  But more sooner than later we are going
to have to default to multifd, so ...

Later, Juan.




[PULL 11/30] migration-test: Be consistent for ppc

2023-06-21 Thread Juan Quintela
It makes no sense that we don't have the same configuration on both sides.

Reviewed-by: Laurent Vivier 
Message-ID: <20230608224943.3877-2-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 tests/qtest/migration-test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index b0c355bbd9..c5e0c69c6b 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -646,7 +646,7 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
   "'nvramrc=hex .\" _\" begin %x %x "
   "do i c@ 1 + i c! 1000 +loop .\" B\" 0 "
   "until'", end_address, start_address);
-arch_target = g_strdup("");
+arch_target = g_strdup("-nodefaults");
 } else if (strcmp(arch, "aarch64") == 0) {
 init_bootfile(bootpath, aarch64_kernel, sizeof(aarch64_kernel));
 machine_opts = "virt,gic-version=max";
-- 
2.40.1




[PULL 22/30] migration: enforce multifd and postcopy preempt to be set before incoming

2023-06-21 Thread Juan Quintela
From: Wei Wang 

qemu_start_incoming_migration needs to check the number of multifd
channels or postcopy ram channels to configure the backlog parameter (i.e.
the maximum length to which the queue of pending connections for sockfd
may grow) of listen(). So enforce the usage of postcopy-preempt and
multifd as below:
- need to use "-incoming defer" on the destination; and
- set_capability and set_parameter need to be done before migrate_incoming

Otherwise, disable the use of the features and report error messages to
remind users to adjust the commands.

Signed-off-by: Wei Wang 
Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Message-ID: <20230606101910.20456-2-wei.w.w...@intel.com>
Signed-off-by: Juan Quintela 
Acked-by: Juan Quintela 
---
 migration/options.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/migration/options.c b/migration/options.c
index ba1010e08b..c072c2fab7 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -433,6 +433,11 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
 MIGRATION_CAPABILITY_VALIDATE_UUID,
 MIGRATION_CAPABILITY_ZERO_COPY_SEND);
 
+static bool migrate_incoming_started(void)
+{
+return !!migration_incoming_get_current()->transport_data;
+}
+
 /**
  * @migration_caps_check - check capability compatibility
  *
@@ -556,6 +561,12 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, 
Error **errp)
 error_setg(errp, "Postcopy preempt not compatible with compress");
 return false;
 }
+
+if (migrate_incoming_started()) {
+error_setg(errp,
+   "Postcopy preempt must be set before incoming starts");
+return false;
+}
 }
 
 if (new_caps[MIGRATION_CAPABILITY_MULTIFD]) {
@@ -563,6 +574,10 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, 
Error **errp)
 error_setg(errp, "Multifd is not compatible with compress");
 return false;
 }
+if (migrate_incoming_started()) {
+error_setg(errp, "Multifd must be set before incoming starts");
+return false;
+}
 }
 
 if (new_caps[MIGRATION_CAPABILITY_DIRTY_LIMIT]) {
-- 
2.40.1




[PULL 12/30] migration-test: Make machine_opts regular with other options

2023-06-21 Thread Juan Quintela
Reviewed-by: Peter Xu 
Signed-off-by: Juan Quintela 
Message-ID: <20230608224943.3877-5-quint...@redhat.com>
---
 tests/qtest/migration-test.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index c5e0c69c6b..79157d600b 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -637,7 +637,7 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
 start_address = S390_TEST_MEM_START;
 end_address = S390_TEST_MEM_END;
 } else if (strcmp(arch, "ppc64") == 0) {
-machine_opts = "vsmt=8";
+machine_opts = "-machine vsmt=8";
 memory_size = "256M";
 start_address = PPC_TEST_MEM_START;
 end_address = PPC_TEST_MEM_END;
@@ -649,7 +649,7 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
 arch_target = g_strdup("-nodefaults");
 } else if (strcmp(arch, "aarch64") == 0) {
 init_bootfile(bootpath, aarch64_kernel, sizeof(aarch64_kernel));
-machine_opts = "virt,gic-version=max";
+machine_opts = "-machine virt,gic-version=max";
 memory_size = "150M";
 arch_source = g_strdup_printf("-cpu max "
   "-kernel %s",
@@ -689,14 +689,13 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
 shmem_opts = g_strdup("");
 }
 
-cmd_source = g_strdup_printf("-accel kvm%s -accel tcg%s%s "
+cmd_source = g_strdup_printf("-accel kvm%s -accel tcg %s "
  "-name source,debug-threads=on "
  "-m %s "
  "-serial file:%s/src_serial "
  "%s %s %s %s",
  args->use_dirty_ring ?
  ",dirty-ring-size=4096" : "",
- machine_opts ? " -machine " : "",
  machine_opts ? machine_opts : "",
  memory_size, tmpfs,
  arch_source, shmem_opts,
@@ -709,7 +708,7 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
  _src_stop);
 }
 
-cmd_target = g_strdup_printf("-accel kvm%s -accel tcg%s%s "
+cmd_target = g_strdup_printf("-accel kvm%s -accel tcg %s "
  "-name target,debug-threads=on "
  "-m %s "
  "-serial file:%s/dest_serial "
@@ -717,7 +716,6 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
  "%s %s %s %s",
  args->use_dirty_ring ?
  ",dirty-ring-size=4096" : "",
- machine_opts ? " -machine " : "",
  machine_opts ? machine_opts : "",
  memory_size, tmpfs, uri,
  arch_target, shmem_opts,
-- 
2.40.1




[PULL 10/30] migration: Extend query-migrate to provide dirty page limit info

2023-06-21 Thread Juan Quintela
From: Hyman Huang(黄勇) 

Extend query-migrate to provide throttle time and estimated
ring full time with dirty-limit capability enabled, through which
we can observe if dirty limit take effect during live migration.

Signed-off-by: Hyman Huang(黄勇) 
Reviewed-by: Markus Armbruster 
Reviewed-by: Juan Quintela 
Message-ID: <168733225273.5845.1587182678887974167...@git.sr.ht>
Signed-off-by: Juan Quintela 
---
 qapi/migration.json| 16 +-
 include/sysemu/dirtylimit.h|  2 ++
 migration/migration-hmp-cmds.c | 10 +
 migration/migration.c  | 10 +
 softmmu/dirtylimit.c   | 39 ++
 5 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 621e6604c6..e9b24fc410 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -250,6 +250,18 @@
 # blocked.  Present and non-empty when migration is blocked.
 # (since 6.0)
 #
+# @dirty-limit-throttle-time-per-round: Maximum throttle time (in 
microseconds) of virtual
+#   CPUs each dirty ring full round, which 
shows how
+#   MigrationCapability dirty-limit 
affects the guest
+#   during live migration. (since 8.1)
+#
+# @dirty-limit-ring-full-time: Estimated average dirty ring full time (in 
microseconds)
+#  each dirty ring full round, note that the value 
equals
+#  dirty ring memory size divided by average dirty 
page rate
+#  of virtual CPU, which can be used to observe 
the average
+#  memory load of virtual CPU indirectly. Note 
that zero
+#  means guest doesn't dirty memory (since 8.1)
+#
 # Since: 0.14
 ##
 { 'struct': 'MigrationInfo',
@@ -267,7 +279,9 @@
'*postcopy-blocktime' : 'uint32',
'*postcopy-vcpu-blocktime': ['uint32'],
'*compression': 'CompressionStats',
-   '*socket-address': ['SocketAddress'] } }
+   '*socket-address': ['SocketAddress'],
+   '*dirty-limit-throttle-time-per-round': 'uint64',
+   '*dirty-limit-ring-full-time': 'uint64'} }
 
 ##
 # @query-migrate:
diff --git a/include/sysemu/dirtylimit.h b/include/sysemu/dirtylimit.h
index 8d2c1f3a6b..d11edb 100644
--- a/include/sysemu/dirtylimit.h
+++ b/include/sysemu/dirtylimit.h
@@ -34,4 +34,6 @@ void dirtylimit_set_vcpu(int cpu_index,
 void dirtylimit_set_all(uint64_t quota,
 bool enable);
 void dirtylimit_vcpu_execute(CPUState *cpu);
+uint64_t dirtylimit_throttle_time_per_round(void);
+uint64_t dirtylimit_ring_full_time(void);
 #endif
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 35e8020bbf..c115ef2d23 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -190,6 +190,16 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
info->cpu_throttle_percentage);
 }
 
+if (info->has_dirty_limit_throttle_time_per_round) {
+monitor_printf(mon, "dirty-limit throttle time: %" PRIu64 " us\n",
+   info->dirty_limit_throttle_time_per_round);
+}
+
+if (info->has_dirty_limit_ring_full_time) {
+monitor_printf(mon, "dirty-limit ring full time: %" PRIu64 " us\n",
+   info->dirty_limit_ring_full_time);
+}
+
 if (info->has_postcopy_blocktime) {
 monitor_printf(mon, "postcopy blocktime: %u\n",
info->postcopy_blocktime);
diff --git a/migration/migration.c b/migration/migration.c
index c101784dfa..719f91573f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -64,6 +64,7 @@
 #include "yank_functions.h"
 #include "sysemu/qtest.h"
 #include "options.h"
+#include "sysemu/dirtylimit.h"
 
 static NotifierList migration_state_notifiers =
 NOTIFIER_LIST_INITIALIZER(migration_state_notifiers);
@@ -968,6 +969,15 @@ static void populate_ram_info(MigrationInfo *info, 
MigrationState *s)
 info->ram->dirty_pages_rate =
stat64_get(_stats.dirty_pages_rate);
 }
+
+if (migrate_dirty_limit() && dirtylimit_in_service()) {
+info->has_dirty_limit_throttle_time_per_round = true;
+info->dirty_limit_throttle_time_per_round =
+dirtylimit_throttle_time_per_round();
+
+info->has_dirty_limit_ring_full_time = true;
+info->dirty_limit_ring_full_time = dirtylimit_ring_full_time();
+}
 }
 
 static void populate_disk_info(MigrationInfo *info)
diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
index 6e218bb249..af27f0d022 100644
--- a/softmmu/dirtylimit.c
+++ b/softmmu/dirtylimit.c
@@ -565,6 +565,45 @@ out:
 hmp_handle_error(mon, err);
 }
 
+/* Return the max throttle time of each virtual CPU */
+uint64_t dirtylimit_throttle_time_per_round(void)
+{
+

[PULL 17/30] migration-test: Add bootfile_create/delete() functions

2023-06-21 Thread Juan Quintela
The bootsector code is read only from the guest (otherwise we are
going to have problems with it being read from both source and
destination).

Create a single copy for all the tests.

Reviewed-by: Peter Xu 
Message-ID: <20230608224943.3877-10-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 tests/qtest/migration-test.c | 50 ++--
 1 file changed, 36 insertions(+), 14 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 0f80dbfe80..eb6a11e758 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -111,14 +111,47 @@ static char *bootpath;
 #include "tests/migration/aarch64/a-b-kernel.h"
 #include "tests/migration/s390x/a-b-bios.h"
 
-static void init_bootfile(void *content, size_t len)
+static void bootfile_create(char *dir)
 {
+const char *arch = qtest_get_arch();
+unsigned char *content;
+size_t len;
+
+bootpath = g_strdup_printf("%s/bootsect", dir);
+if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
+/* the assembled x86 boot sector should be exactly one sector large */
+g_assert(sizeof(x86_bootsect) == 512);
+content = x86_bootsect;
+len = sizeof(x86_bootsect);
+} else if (g_str_equal(arch, "s390x")) {
+content = s390x_elf;
+len = sizeof(s390x_elf);
+} else if (strcmp(arch, "ppc64") == 0) {
+/*
+ * sane architectures can be programmed at the boot prompt
+ */
+return;
+} else if (strcmp(arch, "aarch64") == 0) {
+content = aarch64_kernel;
+len = sizeof(aarch64_kernel);
+g_assert(sizeof(aarch64_kernel) <= ARM_TEST_MAX_KERNEL_SIZE);
+} else {
+g_assert_not_reached();
+}
+
 FILE *bootfile = fopen(bootpath, "wb");
 
 g_assert_cmpint(fwrite(content, len, 1, bootfile), ==, 1);
 fclose(bootfile);
 }
 
+static void bootfile_delete(void)
+{
+unlink(bootpath);
+g_free(bootpath);
+bootpath = NULL;
+}
+
 /*
  * Wait for some output in the serial output file,
  * we get an 'A' followed by an endless string of 'B's
@@ -622,15 +655,11 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
 got_src_stop = false;
 got_dst_resume = false;
 if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
-/* the assembled x86 boot sector should be exactly one sector large */
-assert(sizeof(x86_bootsect) == 512);
-init_bootfile(x86_bootsect, sizeof(x86_bootsect));
 memory_size = "150M";
 arch_opts = g_strdup_printf("-drive file=%s,format=raw", bootpath);
 start_address = X86_TEST_MEM_START;
 end_address = X86_TEST_MEM_END;
 } else if (g_str_equal(arch, "s390x")) {
-init_bootfile(s390x_elf, sizeof(s390x_elf));
 memory_size = "128M";
 arch_opts = g_strdup_printf("-bios %s", bootpath);
 start_address = S390_TEST_MEM_START;
@@ -645,14 +674,11 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
   "until'", end_address, start_address);
 arch_opts = g_strdup("-nodefaults -machine vsmt=8");
 } else if (strcmp(arch, "aarch64") == 0) {
-init_bootfile(aarch64_kernel, sizeof(aarch64_kernel));
 memory_size = "150M";
 arch_opts = g_strdup_printf("-machine virt,gic-version=max -cpu max "
 "-kernel %s", bootpath);
 start_address = ARM_TEST_MEM_START;
 end_address = ARM_TEST_MEM_END;
-
-g_assert(sizeof(aarch64_kernel) <= ARM_TEST_MAX_KERNEL_SIZE);
 } else {
 g_assert_not_reached();
 }
@@ -2493,9 +2519,6 @@ static QTestState *dirtylimit_start_vm(void)
 const char *arch = qtest_get_arch();
 
 assert((strcmp(arch, "x86_64") == 0));
-assert(sizeof(x86_bootsect) == 512);
-init_bootfile(x86_bootsect, sizeof(x86_bootsect));
-
 cmd = g_strdup_printf("-accel kvm,dirty-ring-size=4096 "
   "-name dirtylimit-test,debug-threads=on "
   "-m 150M -smp 1 "
@@ -2671,7 +2694,7 @@ int main(int argc, char **argv)
g_get_tmp_dir(), err->message);
 }
 g_assert(tmpfs);
-bootpath = g_strdup_printf("%s/bootsect", tmpfs);
+bootfile_create(tmpfs);
 
 module_call_init(MODULE_INIT_QOM);
 
@@ -2815,8 +2838,7 @@ int main(int argc, char **argv)
 
 g_assert_cmpint(ret, ==, 0);
 
-cleanup("bootsect");
-g_free(bootpath);
+bootfile_delete();
 ret = rmdir(tmpfs);
 if (ret != 0) {
 g_test_message("unable to rmdir: path (%s): %s",
-- 
2.40.1




[PULL 18/30] migration-test: dirtylimit checks for x86_64 arch before

2023-06-21 Thread Juan Quintela
So no need to assert we are in x86_64.
Once there, refactor the function to remove useless variables.

Reviewed-by: Peter Xu 
Message-ID: <20230608224943.3877-11-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 tests/qtest/migration-test.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index eb6a11e758..fbe9db23cf 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2515,10 +2515,7 @@ static int64_t get_limit_rate(QTestState *who)
 static QTestState *dirtylimit_start_vm(void)
 {
 QTestState *vm = NULL;
-g_autofree gchar *cmd = NULL;
-const char *arch = qtest_get_arch();
-
-assert((strcmp(arch, "x86_64") == 0));
+g_autofree gchar *
 cmd = g_strdup_printf("-accel kvm,dirty-ring-size=4096 "
   "-name dirtylimit-test,debug-threads=on "
   "-m 150M -smp 1 "
-- 
2.40.1




[PULL 14/30] migration-test: machine_opts is really arch specific

2023-06-21 Thread Juan Quintela
And it needs to be in both source and target, so put it on arch_opts.

Reviewed-by: Peter Xu 
Message-ID: <20230608224943.3877-7-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 tests/qtest/migration-test.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 4d8542f5c7..fc3337b7bb 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -609,7 +609,6 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
 g_autofree char *shmem_opts = NULL;
 g_autofree char *shmem_path = NULL;
 const char *arch = qtest_get_arch();
-const char *machine_opts = NULL;
 const char *memory_size;
 
 if (args->use_shmem) {
@@ -637,7 +636,6 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
 start_address = S390_TEST_MEM_START;
 end_address = S390_TEST_MEM_END;
 } else if (strcmp(arch, "ppc64") == 0) {
-machine_opts = "-machine vsmt=8";
 memory_size = "256M";
 start_address = PPC_TEST_MEM_START;
 end_address = PPC_TEST_MEM_END;
@@ -645,12 +643,12 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
   "'nvramrc=hex .\" _\" begin %x %x "
   "do i c@ 1 + i c! 1000 +loop .\" B\" 0 "
   "until'", end_address, start_address);
-arch_opts = g_strdup("-nodefaults");
+arch_opts = g_strdup("-nodefaults -machine vsmt=8");
 } else if (strcmp(arch, "aarch64") == 0) {
 init_bootfile(bootpath, aarch64_kernel, sizeof(aarch64_kernel));
-machine_opts = "-machine virt,gic-version=max";
 memory_size = "150M";
-arch_opts = g_strdup_printf("-cpu max -kernel %s", bootpath);
+arch_opts = g_strdup_printf("-machine virt,gic-version=max -cpu max "
+"-kernel %s", bootpath);
 start_address = ARM_TEST_MEM_START;
 end_address = ARM_TEST_MEM_END;
 
@@ -685,14 +683,13 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
 shmem_opts = g_strdup("");
 }
 
-cmd_source = g_strdup_printf("-accel kvm%s -accel tcg %s "
+cmd_source = g_strdup_printf("-accel kvm%s -accel tcg "
  "-name source,debug-threads=on "
  "-m %s "
  "-serial file:%s/src_serial "
  "%s %s %s %s %s",
  args->use_dirty_ring ?
  ",dirty-ring-size=4096" : "",
- machine_opts ? machine_opts : "",
  memory_size, tmpfs,
  arch_opts ? arch_opts : "",
  arch_source ? arch_source : "",
@@ -706,7 +703,7 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
  _src_stop);
 }
 
-cmd_target = g_strdup_printf("-accel kvm%s -accel tcg %s "
+cmd_target = g_strdup_printf("-accel kvm%s -accel tcg "
  "-name target,debug-threads=on "
  "-m %s "
  "-serial file:%s/dest_serial "
@@ -714,7 +711,6 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
  "%s %s %s %s %s",
  args->use_dirty_ring ?
  ",dirty-ring-size=4096" : "",
- machine_opts ? machine_opts : "",
  memory_size, tmpfs, uri,
  arch_opts ? arch_opts : "",
  arch_target ? arch_target : "",
-- 
2.40.1




[PULL 19/30] migration-test: simplify shmem_opts handling

2023-06-21 Thread Juan Quintela
Reviewed-by: Peter Xu 
Message-ID: <20230608224943.3877-4-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 tests/qtest/migration-test.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index fbe9db23cf..e3e7d54216 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -704,9 +704,6 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
 "-object memory-backend-file,id=mem0,size=%s"
 ",mem-path=%s,share=on -numa node,memdev=mem0",
 memory_size, shmem_path);
-} else {
-shmem_path = NULL;
-shmem_opts = g_strdup("");
 }
 
 if (args->use_dirty_ring) {
@@ -722,7 +719,7 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
  memory_size, tmpfs,
  arch_opts ? arch_opts : "",
  arch_source ? arch_source : "",
- shmem_opts,
+ shmem_opts ? shmem_opts : "",
  args->opts_source ? args->opts_source : "",
  ignore_stderr);
 if (!args->only_target) {
@@ -742,7 +739,7 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
  memory_size, tmpfs, uri,
  arch_opts ? arch_opts : "",
  arch_target ? arch_target : "",
- shmem_opts,
+ shmem_opts ? shmem_opts : "",
  args->opts_target ? args->opts_target : "",
  ignore_stderr);
 *to = qtest_init(cmd_target);
-- 
2.40.1




[PULL 30/30] migration/rdma: Split qemu_fopen_rdma() into input/output functions

2023-06-21 Thread Juan Quintela
This is how everything else in QEMUFile is structured.
As a bonus they are three less lines of code.

Reviewed-by: Peter Xu 
Message-ID: <20230530183941.7223-17-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 migration/qemu-file.h |  1 -
 migration/qemu-file.c | 12 
 migration/rdma.c  | 39 +++
 3 files changed, 19 insertions(+), 33 deletions(-)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 8b8b7d27fe..47015f5201 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -102,7 +102,6 @@ uint64_t qemu_file_transferred_noflush(QEMUFile *f);
  */
 void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, size_t size,
bool may_free);
-bool qemu_file_mode_is_not_valid(const char *mode);
 
 #include "migration/qemu-file-types.h"
 
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index d30bf3c377..19c33c9985 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -100,18 +100,6 @@ int qemu_file_shutdown(QEMUFile *f)
 return 0;
 }
 
-bool qemu_file_mode_is_not_valid(const char *mode)
-{
-if (mode == NULL ||
-(mode[0] != 'r' && mode[0] != 'w') ||
-mode[1] != 'b' || mode[2] != 0) {
-fprintf(stderr, "qemu_fopen: Argument validity check failed\n");
-return true;
-}
-
-return false;
-}
-
 static QEMUFile *qemu_file_new_impl(QIOChannel *ioc, bool is_writable)
 {
 QEMUFile *f;
diff --git a/migration/rdma.c b/migration/rdma.c
index dd1c039e6c..ca430d319d 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -4053,27 +4053,26 @@ static void qio_channel_rdma_register_types(void)
 
 type_init(qio_channel_rdma_register_types);
 
-static QEMUFile *qemu_fopen_rdma(RDMAContext *rdma, const char *mode)
+static QEMUFile *rdma_new_input(RDMAContext *rdma)
 {
-QIOChannelRDMA *rioc;
+QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
 
-if (qemu_file_mode_is_not_valid(mode)) {
-return NULL;
-}
+rioc->file = qemu_file_new_input(QIO_CHANNEL(rioc));
+rioc->rdmain = rdma;
+rioc->rdmaout = rdma->return_path;
+qemu_file_set_hooks(rioc->file, _read_hooks);
 
-rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
+return rioc->file;
+}
 
-if (mode[0] == 'w') {
-rioc->file = qemu_file_new_output(QIO_CHANNEL(rioc));
-rioc->rdmaout = rdma;
-rioc->rdmain = rdma->return_path;
-qemu_file_set_hooks(rioc->file, _write_hooks);
-} else {
-rioc->file = qemu_file_new_input(QIO_CHANNEL(rioc));
-rioc->rdmain = rdma;
-rioc->rdmaout = rdma->return_path;
-qemu_file_set_hooks(rioc->file, _read_hooks);
-}
+static QEMUFile *rdma_new_output(RDMAContext *rdma)
+{
+QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
+
+rioc->file = qemu_file_new_output(QIO_CHANNEL(rioc));
+rioc->rdmaout = rdma;
+rioc->rdmain = rdma->return_path;
+qemu_file_set_hooks(rioc->file, _write_hooks);
 
 return rioc->file;
 }
@@ -4099,9 +4098,9 @@ static void rdma_accept_incoming_migration(void *opaque)
 return;
 }
 
-f = qemu_fopen_rdma(rdma, "rb");
+f = rdma_new_input(rdma);
 if (f == NULL) {
-fprintf(stderr, "RDMA ERROR: could not qemu_fopen_rdma\n");
+fprintf(stderr, "RDMA ERROR: could not open RDMA for input\n");
 qemu_rdma_cleanup(rdma);
 return;
 }
@@ -4224,7 +4223,7 @@ void rdma_start_outgoing_migration(void *opaque,
 
 trace_rdma_start_outgoing_migration_after_rdma_connect();
 
-s->to_dst_file = qemu_fopen_rdma(rdma, "wb");
+s->to_dst_file = rdma_new_output(rdma);
 migrate_fd_connect(s, NULL);
 return;
 return_path_err:
-- 
2.40.1




[PULL 29/30] qemu-file: Make qemu_file_get_error_obj() static

2023-06-21 Thread Juan Quintela
It was not used outside of qemu_file.c anyways.

Reviewed-by: Peter Xu 
Message-ID: <20230530183941.7223-21-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 migration/qemu-file.h | 1 -
 migration/qemu-file.c | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index a081ef6c3f..8b8b7d27fe 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -128,7 +128,6 @@ void qemu_file_skip(QEMUFile *f, int size);
  * accounting information tracks the total migration traffic.
  */
 void qemu_file_credit_transfer(QEMUFile *f, size_t size);
-int qemu_file_get_error_obj(QEMUFile *f, Error **errp);
 int qemu_file_get_error_obj_any(QEMUFile *f1, QEMUFile *f2, Error **errp);
 void qemu_file_set_error_obj(QEMUFile *f, int ret, Error *err);
 void qemu_file_set_error(QEMUFile *f, int ret);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 4c577bdff8..d30bf3c377 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -158,7 +158,7 @@ void qemu_file_set_hooks(QEMUFile *f, const QEMUFileHooks 
*hooks)
  * is not 0.
  *
  */
-int qemu_file_get_error_obj(QEMUFile *f, Error **errp)
+static int qemu_file_get_error_obj(QEMUFile *f, Error **errp)
 {
 if (errp) {
 *errp = f->last_error_obj ? error_copy(f->last_error_obj) : NULL;
-- 
2.40.1




[PULL 25/30] migration: Change qemu_file_transferred to noflush

2023-06-21 Thread Juan Quintela
We do a qemu_fclose() just after that, that also does a qemu_fflush(),
so remove one qemu_fflush().

Reviewed-by: Philippe Mathieu-Daudé 
Message-ID: <20230530183941.7223-3-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 migration/savevm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index f26b455764..b2199d1039 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2952,7 +2952,7 @@ bool save_snapshot(const char *name, bool overwrite, 
const char *vmstate,
 goto the_end;
 }
 ret = qemu_savevm_state(f, errp);
-vm_state_size = qemu_file_transferred(f);
+vm_state_size = qemu_file_transferred_noflush(f);
 ret2 = qemu_fclose(f);
 if (ret < 0) {
 goto the_end;
-- 
2.40.1




[PULL 26/30] migration: Use qemu_file_transferred_noflush() for block migration.

2023-06-21 Thread Juan Quintela
We only care about the amount of bytes transferred.  Flushing is done
by the system somewhere else.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Juan Quintela 
Message-ID: <20230530183941.7223-4-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 migration/block.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/migration/block.c b/migration/block.c
index b9580a6c7e..b29e80bdc4 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -748,7 +748,7 @@ static int block_save_setup(QEMUFile *f, void *opaque)
 static int block_save_iterate(QEMUFile *f, void *opaque)
 {
 int ret;
-uint64_t last_bytes = qemu_file_transferred(f);
+uint64_t last_bytes = qemu_file_transferred_noflush(f);
 
 trace_migration_block_save("iterate", block_mig_state.submitted,
block_mig_state.transferred);
@@ -800,7 +800,7 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
 }
 
 qemu_put_be64(f, BLK_MIG_FLAG_EOS);
-uint64_t delta_bytes = qemu_file_transferred(f) - last_bytes;
+uint64_t delta_bytes = qemu_file_transferred_noflush(f) - last_bytes;
 return (delta_bytes > 0);
 }
 
-- 
2.40.1




[PULL 21/30] migration: Refactor repeated call of yank_unregister_instance

2023-06-21 Thread Juan Quintela
From: Tejus GK 

In the function qmp_migrate(), yank_unregister_instance() gets called
twice which isn't required. Hence, refactoring it so that it gets called
during the local_error cleanup.

Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Juan Quintela 
Acked-by: Peter Xu 
Signed-off-by: Tejus GK 
Message-ID: <20230621130940.178659-3-tejus...@nutanix.com>
Signed-off-by: Juan Quintela 
---
 migration/migration.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index e6bff2e848..7a4ba2e846 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1676,15 +1676,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool 
blk,
 } else if (strstart(uri, "fd:", )) {
 fd_start_outgoing_migration(s, p, _err);
 } else {
-if (!(has_resume && resume)) {
-yank_unregister_instance(MIGRATION_YANK_INSTANCE);
-}
 error_setg(_err, QERR_INVALID_PARAMETER_VALUE, "uri",
"a valid migration protocol");
 migrate_set_state(>state, MIGRATION_STATUS_SETUP,
   MIGRATION_STATUS_FAILED);
 block_cleanup_parameters();
-return;
 }
 
 if (local_err) {
-- 
2.40.1




[PULL 16/30] migration-test: bootpath is the same for all tests and for all archs

2023-06-21 Thread Juan Quintela
So just make it a global variable.

Reviewed-by: Peter Xu 
Message-ID: <20230608224943.3877-9-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 tests/qtest/migration-test.c | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 40967fdffc..0f80dbfe80 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -102,6 +102,7 @@ static bool ufd_version_check(void)
 #endif
 
 static char *tmpfs;
+static char *bootpath;
 
 /* The boot file modifies memory area in [start_address, end_address)
  * repeatedly. It outputs a 'B' at a fixed rate while it's still running.
@@ -110,7 +111,7 @@ static char *tmpfs;
 #include "tests/migration/aarch64/a-b-kernel.h"
 #include "tests/migration/s390x/a-b-bios.h"
 
-static void init_bootfile(const char *bootpath, void *content, size_t len)
+static void init_bootfile(void *content, size_t len)
 {
 FILE *bootfile = fopen(bootpath, "wb");
 
@@ -605,7 +606,6 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
 g_autofree gchar *cmd_source = NULL;
 g_autofree gchar *cmd_target = NULL;
 const gchar *ignore_stderr;
-g_autofree char *bootpath = NULL;
 g_autofree char *shmem_opts = NULL;
 g_autofree char *shmem_path = NULL;
 const char *kvm_opts = NULL;
@@ -621,17 +621,16 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
 
 got_src_stop = false;
 got_dst_resume = false;
-bootpath = g_strdup_printf("%s/bootsect", tmpfs);
 if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
 /* the assembled x86 boot sector should be exactly one sector large */
 assert(sizeof(x86_bootsect) == 512);
-init_bootfile(bootpath, x86_bootsect, sizeof(x86_bootsect));
+init_bootfile(x86_bootsect, sizeof(x86_bootsect));
 memory_size = "150M";
 arch_opts = g_strdup_printf("-drive file=%s,format=raw", bootpath);
 start_address = X86_TEST_MEM_START;
 end_address = X86_TEST_MEM_END;
 } else if (g_str_equal(arch, "s390x")) {
-init_bootfile(bootpath, s390x_elf, sizeof(s390x_elf));
+init_bootfile(s390x_elf, sizeof(s390x_elf));
 memory_size = "128M";
 arch_opts = g_strdup_printf("-bios %s", bootpath);
 start_address = S390_TEST_MEM_START;
@@ -646,7 +645,7 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
   "until'", end_address, start_address);
 arch_opts = g_strdup("-nodefaults -machine vsmt=8");
 } else if (strcmp(arch, "aarch64") == 0) {
-init_bootfile(bootpath, aarch64_kernel, sizeof(aarch64_kernel));
+init_bootfile(aarch64_kernel, sizeof(aarch64_kernel));
 memory_size = "150M";
 arch_opts = g_strdup_printf("-machine virt,gic-version=max -cpu max "
 "-kernel %s", bootpath);
@@ -764,7 +763,6 @@ static void test_migrate_end(QTestState *from, QTestState 
*to, bool test_dest)
 
 qtest_quit(to);
 
-cleanup("bootsect");
 cleanup("migsocket");
 cleanup("src_serial");
 cleanup("dest_serial");
@@ -2493,12 +2491,10 @@ static QTestState *dirtylimit_start_vm(void)
 QTestState *vm = NULL;
 g_autofree gchar *cmd = NULL;
 const char *arch = qtest_get_arch();
-g_autofree char *bootpath = NULL;
 
 assert((strcmp(arch, "x86_64") == 0));
-bootpath = g_strdup_printf("%s/bootsect", tmpfs);
 assert(sizeof(x86_bootsect) == 512);
-init_bootfile(bootpath, x86_bootsect, sizeof(x86_bootsect));
+init_bootfile(x86_bootsect, sizeof(x86_bootsect));
 
 cmd = g_strdup_printf("-accel kvm,dirty-ring-size=4096 "
   "-name dirtylimit-test,debug-threads=on "
@@ -2514,7 +2510,6 @@ static QTestState *dirtylimit_start_vm(void)
 static void dirtylimit_stop_vm(QTestState *vm)
 {
 qtest_quit(vm);
-cleanup("bootsect");
 cleanup("vm_serial");
 }
 
@@ -2676,6 +2671,7 @@ int main(int argc, char **argv)
g_get_tmp_dir(), err->message);
 }
 g_assert(tmpfs);
+bootpath = g_strdup_printf("%s/bootsect", tmpfs);
 
 module_call_init(MODULE_INIT_QOM);
 
@@ -2819,6 +2815,8 @@ int main(int argc, char **argv)
 
 g_assert_cmpint(ret, ==, 0);
 
+cleanup("bootsect");
+g_free(bootpath);
 ret = rmdir(tmpfs);
 if (ret != 0) {
 g_test_message("unable to rmdir: path (%s): %s",
-- 
2.40.1




[PULL 28/30] qemu-file: Simplify qemu_file_shutdown()

2023-06-21 Thread Juan Quintela
Reviewed-by: Peter Xu 
Message-ID: <20230530183941.7223-20-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 migration/qemu-file.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 9a89e17924..4c577bdff8 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -65,8 +65,6 @@ struct QEMUFile {
  */
 int qemu_file_shutdown(QEMUFile *f)
 {
-int ret = 0;
-
 /*
  * We must set qemufile error before the real shutdown(), otherwise
  * there can be a race window where we thought IO all went though
@@ -96,10 +94,10 @@ int qemu_file_shutdown(QEMUFile *f)
 }
 
 if (qio_channel_shutdown(f->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL) < 0) {
-ret = -EIO;
+return -EIO;
 }
 
-return ret;
+return 0;
 }
 
 bool qemu_file_mode_is_not_valid(const char *mode)
-- 
2.40.1




[PULL 20/30] migration: Update error description whenever migration fails

2023-06-21 Thread Juan Quintela
From: Tejus GK 

There are places in migration.c where the migration is marked failed with
MIGRATION_STATUS_FAILED, but the failure reason is never updated. Hence
libvirt doesn't know why the migration failed when it queries for it.

Reviewed-by: Daniel P. Berrangé 
Acked-by: Peter Xu 
Signed-off-by: Tejus GK 
Message-ID: <20230621130940.178659-2-tejus...@nutanix.com>
Signed-off-by: Juan Quintela 
---
 migration/migration.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 719f91573f..e6bff2e848 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1679,7 +1679,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
 if (!(has_resume && resume)) {
 yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
-error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "uri",
+error_setg(_err, QERR_INVALID_PARAMETER_VALUE, "uri",
"a valid migration protocol");
 migrate_set_state(>state, MIGRATION_STATUS_SETUP,
   MIGRATION_STATUS_FAILED);
@@ -2066,7 +2066,7 @@ migration_wait_main_channel(MigrationState *ms)
  * Switch from normal iteration to postcopy
  * Returns non-0 on error
  */
-static int postcopy_start(MigrationState *ms)
+static int postcopy_start(MigrationState *ms, Error **errp)
 {
 int ret;
 QIOChannelBuffer *bioc;
@@ -2176,7 +2176,7 @@ static int postcopy_start(MigrationState *ms)
  */
 ret = qemu_file_get_error(ms->to_dst_file);
 if (ret) {
-error_report("postcopy_start: Migration stream errored (pre package)");
+error_setg(errp, "postcopy_start: Migration stream errored (pre 
package)");
 goto fail_closefb;
 }
 
@@ -2213,7 +2213,7 @@ static int postcopy_start(MigrationState *ms)
 
 ret = qemu_file_get_error(ms->to_dst_file);
 if (ret) {
-error_report("postcopy_start: Migration stream errored");
+error_setg(errp, "postcopy_start: Migration stream errored");
 migrate_set_state(>state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
   MIGRATION_STATUS_FAILED);
 }
@@ -2720,6 +2720,7 @@ typedef enum {
 static MigIterateState migration_iteration_run(MigrationState *s)
 {
 uint64_t must_precopy, can_postcopy;
+Error *local_err = NULL;
 bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
 
 qemu_savevm_state_pending_estimate(_precopy, _postcopy);
@@ -2742,8 +2743,9 @@ static MigIterateState 
migration_iteration_run(MigrationState *s)
 /* Still a significant amount to transfer */
 if (!in_postcopy && must_precopy <= s->threshold_size &&
 qatomic_read(>start_postcopy)) {
-if (postcopy_start(s)) {
-error_report("%s: postcopy failed to start", __func__);
+if (postcopy_start(s, _err)) {
+migrate_set_error(s, local_err);
+error_report_err(local_err);
 }
 return MIG_ITERATE_SKIP;
 }
@@ -3234,8 +3236,10 @@ void migrate_fd_connect(MigrationState *s, Error 
*error_in)
  */
 if (migrate_postcopy_ram() || migrate_return_path()) {
 if (open_return_path_on_source(s, !resume)) {
-error_report("Unable to open return-path for postcopy");
+error_setg(_err, "Unable to open return-path for postcopy");
 migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
+migrate_set_error(s, local_err);
+error_report_err(local_err);
 migrate_fd_cleanup(s);
 return;
 }
@@ -3259,6 +3263,7 @@ void migrate_fd_connect(MigrationState *s, Error 
*error_in)
 }
 
 if (multifd_save_setup(_err) != 0) {
+migrate_set_error(s, local_err);
 error_report_err(local_err);
 migrate_set_state(>state, MIGRATION_STATUS_SETUP,
   MIGRATION_STATUS_FAILED);
-- 
2.40.1




[PULL 24/30] qemu-file: Rename qemu_file_transferred_ fast -> noflush

2023-06-21 Thread Juan Quintela
Fast don't say much.  Noflush indicates more clearly that it is like
qemu_file_transferred but without the flush.

Reviewed-by: Philippe Mathieu-Daudé 
Message-ID: <20230530183941.7223-2-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 migration/qemu-file.h | 11 +--
 migration/qemu-file.c |  2 +-
 migration/savevm.c|  4 ++--
 migration/vmstate.c   |  4 ++--
 4 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index e649718492..aa6eee66da 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -86,16 +86,15 @@ int qemu_fclose(QEMUFile *f);
 uint64_t qemu_file_transferred(QEMUFile *f);
 
 /*
- * qemu_file_transferred_fast:
+ * qemu_file_transferred_noflush:
  *
- * As qemu_file_transferred except for writable
- * files, where no flush is performed and the reported
- * amount will include the size of any queued buffers,
- * on top of the amount actually transferred.
+ * As qemu_file_transferred except for writable files, where no flush
+ * is performed and the reported amount will include the size of any
+ * queued buffers, on top of the amount actually transferred.
  *
  * Returns: the total bytes transferred and queued
  */
-uint64_t qemu_file_transferred_fast(QEMUFile *f);
+uint64_t qemu_file_transferred_noflush(QEMUFile *f);
 
 /*
  * put_buffer without copying the buffer.
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index acc282654a..fdf115b5da 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -694,7 +694,7 @@ int coroutine_mixed_fn qemu_get_byte(QEMUFile *f)
 return result;
 }
 
-uint64_t qemu_file_transferred_fast(QEMUFile *f)
+uint64_t qemu_file_transferred_noflush(QEMUFile *f)
 {
 uint64_t ret = f->total_transferred;
 int i;
diff --git a/migration/savevm.c b/migration/savevm.c
index bc284087f9..f26b455764 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -927,9 +927,9 @@ static int vmstate_load(QEMUFile *f, SaveStateEntry *se)
 static void vmstate_save_old_style(QEMUFile *f, SaveStateEntry *se,
JSONWriter *vmdesc)
 {
-uint64_t old_offset = qemu_file_transferred_fast(f);
+uint64_t old_offset = qemu_file_transferred_noflush(f);
 se->ops->save_state(f, se->opaque);
-uint64_t size = qemu_file_transferred_fast(f) - old_offset;
+uint64_t size = qemu_file_transferred_noflush(f) - old_offset;
 
 if (vmdesc) {
 json_writer_int64(vmdesc, "size", size);
diff --git a/migration/vmstate.c b/migration/vmstate.c
index af01d54b6f..31842c3afb 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -361,7 +361,7 @@ int vmstate_save_state_v(QEMUFile *f, const 
VMStateDescription *vmsd,
 void *curr_elem = first_elem + size * i;
 
 vmsd_desc_field_start(vmsd, vmdesc_loop, field, i, n_elems);
-old_offset = qemu_file_transferred_fast(f);
+old_offset = qemu_file_transferred_noflush(f);
 if (field->flags & VMS_ARRAY_OF_POINTER) {
 assert(curr_elem);
 curr_elem = *(void **)curr_elem;
@@ -391,7 +391,7 @@ int vmstate_save_state_v(QEMUFile *f, const 
VMStateDescription *vmsd,
 return ret;
 }
 
-written_bytes = qemu_file_transferred_fast(f) - old_offset;
+written_bytes = qemu_file_transferred_noflush(f) - old_offset;
 vmsd_desc_field_end(vmsd, vmdesc_loop, field, written_bytes, 
i);
 
 /* Compressed arrays only care about the first element */
-- 
2.40.1




[PULL 13/30] migration-test: Create arch_opts

2023-06-21 Thread Juan Quintela
This will contain the options needed for both source and target.

Reviewed-by: Peter Xu 
Message-ID: <20230608224943.3877-6-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 tests/qtest/migration-test.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 79157d600b..4d8542f5c7 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -600,6 +600,8 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
 {
 g_autofree gchar *arch_source = NULL;
 g_autofree gchar *arch_target = NULL;
+/* options for source and target */
+g_autofree gchar *arch_opts = NULL;
 g_autofree gchar *cmd_source = NULL;
 g_autofree gchar *cmd_target = NULL;
 const gchar *ignore_stderr;
@@ -625,15 +627,13 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
 assert(sizeof(x86_bootsect) == 512);
 init_bootfile(bootpath, x86_bootsect, sizeof(x86_bootsect));
 memory_size = "150M";
-arch_source = g_strdup_printf("-drive file=%s,format=raw", bootpath);
-arch_target = g_strdup(arch_source);
+arch_opts = g_strdup_printf("-drive file=%s,format=raw", bootpath);
 start_address = X86_TEST_MEM_START;
 end_address = X86_TEST_MEM_END;
 } else if (g_str_equal(arch, "s390x")) {
 init_bootfile(bootpath, s390x_elf, sizeof(s390x_elf));
 memory_size = "128M";
-arch_source = g_strdup_printf("-bios %s", bootpath);
-arch_target = g_strdup(arch_source);
+arch_opts = g_strdup_printf("-bios %s", bootpath);
 start_address = S390_TEST_MEM_START;
 end_address = S390_TEST_MEM_END;
 } else if (strcmp(arch, "ppc64") == 0) {
@@ -641,20 +641,16 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
 memory_size = "256M";
 start_address = PPC_TEST_MEM_START;
 end_address = PPC_TEST_MEM_END;
-arch_source = g_strdup_printf("-nodefaults "
-  "-prom-env 'use-nvramrc?=true' -prom-env 
"
+arch_source = g_strdup_printf("-prom-env 'use-nvramrc?=true' -prom-env 
"
   "'nvramrc=hex .\" _\" begin %x %x "
   "do i c@ 1 + i c! 1000 +loop .\" B\" 0 "
   "until'", end_address, start_address);
-arch_target = g_strdup("-nodefaults");
+arch_opts = g_strdup("-nodefaults");
 } else if (strcmp(arch, "aarch64") == 0) {
 init_bootfile(bootpath, aarch64_kernel, sizeof(aarch64_kernel));
 machine_opts = "-machine virt,gic-version=max";
 memory_size = "150M";
-arch_source = g_strdup_printf("-cpu max "
-  "-kernel %s",
-  bootpath);
-arch_target = g_strdup(arch_source);
+arch_opts = g_strdup_printf("-cpu max -kernel %s", bootpath);
 start_address = ARM_TEST_MEM_START;
 end_address = ARM_TEST_MEM_END;
 
@@ -693,12 +689,14 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
  "-name source,debug-threads=on "
  "-m %s "
  "-serial file:%s/src_serial "
- "%s %s %s %s",
+ "%s %s %s %s %s",
  args->use_dirty_ring ?
  ",dirty-ring-size=4096" : "",
  machine_opts ? machine_opts : "",
  memory_size, tmpfs,
- arch_source, shmem_opts,
+ arch_opts ? arch_opts : "",
+ arch_source ? arch_source : "",
+ shmem_opts,
  args->opts_source ? args->opts_source : "",
  ignore_stderr);
 if (!args->only_target) {
@@ -713,12 +711,14 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
  "-m %s "
  "-serial file:%s/dest_serial "
  "-incoming %s "
- "%s %s %s %s",
+ "%s %s %s %s %s",
  args->use_dirty_ring ?
  ",dirty-ring-size=4096" : "",
  machine_opts ? machine_opts : "",
  memory_size, tmpfs, uri,
- arch_target, shmem_opts,
+ arch_opts ? arch_opts : "",
+ arch_target ? arch_target : "",
+ shmem_opts,
   

[PULL 27/30] qemu_file: Make qemu_file_is_writable() static

2023-06-21 Thread Juan Quintela
It is not used outside of qemu_file, and it shouldn't.

Signed-off-by: Juan Quintela 
Message-ID: <20230530183941.7223-19-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 migration/qemu-file.h | 1 -
 migration/qemu-file.c | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index aa6eee66da..a081ef6c3f 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -103,7 +103,6 @@ uint64_t qemu_file_transferred_noflush(QEMUFile *f);
 void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, size_t size,
bool may_free);
 bool qemu_file_mode_is_not_valid(const char *mode);
-bool qemu_file_is_writable(QEMUFile *f);
 
 #include "migration/qemu-file-types.h"
 
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index fdf115b5da..9a89e17924 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -228,7 +228,7 @@ void qemu_file_set_error(QEMUFile *f, int ret)
 qemu_file_set_error_obj(f, ret, NULL);
 }
 
-bool qemu_file_is_writable(QEMUFile *f)
+static bool qemu_file_is_writable(QEMUFile *f)
 {
 return f->is_writable;
 }
-- 
2.40.1




[PULL 08/30] migration: Put the detection logic before auto-converge checking

2023-06-21 Thread Juan Quintela
From: Hyman Huang(黄勇) 

This commit is prepared for the implementation of dirty-limit
convergence algo.

The detection logic of throttling condition can apply to both
auto-converge and dirty-limit algo, putting it's position
before the checking logic for auto-converge feature.

Signed-off-by: Hyman Huang(黄勇) 
Reviewed-by: Juan Quintela 
Message-ID: <168733225273.5845.1587182678887974167...@git.sr.ht>
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 78746849b5..b6559f9312 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -999,17 +999,18 @@ static void migration_trigger_throttle(RAMState *rs)
 return;
 }
 
-if (migrate_auto_converge()) {
-/* The following detection logic can be refined later. For now:
-   Check to see if the ratio between dirtied bytes and the approx.
-   amount of bytes that just got transferred since the last time
-   we were in this routine reaches the threshold. If that happens
-   twice, start or increase throttling. */
-
-if ((bytes_dirty_period > bytes_dirty_threshold) &&
-(++rs->dirty_rate_high_cnt >= 2)) {
+/*
+ * The following detection logic can be refined later. For now:
+ * Check to see if the ratio between dirtied bytes and the approx.
+ * amount of bytes that just got transferred since the last time
+ * we were in this routine reaches the threshold. If that happens
+ * twice, start or increase throttling.
+ */
+if ((bytes_dirty_period > bytes_dirty_threshold) &&
+(++rs->dirty_rate_high_cnt >= 2)) {
+rs->dirty_rate_high_cnt = 0;
+if (migrate_auto_converge()) {
 trace_migration_throttle();
-rs->dirty_rate_high_cnt = 0;
 mig_throttle_guest_down(bytes_dirty_period,
 bytes_dirty_threshold);
 }
-- 
2.40.1




[PULL 23/30] qtest/migration-tests.c: use "-incoming defer" for postcopy tests

2023-06-21 Thread Juan Quintela
From: Wei Wang 

The Postcopy preempt capability is expected to be set before incoming
starts, so change the postcopy tests to start with deferred incoming and
call migrate-incoming after the cap has been set.

Why the existing tests (without this patch) didn't fail?
There could be two reasons:
1) "backlog" specifies the number of pending connections. As long as the
   server accepts the connections faster than the clients side connecting,
   connection will succeed. For the preempt test, it uses only 2 channels,
   so very likely to not have pending connections.
2) per my tests (on kernel 6.2), the number of pending connections allowed
   is actually "backlog + 1", which is 2 in this case.
That said, the implementation of socket_start_incoming_migration_internal
expects "migrate defer" to be used, and for safety, change the test to
work with the expected usage.

Signed-off-by: Wei Wang 
Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Message-ID: <20230606101910.20456-3-wei.w.w...@intel.com>
Signed-off-by: Juan Quintela 
---
 tests/qtest/migration-test.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index e3e7d54216..c694685923 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1161,10 +1161,10 @@ static int migrate_postcopy_prepare(QTestState 
**from_ptr,
 QTestState **to_ptr,
 MigrateCommon *args)
 {
-g_autofree char *uri = g_strdup_printf("unix:%s/migsocket", tmpfs);
+g_autofree char *uri = NULL;
 QTestState *from, *to;
 
-if (test_migrate_start(, , uri, >start)) {
+if (test_migrate_start(, , "defer", >start)) {
 return -1;
 }
 
@@ -1183,9 +1183,13 @@ static int migrate_postcopy_prepare(QTestState 
**from_ptr,
 
 migrate_ensure_non_converge(from);
 
+qtest_qmp_assert_success(to, "{ 'execute': 'migrate-incoming',"
+ "  'arguments': { 'uri': 'tcp:127.0.0.1:0' }}");
+
 /* Wait for the first serial output from the source */
 wait_for_serial("src_serial");
 
+uri = migrate_get_socket_address(to, "socket-address");
 migrate_qmp(from, uri, "{}");
 
 wait_for_migration_pass(from);
-- 
2.40.1




[PULL 15/30] migration-test: Create kvm_opts

2023-06-21 Thread Juan Quintela
So arch_dirty_ring option becomes one option like the others.

Reviewed-by: Peter Xu 
Message-ID: <20230608224943.3877-8-quint...@redhat.com>
Signed-off-by: Juan Quintela 
---
 tests/qtest/migration-test.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index fc3337b7bb..40967fdffc 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -608,6 +608,7 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
 g_autofree char *bootpath = NULL;
 g_autofree char *shmem_opts = NULL;
 g_autofree char *shmem_path = NULL;
+const char *kvm_opts = NULL;
 const char *arch = qtest_get_arch();
 const char *memory_size;
 
@@ -683,13 +684,16 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
 shmem_opts = g_strdup("");
 }
 
+if (args->use_dirty_ring) {
+kvm_opts = ",dirty-ring-size=4096";
+}
+
 cmd_source = g_strdup_printf("-accel kvm%s -accel tcg "
  "-name source,debug-threads=on "
  "-m %s "
  "-serial file:%s/src_serial "
  "%s %s %s %s %s",
- args->use_dirty_ring ?
- ",dirty-ring-size=4096" : "",
+ kvm_opts ? kvm_opts : "",
  memory_size, tmpfs,
  arch_opts ? arch_opts : "",
  arch_source ? arch_source : "",
@@ -709,8 +713,7 @@ static int test_migrate_start(QTestState **from, QTestState 
**to,
  "-serial file:%s/dest_serial "
  "-incoming %s "
  "%s %s %s %s %s",
- args->use_dirty_ring ?
- ",dirty-ring-size=4096" : "",
+ kvm_opts ? kvm_opts : "",
  memory_size, tmpfs, uri,
  arch_opts ? arch_opts : "",
  arch_target ? arch_target : "",
-- 
2.40.1




[PULL 07/30] migration: Refactor auto-converge capability logic

2023-06-21 Thread Juan Quintela
From: Hyman Huang(黄勇) 

Check if block migration is running before throttling
guest down in auto-converge way.

Note that this modification is kind of like code clean,
because block migration does not depend on auto-converge
capability, so the order of checks can be adjusted.

Signed-off-by: Hyman Huang(黄勇) 
Acked-by: Peter Xu 
Reviewed-by: Juan Quintela 
Message-Id: <168618975839.6361.1740763387474768865...@git.sr.ht>
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index 5283a75f02..78746849b5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -995,7 +995,11 @@ static void migration_trigger_throttle(RAMState *rs)
 /* During block migration the auto-converge logic incorrectly detects
  * that ram migration makes no progress. Avoid this by disabling the
  * throttling logic during the bulk phase of block migration. */
-if (migrate_auto_converge() && !blk_mig_bulk_active()) {
+if (blk_mig_bulk_active()) {
+return;
+}
+
+if (migrate_auto_converge()) {
 /* The following detection logic can be refined later. For now:
Check to see if the ratio between dirtied bytes and the approx.
amount of bytes that just got transferred since the last time
-- 
2.40.1




[PULL 05/30] qapi/migration: Introduce vcpu-dirty-limit parameters

2023-06-21 Thread Juan Quintela
From: Hyman Huang(黄勇) 

Introduce "vcpu-dirty-limit" migration parameter used
to limit dirty page rate during live migration.

"vcpu-dirty-limit" and "x-vcpu-dirty-limit-period" are
two dirty-limit-related migration parameters, which can
be set before and during live migration by qmp
migrate-set-parameters.

This two parameters are used to help implement the dirty
page rate limit algo of migration.

Signed-off-by: Hyman Huang(黄勇) 
Acked-by: Peter Xu 
Reviewed-by: Juan Quintela 
Message-Id: <168618975839.6361.1740763387474768865...@git.sr.ht>
Signed-off-by: Juan Quintela 
---
 qapi/migration.json| 18 +++---
 migration/migration-hmp-cmds.c |  8 
 migration/options.c| 21 +
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 67c26d9dea..e7243c0c0d 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -783,6 +783,9 @@
 # live migration. Should be in the range 1 to 
1000ms,
 # defaults to 1000ms. (Since 8.1)
 #
+# @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
+#Defaults to 1. (Since 8.1)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
@@ -806,7 +809,8 @@
'max-cpu-throttle', 'multifd-compression',
'multifd-zlib-level', 'multifd-zstd-level',
'block-bitmap-mapping',
-   { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] } ] 
}
+   { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
+   'vcpu-dirty-limit'] }
 
 ##
 # @MigrateSetParameters:
@@ -945,6 +949,9 @@
 # live migration. Should be in the range 1 to 
1000ms,
 # defaults to 1000ms. (Since 8.1)
 #
+# @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
+#Defaults to 1. (Since 8.1)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
@@ -985,7 +992,8 @@
 '*multifd-zstd-level': 'uint8',
 '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
 '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
-'features': [ 'unstable' ] } } }
+'features': [ 'unstable' ] },
+'*vcpu-dirty-limit': 'uint64'} }
 
 ##
 # @migrate-set-parameters:
@@ -1144,6 +1152,9 @@
 # live migration. Should be in the range 1 to 
1000ms,
 # defaults to 1000ms. (Since 8.1)
 #
+# @vcpu-dirty-limit: Dirtyrate limit (MB/s) during live migration.
+#Defaults to 1. (Since 8.1)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
@@ -1181,7 +1192,8 @@
 '*multifd-zstd-level': 'uint8',
 '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
 '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
-'features': [ 'unstable' ] } } }
+'features': [ 'unstable' ] },
+'*vcpu-dirty-limit': 'uint64'} }
 
 ##
 # @query-migrate-parameters:
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 352e9ec716..35e8020bbf 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -368,6 +368,10 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict 
*qdict)
 monitor_printf(mon, "%s: %" PRIu64 " ms\n",
 MigrationParameter_str(MIGRATION_PARAMETER_X_VCPU_DIRTY_LIMIT_PERIOD),
 params->x_vcpu_dirty_limit_period);
+
+monitor_printf(mon, "%s: %" PRIu64 " MB/s\n",
+MigrationParameter_str(MIGRATION_PARAMETER_VCPU_DIRTY_LIMIT),
+params->vcpu_dirty_limit);
 }
 
 qapi_free_MigrationParameters(params);
@@ -628,6 +632,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict 
*qdict)
 p->has_x_vcpu_dirty_limit_period = true;
 visit_type_size(v, param, >x_vcpu_dirty_limit_period, );
 break;
+case MIGRATION_PARAMETER_VCPU_DIRTY_LIMIT:
+p->has_vcpu_dirty_limit = true;
+visit_type_size(v, param, >vcpu_dirty_limit, );
+break;
 default:
 assert(0);
 }
diff --git a/migration/options.c b/migration/options.c
index 9743dea3ab..8acf5f1d2c 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -81,6 +81,7 @@
 DEFINE_PROP_BOOL(name, MigrationState, capabilities[x], false)
 
 #define DEFAULT_MIGRATE_VCPU_DIRTY_LIMIT_PERIOD 1000/* milliseconds */
+#define DEFAULT_MIGRATE_VCPU_DIRTY_LIMIT1   /* MB/s */
 
 Property migration_properties[] = {
 DEFINE_PROP_BOOL("store-global-state", MigrationState,
@@ -168,6 +169,9 @@ Property migration_properties[] = {
 

[PULL 09/30] migration: Implement dirty-limit convergence algo

2023-06-21 Thread Juan Quintela
From: Hyman Huang(黄勇) 

Implement dirty-limit convergence algo for live migration,
which is kind of like auto-converge algo but using dirty-limit
instead of cpu throttle to make migration convergent.

Enable dirty page limit if dirty_rate_high_cnt greater than 2
when dirty-limit capability enabled, Disable dirty-limit if
migration be canceled.

Note that "set_vcpu_dirty_limit", "cancel_vcpu_dirty_limit"
commands are not allowed during dirty-limit live migration.

Signed-off-by: Hyman Huang(黄勇) 
Reviewed-by: Markus Armbruster 
Message-ID: <168733225273.5845.1587182678887974167...@git.sr.ht>
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/migration.c  |  3 +++
 migration/ram.c| 36 
 softmmu/dirtylimit.c   | 29 +
 migration/trace-events |  1 +
 4 files changed, 69 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 3a001dd042..c101784dfa 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -165,6 +165,9 @@ void migration_cancel(const Error *error)
 if (error) {
 migrate_set_error(current_migration, error);
 }
+if (migrate_dirty_limit()) {
+qmp_cancel_vcpu_dirty_limit(false, -1, NULL);
+}
 migrate_fd_cancel(current_migration);
 }
 
diff --git a/migration/ram.c b/migration/ram.c
index b6559f9312..8a86363216 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -46,6 +46,7 @@
 #include "qapi/error.h"
 #include "qapi/qapi-types-migration.h"
 #include "qapi/qapi-events-migration.h"
+#include "qapi/qapi-commands-migration.h"
 #include "qapi/qmp/qerror.h"
 #include "trace.h"
 #include "exec/ram_addr.h"
@@ -59,6 +60,8 @@
 #include "multifd.h"
 #include "sysemu/runstate.h"
 #include "options.h"
+#include "sysemu/dirtylimit.h"
+#include "sysemu/kvm.h"
 
 #include "hw/boards.h" /* for machine_dump_guest_core() */
 
@@ -984,6 +987,37 @@ static void migration_update_rates(RAMState *rs, int64_t 
end_time)
 }
 }
 
+/*
+ * Enable dirty-limit to throttle down the guest
+ */
+static void migration_dirty_limit_guest(void)
+{
+/*
+ * dirty page rate quota for all vCPUs fetched from
+ * migration parameter 'vcpu_dirty_limit'
+ */
+static int64_t quota_dirtyrate;
+MigrationState *s = migrate_get_current();
+
+/*
+ * If dirty limit already enabled and migration parameter
+ * vcpu-dirty-limit untouched.
+ */
+if (dirtylimit_in_service() &&
+quota_dirtyrate == s->parameters.vcpu_dirty_limit) {
+return;
+}
+
+quota_dirtyrate = s->parameters.vcpu_dirty_limit;
+
+/*
+ * Set all vCPU a quota dirtyrate, note that the second
+ * parameter will be ignored if setting all vCPU for the vm
+ */
+qmp_set_vcpu_dirty_limit(false, -1, quota_dirtyrate, NULL);
+trace_migration_dirty_limit_guest(quota_dirtyrate);
+}
+
 static void migration_trigger_throttle(RAMState *rs)
 {
 uint64_t threshold = migrate_throttle_trigger_threshold();
@@ -1013,6 +1047,8 @@ static void migration_trigger_throttle(RAMState *rs)
 trace_migration_throttle();
 mig_throttle_guest_down(bytes_dirty_period,
 bytes_dirty_threshold);
+} else if (migrate_dirty_limit()) {
+migration_dirty_limit_guest();
 }
 }
 }
diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
index 3f1103b04b..6e218bb249 100644
--- a/softmmu/dirtylimit.c
+++ b/softmmu/dirtylimit.c
@@ -436,6 +436,23 @@ static void dirtylimit_cleanup(void)
 dirtylimit_state_finalize();
 }
 
+/*
+ * dirty page rate limit is not allowed to set if migration
+ * is running with dirty-limit capability enabled.
+ */
+static bool dirtylimit_is_allowed(void)
+{
+MigrationState *ms = migrate_get_current();
+
+if (migration_is_running(ms->state) &&
+(!qemu_thread_is_self(>thread)) &&
+migrate_dirty_limit() &&
+dirtylimit_in_service()) {
+return false;
+}
+return true;
+}
+
 void qmp_cancel_vcpu_dirty_limit(bool has_cpu_index,
  int64_t cpu_index,
  Error **errp)
@@ -449,6 +466,12 @@ void qmp_cancel_vcpu_dirty_limit(bool has_cpu_index,
 return;
 }
 
+if (!dirtylimit_is_allowed()) {
+error_setg(errp, "can't cancel dirty page rate limit while"
+   " migration is running");
+return;
+}
+
 if (!dirtylimit_in_service()) {
 return;
 }
@@ -499,6 +522,12 @@ void qmp_set_vcpu_dirty_limit(bool has_cpu_index,
 return;
 }
 
+if (!dirtylimit_is_allowed()) {
+error_setg(errp, "can't set dirty page rate limit while"
+   " migration is running");
+return;
+}
+
 if (!dirty_rate) {
 qmp_cancel_vcpu_dirty_limit(has_cpu_index, cpu_index, errp);
 return;
diff --git a/migration/trace-events b/migration/trace-events
index cdaef7a1ea..c5cb280d95 

[PULL 03/30] softmmu/dirtylimit: Add parameter check for hmp "set_vcpu_dirty_limit"

2023-06-21 Thread Juan Quintela
From: Hyman Huang(黄勇) 

dirty_rate paraemter of hmp command "set_vcpu_dirty_limit" is invalid
if less than 0, so add parameter check for it.

Note that this patch also delete the unsolicited help message and
clean up the code.

Signed-off-by: Hyman Huang(黄勇) 
Reviewed-by: Markus Armbruster 
Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Message-Id: <168618975839.6361.1740763387474768865...@git.sr.ht>
Signed-off-by: Juan Quintela 
---
 softmmu/dirtylimit.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
index 015a9038d1..5c12d26d49 100644
--- a/softmmu/dirtylimit.c
+++ b/softmmu/dirtylimit.c
@@ -515,14 +515,15 @@ void hmp_set_vcpu_dirty_limit(Monitor *mon, const QDict 
*qdict)
 int64_t cpu_index = qdict_get_try_int(qdict, "cpu_index", -1);
 Error *err = NULL;
 
+if (dirty_rate < 0) {
+error_setg(, "invalid dirty page limit %ld", dirty_rate);
+goto out;
+}
+
 qmp_set_vcpu_dirty_limit(!!(cpu_index != -1), cpu_index, dirty_rate, );
-if (err) {
-hmp_handle_error(mon, err);
-return;
-}
 
-monitor_printf(mon, "[Please use 'info vcpu_dirty_limit' to query "
-   "dirty limit for virtual CPU]\n");
+out:
+hmp_handle_error(mon, err);
 }
 
 static struct DirtyLimitInfo *dirtylimit_query_vcpu(int cpu_index)
-- 
2.40.1




[PULL 06/30] migration: Introduce dirty-limit capability

2023-06-21 Thread Juan Quintela
From: Hyman Huang(黄勇) 

Introduce migration dirty-limit capability, which can
be turned on before live migration and limit dirty
page rate durty live migration.

Introduce migrate_dirty_limit function to help check
if dirty-limit capability enabled during live migration.

Meanwhile, refactor vcpu_dirty_rate_stat_collect
so that period can be configured instead of hardcoded.

dirty-limit capability is kind of like auto-converge
but using dirty limit instead of traditional cpu-throttle
to throttle guest down. To enable this feature, turn on
the dirty-limit capability before live migration using
migrate-set-capabilities, and set the parameters
"x-vcpu-dirty-limit-period", "vcpu-dirty-limit" suitably
to speed up convergence.

Signed-off-by: Hyman Huang(黄勇) 
Acked-by: Peter Xu 
Reviewed-by: Juan Quintela 
Message-Id: <168618975839.6361.1740763387474768865...@git.sr.ht>
Signed-off-by: Juan Quintela 
---
 qapi/migration.json  | 12 +++-
 migration/options.h  |  1 +
 migration/options.c  | 23 +++
 softmmu/dirtylimit.c | 18 ++
 4 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index e7243c0c0d..621e6604c6 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -487,6 +487,16 @@
 # and should not affect the correctness of postcopy migration.
 # (since 7.1)
 #
+# @dirty-limit: If enabled, migration will use the dirty-limit algo to
+#   throttle down guest instead of auto-converge algo.
+#   Throttle algo only works when vCPU's dirtyrate greater
+#   than 'vcpu-dirty-limit', read processes in guest os
+#   aren't penalized any more, so this algo can improve
+#   performance of vCPU during live migration. This is an
+#   optional performance feature and should not affect the
+#   correctness of the existing auto-converge algo.
+#   (since 8.1)
+#
 # Features:
 #
 # @unstable: Members @x-colo and @x-ignore-shared are experimental.
@@ -502,7 +512,7 @@
'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
{ 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
'validate-uuid', 'background-snapshot',
-   'zero-copy-send', 'postcopy-preempt'] }
+   'zero-copy-send', 'postcopy-preempt', 'dirty-limit'] }
 
 ##
 # @MigrationCapabilityStatus:
diff --git a/migration/options.h b/migration/options.h
index 45991af3c2..51964eff29 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -29,6 +29,7 @@ bool migrate_block(void);
 bool migrate_colo(void);
 bool migrate_compress(void);
 bool migrate_dirty_bitmaps(void);
+bool migrate_dirty_limit(void);
 bool migrate_events(void);
 bool migrate_ignore_shared(void);
 bool migrate_late_block_activate(void);
diff --git a/migration/options.c b/migration/options.c
index 8acf5f1d2c..ba1010e08b 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -27,6 +27,7 @@
 #include "qemu-file.h"
 #include "ram.h"
 #include "options.h"
+#include "sysemu/kvm.h"
 
 /* Maximum migrate downtime set to 2000 seconds */
 #define MAX_MIGRATE_DOWNTIME_SECONDS 2000
@@ -194,6 +195,7 @@ Property migration_properties[] = {
 DEFINE_PROP_MIG_CAP("x-zero-copy-send",
 MIGRATION_CAPABILITY_ZERO_COPY_SEND),
 #endif
+DEFINE_PROP_MIG_CAP("x-dirty-limit", MIGRATION_CAPABILITY_DIRTY_LIMIT),
 
 DEFINE_PROP_END_OF_LIST(),
 };
@@ -240,6 +242,13 @@ bool migrate_dirty_bitmaps(void)
 return s->capabilities[MIGRATION_CAPABILITY_DIRTY_BITMAPS];
 }
 
+bool migrate_dirty_limit(void)
+{
+MigrationState *s = migrate_get_current();
+
+return s->capabilities[MIGRATION_CAPABILITY_DIRTY_LIMIT];
+}
+
 bool migrate_events(void)
 {
 MigrationState *s = migrate_get_current();
@@ -556,6 +565,20 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, 
Error **errp)
 }
 }
 
+if (new_caps[MIGRATION_CAPABILITY_DIRTY_LIMIT]) {
+if (new_caps[MIGRATION_CAPABILITY_AUTO_CONVERGE]) {
+error_setg(errp, "dirty-limit conflicts with auto-converge"
+   " either of then available currently");
+return false;
+}
+
+if (!kvm_enabled() || !kvm_dirty_ring_enabled()) {
+error_setg(errp, "dirty-limit requires KVM with accelerator"
+   " property 'dirty-ring-size' set");
+return false;
+}
+}
+
 return true;
 }
 
diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
index 5c12d26d49..3f1103b04b 100644
--- a/softmmu/dirtylimit.c
+++ b/softmmu/dirtylimit.c
@@ -24,6 +24,9 @@
 #include "hw/boards.h"
 #include "sysemu/kvm.h"
 #include "trace.h"
+#include "migration/misc.h"
+#include "migration/migration.h"
+#include "migration/options.h"
 
 /*
  * Dirtylimit stop working if dirty page rate error
@@ -75,14 +78,21 @@ static bool dirtylimit_quit;
 
 static void vcpu_dirty_rate_stat_collect(void)
 {
+MigrationState 

[PULL 04/30] qapi/migration: Introduce x-vcpu-dirty-limit-period parameter

2023-06-21 Thread Juan Quintela
From: Hyman Huang(黄勇) 

Introduce "x-vcpu-dirty-limit-period" migration experimental
parameter, which is in the range of 1 to 1000ms and used to
make dirtyrate calculation period configurable.

Currently with the "x-vcpu-dirty-limit-period" varies, the
total time of live migration changes, test results show the
optimal value of "x-vcpu-dirty-limit-period" ranges from
500ms to 1000 ms. "x-vcpu-dirty-limit-period" should be made
stable once it proves best value can not be determined with
developer's experiments.

Signed-off-by: Hyman Huang(黄勇) 
Reviewed-by: Markus Armbruster 
Reviewed-by: Juan Quintela 
Message-Id: <168618975839.6361.1740763387474768865...@git.sr.ht>
Signed-off-by: Juan Quintela 
---
 qapi/migration.json| 34 +++---
 migration/migration-hmp-cmds.c |  8 
 migration/options.c| 28 
 3 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 5bb5ab82a0..67c26d9dea 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -779,9 +779,14 @@
 # Nodes are mapped to their block device name if there is one, and
 # to their node name otherwise.  (Since 5.2)
 #
+# @x-vcpu-dirty-limit-period: Periodic time (in milliseconds) of dirty limit 
during
+# live migration. Should be in the range 1 to 
1000ms,
+# defaults to 1000ms. (Since 8.1)
+#
 # Features:
 #
-# @unstable: Member @x-checkpoint-delay is experimental.
+# @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
+#are experimental.
 #
 # Since: 2.4
 ##
@@ -799,8 +804,9 @@
'multifd-channels',
'xbzrle-cache-size', 'max-postcopy-bandwidth',
'max-cpu-throttle', 'multifd-compression',
-   'multifd-zlib-level' ,'multifd-zstd-level',
-   'block-bitmap-mapping' ] }
+   'multifd-zlib-level', 'multifd-zstd-level',
+   'block-bitmap-mapping',
+   { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] } ] 
}
 
 ##
 # @MigrateSetParameters:
@@ -935,9 +941,14 @@
 # Nodes are mapped to their block device name if there is one, and
 # to their node name otherwise.  (Since 5.2)
 #
+# @x-vcpu-dirty-limit-period: Periodic time (in milliseconds) of dirty limit 
during
+# live migration. Should be in the range 1 to 
1000ms,
+# defaults to 1000ms. (Since 8.1)
+#
 # Features:
 #
-# @unstable: Member @x-checkpoint-delay is experimental.
+# @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
+#are experimental.
 #
 # TODO: either fuse back into MigrationParameters, or make
 # MigrationParameters members mandatory
@@ -972,7 +983,9 @@
 '*multifd-compression': 'MultiFDCompression',
 '*multifd-zlib-level': 'uint8',
 '*multifd-zstd-level': 'uint8',
-'*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
+'*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
+'*x-vcpu-dirty-limit-period': { 'type': 'uint64',
+'features': [ 'unstable' ] } } }
 
 ##
 # @migrate-set-parameters:
@@ -1127,9 +1140,14 @@
 # Nodes are mapped to their block device name if there is one, and
 # to their node name otherwise.  (Since 5.2)
 #
+# @x-vcpu-dirty-limit-period: Periodic time (in milliseconds) of dirty limit 
during
+# live migration. Should be in the range 1 to 
1000ms,
+# defaults to 1000ms. (Since 8.1)
+#
 # Features:
 #
-# @unstable: Member @x-checkpoint-delay is experimental.
+# @unstable: Members @x-checkpoint-delay and @x-vcpu-dirty-limit-period
+#are experimental.
 #
 # Since: 2.4
 ##
@@ -1161,7 +1179,9 @@
 '*multifd-compression': 'MultiFDCompression',
 '*multifd-zlib-level': 'uint8',
 '*multifd-zstd-level': 'uint8',
-'*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
+'*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
+'*x-vcpu-dirty-limit-period': { 'type': 'uint64',
+'features': [ 'unstable' ] } } }
 
 ##
 # @query-migrate-parameters:
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 9885d7c9f7..352e9ec716 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -364,6 +364,10 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict 
*qdict)
 }
 }
 }
+
+monitor_printf(mon, "%s: %" PRIu64 " ms\n",
+MigrationParameter_str(MIGRATION_PARAMETER_X_VCPU_DIRTY_LIMIT_PERIOD),
+params->x_vcpu_dirty_limit_period);
 }
 
 qapi_free_MigrationParameters(params);
@@ -620,6 +624,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const 

[PULL 02/30] migration/multifd: Protect accesses to migration_threads

2023-06-21 Thread Juan Quintela
From: Fabiano Rosas 

This doubly linked list is common for all the multifd and migration
threads so we need to avoid concurrent access.

Add a mutex to protect the data from concurrent access. This fixes a
crash when removing two MigrationThread objects from the list at the
same time during cleanup of multifd threads.

Fixes: 671326201d ("migration: Introduce interface query-migrationthreads")
Signed-off-by: Fabiano Rosas 
Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Message-Id: <20230607161306.31425-3-faro...@suse.de>
Signed-off-by: Juan Quintela 
---
 migration/threadinfo.h |  2 --
 migration/threadinfo.c | 15 ++-
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/migration/threadinfo.h b/migration/threadinfo.h
index 8aa6999d58..2f356ff312 100644
--- a/migration/threadinfo.h
+++ b/migration/threadinfo.h
@@ -10,8 +10,6 @@
  *  See the COPYING file in the top-level directory.
  */
 
-#include "qemu/queue.h"
-#include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-migration.h"
 
diff --git a/migration/threadinfo.c b/migration/threadinfo.c
index 3dd9b14ae6..262990dd75 100644
--- a/migration/threadinfo.c
+++ b/migration/threadinfo.c
@@ -10,23 +10,35 @@
  *  See the COPYING file in the top-level directory.
  */
 
+#include "qemu/osdep.h"
+#include "qemu/queue.h"
+#include "qemu/lockable.h"
 #include "threadinfo.h"
 
+QemuMutex migration_threads_lock;
 static QLIST_HEAD(, MigrationThread) migration_threads;
 
+static void __attribute__((constructor)) migration_threads_init(void)
+{
+qemu_mutex_init(_threads_lock);
+}
+
 MigrationThread *migration_threads_add(const char *name, int thread_id)
 {
 MigrationThread *thread =  g_new0(MigrationThread, 1);
 thread->name = name;
 thread->thread_id = thread_id;
 
-QLIST_INSERT_HEAD(_threads, thread, node);
+WITH_QEMU_LOCK_GUARD(_threads_lock) {
+QLIST_INSERT_HEAD(_threads, thread, node);
+}
 
 return thread;
 }
 
 void migration_threads_remove(MigrationThread *thread)
 {
+QEMU_LOCK_GUARD(_threads_lock);
 if (thread) {
 QLIST_REMOVE(thread, node);
 g_free(thread);
@@ -39,6 +51,7 @@ MigrationThreadInfoList *qmp_query_migrationthreads(Error 
**errp)
 MigrationThreadInfoList **tail = 
 MigrationThread *thread = NULL;
 
+QEMU_LOCK_GUARD(_threads_lock);
 QLIST_FOREACH(thread, _threads, node) {
 MigrationThreadInfo *info = g_new0(MigrationThreadInfo, 1);
 info->name = g_strdup(thread->name);
-- 
2.40.1




[PULL 00/30] Next patches

2023-06-21 Thread Juan Quintela
The following changes since commit 67fe6ae41da64368bc4936b196fee2bf61f8c720:

  Merge tag 'pull-tricore-20230621-1' of https://github.com/bkoppelmann/qemu 
into staging (2023-06-21 20:08:48 +0200)

are available in the Git repository at:

  https://gitlab.com/juan.quintela/qemu.git tags/next-pull-request

for you to fetch changes up to c53dc569d0a0fb76eaa83f353253a897914948f9:

  migration/rdma: Split qemu_fopen_rdma() into input/output functions 
(2023-06-22 02:45:30 +0200)


Migration Pull request (20230621)

In this pull request:

- fix for multifd thread creation (fabiano)
- dirtylimity (hyman)
  * migration-test will go on next PULL request, as it has failures.
- Improve error description (tejus)
- improve -incoming and set parameters before calling incoming (wei)
- migration atomic counters reviewed patches (quintela)
- migration-test refacttoring reviewed (quintela)

Please apply.



Fabiano Rosas (2):
  migration/multifd: Rename threadinfo.c functions
  migration/multifd: Protect accesses to migration_threads

Hyman Huang(黄勇) (8):
  softmmu/dirtylimit: Add parameter check for hmp "set_vcpu_dirty_limit"
  qapi/migration: Introduce x-vcpu-dirty-limit-period parameter
  qapi/migration: Introduce vcpu-dirty-limit parameters
  migration: Introduce dirty-limit capability
  migration: Refactor auto-converge capability logic
  migration: Put the detection logic before auto-converge checking
  migration: Implement dirty-limit convergence algo
  migration: Extend query-migrate to provide dirty page limit info

Juan Quintela (16):
  migration-test: Be consistent for ppc
  migration-test: Make machine_opts regular with other options
  migration-test: Create arch_opts
  migration-test: machine_opts is really arch specific
  migration-test: Create kvm_opts
  migration-test: bootpath is the same for all tests and for all archs
  migration-test: Add bootfile_create/delete() functions
  migration-test: dirtylimit checks for x86_64 arch before
  migration-test: simplify shmem_opts handling
  qemu-file: Rename qemu_file_transferred_ fast -> noflush
  migration: Change qemu_file_transferred to noflush
  migration: Use qemu_file_transferred_noflush() for block migration.
  qemu_file: Make qemu_file_is_writable() static
  qemu-file: Simplify qemu_file_shutdown()
  qemu-file: Make qemu_file_get_error_obj() static
  migration/rdma: Split qemu_fopen_rdma() into input/output functions

Tejus GK (2):
  migration: Update error description whenever migration fails
  migration: Refactor repeated call of yank_unregister_instance

Wei Wang (2):
  migration: enforce multifd and postcopy preempt to be set before
incoming
  qtest/migration-tests.c: use "-incoming defer" for postcopy tests

 qapi/migration.json|  74 +---
 include/sysemu/dirtylimit.h|   2 +
 migration/options.h|   1 +
 migration/qemu-file.h  |  14 ++--
 migration/threadinfo.h |   7 +-
 migration/block.c  |   4 +-
 migration/migration-hmp-cmds.c |  26 +++
 migration/migration.c  |  40 +++
 migration/multifd.c|   4 +-
 migration/options.c|  87 +++
 migration/qemu-file.c  |  24 ++-
 migration/ram.c|  59 +---
 migration/rdma.c   |  39 +--
 migration/savevm.c |   6 +-
 migration/threadinfo.c |  19 -
 migration/vmstate.c|   4 +-
 softmmu/dirtylimit.c   |  97 +++---
 tests/qtest/migration-test.c   | 123 ++---
 migration/trace-events |   1 +
 19 files changed, 472 insertions(+), 159 deletions(-)


base-commit: 5f9dd6a8ce3961db4ce47411ed2097ad88bdf5fc
prerequisite-patch-id: 99c8bffa9428838925e330eb2881bab476122579
prerequisite-patch-id: 77ba427fd916aeb395e95aa0e7190f84e98e96ab
prerequisite-patch-id: 9983d46fa438d7075a37be883529e37ae41e4228
prerequisite-patch-id: 207f7529924b12dcb57f6557d6db6f79ceb2d682
prerequisite-patch-id: 5ad1799a13845dbf893a28a202b51a6b50d95d90
prerequisite-patch-id: c51959aacd6d65ee84fcd4f1b2aed3dd6f6af879
prerequisite-patch-id: da9dbb6799b2da002c0896574334920097e4c50a
prerequisite-patch-id: c1110ffafbaf5465fb277a20db809372291f7846
prerequisite-patch-id: 8307c92bedd07446214b35b40206eb6793a7384d
prerequisite-patch-id: 0a6106cd4a508d5e700a7ff6c25edfdd03c8ca3d
prerequisite-patch-id: 83205051de22382e75bf4acdf69e59315801fa0d
prerequisite-patch-id: 8c9b3cba89d555c071a410041e6da41806106a7e
prerequisite-patch-id: 0ff62a33b9a242226ccc1f5424a516de803c9fe5
prerequisite-patch-id: 25b8ae1ebe09ace14457c454cfcb23077c37346c
prerequisite-patch-id: 466ea91d5be41fe345dacd4d17bbbe5ce13118c2
prerequisite-patch-id: d1045858f9729ac62eccf2e83ebf95cfebae2cb5
prerequisite-patch-id: 0276ec02073bda5426de39e2f2e81eef080b4f5

[PULL 01/30] migration/multifd: Rename threadinfo.c functions

2023-06-21 Thread Juan Quintela
From: Fabiano Rosas 

We're about to add more functions to this file so make it use the same
coding style as the rest of the code.

Signed-off-by: Fabiano Rosas 
Reviewed-by: Juan Quintela 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Peter Xu 
Message-Id: <20230607161306.31425-2-faro...@suse.de>
Signed-off-by: Juan Quintela 
---
 migration/threadinfo.h | 5 ++---
 migration/migration.c  | 4 ++--
 migration/multifd.c| 4 ++--
 migration/threadinfo.c | 4 ++--
 4 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/migration/threadinfo.h b/migration/threadinfo.h
index 4d69423c0a..8aa6999d58 100644
--- a/migration/threadinfo.h
+++ b/migration/threadinfo.h
@@ -23,6 +23,5 @@ struct MigrationThread {
 QLIST_ENTRY(MigrationThread) node;
 };
 
-MigrationThread *MigrationThreadAdd(const char *name, int thread_id);
-
-void MigrationThreadDel(MigrationThread *info);
+MigrationThread *migration_threads_add(const char *name, int thread_id);
+void migration_threads_remove(MigrationThread *info);
diff --git a/migration/migration.c b/migration/migration.c
index dc05c6f6ea..3a001dd042 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2922,7 +2922,7 @@ static void *migration_thread(void *opaque)
 MigThrError thr_error;
 bool urgent = false;
 
-thread = MigrationThreadAdd("live_migration", qemu_get_thread_id());
+thread = migration_threads_add("live_migration", qemu_get_thread_id());
 
 rcu_register_thread();
 
@@ -3000,7 +3000,7 @@ static void *migration_thread(void *opaque)
 migration_iteration_finish(s);
 object_unref(OBJECT(s));
 rcu_unregister_thread();
-MigrationThreadDel(thread);
+migration_threads_remove(thread);
 return NULL;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 3387d8277f..4c6cee6547 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -651,7 +651,7 @@ static void *multifd_send_thread(void *opaque)
 int ret = 0;
 bool use_zero_copy_send = migrate_zero_copy_send();
 
-thread = MigrationThreadAdd(p->name, qemu_get_thread_id());
+thread = migration_threads_add(p->name, qemu_get_thread_id());
 
 trace_multifd_send_thread_start(p->id);
 rcu_register_thread();
@@ -767,7 +767,7 @@ out:
 qemu_mutex_unlock(>mutex);
 
 rcu_unregister_thread();
-MigrationThreadDel(thread);
+migration_threads_remove(thread);
 trace_multifd_send_thread_end(p->id, p->num_packets, 
p->total_normal_pages);
 
 return NULL;
diff --git a/migration/threadinfo.c b/migration/threadinfo.c
index 1de8b31855..3dd9b14ae6 100644
--- a/migration/threadinfo.c
+++ b/migration/threadinfo.c
@@ -14,7 +14,7 @@
 
 static QLIST_HEAD(, MigrationThread) migration_threads;
 
-MigrationThread *MigrationThreadAdd(const char *name, int thread_id)
+MigrationThread *migration_threads_add(const char *name, int thread_id)
 {
 MigrationThread *thread =  g_new0(MigrationThread, 1);
 thread->name = name;
@@ -25,7 +25,7 @@ MigrationThread *MigrationThreadAdd(const char *name, int 
thread_id)
 return thread;
 }
 
-void MigrationThreadDel(MigrationThread *thread)
+void migration_threads_remove(MigrationThread *thread)
 {
 if (thread) {
 QLIST_REMOVE(thread, node);
-- 
2.40.1




Re: [PATCH v2 06/20] qemu_file: total_transferred is not used anymore

2023-06-21 Thread Juan Quintela
Peter Xu  wrote:
> On Tue, May 30, 2023 at 08:39:27PM +0200, Juan Quintela wrote:
>> Signed-off-by: Juan Quintela 
>> ---
>>  migration/qemu-file.c | 4 
>>  1 file changed, 4 deletions(-)
>> 
>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>> index eb0497e532..6b6deea19b 100644
>> --- a/migration/qemu-file.c
>> +++ b/migration/qemu-file.c
>> @@ -41,9 +41,6 @@ struct QEMUFile {
>>  QIOChannel *ioc;
>>  bool is_writable;
>>  
>> -/* The sum of bytes transferred on the wire */
>> -uint64_t total_transferred;
>> -
>>  int buf_index;
>>  int buf_size; /* 0 when writing */
>>  uint8_t buf[IO_BUF_SIZE];
>> @@ -287,7 +284,6 @@ void qemu_fflush(QEMUFile *f)
>>  qemu_file_set_error_obj(f, -EIO, local_error);
>>  } else {
>>  uint64_t size = iov_size(f->iov, f->iovcnt);
>> -f->total_transferred += size;
>
> I think this patch is another example why I think sometimes the way patch
> is split are pretty much adding more complexity on review...

It depends of taste.

You are doing one thing in way1.
Then you find a better way to do it, lets call it way2.

Now we have two options to see how we arrived there.

a- You got any declarations/definition/initializations for way2
b- You write way2 alongside way1
c- You test that both ways give the same result, and you see that they
   give the same result.
d- you remove the way1.

Or you squash the four patches in a single patch.  But then the reviewer
lost the place where one can see why it is the same than the old one.

Sometimes is better the longer way, sometimes is better the short one.

Clearly we don't agree about what is the best way in this case.

> Here we removed a variable operation but it seems all fine if it's not used
> anywhere.  But it also means current code base (before this patch applied)
> doesn't make sense already because it contains this useless addition.  So
> IMHO it means some previous patch does it just wrong.

No.  It is how it is developed.  And being respectful with the
reviewer.  Given it enough information to do a proper review.

During the development of this series, there were lots of:

if (old_counter != new_counter)
   printf("");

traces were in the several thousand lines long.  If I have to review
that change, I would love any help that writer can give me.  That is why
it is done this way.

> I think it means it's caused by a wrong split of patches, then each patch
> stops to make much sense as a standalone one.

It stops making sense if you want each feature to be a single patch.
Before the patch no feature.  After the patch full feature.  That brings
us to very long patches.

What is easier to review (to do the same)

a - 1 x 1000 lines patch
b - 10 x 100 lines patch

I will go with b any time.  Except if the split is arbitrary.

> I can go back and try to find whatever patch on the list that will explain
> this.  But it'll also go into git log.  Anyone reads this later will be
> confused once again.  Even harder for them to figure out what
> happened.

As said before, I completely disagree here.  And what is worse.  If it
gets wrong, with your approach git bisect will not help as much than
with my appreach.

> Do you think we could reorganize the patches so each of a single patch
> explains itself?

No.  See before.  We go for a very spaguetti code to a much less
spaguety code.

> The other thing is about priority of patches - I still have ~80 patches
> pending reviews on migration only.. Would you think it makes sense we pickg
> up important ones first and merge them with higher priority?

Ok, lets make this clear.
This whole atomic migration counters started because the zero_page
detection in multifd had the counters so wrong that meassuring speed
become impossible.

I haven't yet send the multifd zero pages.  And why was it so
complicated.  Just on top of my memory.

- how much data had we transferred.  Historically we stored that
  information on qemu-file.  But qemu-file can only be read/written from
  the migration thread.  So we went through jumps to be able to update
  that values.

  Current upstream code for compressed multifd assumes that it transfer
  as much data as non compressed one.  Why?  because we don't have an
  easy way to get that value back.  Contorsions that we were trying to
  do:

  https://lore.kernel.org/all/20220802063907.18882-5-quint...@redhat.com/

  To resume, the way that we had to do it was something like:

  - we send a bunch of pages to multifd thread
  - multifd thread send data and returns on the buffer what has written
  - migration thread when reuses a buffer adds the written stuff from
previous time than the struct was used.

  This was not just problematic from multifd zero pages detection.
  * compression was lying about it
  * zero_copy is doing it wrong (accounting at the time that it does the
write, not when it knows that it was written).

- rdma: this is even funnier
  * It accounted for 

Re: [PATCH v2 04/20] qemu-file: We only call qemu_file_transferred_* on the sending side

2023-06-21 Thread Juan Quintela
Peter Xu  wrote:
> On Tue, Jun 13, 2023 at 06:02:05PM +0200, Juan Quintela wrote:
>> Peter Xu  wrote:
>> > On Tue, May 30, 2023 at 08:39:25PM +0200, Juan Quintela wrote:
>> >> Remove the increase in qemu_file_fill_buffer() and add asserts to
>> >> qemu_file_transferred* functions.
>> >> 
>> >> Signed-off-by: Juan Quintela 
>> >
>> > The read side accounting does look a bit weird and never caught my notice..
>> >
>> > Maybe worth also touching the document of QEMUFile::total_transferred to
>> > clarify what it accounts?
>> >
>> > Reviewed-by: Peter Xu 
>> >
>> > Though when I'm looking at the counters (didn't follow every single recent
>> > patch on this..), I found that now reading transferred value is actually
>> > more expensive - qemu_file_transferred() needs flushing, even if for the
>> > fast version, qemu_file_transferred_fast() loops over all possible iovs,
>> > which can be as large as MAX_IOV_SIZE==64.
>> >
>> > To be explicit, I _think_ for each guest page we now need to flush...
>> >
>> >   ram_save_iterate
>> > migration_rate_exceeded
>> >   migration_transferred_bytes
>> > qemu_file_transferred
>> >
>> > I hope I'm wrong..
>> 
>> See patch 7:
>> 
>> diff --git a/migration/migration-stats.c b/migration/migration-stats.c
>> index 79eea8d865..1696185694 100644
>> --- a/migration/migration-stats.c
>> +++ b/migration/migration-stats.c
>> @@ -62,7 +62,7 @@ uint64_t migration_transferred_bytes(QEMUFile *f)
>>  {
>>  uint64_t multifd = stat64_get(_stats.multifd_bytes);
>>  uint64_t rdma = stat64_get(_stats.rdma_bytes);
>> -uint64_t qemu_file = qemu_file_transferred(f);
>> +uint64_t qemu_file = stat64_get(_stats.qemu_file_transferred);
>>  
>>  trace_migration_transferred_bytes(qemu_file, multifd, rdma);
>>  return qemu_file + multifd + rdma;
>
> If this is a known regression, should we make a first patchset fix it and
> make it higher priority to merge?

This is the simpler way that I have found to arrive from A to B.
The reason why I didn't want to enter the atomic vars (and everybody
before) was because it had so many changing bits.

And here we are, moving to single counters instead of several of them
took something like 200 patches.

> It seems this is even not mentioned in the cover letter.. while IMHO this
> is the most important bit to have in it..

My fault.

I cc'd Fiona on v1, and she confirmed that this fixed her problem.

This is the commit that introduces the slowdown.

commit 813cd61669e45ee6d5db09a83d03df8f0c6eb5d2
Author: Juan Quintela 
Date:   Mon May 15 21:57:01 2023 +0200

migration: Use migration_transferred_bytes() to calculate rate_limit

Signed-off-by: Juan Quintela 
Reviewed-by: Cédric Le Goater 
Message-Id: <20230515195709.63843-9-quint...@redhat.com>

Important bits:

-uint64_t rate_limit_used = stat64_get(_stats.rate_limit_used);
+uint64_t rate_limit_start = stat64_get(_stats.rate_limit_start);
+uint64_t rate_limit_current = migration_transferred_bytes(f);
+uint64_t rate_limit_used = rate_limit_current - rate_limit_start;
 uint64_t rate_limit_max = stat64_get(_stats.rate_limit_max);

We moved from reading an atomic to call qemu_file_transferred(), that
does the iovec dance.

This commit (on this series):

ommit 524072cb5f5ce5605f1171f86ba0879405e4b9b3
Author: Juan Quintela 
Date:   Mon May 8 12:16:47 2023 +0200

migration: Use the number of transferred bytes directly

We only use migration_transferred_bytes() to calculate the rate_limit,
for that we don't need to flush whatever is on the qemu_file buffer.
Remember that the buffer is really small (normal case is 32K if we use
iov's can be 64 * TARGET_PAGE_SIZE), so this is not relevant to
calculations.

Signed-off-by: Juan Quintela 

diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index 79eea8d865..1696185694 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -62,7 +62,7 @@ uint64_t migration_transferred_bytes(QEMUFile *f)
 {
 uint64_t multifd = stat64_get(_stats.multifd_bytes);
 uint64_t rdma = stat64_get(_stats.rdma_bytes);
-uint64_t qemu_file = qemu_file_transferred(f);
+uint64_t qemu_file = stat64_get(_stats.qemu_file_transferred);
 
 trace_migration_transferred_bytes(qemu_file, multifd, rdma);
 return qemu_file + multifd + rdma;

Undoes the damage.

And yes, before you ask, I got this wrong lots of times, have to rebase
and changing order of patches several times O:-)

>> 
>> > Does it mean that perhaps we simply need "sent and put into send buffer"
>> > more than "what really got transferred"?  So I start to wonder what's the
>> > origianl purpose of this change, and which one is better..
>> 
>> That is basically what patch 5 and 6 do O:-)
>> 
>> Problem is arriving to something that is bisectable (for correctness)
>> and is easy to review.
>> 
>> And yes, my choices can be different from the ones tat you do.
>> 
>> The other reason 

Re: [PATCH 6/6] iotests: add test 314 for "qemu-img rebase" with compression

2023-06-21 Thread Denis V. Lunev

On 6/1/23 21:28, Andrey Drobyshev wrote:

The test cases considered so far:

1. Check that compression mode isn't compatible with "-f raw" (raw
format doesn't support compression).
2. Check that rebasing an image onto no backing file preserves the data
and writes the copied clusters actually compressed.
3. Same as 2, but with a raw backing file (i.e. the clusters copied from the
backing are originally uncompressed -- we check they end up compressed
after being merged).
4. Remove a single delta from a backing chain, perform the same checks
as in 2.
5. Check that even when backing and overlay are initially uncompressed,
copied clusters end up compressed when rebase with compression is
performed.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/314 | 165 +
  tests/qemu-iotests/314.out |  75 +
  2 files changed, 240 insertions(+)
  create mode 100755 tests/qemu-iotests/314
  create mode 100644 tests/qemu-iotests/314.out

diff --git a/tests/qemu-iotests/314 b/tests/qemu-iotests/314
new file mode 100755
index 00..96d7b4d258
--- /dev/null
+++ b/tests/qemu-iotests/314
@@ -0,0 +1,165 @@
+#!/usr/bin/env bash
+# group: rw backing auto quick
+#
+# Test qemu-img rebase with compression
+#
+# Copyright (c) 2023 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+# creator
+owner=andrey.drobys...@virtuozzo.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+_cleanup()
+{
+_cleanup_test_img
+_rm_test_img "$TEST_IMG.base"
+_rm_test_img "$TEST_IMG.itmd"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+# Want the size divisible by 2 and 3
+size=$(( 48 * 1024 * 1024 ))
+half_size=$(( size / 2 ))
+third_size=$(( size / 3 ))
+
+# 1. "qemu-img rebase -c" should refuse working with any format which doesn't
+# support compression.  We only check "-f raw" here.
+echo
+echo "=== Testing compressed rebase format compatibility ==="
+echo
+
+$QEMU_IMG create -f raw "$TEST_IMG" "$size" | _filter_img_create
+$QEMU_IMG rebase -c -f raw -b "" "$TEST_IMG"
+
+# 2. Write the 1st half of $size to backing file (compressed), 2nd half -- to
+# the top image (also compressed).  Rebase the top image onto no backing file,
+# with compression (i.e. "qemu-img -c -b ''").  Check that the resulting image
+# has the written data preserved, and "qemu-img check" reports 100% clusters
+# as compressed.
+echo
+echo "=== Testing rebase with compression onto no backing file ==="
+echo
+
+TEST_IMG="$TEST_IMG.base" _make_test_img $size
+_make_test_img -b "$TEST_IMG.base" -F $IMGFMT $size
+
+$QEMU_IO -c "write -c -P 0xaa 0 $half_size" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c "write -c -P 0xbb $half_size $half_size" "$TEST_IMG" \
+| _filter_qemu_io
+
+$QEMU_IMG rebase -c -f $IMGFMT -b "" "$TEST_IMG"
+
+$QEMU_IO -c "read -P 0xaa 0 $half_size" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0xbb $half_size $half_size" "$TEST_IMG" | _filter_qemu_io
+
+$QEMU_IMG check "$TEST_IMG" | _filter_testdir
+
+# 3. Same as the previous one, but with raw backing file (hence write to
+# the backing is uncompressed).
+echo
+echo "=== Testing rebase with compression with raw backing file ==="
+echo
+
+$QEMU_IMG create -f raw "$TEST_IMG.base" "$half_size" | _filter_img_create
+_make_test_img -b "$TEST_IMG.base" -F raw $size
+
+$QEMU_IO -f raw -c "write -P 0xaa 0 $half_size" "$TEST_IMG.base" \
+| _filter_qemu_io
+$QEMU_IO -c "write -c -P 0xbb $half_size $half_size" \
+"$TEST_IMG" | _filter_qemu_io
+
+$QEMU_IMG rebase -c -f $IMGFMT -b "" "$TEST_IMG"
+
+$QEMU_IO -c "read -P 0xaa 0 $half_size" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0xbb $half_size $half_size" "$TEST_IMG" | _filter_qemu_io
+
+$QEMU_IMG check "$TEST_IMG" | _filter_testdir
+
+# 4. Create a backing chain base<--itmd<--img, filling 1st, 2nd and 3rd
+# thirds of them, respectively (with compression).  Rebase img onto base,
+# effectively deleting itmd from the chain, and check that written data is
+# preserved in the resulting image.  Also check that "qemu-img check" reports
+# 100% clusters as compressed.
+echo
+echo "=== Testing compressed rebase 

Re: [PATCH 4/6] qemu-img: rebase: avoid unnecessary COW operations

2023-06-21 Thread Denis V. Lunev

On 6/1/23 21:28, Andrey Drobyshev wrote:

When rebasing an image from one backing file to another, we need to
compare data from old and new backings.  If the diff between that data
happens to be unaligned to the target cluster size, we might end up
doing partial writes, which would lead to copy-on-write and additional IO.

Consider the following simple case (virtual_size == cluster_size == 64K):

base <-- inc1 <-- inc2

qemu-io -c "write -P 0xaa 0 32K" base.qcow2
qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2

While doing rebase, we'll write a half of the cluster to inc2, and block
layer will have to read the 2nd half of the same cluster from the base image
inc1 while doing this write operation, although the whole cluster is already
read earlier to perform data comparison.

In order to avoid these unnecessary IO cycles, let's make sure every
write request is aligned to the overlay cluster size.

Signed-off-by: Andrey Drobyshev 
---
  qemu-img.c | 72 +++---
  1 file changed, 52 insertions(+), 20 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 60f4c06487..9a469cd609 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3513,6 +3513,7 @@ static int img_rebase(int argc, char **argv)
  uint8_t *buf_new = NULL;
  BlockDriverState *bs = NULL, *prefix_chain_bs = NULL;
  BlockDriverState *unfiltered_bs;
+BlockDriverInfo bdi = {0};
  char *filename;
  const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
  int c, flags, src_flags, ret;
@@ -3646,6 +3647,15 @@ static int img_rebase(int argc, char **argv)
  }
  }
  
+/* We need overlay cluster size to make sure write requests are aligned */

+ret = bdrv_get_info(unfiltered_bs, );
+if (ret < 0) {
+error_report("could not get block driver info");
+goto out;
+} else if (bdi.cluster_size == 0) {
+bdi.cluster_size = 1;
+}
+
  /* For safe rebasing we need to compare old and new backing file */
  if (!unsafe) {
  QDict *options = NULL;
@@ -3744,6 +3754,7 @@ static int img_rebase(int argc, char **argv)
  int64_t new_backing_size = 0;
  uint64_t offset;
  int64_t n;
+int64_t n_old = 0, n_new = 0;
  float local_progress = 0;
  
  buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);

@@ -3784,7 +3795,7 @@ static int img_rebase(int argc, char **argv)
  }
  
  for (offset = 0; offset < size; offset += n) {

-bool buf_old_is_zero = false;
+bool old_backing_eof = false;
  
  /* How many bytes can we handle with the next read? */

  n = MIN(IO_BUF_SIZE, size - offset);
@@ -3829,33 +3840,38 @@ static int img_rebase(int argc, char **argv)
  }
  }
  
+/* At this point n must be aligned to the target cluster size. */

Minor: except last non-aligned cluster as stated by 'if' :)

+if (offset + n < size) {
+assert(n % bdi.cluster_size == 0);
+}
+
+/*
+ * Much like the with the target image, we'll try to read as much
+ * of the old and new backings as we can.
+ */
+n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset));
+if (blk_new_backing) {
+n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset));
+}
+
  /*
   * Read old and new backing file and take into consideration that
   * backing files may be smaller than the COW image.
   */
-if (offset >= old_backing_size) {
-memset(buf_old, 0, n);
-buf_old_is_zero = true;
+memset(buf_old + n_old, 0, n - n_old);
+if (!n_old) {
+old_backing_eof = true;
  } else {
-if (offset + n > old_backing_size) {
-n = old_backing_size - offset;
-}
-
-ret = blk_pread(blk_old_backing, offset, n, buf_old, 0);
+ret = blk_pread(blk_old_backing, offset, n_old, buf_old, 0);
  if (ret < 0) {
  error_report("error while reading from old backing file");
  goto out;
  }
  }
  
-if (offset >= new_backing_size || !blk_new_backing) {

-memset(buf_new, 0, n);
-} else {
-if (offset + n > new_backing_size) {
-n = new_backing_size - offset;
-}
-
-ret = blk_pread(blk_new_backing, offset, n, buf_new, 0);
+memset(buf_new + n_new, 0, n - n_new);
+if (blk_new_backing && n_new) {
+ret = blk_pread(blk_new_backing, offset, n_new, 

Re: [PATCH 2/6] qemu-iotests: 024: add rebasing test case for overlay_size > backing_size

2023-06-21 Thread Denis V. Lunev

On 6/1/23 21:28, Andrey Drobyshev wrote:

Before previous commit, rebase was getting infitely stuck in case of
rebasing within the same backing chain and when overlay_size > backing_size.
Let's add this case to the rebasing test 024 to make sure it doesn't
break again.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/024 | 57 ++
  tests/qemu-iotests/024.out | 30 
  2 files changed, 87 insertions(+)

diff --git a/tests/qemu-iotests/024 b/tests/qemu-iotests/024
index 25a564a150..98a7c8fd65 100755
--- a/tests/qemu-iotests/024
+++ b/tests/qemu-iotests/024
@@ -199,6 +199,63 @@ echo
  # $BASE_OLD and $BASE_NEW)
  $QEMU_IMG map "$OVERLAY" | _filter_qemu_img_map
  
+# Check that rebase within the chain is working when

+# overlay_size > old_backing_size
+#
+# base_new <-- base_old <-- overlay
+#
+# Backing (new): 11 11 11 11 11
+# Backing (old): 22 22 22 22
+# Overlay:   -- -- -- -- --
+#
+# As a result, overlay should contain data identical to base_old, with the
+# last cluster remaining unallocated.
+
+echo
+echo "=== Test rebase within one backing chain ==="
+echo
+
+echo "Creating backing chain"
+echo
+
+TEST_IMG=$BASE_NEW _make_test_img $(( CLUSTER_SIZE * 5 ))
+TEST_IMG=$BASE_OLD _make_test_img -b "$BASE_NEW" -F $IMGFMT \
+$(( CLUSTER_SIZE * 4 ))
+TEST_IMG=$OVERLAY _make_test_img -b "$BASE_OLD" -F $IMGFMT \
+$(( CLUSTER_SIZE * 5 ))
+
+echo
+echo "Fill backing files with data"
+echo
+
+$QEMU_IO "$BASE_NEW" -c "write -P 0x11 0 $(( CLUSTER_SIZE * 5 ))" \
+| _filter_qemu_io
+$QEMU_IO "$BASE_OLD" -c "write -P 0x22 0 $(( CLUSTER_SIZE * 4 ))" \
+| _filter_qemu_io
+
+echo
+echo "Check the last cluster is zeroed in overlay before the rebase"
+echo
+$QEMU_IO "$OVERLAY" -c "read -P 0x00 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \
+| _filter_qemu_io
+
+echo
+echo "Rebase onto another image in the same chain"
+echo
+
+$QEMU_IMG rebase -b "$BASE_NEW" -F $IMGFMT "$OVERLAY"
+
+echo "Verify that data is read the same before and after rebase"
+echo
+
+# Verify the first 4 clusters are still read the same as in the old base
+$QEMU_IO "$OVERLAY" -c "read -P 0x22 0 $(( CLUSTER_SIZE * 4 ))" \
+| _filter_qemu_io
+# Verify the last cluster still reads as zeroes
+$QEMU_IO "$OVERLAY" -c "read -P 0x00 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \
+| _filter_qemu_io
+
+echo
  
  # success, all done

  echo "*** done"
diff --git a/tests/qemu-iotests/024.out b/tests/qemu-iotests/024.out
index 973a5a3711..245fe8b1d1 100644
--- a/tests/qemu-iotests/024.out
+++ b/tests/qemu-iotests/024.out
@@ -171,4 +171,34 @@ read 65536/65536 bytes at offset 196608
  Offset  Length  File
  0   0x3 TEST_DIR/subdir/t.IMGFMT
  0x3 0x1 TEST_DIR/subdir/t.IMGFMT.base_new
+
+=== Test rebase within one backing chain ===
+
+Creating backing chain
+
+Formatting 'TEST_DIR/subdir/t.IMGFMT.base_new', fmt=IMGFMT size=327680
+Formatting 'TEST_DIR/subdir/t.IMGFMT.base_old', fmt=IMGFMT size=262144 
backing_file=TEST_DIR/subdir/t.IMGFMT.base_new backing_fmt=IMGFMT
+Formatting 'TEST_DIR/subdir/t.IMGFMT', fmt=IMGFMT size=327680 
backing_file=TEST_DIR/subdir/t.IMGFMT.base_old backing_fmt=IMGFMT
+
+Fill backing files with data
+
+wrote 327680/327680 bytes at offset 0
+320 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 262144/262144 bytes at offset 0
+256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Check the last cluster is zeroed in overlay before the rebase
+
+read 65536/65536 bytes at offset 262144
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Rebase onto another image in the same chain
+
+Verify that data is read the same before and after rebase
+
+read 262144/262144 bytes at offset 0
+256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 262144
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
  *** done

For the clarity:
Reviewed-by: Denis V. Lunev 



Re: [PATCH 5/6] qemu-img: add compression option to rebase subcommand

2023-06-21 Thread Denis V. Lunev

On 6/1/23 21:28, Andrey Drobyshev wrote:

If we rebase an image whose backing file has compressed clusters, we
might end up wasting disk space since the copied clusters are now
uncompressed.  In order to have better control over this, let's add
"--compress" option to the "qemu-img rebase" command.

Note that this option affects only the clusters which are actually being
copied from the original backing file.  The clusters which were
uncompressed in the target image will remain so.

Signed-off-by: Andrey Drobyshev 
---
  docs/tools/qemu-img.rst |  6 --
  qemu-img-cmds.hx|  4 ++--
  qemu-img.c  | 19 +--
  3 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index 15aeddc6d8..973a912dec 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -663,7 +663,7 @@ Command description:
  
List, apply, create or delete snapshots in image *FILENAME*.
  
-.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-p] [-u] -b BACKING_FILE [-F BACKING_FMT] FILENAME

+.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-p] [-u] [-c] -b BACKING_FILE [-F BACKING_FMT] FILENAME
  
Changes the backing file of an image. Only the formats ``qcow2`` and

``qed`` support changing the backing file.
@@ -690,7 +690,9 @@ Command description:
  
  In order to achieve this, any clusters that differ between

  *BACKING_FILE* and the old backing file of *FILENAME* are merged
-into *FILENAME* before actually changing the backing file.
+into *FILENAME* before actually changing the backing file. With ``-c``
+option specified, the clusters which are being merged (but not the
+entire *FILENAME* image) are written in the compressed mode.
  
  Note that the safe mode is an expensive operation, comparable to

  converting an image. It only works if the old backing file still
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 1b1dab5b17..068692d13e 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -88,9 +88,9 @@ SRST
  ERST
  
  DEF("rebase", img_rebase,

-"rebase [--object objectdef] [--image-opts] [-U] [-q] [-f fmt] [-t cache] [-T 
src_cache] [-p] [-u] -b backing_file [-F backing_fmt] filename")
+"rebase [--object objectdef] [--image-opts] [-U] [-q] [-f fmt] [-t cache] [-T 
src_cache] [-p] [-u] [-c] -b backing_file [-F backing_fmt] filename")
  SRST
-.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-p] [-u] -b BACKING_FILE [-F BACKING_FMT] FILENAME
+.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-p] [-u] [-c] -b BACKING_FILE [-F BACKING_FMT] FILENAME
  ERST
  
  DEF("resize", img_resize,

diff --git a/qemu-img.c b/qemu-img.c
index 9a469cd609..108da27b23 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3517,11 +3517,13 @@ static int img_rebase(int argc, char **argv)
  char *filename;
  const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
  int c, flags, src_flags, ret;
+BdrvRequestFlags write_flags = 0;
  bool writethrough, src_writethrough;
  int unsafe = 0;
  bool force_share = false;
  int progress = 0;
  bool quiet = false;
+bool compress = false;
  Error *local_err = NULL;
  bool image_opts = false;
  
@@ -3537,9 +3539,10 @@ static int img_rebase(int argc, char **argv)

  {"object", required_argument, 0, OPTION_OBJECT},
  {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
  {"force-share", no_argument, 0, 'U'},
+{"compress", no_argument, 0, 'c'},
  {0, 0, 0, 0}
  };
-c = getopt_long(argc, argv, ":hf:F:b:upt:T:qU",
+c = getopt_long(argc, argv, ":hf:F:b:upt:T:qUc",
  long_options, NULL);
  if (c == -1) {
  break;
@@ -3587,6 +3590,9 @@ static int img_rebase(int argc, char **argv)
  case 'U':
  force_share = true;
  break;
+case 'c':
+compress = true;
+break;
  }
  }
  
@@ -3639,6 +3645,14 @@ static int img_rebase(int argc, char **argv)
  
  unfiltered_bs = bdrv_skip_filters(bs);
  
+if (compress && !block_driver_can_compress(unfiltered_bs->drv)) {

+error_report("Compression not supported for this file format");
+ret = -1;
+goto out;
+} else if (compress) {
+write_flags |= BDRV_REQ_WRITE_COMPRESSED;
+}
+


minor neat-picking. Should we get a global
if (compress) {
  if (!block_driver_can_compress(unfiltered_bs->drv)) {
 report_error
 goto out;
  }
  write_flags |= BDRV_REQ_WRITE_COMPRESSED;
}


  if (out_basefmt != NULL) {
  if (bdrv_find_format(out_basefmt) == NULL) {
  error_report("Invalid format name: '%s'", out_basefmt);
@@ -3903,7 

Re: [PATCH 1/6] qemu-img: rebase: stop when reaching EOF of old backing file

2023-06-21 Thread Denis V. Lunev

On 6/1/23 21:28, Andrey Drobyshev wrote:

In case when we're rebasing within one backing chain, and when target image
is larger than old backing file, bdrv_is_allocated_above() ends up setting
*pnum = 0.  As a result, target offset isn't getting incremented, and we
get stuck in an infinite for loop.  Let's detect this case and proceed
further down the loop body, as the offsets beyond the old backing size need
to be explicitly zeroed.

Signed-off-by: Andrey Drobyshev 
---
  qemu-img.c | 13 -
  1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/qemu-img.c b/qemu-img.c
index 27f48051b0..78433f3746 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3801,6 +3801,8 @@ static int img_rebase(int argc, char **argv)
  }
  
  if (prefix_chain_bs) {

+uint64_t bytes = n;
+
  /*
   * If cluster wasn't changed since prefix_chain, we don't need
   * to take action
@@ -3813,9 +3815,18 @@ static int img_rebase(int argc, char **argv)
   strerror(-ret));
  goto out;
  }
-if (!ret) {
+if (!ret && n) {
  continue;
  }
+if (!n) {
+/*
+ * If we've reached EOF of the old backing, it means that
+ * offsets beyond the old backing size were read as zeroes.
+ * Now we will need to explicitly zero the cluster in
+ * order to preserve that state after the rebase.
+ */
+n = bytes;
+}
  }
  
  /*

for the clarity:
Reviewed-by: Denis V. Lunev 



Re: [PATCH 3/6] qemu-img: rebase: use backing files' BlockBackend for buffer alignment

2023-06-21 Thread Denis V. Lunev

On 6/1/23 21:28, Andrey Drobyshev wrote:

Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for
the data read from the old and new backing files are aligned using
BlockDriverState (or BlockBackend later on) referring to the target image.
However, this isn't quite right, because target image is only being
written to and has nothing to do with those buffers.  Let's fix that.

Signed-off-by: Andrey Drobyshev 
---
  qemu-img.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 78433f3746..60f4c06487 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3746,8 +3746,8 @@ static int img_rebase(int argc, char **argv)
  int64_t n;
  float local_progress = 0;
  
-buf_old = blk_blockalign(blk, IO_BUF_SIZE);

-buf_new = blk_blockalign(blk, IO_BUF_SIZE);
+buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
+buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE);
  
  size = blk_getlength(blk);

  if (size < 0) {

Reviewed-by: Denis V. Lunev 



Re: [PATCH 2/3] qemu-img: map: report compressed data blocks

2023-06-21 Thread Denis V. Lunev

On 6/7/23 17:26, Andrey Drobyshev wrote:

Right now "qemu-img map" reports compressed blocks as containing data
but having no host offset.  This is not very informative.  Instead,
let's add another boolean field named "compressed" in case JSON output
mode is specified.  This is achieved by utilizing new allocation status
flag BDRV_BLOCK_COMPRESSED for bdrv_block_status().

Signed-off-by: Andrey Drobyshev 
---
  qapi/block-core.json |  7 +--
  qemu-img.c   | 16 +---
  2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 5dd5f7e4b0..bc6653e5d6 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -409,6 +409,9 @@
  #
  # @zero: whether the virtual blocks read as zeroes
  #
+# @compressed: true indicates that data is stored compressed (the target
+# format must support compression)
+#
  # @depth: number of layers (0 = top image, 1 = top image's backing
  # file, ..., n - 1 = bottom image (where n is the number of images
  # in the chain)) before reaching one for which the range is
@@ -426,8 +429,8 @@
  ##
  { 'struct': 'MapEntry',
'data': {'start': 'int', 'length': 'int', 'data': 'bool',
-   'zero': 'bool', 'depth': 'int', 'present': 'bool',
-   '*offset': 'int', '*filename': 'str' } }
+   'zero': 'bool', 'compressed': 'bool', 'depth': 'int',
+   'present': 'bool', '*offset': 'int', '*filename': 'str' } }

after some thoughts I would say that for compatibility reasons it
would be beneficial to have compressed field optional.
  
  ##

  # @BlockdevCacheInfo:
diff --git a/qemu-img.c b/qemu-img.c
index 27f48051b0..9bb69f58f6 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3083,7 +3083,7 @@ static int img_info(int argc, char **argv)
  }
  
  static int dump_map_entry(OutputFormat output_format, MapEntry *e,

-  MapEntry *next)
+  MapEntry *next, bool can_compress)
  {
  switch (output_format) {
  case OFORMAT_HUMAN:
@@ -3112,6 +3112,9 @@ static int dump_map_entry(OutputFormat output_format, 
MapEntry *e,
 e->present ? "true" : "false",
 e->zero ? "true" : "false",
 e->data ? "true" : "false");
+if (can_compress) {
+printf(", \"compressed\": %s", e->compressed ? "true" : "false");

If compressed field is optional, then it would be reasonable to skip
filling this field for non-compressed clusters. In that case we
will not need 'can_compress' parameter of the call.

Ha! More importantly. The field (according to the metadata) is
mandatory while it is reported conditionally, i.e. the field is
optional in reality. There is a problem in a this or that way.


+}
  if (e->has_offset) {
  printf(", \"offset\": %"PRId64"", e->offset);
  }
@@ -3172,6 +3175,7 @@ static int get_block_status(BlockDriverState *bs, int64_t 
offset,
  .length = bytes,
  .data = !!(ret & BDRV_BLOCK_DATA),
  .zero = !!(ret & BDRV_BLOCK_ZERO),
+.compressed = !!(ret & BDRV_BLOCK_COMPRESSED),
  .offset = map,
  .has_offset = has_offset,
  .depth = depth,
@@ -3189,6 +3193,7 @@ static inline bool entry_mergeable(const MapEntry *curr, 
const MapEntry *next)
  }
  if (curr->zero != next->zero ||
  curr->data != next->data ||
+curr->compressed != next->compressed ||
  curr->depth != next->depth ||
  curr->present != next->present ||
  !curr->filename != !next->filename ||
@@ -3218,6 +3223,7 @@ static int img_map(int argc, char **argv)
  bool force_share = false;
  int64_t start_offset = 0;
  int64_t max_length = -1;
+bool can_compress = false;
  
  fmt = NULL;

  output = NULL;
@@ -3313,6 +3319,10 @@ static int img_map(int argc, char **argv)
  length = MIN(start_offset + max_length, length);
  }
  
+if (output_format == OFORMAT_JSON) {

+can_compress = block_driver_can_compress(bs->drv);
+}
+
  curr.start = start_offset;
  while (curr.start + curr.length < length) {
  int64_t offset = curr.start + curr.length;
@@ -3330,7 +3340,7 @@ static int img_map(int argc, char **argv)
  }
  
  if (curr.length > 0) {

-ret = dump_map_entry(output_format, , );
+ret = dump_map_entry(output_format, , , can_compress);
  if (ret < 0) {
  goto out;
  }
@@ -3338,7 +3348,7 @@ static int img_map(int argc, char **argv)
  curr = next;
  }
  
-ret = dump_map_entry(output_format, , NULL);

+ret = dump_map_entry(output_format, , NULL, can_compress);
  if (output_format == OFORMAT_JSON) {
  puts("]");
  }





Re: [PATCH 1/3] block: add BDRV_BLOCK_COMPRESSED flag for bdrv_block_status()

2023-06-21 Thread Andrey Drobyshev
On 6/21/23 20:46, Denis V. Lunev wrote:
> On 6/21/23 19:08, Denis V. Lunev wrote:
>> On 6/7/23 17:26, Andrey Drobyshev wrote:
>>> Functions qcow2_get_host_offset(), get_cluster_offset() explicitly
>>> report compressed cluster types when data is compressed. However, this
>>> information is never passed further.  Let's make use of it by adding new
>>> BDRV_BLOCK_COMPRESSED flag for bdrv_block_status(), so that caller may
>>> know that the data range is compressed.  In particular, we're going to
>>> use this flag to tweak "qemu-img map" output.
>>>
>>> This new flag is only being utilized by qcow and qcow2 formats, as only
>>> these two support compression.
>>>
>>> Signed-off-by: Andrey Drobyshev 
>>> ---
>>>   block/qcow.c | 5 -
>>>   block/qcow2.c    | 3 +++
>>>   include/block/block-common.h | 3 +++
>>>   3 files changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/qcow.c b/block/qcow.c
>>> index 3644bbf5cb..8416bcc2c3 100644
>>> --- a/block/qcow.c
>>> +++ b/block/qcow.c
>>> @@ -549,7 +549,10 @@ qcow_co_block_status(BlockDriverState *bs, bool
>>> want_zero,
>>>   if (!cluster_offset) {
>>>   return 0;
>>>   }
>>> -    if ((cluster_offset & QCOW_OFLAG_COMPRESSED) || s->crypto) {
>>> +    if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
>>> +    return BDRV_BLOCK_DATA | BDRV_BLOCK_COMPRESSED;
>>> +    }
>>> +    if (s->crypto) {
>>>   return BDRV_BLOCK_DATA;
>>>   }
>>>   *map = cluster_offset | index_in_cluster;
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index e23edd48c2..8e01adc610 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -2162,6 +2162,9 @@ qcow2_co_block_status(BlockDriverState *bs,
>>> bool want_zero, int64_t offset,
>>>   {
>>>   status |= BDRV_BLOCK_RECURSE;
>>>   }
>>> +    if (type == QCOW2_SUBCLUSTER_COMPRESSED) {
>>> +    status |= BDRV_BLOCK_COMPRESSED;
>>> +    }
>>>   return status;
>>>   }
>>>   diff --git a/include/block/block-common.h
>>> b/include/block/block-common.h
>>> index e15395f2cb..f7a4e7d4db 100644
>>> --- a/include/block/block-common.h
>>> +++ b/include/block/block-common.h
>>> @@ -282,6 +282,8 @@ typedef enum {
>>>    *   layer rather than any backing, set by
>>> block layer
>>>    * BDRV_BLOCK_EOF: the returned pnum covers through end of file for
>>> this
>>>    * layer, set by block layer
>>> + * BDRV_BLOCK_COMPRESSED: the underlying data is compressed; only
>>> valid for
>>> + *    the formats supporting compression: qcow,
>>> qcow2
>>>    *
>>>    * Internal flags:
>>>    * BDRV_BLOCK_RAW: for use by passthrough drivers, such as raw, to
>>> request
>>> @@ -317,6 +319,7 @@ typedef enum {
>>>   #define BDRV_BLOCK_ALLOCATED    0x10
>>>   #define BDRV_BLOCK_EOF  0x20
>>>   #define BDRV_BLOCK_RECURSE  0x40
>>> +#define BDRV_BLOCK_COMPRESSED   0x80
>>>     typedef QTAILQ_HEAD(BlockReopenQueue, BlockReopenQueueEntry)
>>> BlockReopenQueue;
>> Reviewed-by: Denis V. Lunev 
> Looking into the second patch I have found that I was a too fast here :)
> 
> The comment is misleading and the patch is incomplete.
> 
> static inline bool TSA_NO_TSA block_driver_can_compress(BlockDriver *drv)
> {
>     return drv->bdrv_co_pwritev_compressed ||
>    drv->bdrv_co_pwritev_compressed_part;
> }
> 
> which means that
> 
>    1    257  block/copy-on-read.c <>
>  .bdrv_co_pwritev_compressed = cor_co_pwritev_compressed,
>    2   1199  block/qcow.c <>
>  .bdrv_co_pwritev_compressed = qcow_co_pwritev_compressed,
>    3    255  block/throttle.c <>
>  .bdrv_co_pwritev_compressed = throttle_co_pwritev_compressed,
>    4   3108  block/vmdk.c <>
>  .bdrv_co_pwritev_compressed = vmdk_co_pwritev_compressed,
>   1   6121  block/qcow2.c <>
> .bdrv_co_pwritev_compressed_part...cow2_co_pwritev_compressed_part,
> 
> We have missed at least VMDK images.

Thanks, my bad I didn't check that more thoroughly, I was mainly looking
in the docs when making this conclusion.

man qemu-img:
> Only the formats qcow and qcow2 support  compression.  The  com‐
> pression  is  read-only. It means that if a compressed sector is
> rewritten, then it is rewritten as uncompressed data.

Apparently man pages got a bit outdated on this.



Re: [PATCH 1/3] block: add BDRV_BLOCK_COMPRESSED flag for bdrv_block_status()

2023-06-21 Thread Denis V. Lunev

On 6/21/23 19:08, Denis V. Lunev wrote:

On 6/7/23 17:26, Andrey Drobyshev wrote:

Functions qcow2_get_host_offset(), get_cluster_offset() explicitly
report compressed cluster types when data is compressed. However, this
information is never passed further.  Let's make use of it by adding new
BDRV_BLOCK_COMPRESSED flag for bdrv_block_status(), so that caller may
know that the data range is compressed.  In particular, we're going to
use this flag to tweak "qemu-img map" output.

This new flag is only being utilized by qcow and qcow2 formats, as only
these two support compression.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow.c | 5 -
  block/qcow2.c    | 3 +++
  include/block/block-common.h | 3 +++
  3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/qcow.c b/block/qcow.c
index 3644bbf5cb..8416bcc2c3 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -549,7 +549,10 @@ qcow_co_block_status(BlockDriverState *bs, bool 
want_zero,

  if (!cluster_offset) {
  return 0;
  }
-    if ((cluster_offset & QCOW_OFLAG_COMPRESSED) || s->crypto) {
+    if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
+    return BDRV_BLOCK_DATA | BDRV_BLOCK_COMPRESSED;
+    }
+    if (s->crypto) {
  return BDRV_BLOCK_DATA;
  }
  *map = cluster_offset | index_in_cluster;
diff --git a/block/qcow2.c b/block/qcow2.c
index e23edd48c2..8e01adc610 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2162,6 +2162,9 @@ qcow2_co_block_status(BlockDriverState *bs, 
bool want_zero, int64_t offset,

  {
  status |= BDRV_BLOCK_RECURSE;
  }
+    if (type == QCOW2_SUBCLUSTER_COMPRESSED) {
+    status |= BDRV_BLOCK_COMPRESSED;
+    }
  return status;
  }
  diff --git a/include/block/block-common.h 
b/include/block/block-common.h

index e15395f2cb..f7a4e7d4db 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -282,6 +282,8 @@ typedef enum {
   *   layer rather than any backing, set by 
block layer
   * BDRV_BLOCK_EOF: the returned pnum covers through end of file for 
this

   * layer, set by block layer
+ * BDRV_BLOCK_COMPRESSED: the underlying data is compressed; only 
valid for
+ *    the formats supporting compression: qcow, 
qcow2

   *
   * Internal flags:
   * BDRV_BLOCK_RAW: for use by passthrough drivers, such as raw, to 
request

@@ -317,6 +319,7 @@ typedef enum {
  #define BDRV_BLOCK_ALLOCATED    0x10
  #define BDRV_BLOCK_EOF  0x20
  #define BDRV_BLOCK_RECURSE  0x40
+#define BDRV_BLOCK_COMPRESSED   0x80
    typedef QTAILQ_HEAD(BlockReopenQueue, BlockReopenQueueEntry) 
BlockReopenQueue;

Reviewed-by: Denis V. Lunev 

Looking into the second patch I have found that I was a too fast here :)

The comment is misleading and the patch is incomplete.

static inline bool TSA_NO_TSA block_driver_can_compress(BlockDriver *drv)
{
    return drv->bdrv_co_pwritev_compressed ||
   drv->bdrv_co_pwritev_compressed_part;
}

which means that

   1    257  block/copy-on-read.c <>
 .bdrv_co_pwritev_compressed = cor_co_pwritev_compressed,
   2   1199  block/qcow.c <>
 .bdrv_co_pwritev_compressed = qcow_co_pwritev_compressed,
   3    255  block/throttle.c <>
 .bdrv_co_pwritev_compressed = throttle_co_pwritev_compressed,
   4   3108  block/vmdk.c <>
 .bdrv_co_pwritev_compressed = vmdk_co_pwritev_compressed,
  1   6121  block/qcow2.c <>
.bdrv_co_pwritev_compressed_part...cow2_co_pwritev_compressed_part,

We have missed at least VMDK images.



Re: [PATCH 1/3] block: add BDRV_BLOCK_COMPRESSED flag for bdrv_block_status()

2023-06-21 Thread Denis V. Lunev

On 6/7/23 17:26, Andrey Drobyshev wrote:

Functions qcow2_get_host_offset(), get_cluster_offset() explicitly
report compressed cluster types when data is compressed.  However, this
information is never passed further.  Let's make use of it by adding new
BDRV_BLOCK_COMPRESSED flag for bdrv_block_status(), so that caller may
know that the data range is compressed.  In particular, we're going to
use this flag to tweak "qemu-img map" output.

This new flag is only being utilized by qcow and qcow2 formats, as only
these two support compression.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow.c | 5 -
  block/qcow2.c| 3 +++
  include/block/block-common.h | 3 +++
  3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/qcow.c b/block/qcow.c
index 3644bbf5cb..8416bcc2c3 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -549,7 +549,10 @@ qcow_co_block_status(BlockDriverState *bs, bool want_zero,
  if (!cluster_offset) {
  return 0;
  }
-if ((cluster_offset & QCOW_OFLAG_COMPRESSED) || s->crypto) {
+if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
+return BDRV_BLOCK_DATA | BDRV_BLOCK_COMPRESSED;
+}
+if (s->crypto) {
  return BDRV_BLOCK_DATA;
  }
  *map = cluster_offset | index_in_cluster;
diff --git a/block/qcow2.c b/block/qcow2.c
index e23edd48c2..8e01adc610 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2162,6 +2162,9 @@ qcow2_co_block_status(BlockDriverState *bs, bool 
want_zero, int64_t offset,
  {
  status |= BDRV_BLOCK_RECURSE;
  }
+if (type == QCOW2_SUBCLUSTER_COMPRESSED) {
+status |= BDRV_BLOCK_COMPRESSED;
+}
  return status;
  }
  
diff --git a/include/block/block-common.h b/include/block/block-common.h

index e15395f2cb..f7a4e7d4db 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -282,6 +282,8 @@ typedef enum {
   *   layer rather than any backing, set by block layer
   * BDRV_BLOCK_EOF: the returned pnum covers through end of file for this
   * layer, set by block layer
+ * BDRV_BLOCK_COMPRESSED: the underlying data is compressed; only valid for
+ *the formats supporting compression: qcow, qcow2
   *
   * Internal flags:
   * BDRV_BLOCK_RAW: for use by passthrough drivers, such as raw, to request
@@ -317,6 +319,7 @@ typedef enum {
  #define BDRV_BLOCK_ALLOCATED0x10
  #define BDRV_BLOCK_EOF  0x20
  #define BDRV_BLOCK_RECURSE  0x40
+#define BDRV_BLOCK_COMPRESSED   0x80
  
  typedef QTAILQ_HEAD(BlockReopenQueue, BlockReopenQueueEntry) BlockReopenQueue;
  

Reviewed-by: Denis V. Lunev 



Re: [PATCH v6 0/5] parallels: Add duplication check, repair at open, fix bugs

2023-06-21 Thread Denis V. Lunev

On 6/21/23 10:20, Alexander Ivanov wrote:

Fix incorrect data end calculation in parallels_open().

Check if data_end greater than the file size.

Add change_info argument to parallels_check_leak().

Add checking and repairing duplicate offsets in BAT

Image repairing in parallels_open().

v6:
2: Different patch. Refused to split image leak handling. Instead there is a
patch with a data_end check.
3: Different patch. There is a patch with change_info argument.
4: Removed changing fprintf by qemu_log from this patchset. Previously 3rd
patch became 4th. Replaced qemu_memalign() by qemu_blockalign(). Got
rid of iovecs, replaced bdrv_co_pwritev() by bdrv_co_pwrite(). Added
assert(cluster_index < bitmap_size). Now BAT changes are reverted if
there was an error in the cluster copying process. Simplified a sector
calculation.
5: Moved header magic check to the appropriate place. Added a
migrate_del_blocker() call and s->bat_dirty_bmap freeing on error.

v5:
3: Fixed a byteorder bug, fixed zero-length image handling and fixed uint32
truncation.

v4:
2,5: Rebased.

v3:
2: Added (size >= res->image_end_offset) assert and changed the comment in
parallels_get_leak_size(). Changed error printing and leaks fixing order.
3: Removed highest_offset() helper, instead image_end_offset field is used.
5: Moved highest_offset() code to parallels_open() - now it is used only in
this function. Fixed data_end update condition. Fixed a leak of
s->migration_blocker.

v2:
2: Moved outsude parallels_check_leak() 2 helpers:
parallels_get_leak_size() and parallels_fix_leak().

3: Used highest_offset() helper in parallels_check_leak(). Fixed a typo.

Added comments. Replaced g_malloc() call by qemu_memalign(). Replaced
bdrv_pread() call by bdrv_co_pread(). Got rid of keeping bytes and
sectors in the same variable. Added setting the bitmap of the used
clusters for a new allocated cluster if it isn't out of the bitmap.
Moved the leak fix to the end of all the checks. Removed a dependence
on image format for the duplicate check.

4 (old): Merged this patch to the previous.

4 (former 5): Fixed formatting.
5 (former 6): Fixed comments. Added O_INACTIVE check in the condition.
   Replaced inuse detection by header_unclean checking.
   Replaced playing with corutines by bdrv_check() usage.

Alexander Ivanov (5):
   parallels: Incorrect data end calculation in parallels_open()
   parallels: Check if data_end greater than the file size
   parallels: Add change_info argument to parallels_check_leak()
   parallels: Add checking and repairing duplicate offsets in BAT
   parallels: Image repairing in parallels_open()

  block/parallels.c | 228 +++---
  1 file changed, 195 insertions(+), 33 deletions(-)

could you pls also include tests for this functionality for the next 
iteration?


We have had this series some time ago but it was lost



Re: [PATCH v6 5/5] parallels: Image repairing in parallels_open()

2023-06-21 Thread Denis V. Lunev

On 6/21/23 10:20, Alexander Ivanov wrote:

Repair an image at opening if the image is unclean or out-of-image
corruption was detected.

Signed-off-by: Alexander Ivanov 
---
  block/parallels.c | 65 +++
  1 file changed, 38 insertions(+), 27 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index d507fe7d90..dc7eb6375e 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -944,7 +944,7 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
  BDRVParallelsState *s = bs->opaque;
  ParallelsHeader ph;
  int ret, size, i;
-int64_t file_nb_sectors;
+int64_t file_nb_sectors, sector;
  QemuOpts *opts = NULL;
  Error *local_err = NULL;
  char *buf;
@@ -1026,33 +1026,8 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
  }
  s->bat_bitmap = (uint32_t *)(s->header + 1);
  
-for (i = 0; i < s->bat_size; i++) {

-int64_t off = bat2sect(s, i);
-if (off >= file_nb_sectors) {
-if (flags & BDRV_O_CHECK) {
-continue;
-}
-error_setg(errp, "parallels: Offset %" PRIi64 " in BAT[%d] entry "
-   "is larger than file size (%" PRIi64 ")",
-   off << BDRV_SECTOR_BITS, i,
-   file_nb_sectors << BDRV_SECTOR_BITS);
-ret = -EINVAL;
-goto fail;
-}
-if (off >= s->data_end) {
-s->data_end = off + s->tracks;
-}
-}
-
  if (le32_to_cpu(ph.inuse) == HEADER_INUSE_MAGIC) {
-/* Image was not closed correctly. The check is mandatory */
  s->header_unclean = true;
-if ((flags & BDRV_O_RDWR) && !(flags & BDRV_O_CHECK)) {
-error_setg(errp, "parallels: Image was not closed correctly; "
-   "cannot be opened read/write");
-ret = -EACCES;
-goto fail;
-}
  }
  
  opts = qemu_opts_create(_runtime_opts, NULL, 0, errp);

@@ -1113,10 +1088,40 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 bdrv_get_device_or_node_name(bs));
  ret = migrate_add_blocker(s->migration_blocker, errp);
  if (ret < 0) {
-error_free(s->migration_blocker);
+error_setg(errp, "Migration blocker error");
  goto fail;
  }
  qemu_co_mutex_init(>lock);
+
+for (i = 0; i < s->bat_size; i++) {
+sector = bat2sect(s, i);
+if (sector + s->tracks > s->data_end) {
+s->data_end = sector + s->tracks;
+}
+}
+
+/*
+ * We don't repair the image here if it's opened for checks. Also we don't
+ * want to change inactive images and can't change readonly images.
+ */
+if ((flags & (BDRV_O_CHECK | BDRV_O_INACTIVE)) || !(flags & BDRV_O_RDWR)) {
+return 0;
+}
+
+/*
+ * Repair the image if it's dirty or
+ * out-of-image corruption was detected.
+ */
+if (s->data_end > file_nb_sectors || s->header_unclean) {
+BdrvCheckResult res;
+ret = bdrv_check(bs, , BDRV_FIX_ERRORS | BDRV_FIX_LEAKS);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "Could not repair corrupted image");
+migrate_del_blocker(s->migration_blocker);
+goto fail;
+}
+}
+
  return 0;
  
  fail_format:

@@ -1124,6 +1129,12 @@ fail_format:
  fail_options:
  ret = -EINVAL;
  fail:
+/*
+ * "s" object was allocated by g_malloc0 so we can safely
+ * try to free its fields even they were not allocated.
+ */
+error_free(s->migration_blocker);
+g_free(s->bat_dirty_bmap);
  qemu_vfree(s->header);
  return ret;
  }

Reviewed-by: Denis V. Lunev 



Re: [PATCH v6 4/5] parallels: Add checking and repairing duplicate offsets in BAT

2023-06-21 Thread Denis V. Lunev

On 6/21/23 10:20, Alexander Ivanov wrote:

Cluster offsets must be unique among all the BAT entries. Find duplicate
offsets in the BAT and fix it by copying the content of the relevant
cluster to a newly allocated cluster and set the new cluster offset to the
duplicated entry.

Add host_cluster_index() helper to deduplicate the code.

When new clusters are allocated, the file size increases by 128 Mb. Call
parallels_check_leak() to fix this leak.

Signed-off-by: Alexander Ivanov 
---
  block/parallels.c | 142 ++
  1 file changed, 142 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index 78a34bd667..d507fe7d90 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -136,6 +136,12 @@ static int cluster_remainder(BDRVParallelsState *s, 
int64_t sector_num,
  return MIN(nb_sectors, ret);
  }
  
+static uint32_t host_cluster_index(BDRVParallelsState *s, int64_t off)

+{
+off -= le32_to_cpu(s->header->data_off) << BDRV_SECTOR_BITS;

According to parallels_open()
    s->data_end = le32_to_cpu(ph.data_off);
    if (s->data_end == 0) {
    s->data_end = ROUND_UP(bat_entry_off(s->bat_size), 
BDRV_SECTOR_SIZE);

    }
    if (s->data_end < s->header_size) {
    /* there is not enough unused space to fit to block align 
between BAT

   and actual data. We can't avoid read-modify-write... */
    s->header_size = size;
    }
data_off can be 0. In that case the logic of conversion of
host offset is incorrect.

This does not break THIS code, but by the name of this
helper further code reuse could lead to mistakes in some
cases.

Side note. This function could result in a bitmap which would
be shorter by 1 if file length is not cluster aligned. There are
no such checks.

Side note2. It would be nice to validate data_off in advance
as follows from the specification. In all cases data_off should
be greater than bat_entry_off(s->bat_size) if non-zero.

  48 - 51:    data_off
  An offset, in sectors, from the start of the file to the 
start of

  the data area.

  For "WithoutFreeSpace" images:
  - If data_off is zero, the offset is calculated as the 
end of BAT

    table plus some padding to ensure sector size alignment.
  - If data_off is non-zero, the offset should be aligned 
to sector
    size. However it is recommended to align it to cluster 
size for

    newly created images.

  For "WithouFreSpacExt" images:
  data_off must be non-zero and aligned to cluster size.


+return off / s->cluster_size;
+}
+
  static int64_t block_status(BDRVParallelsState *s, int64_t sector_num,
  int nb_sectors, int *pnum)
  {
@@ -528,6 +534,137 @@ parallels_check_leak(BlockDriverState *bs, 
BdrvCheckResult *res,
  return 0;
  }
  
+static int parallels_check_duplicate(BlockDriverState *bs,

+ BdrvCheckResult *res,
+ BdrvCheckMode fix)
+{
+BDRVParallelsState *s = bs->opaque;
+int64_t off, sector;
+unsigned long *bitmap;
+uint32_t i, bitmap_size, cluster_index, old_offset;
+int n, ret = 0;
+uint64_t *buf = NULL;
+bool fixed = false;
+
+/*
+ * Create a bitmap of used clusters.
+ * If a bit is set, there is a BAT entry pointing to this cluster.
+ * Loop through the BAT entries, check bits relevant to an entry offset.
+ * If bit is set, this entry is duplicated. Otherwise set the bit.
+ *
+ * We shouldn't worry about newly allocated clusters outside the image
+ * because they are created higher then any existing cluster pointed by
+ * a BAT entry.
+ */
+bitmap_size = host_cluster_index(s, res->image_end_offset);
+if (bitmap_size == 0) {
+return 0;
+}
+
+bitmap = bitmap_new(bitmap_size);
+
+buf = qemu_blockalign(bs, s->cluster_size);
+
+for (i = 0; i < s->bat_size; i++) {
+off = bat2sect(s, i) << BDRV_SECTOR_BITS;
+if (off == 0) {
+continue;
+}
+
+cluster_index = host_cluster_index(s, off);
+assert(cluster_index < bitmap_size);
+if (test_bit(cluster_index, bitmap)) {

I would prefer to revert this if and next one and have a code like

if (test_bit(cluster_index, bitmap)) {
   
   bitmap_set(bitmap, cluster_index, 1);

   continue;
}

if (!(fix & BDRV_FIX_ERRORS)) {
   continue;
}


+/* this cluster duplicates another one */
+fprintf(stderr,
+"%s duplicate offset in BAT entry %u\n",
+fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+
+res->corruptions++;
+
+if (fix & BDRV_FIX_ERRORS) {
+/*
+ * Reset the entry and allocate a new cluster
+ * for the relevant guest offset. In this way we let
+ * the lower layer to place 

Re: [PATCH v6 4/5] parallels: Add checking and repairing duplicate offsets in BAT

2023-06-21 Thread Denis V. Lunev

On 6/21/23 10:20, Alexander Ivanov wrote:

Cluster offsets must be unique among all the BAT entries. Find duplicate
offsets in the BAT and fix it by copying the content of the relevant
cluster to a newly allocated cluster and set the new cluster offset to the
duplicated entry.

Add host_cluster_index() helper to deduplicate the code.

When new clusters are allocated, the file size increases by 128 Mb. Call
parallels_check_leak() to fix this leak.

Signed-off-by: Alexander Ivanov 
---
  block/parallels.c | 142 ++
  1 file changed, 142 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index 78a34bd667..d507fe7d90 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -136,6 +136,12 @@ static int cluster_remainder(BDRVParallelsState *s, 
int64_t sector_num,
  return MIN(nb_sectors, ret);
  }
  
+static uint32_t host_cluster_index(BDRVParallelsState *s, int64_t off)

+{
+off -= le32_to_cpu(s->header->data_off) << BDRV_SECTOR_BITS;

According to



+return off / s->cluster_size;
+}
+
  static int64_t block_status(BDRVParallelsState *s, int64_t sector_num,
  int nb_sectors, int *pnum)
  {
@@ -528,6 +534,137 @@ parallels_check_leak(BlockDriverState *bs, 
BdrvCheckResult *res,
  return 0;
  }
  
+static int parallels_check_duplicate(BlockDriverState *bs,

+ BdrvCheckResult *res,
+ BdrvCheckMode fix)
+{
+BDRVParallelsState *s = bs->opaque;
+int64_t off, sector;
+unsigned long *bitmap;
+uint32_t i, bitmap_size, cluster_index, old_offset;
+int n, ret = 0;
+uint64_t *buf = NULL;
+bool fixed = false;
+
+/*
+ * Create a bitmap of used clusters.
+ * If a bit is set, there is a BAT entry pointing to this cluster.
+ * Loop through the BAT entries, check bits relevant to an entry offset.
+ * If bit is set, this entry is duplicated. Otherwise set the bit.
+ *
+ * We shouldn't worry about newly allocated clusters outside the image
+ * because they are created higher then any existing cluster pointed by
+ * a BAT entry.
+ */
+bitmap_size = host_cluster_index(s, res->image_end_offset);
+if (bitmap_size == 0) {
+return 0;
+}
+
+bitmap = bitmap_new(bitmap_size);
+
+buf = qemu_blockalign(bs, s->cluster_size);
+
+for (i = 0; i < s->bat_size; i++) {
+off = bat2sect(s, i) << BDRV_SECTOR_BITS;
+if (off == 0) {
+continue;
+}
+
+cluster_index = host_cluster_index(s, off);
+assert(cluster_index < bitmap_size);
+if (test_bit(cluster_index, bitmap)) {
+/* this cluster duplicates another one */
+fprintf(stderr,
+"%s duplicate offset in BAT entry %u\n",
+fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+
+res->corruptions++;
+
+if (fix & BDRV_FIX_ERRORS) {
+/*
+ * Reset the entry and allocate a new cluster
+ * for the relevant guest offset. In this way we let
+ * the lower layer to place the new cluster properly.
+ * Copy the original cluster to the allocated one.
+ * But before save the old offset value for repairing
+ * if we have an error.
+ */
+old_offset = le32_to_cpu(s->bat_bitmap[i]);
+parallels_set_bat_entry(s, i, 0);
+
+ret = bdrv_co_pread(bs->file, off, s->cluster_size, buf, 0);
+if (ret < 0) {
+res->check_errors++;
+goto out_repare_bat;
+}
+
+sector = i * (int64_t)s->tracks;
+sector = allocate_clusters(bs, sector, s->tracks, );
+if (sector < 0) {
+res->check_errors++;
+ret = sector;
+goto out_repare_bat;
+}
+off = sector << BDRV_SECTOR_BITS;
+
+ret = bdrv_co_pwrite(bs->file, off, s->cluster_size, buf, 0);
+if (ret < 0) {
+res->check_errors++;
+goto out_repare_bat;
+}
+
+if (off + s->cluster_size > res->image_end_offset) {
+res->image_end_offset = off + s->cluster_size;
+}
+
+/*
+ * In the future allocate_cluster() will reuse holed offsets
+ * inside the image. Keep the used clusters bitmap content
+ * consistent for the new allocated clusters too.
+ *
+ * Note, clusters allocated outside the current image are not
+ * considered, and the bitmap size doesn't change.
+ */
+cluster_index = host_cluster_index(s, off);
+if (cluster_index < 

Re: [PATCH v6 2/5] parallels: Check if data_end greater than the file size

2023-06-21 Thread Denis V. Lunev

On 6/21/23 10:20, Alexander Ivanov wrote:

Initially data_end is set to the data_off image header field and must not be
greater than the file size.

Signed-off-by: Alexander Ivanov 
---
  block/parallels.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index 1ec98c722b..4b7eafba1e 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -868,6 +868,11 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 and actual data. We can't avoid read-modify-write... */
  s->header_size = size;
  }
+if (s->data_end > file_nb_sectors) {
+error_setg(errp, "Invalid image: incorrect data_off field");
+ret = -EINVAL;
+goto fail;
+}
  
  ret = bdrv_pread(bs->file, 0, s->header_size, s->header, 0);

  if (ret < 0) {

for me this seems incorrect. In this case we would not be able to EVER
open such image, even for a check purpose.

This place should have a clause like

    if (!(flags & BDRV_O_CHECK)) {
    goto fail
    }

This error is not fatal from the check point of view.

Den



Re: [PATCH 00/12] Introduce new vmapple machine type

2023-06-21 Thread Alexander Graf

Hi Mads,


On 20.06.23 13:17, Mads Ynddal wrote:




On 15 Jun 2023, at 00.40, Alexander Graf  wrote:

This patch set introduces a new ARM and HVF specific machine type
called "vmapple". It mimicks the device model that Apple's proprietary
Virtualization.Framework exposes, but implements it in QEMU.

With this new machine type, you can run macOS guests on Apple Silicon
systems via HVF. To do so, you need to first install macOS using
Virtualization.Framework onto a virtual disk image using a tool like
macosvm (https://github.com/s-u/macosvm)

  $ macosvm --disk disk.img,size=32g --aux aux.img \
--restore UniversalMac_12.0.1_21A559_Restore.ipsw vm.json

Then, extract the ECID from the installed VM:

  $ cat "$DIR/macosvm.json" | python3 -c \
  'import json,sys;obj=json.load(sys.stdin);print(obj["machineId"]) |\
  base64 -d | plutil -extract ECID raw -

Beware, that the file will be called 'vm.json' and DIR is undefined following
the previous line. Also, it's missing a single-quote at the end of
`["machineId"])`.



Thanks :)





In addition, cut off the first 16kb of the aux.img:

  $ dd if=aux.img of=aux.img.trimmed bs=$(( 0x4000 )) skip=1

Now, you can just launch QEMU with the bits generated above:

  $ qemu-system-aarch64 -serial mon:stdio\
  -m 4G  \
  -M vmapple,uuid=6240349656165161789\
  -bios /Sys*/Lib*/Fra*/Virtualization.f*/R*/AVPBooter.vmapple2.bin  \
  -pflash aux.img.trimmed\
  -pflash disk.img   \
  -drive file=disk.img,if=none,id=root   \
  -device virtio-blk-pci,drive=root,x-apple-type=1   \
  -drive file=aux.img.trimmed,if=none,id=aux \
  -device virtio-blk-pci,drive=aux,x-apple-type=2\
  -accel hvf -no-reboot

Just for clarity, I'd add that the 'vmapple,uuid=...' has to be set to the
ECID the previous step.

You haven't defined a display, but I'm not sure if that is on purpose to
show a minimal setup. I had to add '-display sdl' for it to fully work.



Weird, I do get a normal cocoa output screen by default.





There are a few limitations with this implementation:

  - Only runs on macOS because it relies on
ParavirtualizesGraphics.Framework
  - Something is not fully correct on interrupt delivery or
similar - the keyboard does not work
  - No Rosetta in the guest because we lack the private
entitlement to enable TSO

Would it be possible to mitigate the keyboard issue using an emulated USB
keyboard? I tried poking around with it, but with no success.



Unfortunately I was not able to get USB stable inside the guest. This 
may be an issue with interrupt propagation: With usb-kbd I see macOS not 
pick up key up or down events in time.



Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




Re: [PATCH 03/12] hvf: Increase number of possible memory slots

2023-06-21 Thread Alexander Graf

Hi Philippe,


On 16.06.23 12:28, Philippe Mathieu-Daudé wrote:



On 15/6/23 00:40, Alexander Graf wrote:

For PVG we will need more than the current 32 possible memory slots.
Bump the limit to 512 instead.

Signed-off-by: Alexander Graf 
---
  accel/hvf/hvf-accel-ops.c | 2 +-
  include/sysemu/hvf_int.h  | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/accel/hvf/hvf-accel-ops.c b/accel/hvf/hvf-accel-ops.c
index 9c3da03c94..bf0caaa852 100644
--- a/accel/hvf/hvf-accel-ops.c
+++ b/accel/hvf/hvf-accel-ops.c
@@ -88,7 +88,7 @@ struct mac_slot {
  uint64_t gva;
  };

-struct mac_slot mac_slots[32];
+struct mac_slot mac_slots[512];

  static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
  {
diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
index 6ab119e49f..c7623a2c09 100644
--- a/include/sysemu/hvf_int.h
+++ b/include/sysemu/hvf_int.h
@@ -40,7 +40,7 @@ typedef struct hvf_vcpu_caps {

  struct HVFState {
  AccelState parent;
-    hvf_slot slots[32];
+    hvf_slot slots[512];
  int num_slots;

  hvf_vcpu_caps *hvf_caps;


Please add a definition in this header (using in ops.c).



Happy to :)




In order to save memory and woods, what about keeping
32 on x86 and only raising to 512 on arm?



I am hoping that someone takes the apple-gfx driver and enables it for 
x86 as well, so I'd rather keep them consistent.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




Re: Open qcow2 on multiple hosts simultaneously.

2023-06-21 Thread kvaps
> QCOW2 caches two forms of data, cluster metadata (L1/L2 data, refcount
> table, etc) and mutable header information (file size, snapshot
> entries, etc). This data is discarded after the last piece of incoming
> migration data is received but before the guest starts, hence QCOW2
> images are safe to use with migration.

> QEMU's qcow2 implementation only supports one writer. Live migration is
> a slight exception, it takes care that only one host writes at any given
> time and also drops caches to avoid stale metadata during migration.

I just found that using 'cache.direct=true' makes active devices
synchronized on both hosts.
However it reduces the performance of the disk subsystem by half.

QEMU Storage Daemon QMP Reference Manual says:

> direct: boolean (optional)
> enables use of O_DIRECT (bypass the host page cache; default: false)

According to this description and the description for another option:

> drop-cache: boolean (optional) (If: CONFIG_LINUX)
> invalidate page cache during live migration. This prevents stale data on 
> the migration destination with cache.direct=off.

I can make an assumption that when cache.direct is enabled, it should
be safe to use the device in ReadWriteMany from both nodes.
If so, then I just need to enable this option only on time when the
device requires to be attached on more than one host. (eg. while
live-migrating)

Can anyone confirm my assumption?

Best Regards,
Andrei Kvapil



Re: [RFC 5/6] migration: Deprecate block migration

2023-06-21 Thread Stefan Hajnoczi
On Mon, Jun 12, 2023 at 09:33:43PM +0200, Juan Quintela wrote:
> It is obsolete.  It is better to use driver_mirror+NBD instead.
> 
> CC: Kevin Wolf 
> CC: Eric Blake 
> CC: Stefan Hajnoczi 
> CC: Hanna Czenczek 
> 
> Signed-off-by: Juan Quintela 
> 
> ---
> 
> Can any of you give one example of how to use driver_mirror+NBD for
> deprecated.rst?

Please see "QMP invocation for live storage migration with
``drive-mirror`` + NBD" in docs/interop/live-block-operations.rst for a
detailed explanation.

> 
> Thanks, Juan.
> ---
>  docs/about/deprecated.rst |  6 ++
>  qapi/migration.json   | 29 +
>  migration/block.c |  2 ++
>  migration/options.c   |  7 +++
>  4 files changed, 40 insertions(+), 4 deletions(-)
> 
> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> index 518672722d..173c5ba5cb 100644
> --- a/docs/about/deprecated.rst
> +++ b/docs/about/deprecated.rst
> @@ -454,3 +454,9 @@ Everything except ``-incoming defer`` are deprecated.  
> This allows to
>  setup parameters before launching the proper migration with
>  ``migrate-incoming uri``.
>  
> +block migration (since 8.1)
> +'''
> +
> +Block migration is too inflexible.  It needs to migrate all block
> +devices or none.  Use driver_mirror+NBD instead.

blockdev-mirror with NBD

> +
> diff --git a/qapi/migration.json b/qapi/migration.json
> index b71e00737e..a8497de48d 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -258,11 +258,16 @@
>  # blocked.  Present and non-empty when migration is blocked.
>  # (since 6.0)
>  #
> +# Features:
> +#
> +# @deprecated: @disk migration is deprecated.  Use driver_mirror+NBD

blockdev-mirror with NBD

> +# instead.
> +#
>  # Since: 0.14
>  ##
>  { 'struct': 'MigrationInfo',
>'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
> -   '*disk': 'MigrationStats',
> +   '*disk': { 'type': 'MigrationStats', 'features': ['deprecated'] },
> '*vfio': 'VfioStats',
> '*xbzrle-cache': 'XBZRLECacheStats',
> '*total-time': 'int',
> @@ -497,6 +502,9 @@
>  #
>  # Features:
>  #
> +# @deprecated: @block migration is deprecated.  Use driver_mirror+NBD

blockdev-mirror with NBD

> +# instead.
> +#
>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>  #
>  # Since: 1.2
> @@ -506,7 +514,8 @@
> 'compress', 'events', 'postcopy-ram',
> { 'name': 'x-colo', 'features': [ 'unstable' ] },
> 'release-ram',
> -   'block', 'return-path', 'pause-before-switchover', 'multifd',
> +   { 'name': 'block', 'features': [ 'deprecated' ] },
> +   'return-path', 'pause-before-switchover', 'multifd',
> 'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
> { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
> 'validate-uuid', 'background-snapshot',
> @@ -789,6 +798,9 @@
>  #
>  # Features:
>  #
> +# @deprecated: Member @block-incremental is obsolete. Use
> +# driver_mirror+NBD instead.

blockdev-mirror with NBD

> +#
>  # @unstable: Member @x-checkpoint-delay is experimental.
>  #
>  # Since: 2.4
> @@ -803,7 +815,7 @@
> 'tls-creds', 'tls-hostname', 'tls-authz', 'max-bandwidth',
> 'downtime-limit',
> { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] },
> -   'block-incremental',
> +   { 'name': 'block-incremental', 'features': [ 'deprecated' ] },
> 'multifd-channels',
> 'xbzrle-cache-size', 'max-postcopy-bandwidth',
> 'max-cpu-throttle', 'multifd-compression',
> @@ -945,6 +957,9 @@
>  #
>  # Features:
>  #
> +# @deprecated: Member @block-incremental is obsolete. Use
> +# driver_mirror+NBD instead.

blockdev-mirror with NBD

> +#
>  # @unstable: Member @x-checkpoint-delay is experimental.
>  #
>  # TODO: either fuse back into MigrationParameters, or make
> @@ -972,7 +987,8 @@
>  '*downtime-limit': 'uint64',
>  '*x-checkpoint-delay': { 'type': 'uint32',
>   'features': [ 'unstable' ] },
> -'*block-incremental': 'bool',
> +'*block-incremental': { 'type': 'bool',
> +'features': [ 'deprecated' ] },
>  '*multifd-channels': 'uint8',
>  '*xbzrle-cache-size': 'size',
>  '*max-postcopy-bandwidth': 'size',
> @@ -1137,6 +1153,9 @@
>  #
>  # Features:
>  #
> +# @deprecated: Member @block-incremental is obsolete. Use
> +# driver_mirror+NBD instead.

blockdev-mirror with NBD

> +#
>  # @unstable: Member @x-checkpoint-delay is experimental.
>  #
>  # Since: 2.4
> @@ -1161,6 +1180,8 @@
>  '*downtime-limit': 'uint64',
>  '*x-checkpoint-delay': { 'type': 'uint32',
>   'features': [ 'unstable' ] },
> +'*block-incremental': { 

Re: [RFC 6/6] migration: Deprecated old compression method

2023-06-21 Thread Daniel P . Berrangé
On Mon, Jun 12, 2023 at 09:33:44PM +0200, Juan Quintela wrote:
> Signed-off-by: Juan Quintela 
> ---
>  docs/about/deprecated.rst |  8 
>  qapi/migration.json   | 92 ---
>  migration/options.c   | 13 ++
>  3 files changed, 79 insertions(+), 34 deletions(-)
> 
> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> index 173c5ba5cb..fe7f2bbde8 100644
> --- a/docs/about/deprecated.rst
> +++ b/docs/about/deprecated.rst
> @@ -460,3 +460,11 @@ block migration (since 8.1)
>  Block migration is too inflexible.  It needs to migrate all block
>  devices or none.  Use driver_mirror+NBD instead.
>  
> +old compression method (since 8.1)
> +''
> +
> +Compression method fails too much.  Too many races.  We are going to
> +remove it if nobody fixes it.  For starters, migration-test
> +compression tests are disabled becase they hand randomly.  If you need
> +compression, use multifd compression methods.
> +
> diff --git a/qapi/migration.json b/qapi/migration.json
> index a8497de48d..40a8b5d124 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -244,6 +244,7 @@
>  #
>  # @compression: migration compression statistics, only returned if
>  # compression feature is on and status is 'active' or 'completed'
> +# It is obsolete and deprecated.  Use multifd compression methods.
>  # (Since 3.1)

This doesn't give users an indication /why/ we're saying this. Instead
I'd suggest

  This feature is unreliable and not tested. It is recommended to
  use multifd migration instead, which offers an alternative reliable
  and tested compression implementation.
  

>  #
>  # @socket-address: Only used for tcp, to know what the real port is
> @@ -261,7 +262,8 @@
>  # Features:
>  #
>  # @deprecated: @disk migration is deprecated.  Use driver_mirror+NBD
> -# instead.
> +# instead. @compression is obsolete use multifd compression
> +# methods instead.

For @deprecated, are we supposed to list multiple things at once, or
use a separate @deprecated tag for each one ?

Again I'd suggest rewording

@compression is unreliable and untested. It is recommended to
use multifd migration, which offers an alternative compression
implementation that is reliable and tested.


> diff --git a/migration/options.c b/migration/options.c
> index 374dc0767e..c04e62ba2d 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -443,6 +443,11 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, 
> Error **errp)
>  "Use driver_mirror+NBD instead.");
>  }
>  
> +if (new_caps[MIGRATION_CAPABILITY_BLOCK]) {

Surely MIGRATION_CAPABILITY_COMPRESS not BLOCK ?

> +warn_report("Old compression method is deprecated. "
> +"Use multifd compression methods instead.");
> +}
> +
>  #ifndef CONFIG_REPLICATION
>  if (new_caps[MIGRATION_CAPABILITY_X_COLO]) {
>  error_setg(errp, "QEMU compiled without replication module"
> @@ -1196,18 +1201,26 @@ static void migrate_params_apply(MigrateSetParameters 
> *params, Error **errp)
>  /* TODO use QAPI_CLONE() instead of duplicating it inline */
>  
>  if (params->has_compress_level) {
> +warn_report("Old compression is deprecated. "
> +"Use multifd compression methods instead.");
>  s->parameters.compress_level = params->compress_level;
>  }
>  
>  if (params->has_compress_threads) {
> +warn_report("Old compression is deprecated. "
> +"Use multifd compression methods instead.");
>  s->parameters.compress_threads = params->compress_threads;
>  }
>  
>  if (params->has_compress_wait_thread) {
> +warn_report("Old compression is deprecated. "
> +"Use multifd compression methods instead.");
>  s->parameters.compress_wait_thread = params->compress_wait_thread;
>  }
>  
>  if (params->has_decompress_threads) {
> +warn_report("Old compression is deprecated. "
> +"Use multifd compression methods instead.");
>  s->parameters.decompress_threads = params->decompress_threads;
>  }
>  
> -- 
> 2.40.1
> 

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: Open qcow2 on multiple hosts simultaneously.

2023-06-21 Thread kvaps
> QEMU's qcow2 implementation only supports one writer. Live migration is
> a slight exception, it takes care that only one host writes at any given
> time and also drops caches to avoid stale metadata during migration.

Thank you for clarifying this. Is there any command which I can send
via QMP to drop the caches. Or should I just reopen the file?

> QEMU's "raw" block driver has offset/size parameters, so a ReadWriteMany
> LUN can be divided into smaller block devices and each host can run its
> own qcow2 inside. The problem with this approach is fragmentation and
> ultimately there must be a component in the system that is aware of
> where each host has allocated its slice of the LUN.
>
> If you implement something from scratch, then it might be possible to
> take advantage of Kubernetes. For example, a Custom Resource Definition
> that describes extents allocated from a shared LUN. Space allocation
> could be persisted at the k8s cluster level via this CRD so that you
> don't need to reimplement clustering yourself. The CSI plugin constructs
> qemu-storage-daemon command-lines from the extent information. That way,
> each node in the cluster can run qemu-storage-daemon on the
> ReadWriteMany LUN and coordinate which extents are allocated for a given
> slice.

Yeah, I was considering idea of using offsets on pure LUN. Removing
cluster LVM from this scheme could simplify it a lot. But problem is
that resize will not work. Of course you can allocate a new qcow2 in
chain for every resize operation, but in that case user rewrites will
only allocate the new space instead of rewriting already existing
data.

> So there is definitely something missing if you want to coordinate
> between hosts sharing the same ReadWriteMany LUN. I'm inclined to
> investigate existing solutions like pNFS or cLVM instead of reinventing
> this for QEMU, because I suspect it seems simple at first but actually
> involves a lot of work.

Thanks for pointing this out. In previous mail I mentioned that I'm
going to use lvmlockd with sanlock backend to coordinate LVM metadata
changes.

Best Regards,
Andrei Kvapil



Re: Open qcow2 on multiple hosts simultaneously.

2023-06-21 Thread kvaps
> Good to hear that. Alberto also has been working on a CSI driver
> which makes use of qemu-storage-daemon and qcwo2 files either with
> local storage or shared storage like NFS. At this point of time it
> focusses on filesystem backends as that's where it is easiest to
> manage qcow2 files. But I think that could be extended to support
> block device backends (ex. LVM) too.
>
> https://gitlab.com/subprovisioner/subprovisioner
>
> This is still work in progress. But I think there might be some overlaps
> in your work and subprovisoner project.
>

Wow that is amazing! Yeah, Alice told me about this project, thanks
for the link.
I wasn't able to reach Alberto. But now I'm happy that I can see the code.

> Hmm..., I will need to spned more time going through numbers and setup.
> This result is little surprising to me though. If you are using
> vduse, nbk, ublk kind of exports, that means all IO will go to kernel
> first, then to userspace(qsd) and then back into kernel. But with
> pure LVM based approach, I/O path is much shorter (user space to
> kernel). Given that, its little surprising that qcow2 is still
> faster as compared to LVM.

RAW LVM, of course, is faster than qcow2, but performance drops when
creating any snapshots.
This test compared three technologies: LVM, LVMThin, and QCOW2 to
determinine which technology provides snapshots with less impact on
perfomance.

> If you somehow managed to use vhost-user-blk export instead, then I/O
> path is shorter for qcow2 as well and that might perform well.

Yeah, but problem is that I need a solution for both containers and
virtual machines.
You must use kernel to provide block devices for containers. As well
KubeVirt has no vhost-user support (yet).


> NBD will be slow. I am curious to know how do UBLK and VDUSE block
> compare. Technically there does not seem to be any reason by VDUSE
> virtio-vdpa device will be faster as compared to ublk. But I could
> be wrong.

Tests on this page represents performance of the local block device
not virtual machines.

UBLK is faster in 4K T1Q128 and much faster 4K T1Q1 with fsync, but
slower in 4K T4Q128 and for 4M sequential write operations.

> What about vhost-user-blk export. Have you considered that? That
> probably will be fastest.

For VMs yeah, but I need to align the design with KubeVirt, thus I
decided to do this later.

> > Despite two independent instances of qemu-storage-daemon for same
> > qcow2 disk running successfully on different hosts, I have concerns
> > about their proper functioning. Similar to live migration, I think
> > they should share the state between each other.
>
> Is it same LV on both the nodes? How are you activating same LV on
> two nodes? IIUC, LVM does not allow that.

LVM can live and work on multiple nodes even without clustered extension.

If there is no competition for modifying LVM metadata, and you perform
metadata refresh before every operation you can live without locks.
The locks can be implemented on CSI-driver side using Kubernetes API.

Some solutions already implements this approach, eg. OpenNebula and Proxmox.
But I'm going to use traditional approach with lvmlockd with sanlock backend.

> >
> > The question is how to make qemu-storage-daemon to share the state
> > between multiple nodes, or is qcow2 format inherently stateless and
> > does not requires this?
>
> That's a good question. For simplicity we could think of NFS backed
> storage and a qcow2 file providing storage. Can two QSD instances
> actually work with same qcow2 file?
>
> I am not sure this can be made to work with writable storage. Read-only
> storage, probably yes.

I already found an answer, on the qemu wiki page "Migration with
shared storage". It describes how live-migration works for various
formats:

QCOW2 caches two forms of data, cluster metadata (L1/L2 data, refcount
table, etc) and mutable header information (file size, snapshot
entries, etc). This data is discarded after the last piece of incoming
migration data is received but before the guest starts, hence QCOW2
images are safe to use with migration.

Thus, qemu does not perform the actual state sharing, instead it just
drops the cache after live-migration.
The problem that CSI driver can't actually catch the moment when VM is
finished migration, and perform the same, because it knows nothing
about workload which uses the block device.

But Kubernetes expecting that driver provides full featured
ReadWriteMany on both nodes.

I did some experiments:

experiment 1:
I exported qcow2 via nbd from the target node, and attached it on the
source node, then started 'blockdev-mirror' with `sync=none` from
local qcow to target node.

It worked. The target node block device got aware aware about the
changes on the source. But I think since this still asyncronios
operation, it will not work for intensive workloads. As well the
reverse 'blockdev-mirror' from the target node to the source node was
not working due to fact I actually made a loop (the 

Re: Open qcow2 on multiple hosts simultaneously.

2023-06-21 Thread Stefan Hajnoczi
On Tue, Jun 20, 2023 at 04:32:13PM -0400, Vivek Goyal wrote:
> On Mon, Jun 19, 2023 at 07:20:34PM +0200, kvaps wrote:
> > Hi Kevin and the community,
> 
> [ CC Alberto, Alice, Stefan ]
> > 
> > I am designing a CSI driver for Kubernetes that allows efficient
> > utilization of SAN (Storage Area Network) and supports thin
> > provisioning, snapshots, and ReadWriteMany mode for block devices.
> 
> Hi Andrei,
> 
> Good to hear that. Alberto also has been working on a CSI driver
> which makes use of qemu-storage-daemon and qcwo2 files either with
> local storage or shared storage like NFS. At this point of time it
> focusses on filesystem backends as that's where it is easiest to
> manage qcow2 files. But I think that could be extended to support
> block device backends (ex. LVM) too.
> 
> https://gitlab.com/subprovisioner/subprovisioner
> 
> This is still work in progress. But I think there might be some overlaps
> in your work and subprovisoner project.
> 
> > 
> > To implement this, I have explored several technologies such as
> > traditional LVM, LVMThin (which does not support shared mode), and
> > QCOW2 on top of block devices. This is the same approach to what oVirt
> > uses for thin provisioning over shared LUN:
> > 
> > https://github.com/oVirt/vdsm/blob/08a656c/doc/thin-provisioning.md
> > 
> > Based on benchmark results, I found that the performance degradation
> > of block-backed QCOW2 is much lower compared to LVM and LVMThin while
> > creating snapshots.
> > 
> > https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=2020746352
> > 
> > Therefore, I have decided to use the same aproach for Kubernetes.
> 
> Hmm..., I will need to spned more time going through numbers and setup.
> This result is little surprising to me though. If you are using
> vduse, nbk, ublk kind of exports, that means all IO will go to kernel
> first, then to userspace(qsd) and then back into kernel. But with
> pure LVM based approach, I/O path is much shorter (user space to
> kernel). Given that, its little surprising that qcow2 is still
> faster as compared to LVM.
> 
> If you somehow managed to use vhost-user-blk export instead, then I/O
> path is shorter for qcow2 as well and that might perform well.
> 
> > 
> > But in Kubernetes, the storage system needs to be self-sufficient and
> > not depended to the workload that uses it. Thus unlike oVirt, we have
> > no option to use the libvirt interface of the running VM to invoke the
> > live-migration. Instead, we should provide pure block device in
> > ReadWriteMany mode, where the block device can be writable on multiple
> > hosts simultaneously.
> > 
> > To achieve this, I decided to use the qemu-storage-daemon with the
> > VDUSE backend.
> > 
> > Other technologies, such as NBD and UBLK, were also considered, and
> > their benchmark results can be seen in the same document on the
> > different sheet:
> > 
> > https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=416958126
> > 
> > Taking into account the performance, stability, and versatility, I
> > concluded that VDUSE is the optimal choice. To connect the device in
> > Kubernetes, the virtio-vdpa interface would be used, and the entire
> > scheme could look like this:
> 
> NBD will be slow. I am curious to know how do UBLK and VDUSE block
> compare. Technically there does not seem to be any reason by VDUSE
> virtio-vdpa device will be faster as compared to ublk. But I could
> be wrong.
> 
> What about vhost-user-blk export. Have you considered that? That
> probably will be fastest.
> 
> > 
> > 
> > +-+  +-+
> > | node1   |  | node2   |
> > | |  | |
> > |+---+|  |+---+|
> > || /dev/vda  ||  || /dev/vda  ||
> > |+-+-+|  |+-+-+|
> > |  |  |  |  |  |
> > | virtio-vdpa |  | virtio-vdpa |
> > |  |  |  |  |  |
> > |vduse|  |vduse|
> > |  |  |  |  |  |
> > | qemu-storage-daemon |  | qemu-storage-daemon |
> > |  |  |  |  |  |
> > | +--- | ---+ |  | +--- | ---+ |
> > | | LUN|| |  | | LUN|| |
> > | |  +-+-+  | |  | |  +-+-+  | |
> > | |  | LV (qcow2)|  | |  | |  | LV (qcow2)|  | |
> > | |  +---+  | |  | |  +---+  | |
> > | +++ |  | +++ |
> > |  |  |  |  |  |
> > |  |  |  |  |  |
> > +- | -+  +- | -+
> >||
> >| +-+|
> >+-| SAN |+
> >  +-+
> > 
> > Despite two independent instances of 

[PATCH v6 4/5] parallels: Add checking and repairing duplicate offsets in BAT

2023-06-21 Thread Alexander Ivanov
Cluster offsets must be unique among all the BAT entries. Find duplicate
offsets in the BAT and fix it by copying the content of the relevant
cluster to a newly allocated cluster and set the new cluster offset to the
duplicated entry.

Add host_cluster_index() helper to deduplicate the code.

When new clusters are allocated, the file size increases by 128 Mb. Call
parallels_check_leak() to fix this leak.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 142 ++
 1 file changed, 142 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index 78a34bd667..d507fe7d90 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -136,6 +136,12 @@ static int cluster_remainder(BDRVParallelsState *s, 
int64_t sector_num,
 return MIN(nb_sectors, ret);
 }
 
+static uint32_t host_cluster_index(BDRVParallelsState *s, int64_t off)
+{
+off -= le32_to_cpu(s->header->data_off) << BDRV_SECTOR_BITS;
+return off / s->cluster_size;
+}
+
 static int64_t block_status(BDRVParallelsState *s, int64_t sector_num,
 int nb_sectors, int *pnum)
 {
@@ -528,6 +534,137 @@ parallels_check_leak(BlockDriverState *bs, 
BdrvCheckResult *res,
 return 0;
 }
 
+static int parallels_check_duplicate(BlockDriverState *bs,
+ BdrvCheckResult *res,
+ BdrvCheckMode fix)
+{
+BDRVParallelsState *s = bs->opaque;
+int64_t off, sector;
+unsigned long *bitmap;
+uint32_t i, bitmap_size, cluster_index, old_offset;
+int n, ret = 0;
+uint64_t *buf = NULL;
+bool fixed = false;
+
+/*
+ * Create a bitmap of used clusters.
+ * If a bit is set, there is a BAT entry pointing to this cluster.
+ * Loop through the BAT entries, check bits relevant to an entry offset.
+ * If bit is set, this entry is duplicated. Otherwise set the bit.
+ *
+ * We shouldn't worry about newly allocated clusters outside the image
+ * because they are created higher then any existing cluster pointed by
+ * a BAT entry.
+ */
+bitmap_size = host_cluster_index(s, res->image_end_offset);
+if (bitmap_size == 0) {
+return 0;
+}
+
+bitmap = bitmap_new(bitmap_size);
+
+buf = qemu_blockalign(bs, s->cluster_size);
+
+for (i = 0; i < s->bat_size; i++) {
+off = bat2sect(s, i) << BDRV_SECTOR_BITS;
+if (off == 0) {
+continue;
+}
+
+cluster_index = host_cluster_index(s, off);
+assert(cluster_index < bitmap_size);
+if (test_bit(cluster_index, bitmap)) {
+/* this cluster duplicates another one */
+fprintf(stderr,
+"%s duplicate offset in BAT entry %u\n",
+fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+
+res->corruptions++;
+
+if (fix & BDRV_FIX_ERRORS) {
+/*
+ * Reset the entry and allocate a new cluster
+ * for the relevant guest offset. In this way we let
+ * the lower layer to place the new cluster properly.
+ * Copy the original cluster to the allocated one.
+ * But before save the old offset value for repairing
+ * if we have an error.
+ */
+old_offset = le32_to_cpu(s->bat_bitmap[i]);
+parallels_set_bat_entry(s, i, 0);
+
+ret = bdrv_co_pread(bs->file, off, s->cluster_size, buf, 0);
+if (ret < 0) {
+res->check_errors++;
+goto out_repare_bat;
+}
+
+sector = i * (int64_t)s->tracks;
+sector = allocate_clusters(bs, sector, s->tracks, );
+if (sector < 0) {
+res->check_errors++;
+ret = sector;
+goto out_repare_bat;
+}
+off = sector << BDRV_SECTOR_BITS;
+
+ret = bdrv_co_pwrite(bs->file, off, s->cluster_size, buf, 0);
+if (ret < 0) {
+res->check_errors++;
+goto out_repare_bat;
+}
+
+if (off + s->cluster_size > res->image_end_offset) {
+res->image_end_offset = off + s->cluster_size;
+}
+
+/*
+ * In the future allocate_cluster() will reuse holed offsets
+ * inside the image. Keep the used clusters bitmap content
+ * consistent for the new allocated clusters too.
+ *
+ * Note, clusters allocated outside the current image are not
+ * considered, and the bitmap size doesn't change.
+ */
+cluster_index = host_cluster_index(s, off);
+if (cluster_index < bitmap_size) {
+bitmap_set(bitmap, cluster_index, 1);

[PATCH v6 0/5] parallels: Add duplication check, repair at open, fix bugs

2023-06-21 Thread Alexander Ivanov
Fix incorrect data end calculation in parallels_open().

Check if data_end greater than the file size.

Add change_info argument to parallels_check_leak().

Add checking and repairing duplicate offsets in BAT

Image repairing in parallels_open().

v6:
2: Different patch. Refused to split image leak handling. Instead there is a
   patch with a data_end check.
3: Different patch. There is a patch with change_info argument.
4: Removed changing fprintf by qemu_log from this patchset. Previously 3rd
   patch became 4th. Replaced qemu_memalign() by qemu_blockalign(). Got
   rid of iovecs, replaced bdrv_co_pwritev() by bdrv_co_pwrite(). Added
   assert(cluster_index < bitmap_size). Now BAT changes are reverted if
   there was an error in the cluster copying process. Simplified a sector
   calculation.
5: Moved header magic check to the appropriate place. Added a
   migrate_del_blocker() call and s->bat_dirty_bmap freeing on error.

v5:
3: Fixed a byteorder bug, fixed zero-length image handling and fixed uint32
   truncation.

v4:
2,5: Rebased.

v3:
2: Added (size >= res->image_end_offset) assert and changed the comment in
   parallels_get_leak_size(). Changed error printing and leaks fixing order.
3: Removed highest_offset() helper, instead image_end_offset field is used.
5: Moved highest_offset() code to parallels_open() - now it is used only in
   this function. Fixed data_end update condition. Fixed a leak of
   s->migration_blocker.

v2:
2: Moved outsude parallels_check_leak() 2 helpers:
   parallels_get_leak_size() and parallels_fix_leak().
   
3: Used highest_offset() helper in parallels_check_leak(). Fixed a typo.
   Added comments. Replaced g_malloc() call by qemu_memalign(). Replaced
   bdrv_pread() call by bdrv_co_pread(). Got rid of keeping bytes and
   sectors in the same variable. Added setting the bitmap of the used
   clusters for a new allocated cluster if it isn't out of the bitmap.
   Moved the leak fix to the end of all the checks. Removed a dependence
   on image format for the duplicate check.
   
4 (old): Merged this patch to the previous.
4 (former 5): Fixed formatting.
5 (former 6): Fixed comments. Added O_INACTIVE check in the condition.
  Replaced inuse detection by header_unclean checking.
  Replaced playing with corutines by bdrv_check() usage.

Alexander Ivanov (5):
  parallels: Incorrect data end calculation in parallels_open()
  parallels: Check if data_end greater than the file size
  parallels: Add change_info argument to parallels_check_leak()
  parallels: Add checking and repairing duplicate offsets in BAT
  parallels: Image repairing in parallels_open()

 block/parallels.c | 228 +++---
 1 file changed, 195 insertions(+), 33 deletions(-)

-- 
2.34.1




[PATCH v6 3/5] parallels: Add change_info argument to parallels_check_leak()

2023-06-21 Thread Alexander Ivanov
In the next patch we need to repair leaks without changing leaks and
leaks_fixed info in res. Add change_info argument to skip info changing
if the argument is false.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 4b7eafba1e..78a34bd667 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -484,7 +484,7 @@ parallels_check_outside_image(BlockDriverState *bs, 
BdrvCheckResult *res,
 
 static int coroutine_fn GRAPH_RDLOCK
 parallels_check_leak(BlockDriverState *bs, BdrvCheckResult *res,
- BdrvCheckMode fix)
+ BdrvCheckMode fix, bool change_info)
 {
 BDRVParallelsState *s = bs->opaque;
 int64_t size;
@@ -502,7 +502,9 @@ parallels_check_leak(BlockDriverState *bs, BdrvCheckResult 
*res,
 fprintf(stderr, "%s space leaked at the end of the image %" PRId64 
"\n",
 fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR",
 size - res->image_end_offset);
-res->leaks += count;
+if (change_info) {
+res->leaks += count;
+}
 if (fix & BDRV_FIX_LEAKS) {
 Error *local_err = NULL;
 
@@ -517,7 +519,9 @@ parallels_check_leak(BlockDriverState *bs, BdrvCheckResult 
*res,
 res->check_errors++;
 return ret;
 }
-res->leaks_fixed += count;
+if (change_info) {
+res->leaks_fixed += count;
+}
 }
 }
 
@@ -570,7 +574,7 @@ parallels_co_check(BlockDriverState *bs, BdrvCheckResult 
*res,
 return ret;
 }
 
-ret = parallels_check_leak(bs, res, fix);
+ret = parallels_check_leak(bs, res, fix, true);
 if (ret < 0) {
 return ret;
 }
-- 
2.34.1




[PATCH v6 2/5] parallels: Check if data_end greater than the file size

2023-06-21 Thread Alexander Ivanov
Initially data_end is set to the data_off image header field and must not be
greater than the file size.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index 1ec98c722b..4b7eafba1e 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -868,6 +868,11 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
and actual data. We can't avoid read-modify-write... */
 s->header_size = size;
 }
+if (s->data_end > file_nb_sectors) {
+error_setg(errp, "Invalid image: incorrect data_off field");
+ret = -EINVAL;
+goto fail;
+}
 
 ret = bdrv_pread(bs->file, 0, s->header_size, s->header, 0);
 if (ret < 0) {
-- 
2.34.1




[PATCH v6 5/5] parallels: Image repairing in parallels_open()

2023-06-21 Thread Alexander Ivanov
Repair an image at opening if the image is unclean or out-of-image
corruption was detected.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 65 +++
 1 file changed, 38 insertions(+), 27 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index d507fe7d90..dc7eb6375e 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -944,7 +944,7 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 BDRVParallelsState *s = bs->opaque;
 ParallelsHeader ph;
 int ret, size, i;
-int64_t file_nb_sectors;
+int64_t file_nb_sectors, sector;
 QemuOpts *opts = NULL;
 Error *local_err = NULL;
 char *buf;
@@ -1026,33 +1026,8 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 s->bat_bitmap = (uint32_t *)(s->header + 1);
 
-for (i = 0; i < s->bat_size; i++) {
-int64_t off = bat2sect(s, i);
-if (off >= file_nb_sectors) {
-if (flags & BDRV_O_CHECK) {
-continue;
-}
-error_setg(errp, "parallels: Offset %" PRIi64 " in BAT[%d] entry "
-   "is larger than file size (%" PRIi64 ")",
-   off << BDRV_SECTOR_BITS, i,
-   file_nb_sectors << BDRV_SECTOR_BITS);
-ret = -EINVAL;
-goto fail;
-}
-if (off >= s->data_end) {
-s->data_end = off + s->tracks;
-}
-}
-
 if (le32_to_cpu(ph.inuse) == HEADER_INUSE_MAGIC) {
-/* Image was not closed correctly. The check is mandatory */
 s->header_unclean = true;
-if ((flags & BDRV_O_RDWR) && !(flags & BDRV_O_CHECK)) {
-error_setg(errp, "parallels: Image was not closed correctly; "
-   "cannot be opened read/write");
-ret = -EACCES;
-goto fail;
-}
 }
 
 opts = qemu_opts_create(_runtime_opts, NULL, 0, errp);
@@ -1113,10 +1088,40 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
bdrv_get_device_or_node_name(bs));
 ret = migrate_add_blocker(s->migration_blocker, errp);
 if (ret < 0) {
-error_free(s->migration_blocker);
+error_setg(errp, "Migration blocker error");
 goto fail;
 }
 qemu_co_mutex_init(>lock);
+
+for (i = 0; i < s->bat_size; i++) {
+sector = bat2sect(s, i);
+if (sector + s->tracks > s->data_end) {
+s->data_end = sector + s->tracks;
+}
+}
+
+/*
+ * We don't repair the image here if it's opened for checks. Also we don't
+ * want to change inactive images and can't change readonly images.
+ */
+if ((flags & (BDRV_O_CHECK | BDRV_O_INACTIVE)) || !(flags & BDRV_O_RDWR)) {
+return 0;
+}
+
+/*
+ * Repair the image if it's dirty or
+ * out-of-image corruption was detected.
+ */
+if (s->data_end > file_nb_sectors || s->header_unclean) {
+BdrvCheckResult res;
+ret = bdrv_check(bs, , BDRV_FIX_ERRORS | BDRV_FIX_LEAKS);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "Could not repair corrupted image");
+migrate_del_blocker(s->migration_blocker);
+goto fail;
+}
+}
+
 return 0;
 
 fail_format:
@@ -1124,6 +1129,12 @@ fail_format:
 fail_options:
 ret = -EINVAL;
 fail:
+/*
+ * "s" object was allocated by g_malloc0 so we can safely
+ * try to free its fields even they were not allocated.
+ */
+error_free(s->migration_blocker);
+g_free(s->bat_dirty_bmap);
 qemu_vfree(s->header);
 return ret;
 }
-- 
2.34.1




[PATCH v6 1/5] parallels: Incorrect data end calculation in parallels_open()

2023-06-21 Thread Alexander Ivanov
The BDRVParallelsState structure contains data_end field that is measured
in sectors. In parallels_open() initially this field is set by data_off
field from parallels image header.

According to the parallels format documentation, data_off field contains
an offset, in sectors, from the start of the file to the start of the
data area. For "WithoutFreeSpace" images: if data_off is zero, the offset
is calculated as the end of the BAT table plus some padding to ensure
sector size alignment.

The parallels_open() function has code for handling zero value in
data_off, but in the result data_end contains the offset in bytes.

Replace the alignment to sector size by division by sector size and fix
the comparision with s->header_size.

Signed-off-by: Alexander Ivanov 
Reviewed-by: Denis V. Lunev 
Reviewed-by: Hanna Czenczek 
---
 block/parallels.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 7c263d5085..1ec98c722b 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -861,9 +861,9 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 s->data_end = le32_to_cpu(ph.data_off);
 if (s->data_end == 0) {
-s->data_end = ROUND_UP(bat_entry_off(s->bat_size), BDRV_SECTOR_SIZE);
+s->data_end = DIV_ROUND_UP(size, BDRV_SECTOR_SIZE);
 }
-if (s->data_end < s->header_size) {
+if (s->data_end < (s->header_size >> BDRV_SECTOR_BITS)) {
 /* there is not enough unused space to fit to block align between BAT
and actual data. We can't avoid read-modify-write... */
 s->header_size = size;
-- 
2.34.1




[PATCH v3 3/3] hw/ufs: Support for UFS logical unit

2023-06-21 Thread Jeuk Kim
This commit adds support for ufs logical unit.
The LU handles processing for the SCSI command,
unit descriptor query request.

This commit enables the UFS device to process
IO requests.

Signed-off-by: Jeuk Kim 
---
 hw/ufs/lu.c  | 1441 ++
 hw/ufs/meson.build   |2 +-
 hw/ufs/trace-events  |   25 +
 hw/ufs/ufs.c |  252 ++-
 hw/ufs/ufs.h |   43 ++
 include/scsi/constants.h |1 +
 6 files changed, 1757 insertions(+), 7 deletions(-)
 create mode 100644 hw/ufs/lu.c

diff --git a/hw/ufs/lu.c b/hw/ufs/lu.c
new file mode 100644
index 00..ef69de61a5
--- /dev/null
+++ b/hw/ufs/lu.c
@@ -0,0 +1,1441 @@
+/*
+ * QEMU UFS Logical Unit
+ *
+ * Copyright (c) 2023 Samsung Electronics Co., Ltd. All rights reserved.
+ *
+ * Written by Jeuk Kim 
+ *
+ * This code is licensed under the GNU GPL v2 or later.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/units.h"
+#include "qapi/error.h"
+#include "qemu/memalign.h"
+#include "hw/scsi/scsi.h"
+#include "scsi/constants.h"
+#include "sysemu/block-backend.h"
+#include "qemu/cutils.h"
+#include "trace.h"
+#include "ufs.h"
+
+/*
+ * The code below handling SCSI commands is copied from hw/scsi/scsi-disk.c,
+ * with minor adjustments to make it work for UFS.
+ */
+
+#define SCSI_DMA_BUF_SIZE (128 * KiB)
+#define SCSI_MAX_INQUIRY_LEN 256
+#define SCSI_INQUIRY_DATA_SIZE 36
+#define SCSI_MAX_MODE_LEN 256
+
+typedef struct UfsSCSIReq {
+SCSIRequest req;
+/* Both sector and sector_count are in terms of BDRV_SECTOR_SIZE bytes.  */
+uint64_t sector;
+uint32_t sector_count;
+uint32_t buflen;
+bool started;
+bool need_fua_emulation;
+struct iovec iov;
+QEMUIOVector qiov;
+BlockAcctCookie acct;
+} UfsSCSIReq;
+
+static void ufs_scsi_free_request(SCSIRequest *req)
+{
+UfsSCSIReq *r = DO_UPCAST(UfsSCSIReq, req, req);
+
+qemu_vfree(r->iov.iov_base);
+}
+
+static void scsi_check_condition(UfsSCSIReq *r, SCSISense sense)
+{
+trace_ufs_scsi_check_condition(r->req.tag, sense.key, sense.asc,
+   sense.ascq);
+scsi_req_build_sense(>req, sense);
+scsi_req_complete(>req, CHECK_CONDITION);
+}
+
+static int ufs_scsi_emulate_vpd_page(SCSIRequest *req, uint8_t *outbuf,
+ uint32_t outbuf_len)
+{
+UfsHc *u = UFS(req->bus->qbus.parent);
+UfsLu *lu = DO_UPCAST(UfsLu, qdev, req->dev);
+uint8_t page_code = req->cmd.buf[2];
+int start, buflen = 0;
+
+if (outbuf_len < SCSI_INQUIRY_DATA_SIZE) {
+return -1;
+}
+
+outbuf[buflen++] = lu->qdev.type & 0x1f;
+outbuf[buflen++] = page_code;
+outbuf[buflen++] = 0x00;
+outbuf[buflen++] = 0x00;
+start = buflen;
+
+switch (page_code) {
+case 0x00: /* Supported page codes, mandatory */
+{
+trace_ufs_scsi_emulate_vpd_page_00(req->cmd.xfer);
+outbuf[buflen++] = 0x00; /* list of supported pages (this page) */
+if (u->params.serial) {
+outbuf[buflen++] = 0x80; /* unit serial number */
+}
+outbuf[buflen++] = 0x87; /* mode page policy */
+break;
+}
+case 0x80: /* Device serial number, optional */
+{
+int l;
+
+if (!u->params.serial) {
+trace_ufs_scsi_emulate_vpd_page_80_not_supported();
+return -1;
+}
+
+l = strlen(u->params.serial);
+if (l > SCSI_INQUIRY_DATA_SIZE) {
+l = SCSI_INQUIRY_DATA_SIZE;
+}
+
+trace_ufs_scsi_emulate_vpd_page_80(req->cmd.xfer);
+memcpy(outbuf + buflen, u->params.serial, l);
+buflen += l;
+break;
+}
+case 0x87: /* Mode Page Policy, mandatory */
+{
+trace_ufs_scsi_emulate_vpd_page_87(req->cmd.xfer);
+outbuf[buflen++] = 0x3f; /* apply to all mode pages and subpages */
+outbuf[buflen++] = 0xff;
+outbuf[buflen++] = 0; /* shared */
+outbuf[buflen++] = 0;
+break;
+}
+default:
+return -1;
+}
+/* done with EVPD */
+assert(buflen - start <= 255);
+outbuf[start - 1] = buflen - start;
+return buflen;
+}
+
+static int ufs_scsi_emulate_inquiry(SCSIRequest *req, uint8_t *outbuf,
+uint32_t outbuf_len)
+{
+int buflen = 0;
+
+if (outbuf_len < SCSI_INQUIRY_DATA_SIZE) {
+return -1;
+}
+
+if (req->cmd.buf[1] & 0x1) {
+/* Vital product data */
+return ufs_scsi_emulate_vpd_page(req, outbuf, outbuf_len);
+}
+
+/* Standard INQUIRY data */
+if (req->cmd.buf[2] != 0) {
+return -1;
+}
+
+/* PAGE CODE == 0 */
+buflen = req->cmd.xfer;
+if (buflen > SCSI_MAX_INQUIRY_LEN) {
+buflen = SCSI_MAX_INQUIRY_LEN;
+}
+
+if (is_wlun(req->lun)) {
+outbuf[0] = TYPE_WLUN;
+} else {
+outbuf[0] = 0;
+}
+outbuf[1] = 0;
+
+strpadcpy((char *)[16], 16, "QEMU UFS", ' ');
+strpadcpy((char 

[PATCH v3 2/3] hw/ufs: Support for Query Transfer Requests

2023-06-21 Thread Jeuk Kim
This commit makes the UFS device support query
and nop out transfer requests.

The next patch would be support for UFS logical
unit and scsi command transfer request.

Signed-off-by: Jeuk Kim 
---
 hw/ufs/trace-events |1 +
 hw/ufs/ufs.c| 1005 ++-
 hw/ufs/ufs.h|   46 ++
 3 files changed, 1050 insertions(+), 2 deletions(-)

diff --git a/hw/ufs/trace-events b/hw/ufs/trace-events
index 17793929b1..97cf6664b9 100644
--- a/hw/ufs/trace-events
+++ b/hw/ufs/trace-events
@@ -19,6 +19,7 @@ ufs_err_dma_read_req_upiu(uint32_t slot, uint64_t addr) 
"failed to read req upiu
 ufs_err_dma_read_prdt(uint32_t slot, uint64_t addr) "failed to read prdt. 
UTRLDBR slot %"PRIu32", prdt addr %"PRIu64""
 ufs_err_dma_write_utrd(uint32_t slot, uint64_t addr) "failed to write utrd. 
UTRLDBR slot %"PRIu32", UTRD dma addr %"PRIu64""
 ufs_err_dma_write_rsp_upiu(uint32_t slot, uint64_t addr) "failed to write rsp 
upiu. UTRLDBR slot %"PRIu32", response upiu addr %"PRIu64""
+ufs_err_utrl_slot_error(uint32_t slot) "UTRLDBR slot %"PRIu32" is in error"
 ufs_err_utrl_slot_busy(uint32_t slot) "UTRLDBR slot %"PRIu32" is busy"
 ufs_err_unsupport_register_offset(uint32_t offset) "Register offset 
0x%"PRIx32" is not yet supported"
 ufs_err_invalid_register_offset(uint32_t offset) "Register offset 0x%"PRIx32" 
is invalid"
diff --git a/hw/ufs/ufs.c b/hw/ufs/ufs.c
index 99e5e55e97..2b564141c6 100644
--- a/hw/ufs/ufs.c
+++ b/hw/ufs/ufs.c
@@ -15,10 +15,234 @@
 #include "ufs.h"
 
 /* The QEMU-UFS device follows spec version 3.1 */
-#define UFS_SPEC_VER 0x0310
+#define UFS_SPEC_VER 0x0310
 #define UFS_MAX_NUTRS 32
 #define UFS_MAX_NUTMRS 8
 
+static MemTxResult ufs_addr_read(UfsHc *u, hwaddr addr, void *buf, int size)
+{
+uint32_t cap = ldl_le_p(>reg.cap);
+hwaddr hi = addr + size - 1;
+
+if (hi < addr) {
+return MEMTX_DECODE_ERROR;
+}
+
+if (!FIELD_EX32(cap, CAP, 64AS) && (hi >> 32)) {
+return MEMTX_DECODE_ERROR;
+}
+
+return pci_dma_read(PCI_DEVICE(u), addr, buf, size);
+}
+
+static MemTxResult ufs_addr_write(UfsHc *u, hwaddr addr, const void *buf,
+  int size)
+{
+uint32_t cap = ldl_le_p(>reg.cap);
+hwaddr hi = addr + size - 1;
+if (hi < addr) {
+return MEMTX_DECODE_ERROR;
+}
+
+if (!FIELD_EX32(cap, CAP, 64AS) && (hi >> 32)) {
+return MEMTX_DECODE_ERROR;
+}
+
+return pci_dma_write(PCI_DEVICE(u), addr, buf, size);
+}
+
+static void ufs_complete_req(UfsRequest *req, UfsReqResult req_result);
+
+static inline hwaddr ufs_get_utrd_addr(UfsHc *u, uint32_t slot)
+{
+uint32_t utrlba = ldl_le_p(>reg.utrlba);
+uint32_t utrlbau = ldl_le_p(>reg.utrlbau);
+hwaddr utrl_base_addr = (((hwaddr)utrlbau) << 32) + utrlba;
+hwaddr utrd_addr = utrl_base_addr + slot * sizeof(UtpTransferReqDesc);
+
+return utrd_addr;
+}
+
+static inline hwaddr ufs_get_req_upiu_base_addr(const UtpTransferReqDesc *utrd)
+{
+uint32_t cmd_desc_base_addr_lo =
+le32_to_cpu(utrd->command_desc_base_addr_lo);
+uint32_t cmd_desc_base_addr_hi =
+le32_to_cpu(utrd->command_desc_base_addr_hi);
+
+return (((hwaddr)cmd_desc_base_addr_hi) << 32) + cmd_desc_base_addr_lo;
+}
+
+static inline hwaddr ufs_get_rsp_upiu_base_addr(const UtpTransferReqDesc *utrd)
+{
+hwaddr req_upiu_base_addr = ufs_get_req_upiu_base_addr(utrd);
+uint32_t rsp_upiu_byte_off =
+le16_to_cpu(utrd->response_upiu_offset) * sizeof(uint32_t);
+return req_upiu_base_addr + rsp_upiu_byte_off;
+}
+
+static MemTxResult ufs_dma_read_utrd(UfsRequest *req)
+{
+UfsHc *u = req->hc;
+hwaddr utrd_addr = ufs_get_utrd_addr(u, req->slot);
+MemTxResult ret;
+
+ret = ufs_addr_read(u, utrd_addr, >utrd, sizeof(req->utrd));
+if (ret) {
+trace_ufs_err_dma_read_utrd(req->slot, utrd_addr);
+}
+return ret;
+}
+
+static MemTxResult ufs_dma_read_req_upiu(UfsRequest *req)
+{
+UfsHc *u = req->hc;
+hwaddr req_upiu_base_addr = ufs_get_req_upiu_base_addr(>utrd);
+UtpUpiuReq *req_upiu = >req_upiu;
+uint32_t copy_size;
+uint16_t data_segment_length;
+MemTxResult ret;
+
+/*
+ * To know the size of the req_upiu, we need to read the
+ * data_segment_length in the header first.
+ */
+ret = ufs_addr_read(u, req_upiu_base_addr, _upiu->header,
+sizeof(UtpUpiuHeader));
+if (ret) {
+trace_ufs_err_dma_read_req_upiu(req->slot, req_upiu_base_addr);
+return ret;
+}
+data_segment_length = be16_to_cpu(req_upiu->header.data_segment_length);
+
+copy_size = sizeof(UtpUpiuHeader) + UFS_TRANSACTION_SPECIFIC_FIELD_SIZE +
+data_segment_length;
+
+ret = ufs_addr_read(u, req_upiu_base_addr, >req_upiu, copy_size);
+if (ret) {
+trace_ufs_err_dma_read_req_upiu(req->slot, req_upiu_base_addr);
+}
+return ret;
+}
+
+static MemTxResult ufs_dma_read_prdt(UfsRequest *req)
+{
+UfsHc 

[PATCH v3 1/3] hw/ufs: Initial commit for emulated Universal-Flash-Storage

2023-06-21 Thread Jeuk Kim
Universal Flash Storage (UFS) is a high-performance mass storage device
with a serial interface. It is primarily used as a high-performance
data storage device for embedded applications.

This commit contains code for UFS device to be recognized
as a UFS PCI device.
Patches to handle UFS logical unit and Transfer Request will follow.

Signed-off-by: Jeuk Kim 
---
 MAINTAINERS  |6 +
 docs/specs/pci-ids.rst   |2 +
 hw/Kconfig   |1 +
 hw/meson.build   |1 +
 hw/ufs/Kconfig   |4 +
 hw/ufs/meson.build   |1 +
 hw/ufs/trace-events  |   33 ++
 hw/ufs/trace.h   |1 +
 hw/ufs/ufs.c |  304 +++
 hw/ufs/ufs.h |   42 ++
 include/block/ufs.h  | 1048 ++
 include/hw/pci/pci.h |1 +
 include/hw/pci/pci_ids.h |1 +
 meson.build  |1 +
 14 files changed, 1446 insertions(+)
 create mode 100644 hw/ufs/Kconfig
 create mode 100644 hw/ufs/meson.build
 create mode 100644 hw/ufs/trace-events
 create mode 100644 hw/ufs/trace.h
 create mode 100644 hw/ufs/ufs.c
 create mode 100644 hw/ufs/ufs.h
 create mode 100644 include/block/ufs.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 88b5a7ee0a..91c2bfbb09 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2231,6 +2231,12 @@ F: tests/qtest/nvme-test.c
 F: docs/system/devices/nvme.rst
 T: git git://git.infradead.org/qemu-nvme.git nvme-next
 
+ufs
+M: Jeuk Kim 
+S: Supported
+F: hw/ufs/*
+F: include/block/ufs.h
+
 megasas
 M: Hannes Reinecke 
 L: qemu-block@nongnu.org
diff --git a/docs/specs/pci-ids.rst b/docs/specs/pci-ids.rst
index e302bea484..d6707fa069 100644
--- a/docs/specs/pci-ids.rst
+++ b/docs/specs/pci-ids.rst
@@ -92,6 +92,8 @@ PCI devices (other than virtio):
   PCI PVPanic device (``-device pvpanic-pci``)
 1b36:0012
   PCI ACPI ERST device (``-device acpi-erst``)
+1b36:0013
+  PCI UFS device (``-device ufs``)
 
 All these devices are documented in :doc:`index`.
 
diff --git a/hw/Kconfig b/hw/Kconfig
index ba62ff6417..9ca7b38c31 100644
--- a/hw/Kconfig
+++ b/hw/Kconfig
@@ -38,6 +38,7 @@ source smbios/Kconfig
 source ssi/Kconfig
 source timer/Kconfig
 source tpm/Kconfig
+source ufs/Kconfig
 source usb/Kconfig
 source virtio/Kconfig
 source vfio/Kconfig
diff --git a/hw/meson.build b/hw/meson.build
index c7ac7d3d75..f01fac4617 100644
--- a/hw/meson.build
+++ b/hw/meson.build
@@ -37,6 +37,7 @@ subdir('smbios')
 subdir('ssi')
 subdir('timer')
 subdir('tpm')
+subdir('ufs')
 subdir('usb')
 subdir('vfio')
 subdir('virtio')
diff --git a/hw/ufs/Kconfig b/hw/ufs/Kconfig
new file mode 100644
index 00..b7b3392e85
--- /dev/null
+++ b/hw/ufs/Kconfig
@@ -0,0 +1,4 @@
+config UFS_PCI
+bool
+default y if PCI_DEVICES
+depends on PCI
diff --git a/hw/ufs/meson.build b/hw/ufs/meson.build
new file mode 100644
index 00..c1d90eeea6
--- /dev/null
+++ b/hw/ufs/meson.build
@@ -0,0 +1 @@
+softmmu_ss.add(when: 'CONFIG_UFS_PCI', if_true: files('ufs.c'))
diff --git a/hw/ufs/trace-events b/hw/ufs/trace-events
new file mode 100644
index 00..17793929b1
--- /dev/null
+++ b/hw/ufs/trace-events
@@ -0,0 +1,33 @@
+# ufs.c
+ufs_irq_raise(void) "INTx"
+ufs_irq_lower(void) "INTx"
+ufs_mmio_read(uint64_t addr, uint64_t data, unsigned size) "addr 0x%"PRIx64" 
data 0x%"PRIx64" size %d"
+ufs_mmio_write(uint64_t addr, uint64_t data, unsigned size) "addr 0x%"PRIx64" 
data 0x%"PRIx64" size %d"
+ufs_process_db(uint32_t slot) "UTRLDBR slot %"PRIu32""
+ufs_process_req(uint32_t slot) "UTRLDBR slot %"PRIu32""
+ufs_complete_req(uint32_t slot) "UTRLDBR slot %"PRIu32""
+ufs_sendback_req(uint32_t slot) "UTRLDBR slot %"PRIu32""
+ufs_exec_nop_cmd(uint32_t slot) "UTRLDBR slot %"PRIu32""
+ufs_exec_scsi_cmd(uint32_t slot, uint8_t lun, uint8_t opcode) "slot %"PRIu32", 
lun 0x%"PRIx8", opcode 0x%"PRIx8""
+ufs_exec_query_cmd(uint32_t slot, uint8_t opcode) "slot %"PRIu32", opcode 
0x%"PRIx8""
+ufs_process_uiccmd(uint32_t uiccmd, uint32_t ucmdarg1, uint32_t ucmdarg2, 
uint32_t ucmdarg3) "uiccmd 0x%"PRIx32", ucmdarg1 0x%"PRIx32", ucmdarg2 
0x%"PRIx32", ucmdarg3 0x%"PRIx32""
+
+# error condition
+ufs_err_memory_allocation(void) "failed to allocate memory"
+ufs_err_dma_read_utrd(uint32_t slot, uint64_t addr) "failed to read utrd. 
UTRLDBR slot %"PRIu32", UTRD dma addr %"PRIu64""
+ufs_err_dma_read_req_upiu(uint32_t slot, uint64_t addr) "failed to read req 
upiu. UTRLDBR slot %"PRIu32", request upiu addr %"PRIu64""
+ufs_err_dma_read_prdt(uint32_t slot, uint64_t addr) "failed to read prdt. 
UTRLDBR slot %"PRIu32", prdt addr %"PRIu64""
+ufs_err_dma_write_utrd(uint32_t slot, uint64_t addr) "failed to write utrd. 
UTRLDBR slot %"PRIu32", UTRD dma addr %"PRIu64""
+ufs_err_dma_write_rsp_upiu(uint32_t slot, uint64_t addr) "failed to write rsp 
upiu. UTRLDBR slot %"PRIu32", response upiu addr %"PRIu64""
+ufs_err_utrl_slot_busy(uint32_t slot) "UTRLDBR slot %"PRIu32" is busy"
+ufs_err_unsupport_register_offset(uint32_t offset) "Register offset 
0x%"PRIx32" 

[PATCH v3 0/3] hw/ufs: Add Universal Flash Storage (UFS) support

2023-06-21 Thread Jeuk Kim
Since v2:
Addressed review comment from Stefan Hajnoczi. The main fixes are as follows.
- Use of SPDX licence identifiers
- fixed endianness error
- removed memory leak
- fixed DMA error handling logic


This patch series adds support for a new PCI-based UFS device.

The UFS pci device id (PCI_DEVICE_ID_REDHAT_UFS) is not registered
in the Linux kernel yet, so it does not work right away, but I confirmed
that it works with Linux when the UFS pci device id is registered.

I have also verified that it works with Windows 10.

Jeuk Kim (3):
  hw/ufs: Initial commit for emulated Universal-Flash-Storage
  hw/ufs: Support for Query Transfer Requests
  hw/ufs: Support for UFS logical unit

 MAINTAINERS  |6 +
 docs/specs/pci-ids.rst   |2 +
 hw/Kconfig   |1 +
 hw/meson.build   |1 +
 hw/ufs/Kconfig   |4 +
 hw/ufs/lu.c  | 1441 +++
 hw/ufs/meson.build   |1 +
 hw/ufs/trace-events  |   59 ++
 hw/ufs/trace.h   |1 +
 hw/ufs/ufs.c | 1545 ++
 hw/ufs/ufs.h |  131 
 include/block/ufs.h  | 1048 ++
 include/hw/pci/pci.h |1 +
 include/hw/pci/pci_ids.h |1 +
 include/scsi/constants.h |1 +
 meson.build  |1 +
 16 files changed, 4244 insertions(+)
 create mode 100644 hw/ufs/Kconfig
 create mode 100644 hw/ufs/lu.c
 create mode 100644 hw/ufs/meson.build
 create mode 100644 hw/ufs/trace-events
 create mode 100644 hw/ufs/trace.h
 create mode 100644 hw/ufs/ufs.c
 create mode 100644 hw/ufs/ufs.h
 create mode 100644 include/block/ufs.h

-- 
2.34.1




Re: [RFC 6/6] migration: Deprecated old compression method

2023-06-21 Thread Thomas Huth

On 12/06/2023 21.33, Juan Quintela wrote:

Signed-off-by: Juan Quintela 
---
  docs/about/deprecated.rst |  8 
  qapi/migration.json   | 92 ---
  migration/options.c   | 13 ++
  3 files changed, 79 insertions(+), 34 deletions(-)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index 173c5ba5cb..fe7f2bbde8 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -460,3 +460,11 @@ block migration (since 8.1)
  Block migration is too inflexible.  It needs to migrate all block
  devices or none.  Use driver_mirror+NBD instead.
  
+old compression method (since 8.1)

+''
+
+Compression method fails too much.  Too many races.  We are going to
+remove it if nobody fixes it.  For starters, migration-test
+compression tests are disabled becase they hand randomly.  If you need


"because they fail randomly" ?


+compression, use multifd compression methods.
+
diff --git a/qapi/migration.json b/qapi/migration.json
index a8497de48d..40a8b5d124 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -244,6 +244,7 @@
  #
  # @compression: migration compression statistics, only returned if
  # compression feature is on and status is 'active' or 'completed'
+# It is obsolete and deprecated.  Use multifd compression methods.
  # (Since 3.1)
  #
  # @socket-address: Only used for tcp, to know what the real port is
@@ -261,7 +262,8 @@
  # Features:
  #
  # @deprecated: @disk migration is deprecated.  Use driver_mirror+NBD
-# instead.
+# instead. @compression is obsolete use multifd compression


Use a dot or comma after "obsolete".


+# methods instead.
  #
  # Since: 0.14
  ##
@@ -279,7 +281,7 @@
 '*blocked-reasons': ['str'],
 '*postcopy-blocktime': 'uint32',
 '*postcopy-vcpu-blocktime': ['uint32'],
-   '*compression': 'CompressionStats',
+   '*compression': { 'type': 'CompressionStats', 'features': 
['deprecated'] },
 '*socket-address': ['SocketAddress'] } }
  
  ##

@@ -432,7 +434,8 @@
  # compress and xbzrle are both on, compress only takes effect in
  # the ram bulk stage, after that, it will be disabled and only
  # xbzrle takes effect, this can help to minimize migration
-# traffic.  The feature is disabled by default.  (since 2.4 )
+# traffic.  The feature is disabled by default.  Obsolete.  Use
+# multifd compression methods if needed. (since 2.4 )
  #
  # @events: generate events for each migration state change (since 2.4
  # )
@@ -503,6 +506,7 @@
  # Features:
  #
  # @deprecated: @block migration is deprecated.  Use driver_mirror+NBD
+# instead. @compress is obsolete use multifd compression methods


dito


  # instead.
  #
  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
@@ -511,7 +515,8 @@
  ##
  { 'enum': 'MigrationCapability',
'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
-   'compress', 'events', 'postcopy-ram',
+   { 'name': 'compress', 'features': [ 'deprecated' ] },
+   'events', 'postcopy-ram',
 { 'name': 'x-colo', 'features': [ 'unstable' ] },
 'release-ram',
 { 'name': 'block', 'features': [ 'deprecated' ] },
@@ -671,22 +676,24 @@
  # migration, the compression level is an integer between 0 and 9,
  # where 0 means no compression, 1 means the best compression
  # speed, and 9 means best compression ratio which will consume
-# more CPU.
+# more CPU. Obsolete, see multifd compression if needed.
  #
  # @compress-threads: Set compression thread count to be used in live
  # migration, the compression thread count is an integer between 1
-# and 255.
+# and 255. Obsolete, see multifd compression if needed.
  #
  # @compress-wait-thread: Controls behavior when all compression
  # threads are currently busy.  If true (default), wait for a free
  # compression thread to become available; otherwise, send the page
-# uncompressed.  (Since 3.1)
+# uncompressed. Obsolete, see multifd compression if
+# needed. (Since 3.1)
  #
  # @decompress-threads: Set decompression thread count to be used in
  # live migration, the decompression thread count is an integer
  # between 1 and 255. Usually, decompression is at least 4 times as
  # fast as compression, so set the decompress-threads to the number
-# about 1/4 of compress-threads is adequate.
+# about 1/4 of compress-threads is adequate. Obsolete, see multifd
+# compression if needed.
  #
  # @throttle-trigger-threshold: The ratio of bytes_dirty_period and
  # bytes_xfer_period to trigger throttling.  It is expressed as
@@ -799,7 +806,9 @@
  # Features:
  #
  # @deprecated: Member @block-incremental is obsolete. Use
-# driver_mirror+NBD instead.
+# driver_mirror+NBD instead. Compression is obsolete, so members
+# 

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-21 Thread Thomas Huth

On 12/06/2023 21.33, Juan Quintela wrote:

Only "defer" is recommended.  After setting all migation parameters,
start incoming migration with "migrate-incoming uri" command.

Signed-off-by: Juan Quintela 
---
  docs/about/deprecated.rst | 7 +++
  softmmu/vl.c  | 2 ++
  2 files changed, 9 insertions(+)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index 47e98dc95e..518672722d 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -447,3 +447,10 @@ The new way to modify migration is using migration 
parameters.
  ``blk`` functionality can be acchieved using
  ``migrate_set_parameter block-incremental true``.
  
+``-incoming uri`` (since 8.1)

+'
+
+Everything except ``-incoming defer`` are deprecated.  This allows to
+setup parameters before launching the proper migration with
+``migrate-incoming uri``.
+
diff --git a/softmmu/vl.c b/softmmu/vl.c
index b0b96f67fa..7fe865ab59 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2651,6 +2651,8 @@ void qmp_x_exit_preconfig(Error **errp)
  if (incoming) {
  Error *local_err = NULL;
  if (strcmp(incoming, "defer") != 0) {
+warn_report("-incoming %s is deprecated, use -incoming defer and "
+" set the uri with migrate-incoming.", incoming);
  qmp_migrate_incoming(incoming, _err);
  if (local_err) {
  error_reportf_err(local_err, "-incoming %s: ", incoming);


Could we maybe keep at least the smallest set of necessary parameters 
around? I'm often doing a "-incoming tcp:0:1234" for doing quick sanity 
checks with migration, not caring about other migration parameters, so if 
that could continue to work, that would be very appreciated.


 Thomas




Re: [PATCH RFC 0/6] Switch iotests to pyvenv

2023-06-21 Thread Paolo Bonzini
Il mer 21 giu 2023, 02:21 John Snow  ha scritto:

> Hi, this is ... a fairly incomplete series about trying to get iotests
> to run out of the configure-time venv. I'm looking for some feedback, so
> out to the list it goes.
>
> Primarily, I'm having doubts about these points:
>
> 1) I think I need something like "mkvenv install" in the first patch,
>but mkvenv.py is getting pretty long...
>

It's not a lot of code, but using pip install from configure might also be
good enough, I don't know.

2) Is there a way to optimize the speed for patch #2? Maybe installing

   this package can be skipped until it's needed, but that means that
>things like iotest's ./check might get complicated to support that.
>
> 3) I cheated quite a bit in patch 4 to use the correct Python to launch
>iotests, but I'm wondering if there's a nicer way to solve this
>more *completely*.
>

Maybe patch 4 can use distlib.scripts as well to create the check script in
the build directory? (Yes that's another mkvenv functionality...) On a
phone and don't have the docs at hand, so I am not sure. If not, your
solution is good enough.

Apart from this the only issue is the speed. IIRC having a prebuilt .whl
would fix it, I think for Meson we observed that the slow part was building
the wheel. Possibilities:

1) using --no-pep517 if that also speeds it up?

2) already removing the sources to qemu.qmp since that's the plan anyway;
and then, if you want editability you can install the package with --user
--editable, i.e. outside the venv

Paolo


> John Snow (6):
>   experiment: add mkvenv install
>   build, tests: Add qemu in-tree packages to pyvenv at configure time.
>   iotests: get rid of '..' in path environment output
>   iotests: use the correct python to run linters
>   iotests: use pyvenv/bin/python3 to launch child test processes
>   iotests: don't add qemu.git/python to PYTHONPATH
>
>  configure | 31 +++
>  python/scripts/mkvenv.py  | 40 +++
>  tests/qemu-iotests/linters.py |  2 +-
>  tests/qemu-iotests/testenv.py | 21 --
>  4 files changed, 87 insertions(+), 7 deletions(-)
>
> --
> 2.40.1
>
>
>