On 2025/09/03 8:47, Arun Menon wrote:
Hi Akihiko,
It took some time to set up the machines; apologies for the delay in response.
On Mon, Sep 01, 2025 at 02:12:54AM +0900, Akihiko Odaki wrote:
On 2025/09/01 1:38, Arun Menon wrote:
Hi,
On Mon, Sep 01, 2025 at 01:04:40AM +0900, Akihiko Odaki wrote:
On 2025/09/01 0:45, Arun Menon wrote:
Hi Akihiko,
Thanks for the review.
On Sat, Aug 30, 2025 at 02:58:05PM +0900, Akihiko Odaki wrote:
On 2025/08/30 5:01, Arun Menon wrote:
This is an incremental step in converting vmstate loading
code to report error via Error objects instead of directly
printing it to console/monitor.
It is ensured that qemu_loadvm_state() must report an error
in errp, in case of failure.
When postcopy live migration runs, the device states are loaded by
both the qemu coroutine process_incoming_migration_co() and the
postcopy_ram_listen_thread(). Therefore, it is important that the
coroutine also reports the error in case of failure, with
error_report_err(). Otherwise, the source qemu will not display
any errors before going into the postcopy pause state.
Reviewed-by: Marc-André Lureau <marcandre.lur...@redhat.com>
Reviewed-by: Fabiano Rosas <faro...@suse.de>
Signed-off-by: Arun Menon <arme...@redhat.com>
---
migration/migration.c | 9 +++++----
migration/savevm.c | 30 ++++++++++++++++++------------
migration/savevm.h | 2 +-
3 files changed, 24 insertions(+), 17 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index
10c216d25dec01f206eacad2edd24d21f00e614c..c6768d88f45c870c7fad9b9957300766ff69effc
100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -881,7 +881,7 @@ process_incoming_migration_co(void *opaque)
MIGRATION_STATUS_ACTIVE);
mis->loadvm_co = qemu_coroutine_self();
- ret = qemu_loadvm_state(mis->from_src_file);
+ ret = qemu_loadvm_state(mis->from_src_file, &local_err);
mis->loadvm_co = NULL;
trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
@@ -908,7 +908,8 @@ process_incoming_migration_co(void *opaque)
}
if (ret < 0) {
- error_setg(&local_err, "load of migration failed: %s", strerror(-ret));
+ error_prepend(&local_err, "load of migration failed: %s: ",
+ strerror(-ret));
goto fail;
}
@@ -924,13 +925,13 @@ fail:
migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
MIGRATION_STATUS_FAILED);
migrate_set_error(s, local_err);
- error_free(local_err);
+ error_report_err(local_err);
This is problematic because it results in duplicate error reports when
!mis->exit_on_error; in that case the query-migrate QMP command reports the
error and this error reporting is redundant.
If I comment this change, then all of the errors propagated up to now, using
error_setg() will not be reported. This is the place where it is finally
reported,
when qemu_loadvm_state() fails. In other words, all the error_reports() we
removed
from all the files, replacing them with error_setg(), will finally be reported
here
using error_report_err().
My understanding of the code without these two changes is:
- If the migrate-incoming QMP command is used with false as
exit-on-error, this function will not report the error but
the query-migrate QMP command will report the error.
- Otherwise, this function reports the error.
With my limited experience in testing, I have a question,
So there are 2 scenarios,
1. running the virsh migrate command on the source host. Something like the
following,
virsh -c 'qemu:///system' migrate --live --verbose --domain guest-vm
--desturi qemu+ssh://10.6.120.20/system
OR for postcopy-ram,
virsh migrate guest-vm --live qemu+ssh://10.6.120.20/system --verbose
--postcopy --timeout 10 --timeout-postcopy
2. Using QMP commands, performing a migration from source to destination.
Running something like the following on the destination:
{
"execute": "migrate-incoming",
"arguments": {
"uri": "tcp:127.0.0.1:7777",
"exit-on-error": false
}
}
{
"execute": "migrate-incoming",
"arguments": {
"uri": "tcp:127.0.0.1:7777",
"exit-on-error": false
}
}
and the somthing like the following on source:
{
"execute": "migrate",
"arguments": {
"uri": "tcp:127.0.0.1:7777"
}
}
{"execute" : "query-migrate"}
In 1, previously, the user used to get an error message on migration failure.
This was because there were error_report() calls in all of the files.
Now that they are replaced with error_setg() and the error is stored in errp,
we need to display that using error_report_err(). Hence I introduced an
error_report_err()
call in the fail section.
In 2, we have 2 QMP sessions, one for the source and another for the
destination.
The QMP command migrate will be issued on the source, and the errp will be set.
I did not understand the part where the message will be displayed because of the
error_report_err() call. I did not see such a message on failure scenario on
both
the sessions.
If the user wants to check for errors, then the destination qemu will not exit
(exit-on-error = false ) and we can retrieve it using {"execute" :
"query-migrate"}
Aren't the 2 scenarios different by nature?
In 1, doesn't libvirt query the error with query-migrate and print it?
Ideally it should find the the error, and print the whole thing. It does work
in the normal scenario. However, the postcopy scenario does not show the same
result,
which is mentioned in the commit message.
In any case, it would be nice if you describe how libvirt interacts with
QEMU in 1.
Please find below the difference in the command output at source, when we run a
live migration
with postcopy enabled.
=========
With the current changes:
[root@dell-per750-42 qemu-priv]# virsh migrate-setspeed guest-vm 1
[root@dell-per750-42 build]# virsh migrate guest-vm --live
qemu+ssh://10.6.120.9/system --verbose --postcopy --timeout 10
--timeout-postcopy
root@10.6.120.9's password:
Migration: [ 1.26 %]error: internal error: QEMU unexpectedly closed the monitor
(vm='guest-vm'): 2025-09-03T06:19:15.076547Z qemu-system-x86_64: -accel kvm:
warning: Number of SMP cpus requested (2) exceeds the recommended cpus
supported by KVM (1)
2025-09-03T06:19:15.076586Z qemu-system-x86_64: -accel kvm: warning: Number of
hotpluggable cpus requested (2) exceeds the recommended cpus supported by KVM
(1)
2025-09-03T06:19:27.776715Z qemu-system-x86_64: load of migration failed:
Input/output error: error while loading state for instance 0x0 of device
'tpm-emulator': post load hook failed for: tpm-emulator, version_id: 0,
minimum_version: 0, ret: -5: tpm-emulator: Setting the stateblob (type 1)
failed with a TPM error 0x21 decryption error
[root@dell-per750-42 build]#
=========
Without the current changes:
[root@dell-per750-42 qemu-priv]# virsh migrate-setspeed guest-vm 1
[root@dell-per750-42 qemu-priv]# virsh migrate guest-vm --live
qemu+ssh://10.6.120.9/system --verbose --postcopy --timeout 10
--timeout-postcopy
root@10.6.120.9's password:
Migration: [ 1.28 %]error: internal error: QEMU unexpectedly closed the monitor
(vm='guest-vm'): 2025-09-03T06:26:17.733786Z qemu-system-x86_64: -accel kvm:
warning: Number of SMP cpus requested (2) exceeds the recommended cpus
supported by KVM (1)
2025-09-03T06:26:17.733830Z qemu-system-x86_64: -accel kvm: warning: Number of
hotpluggable cpus requested (2) exceeds the recommended cpus supported by KVM
(1)
[root@dell-per750-42 qemu-priv]#
=========
The original behavior was to print the error to the console regardless of
whether the migration is normal or postcopy.
This was true for messages in qemu_loadvm_state(), but the message "load
of migration failed" was printed or queried with query-migrate, not
both. We should think of which behavior is more appropriate, and I think
we should avoid duplicate reports.
The source machine goes in to a paused state after this.
The output is informative. It implies the destination machine exited,
and it makes sense to print error messages as it is done for
mis->exit_on_error. I wonder if it is possible to detect the condition
and treat it identically to mis->exit_on_error.
Regards,
Akihiko Odaki