Hi, I'm seeing VM live migration failure when a VM is running a nested VM. I'm using latest Linux kernel (v5.3) and QEMU (v4.1.0). I also tried v5.2, but the result was the same. Kernel versions in L1 and L2 VM are v4.18, but I don't think that matters.
The symptom is that L2 VM kernel crashes in different places after migration but the call stack is mostly related to memory management like [1] and [2]. The kernel crash happens almost all the time. While L2 VM gets kernel panic, L1 VM runs fine after the migration. Both L1 and L2 VM were doing nothing during migration. I found a few clues about this issue. 1) It happens with a relatively large memory for L1 (24G), but it does not with a smaller size (3G). 2) Dead migration worked; when I ran "stop" command in the qemu monitor for L1 first and did migration, migration worked always. It also worked when I only stopped L2 VM and kept L1 live during the migration. With those two clues, I guess maybe some dirty pages made by L2 are not transferred to the destination correctly, but I'm not really sure. 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on Intel(R) Xeon(R) CPU E5-2630 v3 CPU. This makes me confused because I thought migrating nested state doesn't depend on the underlying hardware.. Anyways, L1-only migration with the large memory size (24G) works on both CPUs without any problem. I would appreciate any comments/suggestions to fix this problem. Thanks, Jintack [1]https://paste.ubuntu.com/p/XGDKH45yt4/ [2]https://paste.ubuntu.com/p/CpbVTXJCyc/