zhanghailiang <zhang.zhanghaili...@huawei.com> wrote: > On 2015/3/26 11:52, Li Zhijian wrote: >> On 03/26/2015 11:12 AM, Wen Congyang wrote: >>> On 03/25/2015 05:50 PM, Juan Quintela wrote: >>>> zhanghailiang<zhang.zhanghaili...@huawei.com> wrote: >>>>> Hi all, >>>>> >>>>> We found that, sometimes, the content of VM's memory is >>>>> inconsistent between Source side and Destination side >>>>> when we check it just after finishing migration but before VM continue to >>>>> Run. >>>>> >>>>> We use a patch like bellow to find this issue, you can find it from affix, >>>>> and Steps to reprduce: >>>>> >>>>> (1) Compile QEMU: >>>>> ./configure --target-list=x86_64-softmmu --extra-ldflags="-lssl" && >>>>> make >>>>> >>>>> (2) Command and output: >>>>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu >>>>> qemu64,-kvmclock -netdev tap,id=hn0-device >>>>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive >>>>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe >>>>> -device >>>>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 >>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet >>>>> -monitor stdio >>>> Could you try to reproduce: >>>> - without vhost >>>> - without virtio-net >>>> - cache=unsafe is going to give you trouble, but trouble should only >>>> happen after migration of pages have finished. >>> If I use ide disk, it doesn't happen. >>> Even if I use virtio-net with vhost=on, it still doesn't happen. I guess >>> it is because I migrate the guest when it is booting. The virtio net >>> device is not used in this case. >> Er~~ >> it reproduces in my ide disk >> there is no any virtio device, my command line like below >> >> x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -net none >> -boot c -drive file=/home/lizj/ubuntu.raw -vnc :7 -m 2048 -smp 2 -machine >> usb=off -no-user-config -nodefaults -monitor stdio -vga std >> >> it seems easily to reproduce this issue by following steps in _ubuntu_ guest >> 1. in source side, choose memtest in grub >> 2. do live migration >> 3. exit memtest(type Esc in when memory testing) >> 4. wait migration complete >> > > Yes,it is a thorny problem. It is indeed easy to reproduce, just as > your steps in the above.
Thanks for the test case. I will try to give a try on Monday. Now that we have a test case, it should be able to instrument things. As the problem is on memtest, it can't be the disk, clearly :p Later, Juan. > > This is my test result: (I also test accel=tcg, it can be reproduced also.) > Source side: > # x86_64-softmmu/qemu-system-x86_64 -machine > pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults -cpu > qemu64,-kvmclock -boot c -drive > file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device > cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio > (qemu) ACPI_BUILD: init ACPI tables > ACPI_BUILD: init ACPI tables > migrate tcp:9.61.1.8:3004 > ACPI_BUILD: init ACPI tables > before cpu_synchronize_all_states > 5a8f72d66732cac80d6a0d5713654c0e > md_host : before saving ram complete > 5a8f72d66732cac80d6a0d5713654c0e > md_host : after saving ram complete > 5a8f72d66732cac80d6a0d5713654c0e > (qemu) > > Destination side: > # x86_64-softmmu/qemu-system-x86_64 -machine > pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults -cpu > qemu64,-kvmclock -boot c -drive > file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device > cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio > -incoming tcp:0:3004 > (qemu) QEMU_VM_SECTION_END, after loading ram > d7cb0d8a4bdd1557fb0e78baee50c986 > md_host : after loading all vmstate > d7cb0d8a4bdd1557fb0e78baee50c986 > md_host : after cpu_synchronize_all_post_init > d7cb0d8a4bdd1557fb0e78baee50c986 > > > Thanks, > zhang > >>> >>>> What kind of load were you having when reproducing this issue? >>>> Just to confirm, you have been able to reproduce this without COLO >>>> patches, right? >>>> >>>>> (qemu) migrate tcp:192.168.3.8:3004 >>>>> before saving ram complete >>>>> ff703f6889ab8701e4e040872d079a28 >>>>> md_host : after saving ram complete >>>>> ff703f6889ab8701e4e040872d079a28 >>>>> >>>>> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu >>>>> qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device >>>>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive >>>>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe >>>>> -device >>>>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 >>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet >>>>> -monitor stdio -incoming tcp:0:3004 >>>>> (qemu) QEMU_VM_SECTION_END, after loading ram >>>>> 230e1e68ece9cd4e769630e1bcb5ddfb >>>>> md_host : after loading all vmstate >>>>> 230e1e68ece9cd4e769630e1bcb5ddfb >>>>> md_host : after cpu_synchronize_all_post_init >>>>> 230e1e68ece9cd4e769630e1bcb5ddfb >>>>> >>>>> This happens occasionally, and it is more easy to reproduce when >>>>> issue migration command during VM's startup time. >>>> OK, a couple of things. Memory don't have to be exactly identical. >>>> Virtio devices in particular do funny things on "post-load". There >>>> aren't warantees for that as far as I know, we should end with an >>>> equivalent device state in memory. >>>> >>>>> We have done further test and found that some pages has been >>>>> dirtied but its corresponding migration_bitmap is not set. >>>>> We can't figure out which modules of QEMU has missed setting >>>>> bitmap when dirty page of VM, >>>>> it is very difficult for us to trace all the actions of dirtying VM's >>>>> pages. >>>> This seems to point to a bug in one of the devices. >>>> >>>>> Actually, the first time we found this problem was in the COLO FT >>>>> development, and it triggered some strange issues in >>>>> VM which all pointed to the issue of inconsistent of VM's >>>>> memory. (We have try to save all memory of VM to slave side every >>>>> time >>>>> when do checkpoint in COLO FT, and everything will be OK.) >>>>> >>>>> Is it OK for some pages that not transferred to destination when >>>>> do migration ? Or is it a bug? >>>> Pages transferred should be the same, after device state transmission is >>>> when things could change. >>>> >>>>> This issue has blocked our COLO development... :( >>>>> >>>>> Any help will be greatly appreciated! >>>> Later, Juan. >>>> >>> . >>> >> >>