* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: > On 2015/3/27 18:18, Dr. David Alan Gilbert wrote: > >* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: > >>On 2015/3/26 11:52, Li Zhijian wrote: > >>>On 03/26/2015 11:12 AM, Wen Congyang wrote: > >>>>On 03/25/2015 05:50 PM, Juan Quintela wrote: > >>>>>zhanghailiang<zhang.zhanghaili...@huawei.com> wrote: > >>>>>>Hi all, > >>>>>> > >>>>>>We found that, sometimes, the content of VM's memory is inconsistent > >>>>>>between Source side and Destination side > >>>>>>when we check it just after finishing migration but before VM continue > >>>>>>to Run. > >>>>>> > >>>>>>We use a patch like bellow to find this issue, you can find it from > >>>>>>affix, > >>>>>>and Steps to reprduce: > >>>>>> > >>>>>>(1) Compile QEMU: > >>>>>> ./configure --target-list=x86_64-softmmu --extra-ldflags="-lssl" && > >>>>>> make > >>>>>> > >>>>>>(2) Command and output: > >>>>>>SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu > >>>>>>qemu64,-kvmclock -netdev tap,id=hn0-device > >>>>>>virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive > >>>>>>file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe > >>>>>> -device > >>>>>>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 > >>>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet > >>>>>>-monitor stdio > >>>>>Could you try to reproduce: > >>>>>- without vhost > >>>>>- without virtio-net > >>>>>- cache=unsafe is going to give you trouble, but trouble should only > >>>>> happen after migration of pages have finished. > >>>>If I use ide disk, it doesn't happen. > >>>>Even if I use virtio-net with vhost=on, it still doesn't happen. I guess > >>>>it is because I migrate the guest when it is booting. The virtio net > >>>>device is not used in this case. > >>>Er?????? > >>>it reproduces in my ide disk > >>>there is no any virtio device, my command line like below > >>> > >>>x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -net > >>>none > >>>-boot c -drive file=/home/lizj/ubuntu.raw -vnc :7 -m 2048 -smp 2 -machine > >>>usb=off -no-user-config -nodefaults -monitor stdio -vga std > >>> > >>>it seems easily to reproduce this issue by following steps in _ubuntu_ > >>>guest > >>>1. in source side, choose memtest in grub > >>>2. do live migration > >>>3. exit memtest(type Esc in when memory testing) > >>>4. wait migration complete > >>> > >> > >>Yes???it is a thorny problem. It is indeed easy to reproduce, just as > >>your steps in the above. > >> > >>This is my test result: (I also test accel=tcg, it can be reproduced also.) > >>Source side: > >># x86_64-softmmu/qemu-system-x86_64 -machine > >>pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults -cpu > >>qemu64,-kvmclock -boot c -drive > >>file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device > >>cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio > >>(qemu) ACPI_BUILD: init ACPI tables > >>ACPI_BUILD: init ACPI tables > >>migrate tcp:9.61.1.8:3004 > >>ACPI_BUILD: init ACPI tables > >>before cpu_synchronize_all_states > >>5a8f72d66732cac80d6a0d5713654c0e > >>md_host : before saving ram complete > >>5a8f72d66732cac80d6a0d5713654c0e > >>md_host : after saving ram complete > >>5a8f72d66732cac80d6a0d5713654c0e > >>(qemu) > >> > >>Destination side: > >># x86_64-softmmu/qemu-system-x86_64 -machine > >>pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults -cpu > >>qemu64,-kvmclock -boot c -drive > >>file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device > >>cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio > >>-incoming tcp:0:3004 > >>(qemu) QEMU_VM_SECTION_END, after loading ram > >>d7cb0d8a4bdd1557fb0e78baee50c986 > >>md_host : after loading all vmstate > >>d7cb0d8a4bdd1557fb0e78baee50c986 > >>md_host : after cpu_synchronize_all_post_init > >>d7cb0d8a4bdd1557fb0e78baee50c986 > > > >Hmm, that's not good. I suggest you md5 each of the RAMBlock's individually; > >to see if it's main RAM that's different or something more subtle like > >video RAM. > > > > Er, all my previous tests are md5 'pc.ram' block only. > > >But then maybe it's easier just to dump the whole of RAM to file > >and byte compare it (hexdump the two dumps and diff ?) > > Hmm, we also used memcmp function to compare every page, but the addresses > seem to be random. > > Besides, in our previous test, we found it seems to be more easy to reproduce > when migration occurs during VM's start-up or reboot process. > > Is there any possible that some devices have special treatment when VM > start-up > which may miss setting dirty-bitmap ?
I don't think there should be, but the code paths used during startup are probably much less tested with migration. I'm sure the startup code uses different part of device emulation. I do know we have some bugs filed against migration during windows boot, I'd not considered that it might be devices not updating the bitmap. Dave > > > Thanks, > zhanghailiang > > > >>>> > >>>>>What kind of load were you having when reproducing this issue? > >>>>>Just to confirm, you have been able to reproduce this without COLO > >>>>>patches, right? > >>>>> > >>>>>>(qemu) migrate tcp:192.168.3.8:3004 > >>>>>>before saving ram complete > >>>>>>ff703f6889ab8701e4e040872d079a28 > >>>>>>md_host : after saving ram complete > >>>>>>ff703f6889ab8701e4e040872d079a28 > >>>>>> > >>>>>>DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu > >>>>>>qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device > >>>>>>virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive > >>>>>>file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe > >>>>>> -device > >>>>>>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 > >>>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet > >>>>>>-monitor stdio -incoming tcp:0:3004 > >>>>>>(qemu) QEMU_VM_SECTION_END, after loading ram > >>>>>>230e1e68ece9cd4e769630e1bcb5ddfb > >>>>>>md_host : after loading all vmstate > >>>>>>230e1e68ece9cd4e769630e1bcb5ddfb > >>>>>>md_host : after cpu_synchronize_all_post_init > >>>>>>230e1e68ece9cd4e769630e1bcb5ddfb > >>>>>> > >>>>>>This happens occasionally, and it is more easy to reproduce when issue > >>>>>>migration command during VM's startup time. > >>>>>OK, a couple of things. Memory don't have to be exactly identical. > >>>>>Virtio devices in particular do funny things on "post-load". There > >>>>>aren't warantees for that as far as I know, we should end with an > >>>>>equivalent device state in memory. > >>>>> > >>>>>>We have done further test and found that some pages has been dirtied > >>>>>>but its corresponding migration_bitmap is not set. > >>>>>>We can't figure out which modules of QEMU has missed setting bitmap > >>>>>>when dirty page of VM, > >>>>>>it is very difficult for us to trace all the actions of dirtying VM's > >>>>>>pages. > >>>>>This seems to point to a bug in one of the devices. > >>>>> > >>>>>>Actually, the first time we found this problem was in the COLO FT > >>>>>>development, and it triggered some strange issues in > >>>>>>VM which all pointed to the issue of inconsistent of VM's memory. (We > >>>>>>have try to save all memory of VM to slave side every time > >>>>>>when do checkpoint in COLO FT, and everything will be OK.) > >>>>>> > >>>>>>Is it OK for some pages that not transferred to destination when do > >>>>>>migration ? Or is it a bug? > >>>>>Pages transferred should be the same, after device state transmission is > >>>>>when things could change. > >>>>> > >>>>>>This issue has blocked our COLO development... :( > >>>>>> > >>>>>>Any help will be greatly appreciated! > >>>>>Later, Juan. > >>>>> > >>>>. > >>>> > >>> > >>> > >> > >> > >-- > >Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > > > >. > > > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK