Hi all, I just spent a fair bit of time debugging a weird error, and now that I've solved it, I wanted to share it on the list so that it is archived. With luck, it will save someone else some heartache. No replies are expected. :)
Environment: * Anvil m2 (RHEL 6.8, cman+rgmanager+kvm+drbd+clvmd, fully updated) * Guest VM OS - Win2012 R2 64-bit When I tried to live-migrate the server, rgmanager failed with: [root@an-a07n02 ~]# clusvcadm -M Windows-Server-2012-R2 -m an-a07n02.alteeve.ca Trying to migrate service:Windows-Server-2012-R2 to an-a07n02.alteeve.ca...Failed; service running on original owner /var/log/messages showed: ==== Oct 4 19:15:05 an-a07n01 rgmanager[4213]: Migrating vm:Windows-Server-2012-R2 to an-a07n02.alteeve.ca Oct 4 19:15:41 an-a07n01 rgmanager[7588]: [vm] Migrate Windows-Server-2012-R2 to an-a07n02.alteeve.ca failed: Oct 4 19:15:41 an-a07n01 rgmanager[7610]: [vm] error: Unable to read from monitor: Connection reset by peer Oct 4 19:15:41 an-a07n01 rgmanager[4213]: migrate on vm "Windows-Server-2012-R2" returned 150 (unspecified) Oct 4 19:15:41 an-a07n01 rgmanager[4213]: Migration of vm:Windows-Server-2012-R2 to an-a07n02.alteeve.ca failed; return code 150 ==== I disabled the VM in rgmanager, manually booted it using virsh and tried to live migrate it directly. Note that I booted the server on node 2 fine, and was trying to migrate from 2 -> 1. Note also that the '--unsafe' is required because nodes using 4kib sector disks can't use 'cache="none"' in KVM/qemu (so we set 'write-through', so it is still safe). [root@an-a07n02 ~]# virsh migrate --live Windows-Server-2012-R2 qemu+ssh://an-a07n01.alteeve.ca/system --unsafe error: Unable to read from monitor: Connection reset by peer In the qemu log file: ==== 2016-10-05 16:11:19.948+0000: starting up LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name Windows-Server-2012-R2 -S -M rhel6.6.0 -cpu SandyBridge,+erms,+smep,+fsgsbase,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -enable-kvm -m 16384 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid be69b994-0f70-ccf3-2934-43eb4a4b795b -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/Windows-Server-2012-R2.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime,driftfix=slew -no-reboot -no-shutdown -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x4.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x4 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x4.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x4.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/shared/files/Windows_2012_R2_64-bit_eval.iso,if=none,media=cdrom,id=drive-ide0-0-0,readonly=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=2 -drive file=/shared/files/virtio-win-0.1.102.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/dev/an-a07n01_vg0/Windows-Server-2012-R2_0,if=none,id=drive-virtio-disk0,format=raw,cache=writethrough,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/an-a07n01_vg0/Windows-Server-2012-R2_1,if=none,id=drive-virtio-disk1,format=raw,cache=writethrough,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:80:2d:e0,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -device usb-tablet,id=input0 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -vga qxl -global qxl-vga.ram_size=67108864 -global qxl-vga.vram_size=67108864 -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on char device redirected to /dev/pts/0 Features 0x20000250 unsupported. Allowed features: 0x71000454 qemu: warning: error while loading state for instance 0x0 of device '0000:00:06.0/virtio-blk' load of migration failed 2016-10-05 16:11:31.503+0000: shutting down ==== The key here was "qemu: warning: error while loading state for instance 0x0 of device '0000:00:06.0/virtio-blk'". There was precious little matching this on google. I could see no problems with the XML definition, the backing LVs (two on this VM, the LVs are passed up raw to the guest). Inside the guest OS, I could see no problems. I could, as mentioned above, boot the server on both nodes, but I could not live migrate. I got to the point where I started throwing things against the wall out of desperation. One of those was to try updating the virtio-block drivers on the guest. The guest was built with 0.1.102 virtio stable drivers, and the latest stable is now 0.1.126. So I updated the drivers in Device Manager and voila! Migration started working. We have many Win2012 R2 guests out in production, and many are using the .102 drivers. So I have a feeling that it wasn't so much the upgrade that made the difference, but instead the reinstall of the drivers. I have no idea why this bug happened, but hopefully this might save someone some grief in the future if they hit the same. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org