[Copy/pasting my comment from here: https://bugs.launchpad.net/nova/+bug/1737625/comments/4]
I just talked to Dave Gilbert from upstream QEMU. Overall, as I implied in comment#2, this gnarly issue requires specialized debugging, digging deep into the bowels of QEMU, 'virtio-blk' and 'virtio. That said, Dave notes that we get this "guest index inconsistent" error when the migrated RAM is inconsistent with the migrated 'virtio' device state. And a common case is where a 'virtio' device does an operation after the vCPU is stopped and after RAM has been transmitted. Dave makes some guesswork of a potential scenario where this can occur: - Guest is running - ... live migration starts - ... a "block read" request gets submitteed - ... live migration stops the vCPUs, finishes transmitting RAM - ... the "block read" completes, 'virtio-blk' updates pointers - ... live migration "serializes" the 'virito-blk' state So the "guest index inconsistent" state would only happen if you got unlucky with the timing of that read. Another possibility, Dave points out, is that the guest has screwed up the device state somehow; the migration code in 'virtio' checks the state a lot. We have ruled this possibility out becausethe guest is just a garden-variety CirrOS instance idling; nothing special about it. -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1761798 Title: live migration intermittently fails in CI with "VQ 0 size 0x80 Guest index 0x12c inconsistent with Host index 0x134: delta 0xfff8" Status in OpenStack Compute (nova): Confirmed Status in QEMU: New Bug description: Seen here: http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm- multinode-live- migration/8de6e74/logs/subnode-2/libvirt/qemu/instance-00000002.txt.gz 2018-04-05T21:48:38.205752Z qemu-system-x86_64: -chardev pty,id=charserial0,logfile=/dev/fdset/1,logappend=on: char device redirected to /dev/pts/0 (label charserial0) warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5] 2018-04-05T21:48:43.153268Z qemu-system-x86_64: VQ 0 size 0x80 Guest index 0x12c inconsistent with Host index 0x134: delta 0xfff8 2018-04-05T21:48:43.153288Z qemu-system-x86_64: Failed to load virtio-blk:virtio 2018-04-05T21:48:43.153292Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:04.0/virtio-blk' 2018-04-05T21:48:43.153347Z qemu-system-x86_64: load of migration failed: Operation not permitted 2018-04-05 21:48:43.198+0000: shutting down, reason=crashed And in the n-cpu logs on the other host: http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm- multinode-live- migration/8de6e74/logs/screen-n-cpu.txt.gz#_Apr_05_21_48_43_257541 There is a related Red Hat bug: https://bugzilla.redhat.com/show_bug.cgi?id=1450524 The CI job failures are at present using the Pike UCA: ii libvirt-bin 3.6.0-1ubuntu6.2~cloud0 ii qemu-system-x86 1:2.10+dfsg-0ubuntu3.5~cloud0 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1761798/+subscriptions