Public bug reported: Currently the libvirt driver's approach to live migration is bested characterized as "launch & pray". It starts the live migration operation and then just unconditionally waits for it to finish. It never makes any attempt to tune its behaviour (for example changing max downtime), nor does it look at the data transfer statistics to check if it is making any progress, nor does it have any overall timeout.
It is not uncommon for guests to have workloads that will preclude live migration from completing. Basically they can be dirtying guest RAM (or block devices) faster than the network is able to transfer it to the destination host. In such a case Nova will just leave the migration running, burning up host CPU cycles and trashing network bandwidth until the end of the universe. There are many features exposed by libvirt, that Nova could be using to do a better job, but the question is obviously ...which features and how should they be used. Fortunately Nova is not the first project to come across this problem. The oVirt data center mgmt project has the exact same problem. So rather than trying to invent some new logic for Nova, we should, as an immediate bug fix task, just copy the oVirt logic from VDSM https://github.com/oVirt/vdsm/blob/master/vdsm/virt/migration.py#L430 If we get this out to users and then get real world feedback on how it operates, we will have an idea of how/where to focus future ongoing efforts. ** Affects: nova Importance: High Assignee: Daniel Berrange (berrange) Status: In Progress ** Changed in: nova Importance: Undecided => High ** Changed in: nova Assignee: (unassigned) => Daniel Berrange (berrange) ** Changed in: nova Status: New => Confirmed -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1429220 Title: libvirt does ensure live migration will eventually complete (or abort) Status in OpenStack Compute (Nova): In Progress Bug description: Currently the libvirt driver's approach to live migration is bested characterized as "launch & pray". It starts the live migration operation and then just unconditionally waits for it to finish. It never makes any attempt to tune its behaviour (for example changing max downtime), nor does it look at the data transfer statistics to check if it is making any progress, nor does it have any overall timeout. It is not uncommon for guests to have workloads that will preclude live migration from completing. Basically they can be dirtying guest RAM (or block devices) faster than the network is able to transfer it to the destination host. In such a case Nova will just leave the migration running, burning up host CPU cycles and trashing network bandwidth until the end of the universe. There are many features exposed by libvirt, that Nova could be using to do a better job, but the question is obviously ...which features and how should they be used. Fortunately Nova is not the first project to come across this problem. The oVirt data center mgmt project has the exact same problem. So rather than trying to invent some new logic for Nova, we should, as an immediate bug fix task, just copy the oVirt logic from VDSM https://github.com/oVirt/vdsm/blob/master/vdsm/virt/migration.py#L430 If we get this out to users and then get real world feedback on how it operates, we will have an idea of how/where to focus future ongoing efforts. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1429220/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp