On 04/12/2016 12:05 PM, Andreas Scheuring wrote: > Hi together, > I wanted to start discussion about Live Migration problem that currently > exists in the nova neutron communication. > > Basics Live Migration and Nova - Neutron communication > ------------------------------------------------------ > On a high level, Nova Live Migration happens in 3 stages. (--> is what's > happening from network perspective) > #1 pre_live_migration > --> libvirtdriver: nova plugs the interface (for ovs hybrid sets up the > linuxbridge + veth and connects it to br-int) > #2 live_migration_operation > --> instance is being migrated (using libvirt with the domain.xml that is > currently active on the migration source) > #3 post_live_migration > --> binding:host_id is being updated for the port > --> libvirtdriver: domain.xml is being regenerated > More details can be found here [1] > > The problem - portbinding fails > ------------------------------- > With this flow, ML2 portbinding is triggered in post_live_migration. At this > point, the instance has already been migrated and is active on the migration > destination. > Part of the port-binding is happening in the mechanism drivers, where the vif > information for the port (vif-type, vif-details,..) is being updated. > If this portbinding fails, port will get the binding:vif_type > "binding_failed". > After that the nova libvirt driver starts generating the domain xml again to > persist it. Part of this generation is also generating the interface > definition. > This fails as the vif_type is "binding_failed". Nova will set the instance to > error state. --> There is no rollback, as it's already too late! > > Just a remark: There is no explicit check for the vif_type binding_failed. I > have the feeling that it (luckily) fails by accident when generating the xml. > > --> Ideally we would trigger the portbinding before the migration started - > in pre_live_migration. Then, if binding fails, we could abort migration > before it even started. The instance would still be > active and fully functional on the source host. I have a WIP patchset out > proposing this change [2] > > > The impact > ---------- > Patchset [2] propose updating the host_id already in pre_live_migration. > During migration, the port would already be owned by the migration target > (although the guest is still active on the source) > Technically this works fine for all the reference implementations, but this > could be a problem for some third party mech drivers, if they shut down the > port on the source and activate it on the target - although instance is still > on the source > > Any thoughts on this?
+1 on this anyway let's hear back from third party drivers maintainers. > > > Additional use cases that would be enabled with this change > ----------------------------------------------------------- > When updating the host_id in pre_live_migration, we could modify the > domain.xml with the new vif information before live migration (see patch [2] > and nova spec [4]). > This enables the following use cases > > #1 Live Migration between nodes that run different l2 agents > E.g. you could migrate a instance from an ovs node to an lb node and vice > versa. This could be used as l2 agent transition strategy! > #2 Live Migration with macvtap agent > It would enable the macvtap agent to live migrate instances between hosts, > that use a different physical_interface_mapping. See bug [3] > > --> #2 is the use case that made me thinking about this whole topic.... > > Potential other solutions > ------------------------- > #1 Have something like simultaneous portbinding - On migration, a port is > bound to 2 hosts (like a dvr port can today). > Therefore some database refactorings would be required (work has already been > started in the DVR context [7]) > And the Rest API would need to be changed in a way, that there's not a single > binding, but a list of bindings returned. Of course also create & update that > list. > I don't like this one. This would require lots of code changes and I am not sure it would solve the problem completely. The model of having a port bound to two hosts just because it's migrating, it's confusing. > #2 execute portbinding without saving it to db > we could also introduce a new api( like update port, with live migration > flag), that would run through the portbinding code and would return the port > information for the target node, but would not persist this information. Son > on port-show you would still get the old information. Update would only > happen if the migration flag is not present (in post_live_migration like > today) > Alternatively the generated protbidning could be stored in the port context > and be used on the final port_update be instead of running through all the > code pathes again. > Another possible solution is to apply the same strategy we use for instance creation. Nova should wait to get a confirmation from Neutron before declaring the migration successful. cheers, Rossella > > Other efforts in the area nova neutron live migration > ----------------------------------------------------- > Just for reference, those are the other activities around nova-neutron live > migration I'm aware of. But non of them is related to this IMO. > > #1 ovs-hybrid plug wait for vi-plug event before doing live migration > see patches [5] > --> on nova plug, creates the linuxbridge and the veth pair and plugs it into > the br-int. This plug is being detected by the ovs agent, which then reports > the device as up > which again triggers this vif-plug event. This does not solve the problem as > portbinding is not involved anyhow. This patch can also not be used for lb, > ovs normal and macvtap, > as for those vif-types libvirt sets up the device that the agent is looking > for. But this happens during live migration operation. > > #2 Implement setup_networks_on_host for Neutron networks > Notification that Neutron sets up a DVR router attachment on the target node > see patch [6] + related patches > > #3 I also know the midonet faces some challenges during nova plug > but this is also a separate topic > > > > Any discussion / input would be helpful, thanks a lot! > > > [1] > https://review.openstack.org/#/c/274097/6/doc/source/devref/live_migration.rst > [2] https://review.openstack.org/297100 > [3] https://bugs.launchpad.net/neutron/+bug/1550400 > [4] https://review.openstack.org/301090 > [5] https://review.openstack.org/246898 & https://review.openstack.org/246910 > [6] https://review.openstack.org/275073 > [7] https://bugs.launchpad.net/neutron/+bug/1367391 > > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
