regarding the VM Resumimg before the 'brctl addif'. the libvirt xml we generate contains the name of the linux bridge the tap should be added too. so the linux bridge agent does not need to actully do the brctl addif command. the tap should already be a member of that bridge when the vm is resumed.
it looks like the vm paused on the source compute node at 020-05-26 14:56:05.246 7 INFO nova.compute.manager [req-fe4495ae-f1a7-4e93-871e-4d034098babd - - - - -] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d] VM Paused (Lifecycle Event) and it resumed on the dest at 2020-05-26 14:56:05.303 6 INFO nova.compute.manager [req-8cd4842d-4783-42b3-9a8e-c4e757b8e6f0 - - - - -] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d] VM Resumed (Lifecycle Event) previously at 2020-05-26 14:55:54.639 6 DEBUG nova.virt.libvirt.driver [req-28bba2f2-89d9-4cbb-8b6a-7a9690469c86 b114d7969c0e465fbd15c2911ca4bb23 28e6517b7d6d4064be1bc878b590c40c - default default] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d] Plugging VIFs before live migration. pre_live_migration /var/lib/kolla/venv/lib/python2.7/site- packages/nova/virt/libvirt/driver.py:7621 on the dest node we have started per pluging the network backend which successfully completed at 2020-05-26 14:55:58.769 6 INFO os_vif [req-28bba2f2-89d9-4cbb-8b6a- 7a9690469c86 b114d7969c0e465fbd15c2911ca4bb23 28e6517b7d6d4064be1bc878b590c40c - default default] Successfully plugged vif VIFBridge(active=True,address=fa:16:3e:e1:50:ac,bridge_name='brq49b34298-a8',has_traffic_filtering=True,id=b3526533-dc6a-4174 -bd3b-c300e78eda62,network=Network(49b34298-a85a- 42a9-b264-b3a9242fef8f),plugin='linux_bridge',port_profile=,preserve_on_delete=False,vif_name='tapb3526533-dc') this happens before we call libvirt to migrate the instance so at this point os-vif has ensured the linux bridge "brq49b34298-a8" is created so that when the vm starts the the tap is directly created in the correct bridge. at 2020-05-26 14:56:02.172 7 DEBUG nova.compute.manager [req- 40866f62-6362-4e9a-910e-365c3452d29f 030ec97d13dd4d9698209595a7ac01c4 ef210c7d6b2146139a9c94ef790081d8 - default default] [instance: 7f050d9a- 413c-4143-849b-75f931a2c07d] Received event network- changed-b3526533-dc6a-4174-bd3b-c300e78eda62 external_instance_event /var/lib/kolla/venv/lib/python2.7/site- packages/nova/compute/manager.py:8050 we received a network-changed and then at 2020-05-26 14:56:04.314 7 DEBUG nova.compute.manager [req- 379d2410-5959-45ae-89ce-649dca3ed666 030ec97d13dd4d9698209595a7ac01c4 ef210c7d6b2146139a9c94ef790081d8 - default default] [instance: 7f050d9a- 413c-4143-849b-75f931a2c07d] Received event network-vif- plugged-b3526533-dc6a-4174-bd3b-c300e78eda62 external_instance_event /var/lib/kolla/venv/lib/python2.7/site- packages/nova/compute/manager.py:8050 we recive a network-vif-plugged event which should ideally only be sent by the ml2 driver when the l2 agent has finished wireing up the networking on the destintaiton node. as you pointed out the l2 agent does not finish adding the vlan subport to the correct bridge until 2020-05-26 14:56:25.743 6 DEBUG neutron.agent.linux.utils [req- fcca2dcc-5578-4827-ae05-d10935d35223 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'brctl', 'addif', 'brq49b34298-a8', 'p2p2.64'] create_process /var/lib/kolla/venv/lib/python2.7/site- packages/neutron/agent/linux/utils.py:87 19 seconds later so i think the issue is that the linux bridge ml2 driver is not sending plug time network-vif-plugged events but is instead sending bind time events. we wait for the netwroking to be configured here https://github.com/openstack/nova/blob/stable/queens/nova/compute/manager.py#L6420-L6425 which wait for the network-vif-plugged event i showed in the log https://github.com/openstack/nova/blob/bea91b8d58d909852949726296149d93f2c639d5/nova/compute/manager.py#L6352-L6362 before actully starting the migration here https://github.com/openstack/nova/blob/stable/queens/nova/compute/manager.py#L6467-L6470 the linux bridge l2 agent should only notify nova that the iterface is plugged when the tap is fully wired up https://github.com/openstack/neutron/blob/4acc6843e849e98cd04a6d01861555c3e120f081/neutron/plugins/ml2/drivers/agent/_common_agent.py#L303-L306 but as the comment suggest this behavior is racy https://github.com/openstack/neutron/blob/4acc6843e849e98cd04a6d01861555c3e120f081/neutron/plugins/ml2/drivers/agent/_common_agent.py#L259-L296 in this it started ensurign the bridge had connectiyt to the physical network at 2020-05-26 14:56:17.576 6 DEBUG neutron.plugins.ml2.drivers.linuxbridge.agent.linuxbridge_neutron_agent [req-fcca2dcc-5578-4827-ae05-d10935d35223 - - - - -] Creating subinterface p2p2.64 for VLAN 64 on interface p2p2 ensure_vlan /var/lib/kolla/venv/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:303 which is long after the vm resumed at 2020-05-26 14:56:05.303 so with that in mind i think this is a neutron bug and its is reporting the network is pluged before it actully is. i should not the netwrok-vif-plugged event i noted in the log was "unexpected" 2020-05-26 14:56:04.316 7 WARNING nova.compute.manager [req- 379d2410-5959-45ae-89ce-649dca3ed666 030ec97d13dd4d9698209595a7ac01c4 ef210c7d6b2146139a9c94ef790081d8 - default default] [instance: 7f050d9a- 413c-4143-849b-75f931a2c07d] Received unexpected event network-vif- plugged-b3526533-dc6a-4174-bd3b-c300e78eda62 for instance with vm_state active and task_state migrating. meaning that nova had already recived a network-vif-plugged event previously when it was waiting for the networking on the destintaiotn to be completed but i think that unexpect plug event was the actual event that should have start the migration. that is still signifcantly before neutron actully finished wiring up the networking which completed at 2020-05-26 14:56:21.061 6 DEBUG neutron.plugins.ml2.drivers.linuxbridge.agent.linuxbridge_neutron_agent [req-fcca2dcc-5578-4827-ae05-d10935d35223 - - - - -] Done creating subinterface p2p2.64 ensure_vlan /var/lib/kolla/venv/lib/python2.7/site- packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:317 neutron should not have sent any netwrok-vif-plugged event untill at least the physical interface vlan subport was added to the bridge. i dont think this is actully a nova bug but suspect its a neutron bug. likely due to the cahgne that were intoduced for multiple port binding. ** Also affects: neutron Importance: Undecided Status: New ** Changed in: nova Status: Incomplete => New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1880389 Title: lost net connection when live migration Status in neutron: New Status in OpenStack Compute (nova): New Bug description: Description =========== I find VM will lost net connection when live migration. I enabled the live_migration_wait_for_vif_plug and set vif_plugging_timeout=120. My openstack version is queens, and I used linux bridge as my L2 plugin. and the physical adapter of my host I did bond0 mode. when the live migration done, I found the ping of VM is broken down, and I use tcpdump to catch the package, I found the package from switch still reach the destination host. Steps to reproduce ================== 1. enable the live_migration_wait_for_vif_plug = True and vif_plugging_timeout = 120 (remeber restart the linuxbridge-agent in source and destination host) 2. created a new VM ,and then ping the VM which created in step 3. do the live migration action.(before live migration you need to make sure the vlan sub_interface of the network which vm attached not in the dest host. ) 4. when the processing of live migration, U will found the ping to the VM was broken down. Expected result =============== the ping should not broken down when live migration. Actual result ============= the ping was broken down. and the ping package from the physic switch was still set to the source. Environment =========== 1. Exact version of OpenStack you are running. See the following list for all releases: http://docs.openstack.org/releases/ queens 2. Which hypervisor did you use? Libvirt + KVM 3. Which storage type did you use? ceph 4. Which networking type did you use? Neutron with LinuxBirdge the network type of VM is Vlan type. And I found the time of 5 RARP from VM is before the time of the vlan sub interface insert to the linuxbridge bridge when live migration. maybe this will help us to make sure the problem I found. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1880389/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp