Public bug reported: Description =========== This is not a Nova bug per se, but rather an issue with Arista and potentially other network fabrics.
I have observed a case where VMs are inaccessible by network traffic after live migrating on certain fabrics, in this case, Arista VXlan, despite the hypervisor sending out a number of garp packets following a live migration. This was observed on an Arista VXlan fabric - live migrating a VM between hypervisors on two different switches. A live migration between two hypervisors on the same switch is not affected. In both cases, I can see garps on the wire triggered by a VM being live migrated, these packets have been observed from other hypervisors and even other VMs in the same VLAN on different hypervisors. The VM is accessible after a period of time, at the point the switch arp aging timer resets and the MAC is re-learnt on the correct switch. This occurs on any VM - even a simple c1.m1 with no active workload, backed by Ceph storage. Steps to Reproduce =========== To try and prevent this from happening, I have tested the libvirt: Add announce-self post live-migration workaround patch[0] - despite this, the issue was still observed. Create VM: c1.m1 or similar, Centos7 or Centos8 - Ceph storage, no active or significant load on VM Run: `ping VM_IP | while read ping; do echo "$(date): $pong"; done` Then: `openstack server migrate --live TARGET_HOST VM_INSTANCE` Expected result =============== VM live migrates and is accessible in a reasonable <10 timeframe Actual result ============= VM live migrates successfully, ping fails until switch arp timer resets (in our environment, 60-180 seconds) Despite efforts from us and our network team, we are unable to determine why the VM is inaccessible, what has been noticed is that sending a further number of announce_self commands to the qemu monitor, triggering more garps, gets the VM into an accessible state in an acceptable time of <5 seconds. Environment ============= Arista EOS4.26M VXLan fabric OpenStack Nova Train, Ussuri, Victoria (with and without patch Ceph Nautlius OpenStack provider networking, using VLANs Patch/Workaround ============= I have a follow-up workaround patch which builds on the announce-self patch prepared which we have been running in our production deployment. This patch adds two configurable options and the associated code: `enable_qemu_monitor_announce_max_retries` - this will call announce_self a futher n number of times, triggering more garp packets to be sent. `enable_qemu_monitor_announce_retry_interval` - this is the delay which will be used between triggering the additional announce_self calls, as configured in the option above. My tests of nearly 5000 live migrations show that the optimal settings in our environment are 3 additional calls to qemu_announce_self with 1 second delay - this gets out VMs accessible in 2 or 3 seconds in the vast majority of cases, and 99% within 5 seconds after they stop responding to ping (the point at which we determine they are inaccessible). I shall be submitting this patch for review by the Nova community 0: https://opendev.org/openstack/nova/commit/9609ae0bab30675e184d1fc63aec849c1de020d0 ** Affects: nova Importance: Undecided Status: New ** Tags: live-migration ** Description changed: Description =========== This is not a Nova bug per se, but rather an issue with Arista and potentially other network fabrics. - - I have observed a case where VMs are inaccessible by network traffic after live migrating on certain fabrics, in this case, Arista VXlan, despite the hypervisor sending out a number of garp packets following a live migration. + I have observed a case where VMs are inaccessible by network traffic + after live migrating on certain fabrics, in this case, Arista VXlan, + despite the hypervisor sending out a number of garp packets following a + live migration. This was observed on an Arista VXlan fabric - live migrating a VM between hypervisors on two different switches. A live migration between two hypervisors on the same switch is not affected. In both cases, I can see garps on the wire triggered by a VM being live migrated, these packets have been observed from other hypervisors and even other VMs in the same VLAN on different hypervisors. The VM is accessible after a period of time, at the point the switch arp aging timer resets and the MAC is re-learnt on the correct switch. This occurs on any VM - even a simple c1.m1 with no active workload, backed by Ceph storage. - Steps to Reproduce =========== To try and prevent this from happening, I have tested the libvirt: Add announce-self post live-migration workaround patch[0] - despite this, the issue was still observed. Create VM: c1.m1 or similar, Centos7 or Centos8 - Ceph storage, no active or significant load on VM Run: `ping VM_IP | while read ping; do echo "$(date): $pong"; done` Then: `openstack server migrate --live TARGET_HOST VM_INSTANCE` Expected result =============== VM live migrates and is accessible in a reasonable <10 timeframe - Actual result ============= VM live migrates successfully, ping fails until switch arp timer resets (in our environment, 60-180 seconds) Despite efforts from us and our network team, we are unable to determine why the VM is inaccessible, what has been noticed is that sending a further number of announce_self commands to the qemu monitor, triggering more garps, gets the VM into an accessible state in an acceptable time of <5 seconds. Environment ============= Arista EOS4.26M VXLan fabric - OpenStack Nova Train, Ussuri, Victoria (with and without patch + OpenStack Nova Train, Ussuri, Victoria (with and without patch Ceph Nautlius OpenStack provider networking, using VLANs - Patch/Workaround ============= I have a follow-up workaround patch which builds on the announce-self patch prepared which we have been running in our production deployment. This patch adds two configurable options and the associated code: `enable_qemu_monitor_announce_max_retries` - this will call announce_self a futher n number of times, triggering more garp packets to be sent. `enable_qemu_monitor_announce_retry_interval` - this is the delay which will be used between triggering the additional announce_self calls, as configured in the option above. - - My tests of nearly 5000 live migrations show that the optimal settings in our environment are 3 additional calls to qemu_announce_self with 1 second delay - this gets out VMs accessible in 2 or 3 seconds in the vast majority of cases, and 99% within 5 seconds after they stop responding to ping (the point at which we determine they are inaccessible). + My tests of nearly 5000 live migrations show that the optimal settings + in our environment are 3 additional calls to qemu_announce_self with 1 + second delay - this gets out VMs accessible in 2 or 3 seconds in the + vast majority of cases, and 99% within 5 seconds after they stop + responding to ping (the point at which we determine they are + inaccessible). - 0: https://opendev.org/openstack/nova/commit/9609ae0bab30675e184d1fc63aec849c1de020d0 + I shall be submitting this patch for review by the Nova community + + 0: + https://opendev.org/openstack/nova/commit/9609ae0bab30675e184d1fc63aec849c1de020d0 -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1996995 Title: VM's inaccessible after live migration on certain Arista VXLAN Flood and Learn fabrics Status in OpenStack Compute (nova): New Bug description: Description =========== This is not a Nova bug per se, but rather an issue with Arista and potentially other network fabrics. I have observed a case where VMs are inaccessible by network traffic after live migrating on certain fabrics, in this case, Arista VXlan, despite the hypervisor sending out a number of garp packets following a live migration. This was observed on an Arista VXlan fabric - live migrating a VM between hypervisors on two different switches. A live migration between two hypervisors on the same switch is not affected. In both cases, I can see garps on the wire triggered by a VM being live migrated, these packets have been observed from other hypervisors and even other VMs in the same VLAN on different hypervisors. The VM is accessible after a period of time, at the point the switch arp aging timer resets and the MAC is re-learnt on the correct switch. This occurs on any VM - even a simple c1.m1 with no active workload, backed by Ceph storage. Steps to Reproduce =========== To try and prevent this from happening, I have tested the libvirt: Add announce-self post live-migration workaround patch[0] - despite this, the issue was still observed. Create VM: c1.m1 or similar, Centos7 or Centos8 - Ceph storage, no active or significant load on VM Run: `ping VM_IP | while read ping; do echo "$(date): $pong"; done` Then: `openstack server migrate --live TARGET_HOST VM_INSTANCE` Expected result =============== VM live migrates and is accessible in a reasonable <10 timeframe Actual result ============= VM live migrates successfully, ping fails until switch arp timer resets (in our environment, 60-180 seconds) Despite efforts from us and our network team, we are unable to determine why the VM is inaccessible, what has been noticed is that sending a further number of announce_self commands to the qemu monitor, triggering more garps, gets the VM into an accessible state in an acceptable time of <5 seconds. Environment ============= Arista EOS4.26M VXLan fabric OpenStack Nova Train, Ussuri, Victoria (with and without patch Ceph Nautlius OpenStack provider networking, using VLANs Patch/Workaround ============= I have a follow-up workaround patch which builds on the announce-self patch prepared which we have been running in our production deployment. This patch adds two configurable options and the associated code: `enable_qemu_monitor_announce_max_retries` - this will call announce_self a futher n number of times, triggering more garp packets to be sent. `enable_qemu_monitor_announce_retry_interval` - this is the delay which will be used between triggering the additional announce_self calls, as configured in the option above. My tests of nearly 5000 live migrations show that the optimal settings in our environment are 3 additional calls to qemu_announce_self with 1 second delay - this gets out VMs accessible in 2 or 3 seconds in the vast majority of cases, and 99% within 5 seconds after they stop responding to ping (the point at which we determine they are inaccessible). I shall be submitting this patch for review by the Nova community 0: https://opendev.org/openstack/nova/commit/9609ae0bab30675e184d1fc63aec849c1de020d0 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1996995/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp