[Yahoo-eng-team] [Bug 1996995] [NEW] VM's inaccessible after live migration on certain Arista VXLAN Flood and Learn fabrics

Aaron S Fri, 18 Nov 2022 01:06:08 -0800

Public bug reported:

Description
===========
This is not a Nova bug per se, but rather an issue with Arista and potentially 
other network fabrics.


I have observed a case where VMs are inaccessible by network traffic
after live migrating on certain fabrics, in this case, Arista VXlan,
despite the hypervisor sending out a number of garp packets following a
live migration.

This was observed on an Arista VXlan fabric - live migrating a VM
between hypervisors on two different switches. A live migration between
two hypervisors on the same switch is not affected.

In both cases, I can see garps on the wire triggered by a VM being live
migrated, these packets have been observed from other hypervisors and
even other VMs in the same VLAN on different hypervisors.

The VM is accessible after a period of time, at the point the switch arp
aging timer resets and the MAC is re-learnt on the correct switch.

This occurs on any VM - even a simple c1.m1 with no active workload,
backed by Ceph storage.

Steps to Reproduce
===========

To try and prevent this from happening, I have tested the libvirt: Add
announce-self post live-migration workaround patch[0] - despite this,
the issue was still observed.

Create VM: c1.m1 or similar, Centos7 or Centos8 - Ceph storage, no
active or significant load on VM

Run:
`ping VM_IP | while read ping; do echo "$(date): $pong"; done`

Then:
`openstack server migrate --live TARGET_HOST VM_INSTANCE`

Expected result
===============
VM live migrates and is accessible in a reasonable <10 timeframe

Actual result
=============
VM live migrates successfully, ping fails until switch arp timer resets (in our 
environment, 60-180 seconds)

Despite efforts from us and our network team, we are unable to determine
why the VM is inaccessible, what has been noticed is that sending a
further number of announce_self commands to the qemu monitor, triggering
more garps, gets the VM into an accessible state in an acceptable time
of <5 seconds.

Environment
=============
Arista EOS4.26M VXLan fabric
OpenStack Nova Train, Ussuri, Victoria (with and without patch
Ceph Nautlius

OpenStack provider networking, using VLANs

Patch/Workaround
=============
I have a follow-up workaround patch which builds on the announce-self patch 
prepared which we have been running in our production deployment.

This patch adds two configurable options and the associated code:

`enable_qemu_monitor_announce_max_retries` - this will call
announce_self a futher n number of times, triggering more garp packets
to be sent.

`enable_qemu_monitor_announce_retry_interval` - this is the delay which
will be used between triggering the additional announce_self calls, as
configured in the option above.

My tests of nearly 5000 live migrations show that the optimal settings
in our environment are 3 additional calls to qemu_announce_self with 1
second delay - this gets out VMs accessible in 2 or 3 seconds in the
vast majority of cases, and 99% within 5 seconds after they stop
responding to ping (the point at which we determine they are
inaccessible).


I shall be submitting this patch for review by the Nova community

0:
https://opendev.org/openstack/nova/commit/9609ae0bab30675e184d1fc63aec849c1de020d0

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: live-migration

** Description changed:

  Description
  ===========
  This is not a Nova bug per se, but rather an issue with Arista and 
potentially other network fabrics.
  
- 
- I have observed a case where VMs are inaccessible by network traffic after 
live migrating on certain fabrics, in this case, Arista VXlan, despite the 
hypervisor sending out a number of garp packets following a live migration.
+ I have observed a case where VMs are inaccessible by network traffic
+ after live migrating on certain fabrics, in this case, Arista VXlan,
+ despite the hypervisor sending out a number of garp packets following a
+ live migration.
  
  This was observed on an Arista VXlan fabric - live migrating a VM
  between hypervisors on two different switches. A live migration between
  two hypervisors on the same switch is not affected.
  
  In both cases, I can see garps on the wire triggered by a VM being live
  migrated, these packets have been observed from other hypervisors and
  even other VMs in the same VLAN on different hypervisors.
  
  The VM is accessible after a period of time, at the point the switch arp
  aging timer resets and the MAC is re-learnt on the correct switch.
  
  This occurs on any VM - even a simple c1.m1 with no active workload,
  backed by Ceph storage.
- 
  
  Steps to Reproduce
  ===========
  
  To try and prevent this from happening, I have tested the libvirt: Add
  announce-self post live-migration workaround patch[0] - despite this,
  the issue was still observed.
  
  Create VM: c1.m1 or similar, Centos7 or Centos8 - Ceph storage, no
  active or significant load on VM
  
  Run:
  `ping VM_IP | while read ping; do echo "$(date): $pong"; done`
  
  Then:
  `openstack server migrate --live TARGET_HOST VM_INSTANCE`
  
  Expected result
  ===============
  VM live migrates and is accessible in a reasonable <10 timeframe
  
- 
  Actual result
  =============
  VM live migrates successfully, ping fails until switch arp timer resets (in 
our environment, 60-180 seconds)
  
  Despite efforts from us and our network team, we are unable to determine
  why the VM is inaccessible, what has been noticed is that sending a
  further number of announce_self commands to the qemu monitor, triggering
  more garps, gets the VM into an accessible state in an acceptable time
  of <5 seconds.
  
  Environment
  =============
  Arista EOS4.26M VXLan fabric
- OpenStack Nova Train, Ussuri, Victoria (with and without patch 
+ OpenStack Nova Train, Ussuri, Victoria (with and without patch
  Ceph Nautlius
  
  OpenStack provider networking, using VLANs
- 
  
  Patch/Workaround
  =============
  I have a follow-up workaround patch which builds on the announce-self patch 
prepared which we have been running in our production deployment.
  
  This patch adds two configurable options and the associated code:
  
  `enable_qemu_monitor_announce_max_retries` - this will call
  announce_self a futher n number of times, triggering more garp packets
  to be sent.
  
  `enable_qemu_monitor_announce_retry_interval` - this is the delay which
  will be used between triggering the additional announce_self calls, as
  configured in the option above.
  
- 
- My tests of nearly 5000 live migrations show that the optimal settings in our 
environment are 3 additional calls to qemu_announce_self with 1 second delay - 
this gets out VMs accessible in 2 or 3 seconds in the vast majority of cases, 
and 99% within 5 seconds after they stop responding to ping (the point at which 
we determine they are inaccessible).
+ My tests of nearly 5000 live migrations show that the optimal settings
+ in our environment are 3 additional calls to qemu_announce_self with 1
+ second delay - this gets out VMs accessible in 2 or 3 seconds in the
+ vast majority of cases, and 99% within 5 seconds after they stop
+ responding to ping (the point at which we determine they are
+ inaccessible).
  
  
- 0: 
https://opendev.org/openstack/nova/commit/9609ae0bab30675e184d1fc63aec849c1de020d0
+ I shall be submitting this patch for review by the Nova community
+ 
+ 0:
+ 
https://opendev.org/openstack/nova/commit/9609ae0bab30675e184d1fc63aec849c1de020d0

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1996995

Title:
  VM's inaccessible after live migration on certain Arista VXLAN Flood
  and Learn fabrics

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  This is not a Nova bug per se, but rather an issue with Arista and 
potentially other network fabrics.

  I have observed a case where VMs are inaccessible by network traffic
  after live migrating on certain fabrics, in this case, Arista VXlan,
  despite the hypervisor sending out a number of garp packets following
  a live migration.

  This was observed on an Arista VXlan fabric - live migrating a VM
  between hypervisors on two different switches. A live migration
  between two hypervisors on the same switch is not affected.

  In both cases, I can see garps on the wire triggered by a VM being
  live migrated, these packets have been observed from other hypervisors
  and even other VMs in the same VLAN on different hypervisors.

  The VM is accessible after a period of time, at the point the switch
  arp aging timer resets and the MAC is re-learnt on the correct switch.

  This occurs on any VM - even a simple c1.m1 with no active workload,
  backed by Ceph storage.

  Steps to Reproduce
  ===========

  To try and prevent this from happening, I have tested the libvirt: Add
  announce-self post live-migration workaround patch[0] - despite this,
  the issue was still observed.

  Create VM: c1.m1 or similar, Centos7 or Centos8 - Ceph storage, no
  active or significant load on VM

  Run:
  `ping VM_IP | while read ping; do echo "$(date): $pong"; done`

  Then:
  `openstack server migrate --live TARGET_HOST VM_INSTANCE`

  Expected result
  ===============
  VM live migrates and is accessible in a reasonable <10 timeframe

  Actual result
  =============
  VM live migrates successfully, ping fails until switch arp timer resets (in 
our environment, 60-180 seconds)

  Despite efforts from us and our network team, we are unable to
  determine why the VM is inaccessible, what has been noticed is that
  sending a further number of announce_self commands to the qemu
  monitor, triggering more garps, gets the VM into an accessible state
  in an acceptable time of <5 seconds.

  Environment
  =============
  Arista EOS4.26M VXLan fabric
  OpenStack Nova Train, Ussuri, Victoria (with and without patch
  Ceph Nautlius

  OpenStack provider networking, using VLANs

  Patch/Workaround
  =============
  I have a follow-up workaround patch which builds on the announce-self patch 
prepared which we have been running in our production deployment.

  This patch adds two configurable options and the associated code:

  `enable_qemu_monitor_announce_max_retries` - this will call
  announce_self a futher n number of times, triggering more garp packets
  to be sent.

  `enable_qemu_monitor_announce_retry_interval` - this is the delay
  which will be used between triggering the additional announce_self
  calls, as configured in the option above.

  My tests of nearly 5000 live migrations show that the optimal settings
  in our environment are 3 additional calls to qemu_announce_self with 1
  second delay - this gets out VMs accessible in 2 or 3 seconds in the
  vast majority of cases, and 99% within 5 seconds after they stop
  responding to ping (the point at which we determine they are
  inaccessible).

  
  I shall be submitting this patch for review by the Nova community

  0:
  
https://opendev.org/openstack/nova/commit/9609ae0bab30675e184d1fc63aec849c1de020d0

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1996995/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1996995] [NEW] VM's inaccessible after live migration on certain Arista VXLAN Flood and Learn fabrics

Reply via email to