Hi Mark,

Thanks for the review.

On 01/06/23 12:29 am, Mark Michelson wrote:
Hi Priyankar,

The description makes the issue crystal clear, and you appear to be solving the race condition that can happen between the OVS interface table and the southbound port_binding table.

Acked-by: Mark Michelson <mmich...@redhat.com>

Just to let you know, the flapping problem you mention can be avoided altogether by using options:requested-chassis on the northbound logical switch port. When you migrate the port to a new chassis, place the new chassis's name or hostname as this option, and ovn-controller will only claim the logical switch port on that chassis. The old chassis will not try to claim the port even if the tap is still present.


Thanks for the suggestion. I'll definitely try out this.
Appreciate all the help!

Regards,
Priyankar

I wouldn't be surprised if there were other ways to trigger this race condition as well. I suspect the port-flapping scenario is most likely to trigger it, though.

On 5/31/23 01:35, Priyankar Jain wrote:
Currently during port migration, two chassis (source and destination)
can try to claim the same logical switch port simultaneously for a
short-period of time until the tap is deleted on source hypervisor.
ovn-controllers on these 2 hosts constantly receives port-binding
updates about other chassis claiming the port and as a result it tries
to claim the port again (because its chassis has a tap interface
referencing the LSP). This flapping ends once CMS cleans up tap
interface from the source chassis.

Now following steps occur during a single iteration inc-proc-eng during
flapping:

1. PB update received on OVN controller about other chassis owning the
    port.
2. ovn-controller tries to claim the port.
3. It installs the OVS flows for the port and updates the runtime_data
    to include this port in locally relevant ports.
4. If some change to runtime data happens as part of 3, port-groups
    containing the affected ports are recomputed. It uses related_lports
    runtime data to compute the port-groups.

Finally, ovn-controller sends a port-binding update to SB changing the
chassis to itself.
At a later point of time, SB sends the notification to ovn-controller
about (4) being completed.

Once CMS deletes the tap interface, ovn-controller receives the
notification and updates the runtime data accordingly.

Issue: ovs-flows are (sometimes)not cleaned up upon port migration.

If the notification of OVS interface deletion is received before SB
acks the PortBinding update, then ovn-controller does not cleanup
related_lports leading to incorrect port-groups computation.

i.e if the order of events is as follows:

1. PB update received on OVN controller about other chassis owning the
    port.
2. ovn-controller claims the port, installs OVS flows and sends the
    PortBinding update to SB.
3. OVS interface deletion notification received by ovn-controller.
4. SB ack received for step-2 PB update.

This commit fixes this issue by removing the logical_port from related
port even in case there is no binding available locally.

Signed-off-by: Priyankar Jain <priyankar.j...@nutanix.com>
---
  controller/binding.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/controller/binding.c b/controller/binding.c
index 9b0647b70..9889be5c7 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -1568,6 +1568,7 @@ consider_vif_lport_(const struct sbrec_port_binding *pb,
              || is_additional_chassis(pb, b_ctx_in->chassis_rec)) {
          /* Release the lport if there is no lbinding. */
          if (!lbinding_set || !can_bind) {
+            remove_related_lport(pb, b_ctx_out);
              return release_lport(pb, b_ctx_in->chassis_rec,
                                   !b_ctx_in->ovnsb_idl_txn,
                                   b_ctx_out->tracked_dp_bindings,

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to