To share some findings after doing some failover test at scale test env
with pacemaker active-standby setup for OVN.

1. With ipaddr2 resource as VIP, since the IP moves before ovsdb-server
stops, the ovn-controller on HVs won't get TCP connection reset during
failover. Because of this, HVs won't notice the failover until the next
probe. In large scale environment where there are thousands of HVs, we
usually set probe to a very long time, e.g. 3min, so that central SB DB
won't get overloaded for probe handling. (In our scale test env the probe
was disabled completely, so the HVs never notice the change). This can
happen on LB setup, too, if the old active node is down quietly without
sending out TCP reset.

2. After failover, when the SB DB starts on the new active node, it will be
very busy on syncing data to all HVs, as discussed in this thread:
https://mail.openvswitch.org/pipermail/ovs-discuss/2018-September/047405.html.
During this time, pacemaker monitoring can get timed out. Because of this,
the timeout value for "op monitor" needs to be set big enough to avoid
timeout. Otherwise, it will trigger restart/failover forever.

Thanks,
Han
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to