To share some findings after doing some failover test at scale test env with pacemaker active-standby setup for OVN.
1. With ipaddr2 resource as VIP, since the IP moves before ovsdb-server stops, the ovn-controller on HVs won't get TCP connection reset during failover. Because of this, HVs won't notice the failover until the next probe. In large scale environment where there are thousands of HVs, we usually set probe to a very long time, e.g. 3min, so that central SB DB won't get overloaded for probe handling. (In our scale test env the probe was disabled completely, so the HVs never notice the change). This can happen on LB setup, too, if the old active node is down quietly without sending out TCP reset. 2. After failover, when the SB DB starts on the new active node, it will be very busy on syncing data to all HVs, as discussed in this thread: https://mail.openvswitch.org/pipermail/ovs-discuss/2018-September/047405.html. During this time, pacemaker monitoring can get timed out. Because of this, the timeout value for "op monitor" needs to be set big enough to avoid timeout. Otherwise, it will trigger restart/failover forever. Thanks, Han
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss