On Wed, Aug 5, 2020 at 5:23 PM Han Zhou <zhou...@gmail.com> wrote: > > > On Wed, Aug 5, 2020 at 4:35 PM Girish Moodalbail <gmoodalb...@gmail.com> > wrote: > >> >> >> On Wed, Aug 5, 2020 at 3:05 PM Han Zhou <hz...@ovn.org> wrote: >> >>> >>> >>> On Wed, Aug 5, 2020 at 12:51 PM Winson Wang <windson.w...@gmail.com> >>> wrote: >>> >>>> Hello OVN Experts: >>>> >>>> With large scale ovn-k8s cluster, there are several conditions that >>>> would make ovn-controller clients connect SB central from a balanced state >>>> to an unbalanced state. >>>> Is there an ongoing project to address this problem? >>>> If not, I have one proposal not sure if it is doable. >>>> Please share your thoughts. >>>> >>>> The issue: >>>> >>>> OVN SB RAFT 3 node cluster, at first all the ovn-controller clients >>>> will connect all the 3 nodes in a balanced state. >>>> >>>> The following conditions will make the connections become unbalanced. >>>> >>>> - >>>> >>>> One RAFT node restart, all the ovn-controller clients to reconnect >>>> to the two remaining cluster nodes. >>>> >>>> >>>> - >>>> >>>> Ovn-k8s, after SB raft pods rolling upgrade, the last raft pod has >>>> no client connections. >>>> >>>> >>>> RAFT clients in an unbalanced state would trigger more stress to the >>>> raft cluster, which makes the raft unstable under stress compared to a >>>> balanced state. >>>> The proposal solution: >>>> >>>> Ovn-controller adds next unix commands “reconnect” with argument of >>>> preferred SB node IP. >>>> >>>> When unbalanced state happens, the UNIX command can trigger >>>> ovn-controller reconnect >>>> >>>> To new SB raft node with fast sync which doesn’t trigger the whole DB >>>> downloading process. >>>> >>>> >>> Thanks Winson. The proposal sounds good to me. Will you implement it? >>> >> >> Han/Winson, >> >> The fast re-sync is for ovsdb-server restart and it will not apply for >> ovn-controller restart, right? >> >> > Right, but the proposal is to provide a command just to reconnect, without > restarting. In that case fast-resync should work. > > >> If the ovsdb-client (ovn-controller) restarts, then it would have lost >> all its state and when it starts again it will still need to download >> logical_flows, port_bindings , and other tables it cares about. So, fast >> re-sync may not apply to this case. >> >> Also, the ovn-controller should stash the IP address of the SB server to >> which it is connected to in Open_vSwitch table's external_id column. It >> updates this field whenever it re-connects to a different SB server >> (because that ovsdb-server instance failed or restarted). When >> ovn-controller itself restarts it could check for the value in this field >> and try to connect to it first and on failure fallback to connect to >> default connection approach. >> > > The imbalance is usually caused by failover on server side. When one > server is down, all clients are expected to connect to the rest of the > servers, and when the server is back, there is no motivation for the > clients to reconnect again (unless you purposely restart the clients, which > would bring 1/3 of the restarted clients back to the old server). So I > don't understand how "stash the IP address" would work in this scenario. > > The proposal above by Winson is to purposely trigger a reconnection > towards the desired server without restarting the clients, which I think > solves this problem directly. >
Right. This is what we discussed internally, however when I read this email on the list I got confused with the other thread (rolling update of ovn-controller in K8s cluster which involves restart of ovn-controller). Sorry, for the noise. Regards, ~Girish
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss