On Tue, Jun 29, 2021 at 10:29:59AM -0700, Han Zhou wrote: > On Tue, Jun 29, 2021 at 8:43 AM Ben Pfaff <b...@ovn.org> wrote: > > > > On Tue, Jun 29, 2021 at 12:56:18PM +0200, Ilya Maximets wrote: > > > If a new database server added to the cluster, or if one of the > > > database servers changed its IP address or port, then you need to > > > update the list of remotes for the client. For example, if a new > > > OVN_Southbound database server is added, you need to update the > > > ovn-remote for the ovn-controller. > > > > > > However, in the current implementation, the ovsdb-cs module always > > > closes the current connection and creates a new one. This can lead > > > to a storm of re-connections if all ovn-controllers will be updated > > > simultaneously. They can also start re-dowloading the database > > > content, creating even more load on the database servers. > > > > > > Correct this by saving an existing connection if it is still in the > > > list of remotes after the update. > > > > > > 'reconnect' module will report connection state updates, but that > > > is OK since no real re-connection happened and we only updated the > > > state of a new 'reconnect' instance. > > > > > > If required, re-connection can be forced after the update of remotes > > > with ovsdb_cs_force_reconnect(). > > > > I think one of the goals here was to keep the load balanced as servers > > are added. Maybe that's not a big deal, or maybe it would make sense to > > flip a coin for each of the new servers and switch over to it with > > probability 1/n where n is the number of servers. > > A similar load-balancing problem exists also when a server is down and then > recovered. Connections will obviously move away when it is down but they > won't automatically connect back when it is recovered. Apart from the > flipping-a-coin approach suggested by Ben, I saw a proposal [0] [1] in the > past that provides a CLI to reconnect to a specific server which leaves > this burden to CMS/operators. It is not ideal but still could be an > alternative to solve the problem. > > I think both approaches have their pros and cons. The smart way doesn't > require human intervention in theory, but when operating at scale people > usually want to be cautious and have more control over the changes. For > example, they may want to add the server to the cluster first, and then > gradually move 1/n connections to the new server after a graceful period, > or they could be more conservative and only let the new server take new > connections without moving any existing connections. I'd support both > options and let the operators decide according to their requirements. > > Regarding the current patch, I think it's better to add a test case to > cover the scenario and confirm that existing connections didn't reset. With > that: > Acked-by: Han Zhou <hz...@ovn.org>
This seems reasonable; to be sure, I'm not arguing against Ilya's appproach, just trying to explain my recollection of why it was done this way. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev