I am starting work on a project to improve the tolerance of Zookeeper to network failures and would like feedback on the idea.
The problem is that with environments where link bonding is forbidden (they exist, trust me), Zookeeper is sensitive to the loss of a single switch or a few network links. This applies to client and server. Upon examination of the problem, I think that this could be mitigated by changing the logic that opens connections between servers to try one of several options. This should be a small change. I think that dynamic reconfiguration should be fine with this as well. On the client side, the situation is simpler, we can simply provide, either by configuration or from the server cluster, a list of all possible addresses and the client's current connection logic should work fine. One worry I have has to do with certificates on secure connection, but it seems that multiple certs would work the trick. I have started a collaborative document to work on the design approach. Once that is judged by the community to be sufficiently mature, I will move it to a JIRA. That document is at https://docs.google.com/document/d/1iGVwxeHp57qogwfdodCh9b32P2_kOQaJZ2GDo7j36fI/edit?usp=sharing The design document is currently open to the world for commenting so that anybody can suggest changes or ask questions. I will act as a bit of a moderator so that the document can remain completely open.