I am starting work on a project to improve the tolerance of Zookeeper to
network failures and would like feedback on the idea.

The problem is that with environments where link bonding is forbidden (they
exist, trust me), Zookeeper is sensitive to the loss of a single switch or
a few network links. This applies to client and server.

Upon examination of the problem, I think that this could be mitigated by
changing the logic that opens connections between servers to try one of
several options. This should be a small change. I think that dynamic
reconfiguration should be fine with this as well.

On the client side, the situation is simpler, we can simply provide, either
by configuration or from the server cluster, a list of all possible
addresses and the client's current connection logic should work fine.

One worry I have has to do with certificates on secure connection, but it
seems that multiple certs would work the trick.

I have started a collaborative document to work on the design approach.
Once that is judged by the community to be sufficiently mature, I will move
it to a JIRA.

That document is at
https://docs.google.com/document/d/1iGVwxeHp57qogwfdodCh9b32P2_kOQaJZ2GDo7j36fI/edit?usp=sharing

The design document is currently open to the world for commenting so that
anybody can suggest changes or ask questions. I will act as a bit of a
moderator so that the document can remain completely open.

Reply via email to