Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU
On 04/30/2010 10:16 AM, Aaron Crow wrote: Hi Patrick, thanks for your time and detailed questions. No worries. When we hear about an issue we're very interested to followup and resolve it, regardless of the source. We take the project goals of high reliability/availablity _very_ seriously, we know that our users are deploying to mission critical environments. We're running on Java build 1.6.0_14-b08, on Ubuntu 4.2.4-1ubuntu3. Below is output from a recent stat, and a question about node count. For your other questions, I should save your time with a batch reply: I wasn't tracking nearly enough things (like logs), so it might not be fruitful to try to investigate this failure in detail. NP. If you had those available I would be interested to take a look but I think this node count size (below) is probably the issue. I've since started rolling out better settings, including the memory and GC settings recommended in the wiki, and logging to ROLLINGFILE. Also logging and archiving verbose gc. We've been fine tuning our client app with similar settings, but my bad that we didn't roll out ZK itself with these settings as well. (Oh and I also set my browser's homepage to http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing ;-) Great. That page is really good for challenging assumptions. So if we have any more serious issues, I think I should be able to provide better details at that time. Great! I do have a question about node count, though. Below is output from stat on one of our servers. Node count says *1182653*. Really!? Well, from our application stand point, we only create and use hundreds of nodes at the most. Certainly not any amount in the neighborhood of 1 million. Does this demonstrate a serious problem? It certainly looks like this might be the issue. One millions znodes is alot, but I've tested with more (upwards of 5 million znodes with 25million concurrent watches is the most I remember testing) successfully. It's really just a matter of tuning the GC and starting the JVM with sufficient heap. It could be that the counter is wrong, but I doubt it (never seen that before). Is the count approximately the same on all servers in the ensemble? Given that you are only are expecting hundreds of znodes, I suspect that you have a leak of znodes somewhere. Suggestion - Fire up the java client shell (bin/zkCli.sh or just run ZooKeeperMain directly), connect to one of your servers and use the ls command to look around. You might be surprised at what you find. (or you might find that the count is wrong, etc...) It's a good sanity check. Patrick Thanks, Aaron rails_dep...@task01:~/zookeeper-3.3.0/bin$ echo stat | nc 127.0.0.1 2181 Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT Clients: /10.0.10.116:34648[1](queued=0,recved=86200,sent=86200) /10.0.10.116:45609[1](queued=0,recved=27695,sent=27695) /10.0.10.120:48432[1](queued=0,recved=26030,sent=26030) /10.0.10.117:53336[1](queued=0,recved=43100,sent=43100) /10.0.10.120:35087[1](queued=0,recved=4268,sent=4268) /10.0.10.116:49526[1](queued=0,recved=4273,sent=4273) /10.0.10.118:45614[1](queued=0,recved=43624,sent=43624) /10.0.10.116:45600[1](queued=0,recved=27704,sent=27704) /10.0.10.120:48440[1](queued=0,recved=7161,sent=7161) /10.0.10.120:48437[1](queued=0,recved=7180,sent=7180) /10.0.10.118:59310[1](queued=0,recved=63205,sent=63205) /10.0.10.116:51072[1](queued=0,recved=14516,sent=14516) /10.0.10.116:51071[1](queued=0,recved=14516,sent=14516) /10.0.10.119:34097[1](queued=0,recved=42984,sent=42984) /10.0.10.119:41868[1](queued=0,recved=18558,sent=18558) /10.0.10.120:48441[1](queued=0,recved=21712,sent=21712) /10.0.10.116:49528[1](queued=0,recved=12967,sent=12967) /127.0.0.1:37234[1](queued=0,recved=1,sent=0) Latency min/avg/max: 0/0/105 Received: 497790 Sent: 497788 Outstanding: 0 Zxid: 0xa0003aa4b Mode: follower *Node count: 1182653* On Wed, Apr 28, 2010 at 10:56 PM, Patrick Huntph...@apache.org wrote: Btw, are you monitoring the ZK server jvms? Please take a look at http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands It would be interesting if you could run commmands such as stat against your currently running cluster. In particular I'd be interested to know what you see for latency and node count in the stat response. Run this command against each of your servers in the ensemble. For example high max latency might indicate that something (usually swap/gc) is causing the server to respond slowly in some cases. Patrick On 04/28/2010 10:47 PM, Patrick Hunt wrote: Hi Aaron, some questions/comments below: On 04/28/2010 06:29 PM, Aaron Crow wrote: We were running version 3.2.2 for about a month and it was working well for us. Then late this past Saturday night, our cluster went pathological. One of the 3 ZK servers spewed many WARNs (see below), and the other 2 servers were
Re: Question on maintaining leader/membership status in zookeeper
Hi Lei, In this case, the Leader will be disconnected from ZK cluster and will give up its leadership. Since its disconnected, ZK cluster will realize that the Leader is dead! When Zk cluster realizes that the Leader is dead (this is because the zk cluster hasn't heard from the Leader for a certain time Configurable via session timeout parameter), the slaves will be notified of this via watchers in zookeeper cluster. The slaves will realize that the Leader is gone and will relect a new Leader and will start working with the new Leader. Does that answer your question? You might want to look though the documentation of ZK to understand its use case and how it solves these kind of issues Thanks mahadev On 4/30/10 2:08 PM, Lei Gao l...@linkedin.com wrote: Thank you all for your answers. It clarifies a lot of my confusion about the service guarantees of ZK. I am still struggling with one failure case (I am not trying to be the pain in the neck. But I need to have a full understanding of what ZK can offer before I make a decision on whether to used it in my cluster.) Assume the following topology: Leader ZK cluster \\// \\ // \\ // Slave(s) If I am asymmetric network failure such that the connection between Leader and Slave(s) are broken while all other connections are still alive, would my system hang after some point? Because no new leader election will be initiated by slaves and the leader can't get the work to slave(s). Thanks, Lei On 4/30/10 1:54 PM, Ted Dunning ted.dunn...@gmail.com wrote: If one of your user clients can no longer reach one member of the ZK cluster, then it will try to reach another. If it succeeds, then it will continue without any problems as long as the ZK cluster itself is OK. This applies for all the ZK recipes. You will have to be a little bit careful to handle connection loss, but that should get easier soon (and isn't all that difficult anyway). On Fri, Apr 30, 2010 at 1:26 PM, Lei Gao l...@linkedin.com wrote: I am not talking about the leader election within zookeeper cluster. I guess I didn't make the discussion context clear. In my case, I run a cluster that uses zookeeper for doing the leader election. Yes, nodes in my cluster are the clients of zookeeper. Those nodes depend on zookeeper to elect a new leader and figure out what the current leader is. So if the zookeeper (think of it as a stand-alone entity) becomes unavailabe in the way I've described earlier, how can I handle such situation so my cluster can still function while a majority of nodes still connect to each other (but not to the zookeeper)?
Re: Question on maintaining leader/membership status in zookeeper
Hi Mahadev, Why would the leader be disconnected from ZK? ZK is fine communicating with the leader in this case. We are talking about asymmetric network failure. Yes. Leader could consider all the slaves being down if it tracks the status of all slaves himself. But I guess if ZK is used for for membership management, neither the leader nor the slaves will be considered disconnected because they can all connect to ZK. Thanks, Lei On 4/30/10 3:47 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Lei, In this case, the Leader will be disconnected from ZK cluster and will give up its leadership. Since its disconnected, ZK cluster will realize that the Leader is dead! When Zk cluster realizes that the Leader is dead (this is because the zk cluster hasn't heard from the Leader for a certain time Configurable via session timeout parameter), the slaves will be notified of this via watchers in zookeeper cluster. The slaves will realize that the Leader is gone and will relect a new Leader and will start working with the new Leader. Does that answer your question? You might want to look though the documentation of ZK to understand its use case and how it solves these kind of issues Thanks mahadev On 4/30/10 2:08 PM, Lei Gao l...@linkedin.com wrote: Thank you all for your answers. It clarifies a lot of my confusion about the service guarantees of ZK. I am still struggling with one failure case (I am not trying to be the pain in the neck. But I need to have a full understanding of what ZK can offer before I make a decision on whether to used it in my cluster.) Assume the following topology: Leader ZK cluster \\// \\ // \\ // Slave(s) If I am asymmetric network failure such that the connection between Leader and Slave(s) are broken while all other connections are still alive, would my system hang after some point? Because no new leader election will be initiated by slaves and the leader can't get the work to slave(s). Thanks, Lei On 4/30/10 1:54 PM, Ted Dunning ted.dunn...@gmail.com wrote: If one of your user clients can no longer reach one member of the ZK cluster, then it will try to reach another. If it succeeds, then it will continue without any problems as long as the ZK cluster itself is OK. This applies for all the ZK recipes. You will have to be a little bit careful to handle connection loss, but that should get easier soon (and isn't all that difficult anyway). On Fri, Apr 30, 2010 at 1:26 PM, Lei Gao l...@linkedin.com wrote: I am not talking about the leader election within zookeeper cluster. I guess I didn't make the discussion context clear. In my case, I run a cluster that uses zookeeper for doing the leader election. Yes, nodes in my cluster are the clients of zookeeper. Those nodes depend on zookeeper to elect a new leader and figure out what the current leader is. So if the zookeeper (think of it as a stand-alone entity) becomes unavailabe in the way I've described earlier, how can I handle such situation so my cluster can still function while a majority of nodes still connect to each other (but not to the zookeeper)?
Re: Question on maintaining leader/membership status in zookeeper
Hi Lei, Sorry I minsinterpreted your question! The scenario you describe could be handled in such a way - You could have a status node in ZooKeeper which every slave will subscribe to and update! If one of the slave nodes sees that there have been too many connection refused to the Leader by the slaves, the slave could go ahead and delete the Leader znode, and force the Leader to give up its leadership. I am not describing a deatiled way to do it, but its not very hard to come up with a design for this. Do you intend to have the Leader and Slaves in different Network (different ACLs I mean) protected zones? In that case, it is a legitimate concern else I do think assymetric network partition would be very unlikely to happen. Do you usually see network partitions in such scenarios? Thanks mahadev On 4/30/10 4:05 PM, Lei Gao l...@linkedin.com wrote: Hi Mahadev, Why would the leader be disconnected from ZK? ZK is fine communicating with the leader in this case. We are talking about asymmetric network failure. Yes. Leader could consider all the slaves being down if it tracks the status of all slaves himself. But I guess if ZK is used for for membership management, neither the leader nor the slaves will be considered disconnected because they can all connect to ZK. Thanks, Lei On 4/30/10 3:47 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Lei, In this case, the Leader will be disconnected from ZK cluster and will give up its leadership. Since its disconnected, ZK cluster will realize that the Leader is dead! When Zk cluster realizes that the Leader is dead (this is because the zk cluster hasn't heard from the Leader for a certain time Configurable via session timeout parameter), the slaves will be notified of this via watchers in zookeeper cluster. The slaves will realize that the Leader is gone and will relect a new Leader and will start working with the new Leader. Does that answer your question? You might want to look though the documentation of ZK to understand its use case and how it solves these kind of issues Thanks mahadev On 4/30/10 2:08 PM, Lei Gao l...@linkedin.com wrote: Thank you all for your answers. It clarifies a lot of my confusion about the service guarantees of ZK. I am still struggling with one failure case (I am not trying to be the pain in the neck. But I need to have a full understanding of what ZK can offer before I make a decision on whether to used it in my cluster.) Assume the following topology: Leader ZK cluster \\// \\ // \\ // Slave(s) If I am asymmetric network failure such that the connection between Leader and Slave(s) are broken while all other connections are still alive, would my system hang after some point? Because no new leader election will be initiated by slaves and the leader can't get the work to slave(s). Thanks, Lei On 4/30/10 1:54 PM, Ted Dunning ted.dunn...@gmail.com wrote: If one of your user clients can no longer reach one member of the ZK cluster, then it will try to reach another. If it succeeds, then it will continue without any problems as long as the ZK cluster itself is OK. This applies for all the ZK recipes. You will have to be a little bit careful to handle connection loss, but that should get easier soon (and isn't all that difficult anyway). On Fri, Apr 30, 2010 at 1:26 PM, Lei Gao l...@linkedin.com wrote: I am not talking about the leader election within zookeeper cluster. I guess I didn't make the discussion context clear. In my case, I run a cluster that uses zookeeper for doing the leader election. Yes, nodes in my cluster are the clients of zookeeper. Those nodes depend on zookeeper to elect a new leader and figure out what the current leader is. So if the zookeeper (think of it as a stand-alone entity) becomes unavailabe in the way I've described earlier, how can I handle such situation so my cluster can still function while a majority of nodes still connect to each other (but not to the zookeeper)?
Re: Question on maintaining leader/membership status in zookeeper
Lei, I think that Mahadev was correct that there is some confusion here. Leader election is normally a term used for an operation that is entirely internal to ZK. It is very robust and you probably don't need to worry about it. You can then use ZK in your application to pick a lead machine for other operations. In that case, essentially every failure scenario is handled by the standard recipe. In your example where the master and slave are cut off, but both still have access to ZK, all that will happen is that the master cannot communicate with the slave. Both will still be clear about who is in which role. The case where the master is cut off from both ZK and the slave is also handled well as is the case where the master is cut off from ZK, but not from the slave. In both cases, the master will get a connection loss event and stop trying to act like a master and the slave will be notified that the master has dropped out of its role. On Fri, Apr 30, 2010 at 4:05 PM, Lei Gao l...@linkedin.com wrote: Hi Mahadev, Why would the leader be disconnected from ZK? ZK is fine communicating with the leader in this case. We are talking about asymmetric network failure. Yes. Leader could consider all the slaves being down if it tracks the status of all slaves himself. But I guess if ZK is used for for membership management, neither the leader nor the slaves will be considered disconnected because they can all connect to ZK. Thanks, Lei On 4/30/10 3:47 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Lei, In this case, the Leader will be disconnected from ZK cluster and will give up its leadership. Since its disconnected, ZK cluster will realize that the Leader is dead! When Zk cluster realizes that the Leader is dead (this is because the zk cluster hasn't heard from the Leader for a certain time Configurable via session timeout parameter), the slaves will be notified of this via watchers in zookeeper cluster. The slaves will realize that the Leader is gone and will relect a new Leader and will start working with the new Leader. Does that answer your question? You might want to look though the documentation of ZK to understand its use case and how it solves these kind of issues Thanks mahadev On 4/30/10 2:08 PM, Lei Gao l...@linkedin.com wrote: Thank you all for your answers. It clarifies a lot of my confusion about the service guarantees of ZK. I am still struggling with one failure case (I am not trying to be the pain in the neck. But I need to have a full understanding of what ZK can offer before I make a decision on whether to used it in my cluster.) Assume the following topology: Leader ZK cluster \\// \\ // \\ // Slave(s) If I am asymmetric network failure such that the connection between Leader and Slave(s) are broken while all other connections are still alive, would my system hang after some point? Because no new leader election will be initiated by slaves and the leader can't get the work to slave(s). Thanks, Lei On 4/30/10 1:54 PM, Ted Dunning ted.dunn...@gmail.com wrote: If one of your user clients can no longer reach one member of the ZK cluster, then it will try to reach another. If it succeeds, then it will continue without any problems as long as the ZK cluster itself is OK. This applies for all the ZK recipes. You will have to be a little bit careful to handle connection loss, but that should get easier soon (and isn't all that difficult anyway). On Fri, Apr 30, 2010 at 1:26 PM, Lei Gao l...@linkedin.com wrote: I am not talking about the leader election within zookeeper cluster. I guess I didn't make the discussion context clear. In my case, I run a cluster that uses zookeeper for doing the leader election. Yes, nodes in my cluster are the clients of zookeeper. Those nodes depend on zookeeper to elect a new leader and figure out what the current leader is. So if the zookeeper (think of it as a stand-alone entity) becomes unavailabe in the way I've described earlier, how can I handle such situation so my cluster can still function while a majority of nodes still connect to each other (but not to the zookeeper)?
Re: Question on maintaining leader/membership status in zookeeper
Hi Mahadev, First of all, I like to thank you for being patient with me - my questions seem unclear to many of you who try to help me. I guess clients have to be smart enough to trigger a new leader election by trying to delete the znode. But in this case, ZK should not allow any single or multiple (as long as they are less than a quorum) client(s) to delete the znode responding to the master, right? A new consensus among clients (NOT among the nodes in zk cluster) has to be there for the znode to be deleted, right? Does zk have this capability or the clients have to come to this consensus outside of zk before trying to delete the znode in zk? Thanks, Lei Hi Lei, Sorry I minsinterpreted your question! The scenario you describe could be handled in such a way - You could have a status node in ZooKeeper which every slave will subscribe to and update! If one of the slave nodes sees that there have been too many connection refused to the Leader by the slaves, the slave could go ahead and delete the Leader znode, and force the Leader to give up its leadership. I am not describing a deatiled way to do it, but its not very hard to come up with a design for this. Do you intend to have the Leader and Slaves in different Network (different ACLs I mean) protected zones? In that case, it is a legitimate concern else I do think assymetric network partition would be very unlikely to happen. Do you usually see network partitions in such scenarios? Thanks mahadev On 4/30/10 4:05 PM, Lei Gao l...@linkedin.com wrote: Hi Mahadev, Why would the leader be disconnected from ZK? ZK is fine communicating with the leader in this case. We are talking about asymmetric network failure. Yes. Leader could consider all the slaves being down if it tracks the status of all slaves himself. But I guess if ZK is used for for membership management, neither the leader nor the slaves will be considered disconnected because they can all connect to ZK. Thanks, Lei On 4/30/10 3:47 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Lei, In this case, the Leader will be disconnected from ZK cluster and will give up its leadership. Since its disconnected, ZK cluster will realize that the Leader is dead! When Zk cluster realizes that the Leader is dead (this is because the zk cluster hasn't heard from the Leader for a certain time Configurable via session timeout parameter), the slaves will be notified of this via watchers in zookeeper cluster. The slaves will realize that the Leader is gone and will relect a new Leader and will start working with the new Leader. Does that answer your question? You might want to look though the documentation of ZK to understand its use case and how it solves these kind of issues Thanks mahadev On 4/30/10 2:08 PM, Lei Gao l...@linkedin.com wrote: Thank you all for your answers. It clarifies a lot of my confusion about the service guarantees of ZK. I am still struggling with one failure case (I am not trying to be the pain in the neck. But I need to have a full understanding of what ZK can offer before I make a decision on whether to used it in my cluster.) Assume the following topology: Leader ZK cluster \\// \\ // \\ // Slave(s) If I am asymmetric network failure such that the connection between Leader and Slave(s) are broken while all other connections are still alive, would my system hang after some point? Because no new leader election will be initiated by slaves and the leader can't get the work to slave(s). Thanks, Lei On 4/30/10 1:54 PM, Ted Dunning ted.dunn...@gmail.com wrote: If one of your user clients can no longer reach one member of the ZK cluster, then it will try to reach another. If it succeeds, then it will continue without any problems as long as the ZK cluster itself is OK. This applies for all the ZK recipes. You will have to be a little bit careful to handle connection loss, but that should get easier soon (and isn't all that difficult anyway). On Fri, Apr 30, 2010 at 1:26 PM, Lei Gao l...@linkedin.com wrote: I am not talking about the leader election within zookeeper cluster. I guess I didn't make the discussion context clear. In my case, I run a cluster that uses zookeeper for doing the leader election. Yes, nodes in my cluster are the clients of zookeeper. Those nodes depend on zookeeper to elect a new leader and figure out what the current leader is. So if the zookeeper (think of it as a stand-alone entity) becomes unavailabe in the way I've described earlier, how can I handle such situation so my cluster can still function while a majority of nodes still connect to each other (but not to the zookeeper)?
Re: Question on maintaining leader/membership status in zookeeper
I believe Lei's concern is that the leader and all slaves can talk to ZK, but the slaves cannot talk to the leader. As a result no work can be done. However nothing will happen on the ZK side since everyone is heartbeating properly. Mahadev I think you came up with a pretty good solution. However since the leader can see the votes from all the slaves it might just want to give up the lead itself and pause for a while (to give someone else the chance to be the leader). This would allow the leader to handle the case better where a single slave cannot talk to the leader, but the rest of the slaves can communicate fine. Patrick On 04/30/2010 04:31 PM, Mahadev Konar wrote: Maybe I jumped the gun here but Ted's response to your query is more appropriate - You can then use ZK in your application to pick a lead machine for other operations. In that case, essentially every failure scenario is handled by the standard recipe. In your example where the master and slave are cut off, but both still have access to ZK, all that will happen is that the master cannot communicate with the slave. Both will still be clear about who is in which role. The case where the master is cut off from both ZK and the slave is also handled well as is the case where the master is cut off from ZK, but not from the slave. In both cases, the master will get a connection loss event and stop trying to act like a master and the slave will be notified that the master has dropped out of its role. -- On 4/30/10 4:14 PM, Mahadev Konarmaha...@yahoo-inc.com wrote: Hi Lei, Sorry I minsinterpreted your question! The scenario you describe could be handled in such a way - You could have a status node in ZooKeeper which every slave will subscribe to and update! If one of the slave nodes sees that there have been too many connection refused to the Leader by the slaves, the slave could go ahead and delete the Leader znode, and force the Leader to give up its leadership. I am not describing a deatiled way to do it, but its not very hard to come up with a design for this. Do you intend to have the Leader and Slaves in different Network (different ACLs I mean) protected zones? In that case, it is a legitimate concern else I do think assymetric network partition would be very unlikely to happen. Do you usually see network partitions in such scenarios? Thanks mahadev On 4/30/10 4:05 PM, Lei Gaol...@linkedin.com wrote: Hi Mahadev, Why would the leader be disconnected from ZK? ZK is fine communicating with the leader in this case. We are talking about asymmetric network failure. Yes. Leader could consider all the slaves being down if it tracks the status of all slaves himself. But I guess if ZK is used for for membership management, neither the leader nor the slaves will be considered disconnected because they can all connect to ZK. Thanks, Lei On 4/30/10 3:47 PM, Mahadev Konarmaha...@yahoo-inc.com wrote: Hi Lei, In this case, the Leader will be disconnected from ZK cluster and will give up its leadership. Since its disconnected, ZK cluster will realize that the Leader is dead! When Zk cluster realizes that the Leader is dead (this is because the zk cluster hasn't heard from the Leader for a certain time Configurable via session timeout parameter), the slaves will be notified of this via watchers in zookeeper cluster. The slaves will realize that the Leader is gone and will relect a new Leader and will start working with the new Leader. Does that answer your question? You might want to look though the documentation of ZK to understand its use case and how it solves these kind of issues Thanks mahadev On 4/30/10 2:08 PM, Lei Gaol...@linkedin.com wrote: Thank you all for your answers. It clarifies a lot of my confusion about the service guarantees of ZK. I am still struggling with one failure case (I am not trying to be the pain in the neck. But I need to have a full understanding of what ZK can offer before I make a decision on whether to used it in my cluster.) Assume the following topology: Leader ZK cluster \\// \\ // \\ // Slave(s) If I am asymmetric network failure such that the connection between Leader and Slave(s) are broken while all other connections are still alive, would my system hang after some point? Because no new leader election will be initiated by slaves and the leader can't get the work to slave(s). Thanks, Lei On 4/30/10 1:54 PM, Ted Dunningted.dunn...@gmail.com wrote: If one of your user clients can no longer reach one member of the ZK cluster, then it will try to reach another. If it succeeds, then it will continue without any problems as long as the ZK cluster itself is OK. This applies for all the ZK recipes. You will have to be a little bit careful to handle connection loss, but that
Re: Question on maintaining leader/membership status in zookeeper
HI Lei, ZooKeeper provides a set of primitives which allows you to do all kinds of things! You might want to take a look at the api and some examples of zookeeper recipes to see how it works and probably that will clear things out for you. Here are the links: http://hadoop.apache.org/zookeeper/docs/r3.3.0/recipes.html Thanks mahadev On 4/30/10 4:46 PM, Lei Gao l...@linkedin.com wrote: Hi Mahadev, First of all, I like to thank you for being patient with me - my questions seem unclear to many of you who try to help me. I guess clients have to be smart enough to trigger a new leader election by trying to delete the znode. But in this case, ZK should not allow any single or multiple (as long as they are less than a quorum) client(s) to delete the znode responding to the master, right? A new consensus among clients (NOT among the nodes in zk cluster) has to be there for the znode to be deleted, right? Does zk have this capability or the clients have to come to this consensus outside of zk before trying to delete the znode in zk? Thanks, Lei Hi Lei, Sorry I minsinterpreted your question! The scenario you describe could be handled in such a way - You could have a status node in ZooKeeper which every slave will subscribe to and update! If one of the slave nodes sees that there have been too many connection refused to the Leader by the slaves, the slave could go ahead and delete the Leader znode, and force the Leader to give up its leadership. I am not describing a deatiled way to do it, but its not very hard to come up with a design for this. Do you intend to have the Leader and Slaves in different Network (different ACLs I mean) protected zones? In that case, it is a legitimate concern else I do think assymetric network partition would be very unlikely to happen. Do you usually see network partitions in such scenarios? Thanks mahadev On 4/30/10 4:05 PM, Lei Gao l...@linkedin.com wrote: Hi Mahadev, Why would the leader be disconnected from ZK? ZK is fine communicating with the leader in this case. We are talking about asymmetric network failure. Yes. Leader could consider all the slaves being down if it tracks the status of all slaves himself. But I guess if ZK is used for for membership management, neither the leader nor the slaves will be considered disconnected because they can all connect to ZK. Thanks, Lei On 4/30/10 3:47 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Lei, In this case, the Leader will be disconnected from ZK cluster and will give up its leadership. Since its disconnected, ZK cluster will realize that the Leader is dead! When Zk cluster realizes that the Leader is dead (this is because the zk cluster hasn't heard from the Leader for a certain time Configurable via session timeout parameter), the slaves will be notified of this via watchers in zookeeper cluster. The slaves will realize that the Leader is gone and will relect a new Leader and will start working with the new Leader. Does that answer your question? You might want to look though the documentation of ZK to understand its use case and how it solves these kind of issues Thanks mahadev On 4/30/10 2:08 PM, Lei Gao l...@linkedin.com wrote: Thank you all for your answers. It clarifies a lot of my confusion about the service guarantees of ZK. I am still struggling with one failure case (I am not trying to be the pain in the neck. But I need to have a full understanding of what ZK can offer before I make a decision on whether to used it in my cluster.) Assume the following topology: Leader ZK cluster \\// \\ // \\ // Slave(s) If I am asymmetric network failure such that the connection between Leader and Slave(s) are broken while all other connections are still alive, would my system hang after some point? Because no new leader election will be initiated by slaves and the leader can't get the work to slave(s). Thanks, Lei On 4/30/10 1:54 PM, Ted Dunning ted.dunn...@gmail.com wrote: If one of your user clients can no longer reach one member of the ZK cluster, then it will try to reach another. If it succeeds, then it will continue without any problems as long as the ZK cluster itself is OK. This applies for all the ZK recipes. You will have to be a little bit careful to handle connection loss, but that should get easier soon (and isn't all that difficult anyway). On Fri, Apr 30, 2010 at 1:26 PM, Lei Gao l...@linkedin.com wrote: I am not talking about the leader election within zookeeper cluster. I guess I didn't make the discussion context clear. In my case, I run a cluster that uses zookeeper for doing the leader election. Yes, nodes in my cluster are the clients of zookeeper. Those nodes depend on zookeeper to elect