Re: What happens when a server loses all its state?

Thomas Vinod Johnson Wed, 17 Dec 2008 08:36:26 -0800

Mahadev Konar wrote:

Hi Thomas,
 Here is what would happen in the scenario you mentioned.

Great - thanks Mahadev.

Not to drag this on more than necessary, please bear with me for one
more example of 'amnesia' that comes to mind. I have a set of ZooKeeper
servers A, B, C.
- C is currently not running, A is the leader, B is the follower.
- A proposes zxid1 to A and B, both acknowledge.
- A asks A to commit (which it persists), but before the same commit
request reaches B, all servers go down (say a power failure).

In this case, the zookeeper protocol says that zxid1 would be available only
if the client gets a success. So zxid1 may or may not get committed if A and
B come up later. ( this is a different scenario then what you mention
later).

The general scenario I was interested in was a minority of serverslosing state, and trying to understand what other correlated eventscould cause issues. Just to be clear, since A has sent the commit to B(or is it when A has got its own commit), it *could have* sent a successback to the client before everything went down, correct?

- Later, B and C come up (A is slow to reboot), but B has lost all state
due to disk failure.

This is how zookeeper would work in this scenario ---

Now since we have B and C come up and B has the most recent state but loses
it, then zookeeper is clueless about this. So C would say I have the some
zxid say zxid-n and B would say that I have zxid = 0 (since its stateless)
and C would become a leader (since it has the highest zxid).

This would lead to loss of data and loss of state in zookeeper. That's what
I meant when I mentioned that zookeeper relies heavily on the state being
persisted on disk.

OK good, my understanding was correct then.

- C becomes the new leader and perhaps continues with some more new
transactions.

Now if A comes back again, C would say that its the leader and ask A to
truncate all the transactions that A had to come to sync with C.

I wasn't aware that C would ask A to truncate even committedtransactions (the zookeeper internals doc/slides talks about proposals -I suspect I may have some terminology confusion here). Anotherpossibility is C is now at zxid2 >= zxid1, in which case A couldpossibly *not* get rid of the committed transaction?

Again, you can see that how persistence loss can trigger state loss in
zookeeper. If its just minority of servers failing then this can be taken
care of by zookeeper but in this scenario is C failing and then being
brought up with an inconsisten state with another failure of A and data loss
of B -- which zookeeper cannot handle.

I hope this helps.

Yes thanks. Not sure if this makes sense, but is it worthwhile to have a'safe' mode when a server comes up with no state (I think it should besimple to distinguish between having a clean disk 'no state'/corruptstate and 'empty state')? In this case, I think it could simply waittill it sees a successful propose/commit cycle to know that it is safefor it to take a snapshot and start participating in the ensemble.

In the scenario I previously described, when B and C comes up, B wouldnot respond to C, but just watch - C would not be able to establishquorum until A came up; at which point B has witnessed a successfulleader activation, and can join. If one is willing to sacrifice livenessfor safety in situations where 1 or more nodes have amnesia, would thisbe a viable option?

Re: What happens when a server loses all its state?

Reply via email to