On Mon, Jul 6, 2009 at 10:16 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> No. This should not cause data loss. > As soon as ZK cannot replicate changes to a majority of machines, it > refuses > to take any more changes. This is old ground and is required for > correctness in the face of network partition. It is conceivable (barely) > that *exactly* the minority that were behind were the survivors, but this > is > almost equivalent to a complete failure of the cluster choreographed in > such > a way that a few nodes come back from the dead just afterwards. That could > cause the state to not include some "completed" transactions to disappear, > but at this level of massive failure, we have the same issues with any > cluster. > Effectively, EC2 does not introduce any new failure modes but potentially exacerbates some existing ones. If a majority of EC2 nodes fail (in the sense that their hard drive images cannot be recovered), there is no way to restart the cluster, and persistence is lost. As you say, this is highly unlikely. If, for some reason, the quorums are set such that only a single node failure could bring down the quorum (bad design, but plausible), this failure is more likely. EC2 just ups the stakes - crash failures are now potentially more dangerous (bugs, packet corruption, rack local hardware failures etc all could cause crash failures). It is common to assume that, notwithstanding a significant physical event that wipes a number of hard drives, writes that are written stay written. This assumption is sometimes false given certain choices of filesystem. EC2 just gives us a few more ways for that not to be true. I think it's more possible than one might expect to have a lagging minority left behind - say they are partitioned from the majority by a malfunctioning switch. They might all be lagging already as a result. Care must be taken not to bring up another follower on the minority side to make it a majority, else there are split-brain issues as well as the possibility of lost transactions. Again, not *too* likely to happen in the wild, but these permanently running services have a nasty habit of exploring the edge cases... > > To be explicit, you can cause any ZK cluster to back-track in time by doing > the following: > ... > > f) add new members of the cluster Which is why care needs to be taken that the ensemble can't be expanded with a current quorum. Dynamic membership doesn't save us when a majority fails - the existence of a quorum is a liveness condition for ZK. To help with the liveness issue we can sacrifice a little safety (see, e.g. vector clock ordered timestamps in Dynamo), but I think that ZK is aimed at safety first, liveness second. Not that you were advocating changing that, I'm just articulating why correctness is extremely important from my perspective. Henry > > > At this point, you will have lost the transactions from (b), but I really, > really am not going to worry about this happening either by plan or by > accident. Without steps (e) and (f), the cluster will tell you that it > knows something is wrong and that it cannot elect a leader. If you don't > have *exact* coincidence of the survivor set and the set of laggards, then > you won't have any data loss at all. > > You have to decide if this is too much risk for you. My feeling is that it > is OK level of correctness for conventional weapon fire control, but not > for > nuclear weapons safeguards. Since my apps are considerably less sensitive > than either of those, I am not much worried. > > On Mon, Jul 6, 2009 at 12:40 PM, Henry Robinson <he...@cloudera.com> > wrote: > > > It seems like there is a > > correctness issue: if a majority of servers fail, with the remaining > > minority lagging the leader for some reason, won't the ensemble's current > > state be forever lost? > > >