Hi Mark, Thanks. All that is clear (I think Voldemort does a good job with hinted handoff, which I think Mark is referring to). The part that I'm not clear about is maybe not SolrCloud-specific, and that is - what exactly prevents the two halves of a cluster that's been split from thinking they are *the* cluster? Let's say you have a 10-node cluster, say with 10 ZK instances, one instance on each Solr node. And say 5 of these 10 servers are on switch A and the other 5 are on switch B. Something happens and switch A and 5 nodes on it get separated from 5 nodes on switch B. Say that both A and B happen to have complete copies of the index.
What in Solr (or ZK) tells either A or B half that "no, you are not *the* cluster and thou shalt not accept updates"? I'm guessing this: https://cwiki.apache.org/confluence/display/ZOOKEEPER/FailureScenarios ? So then the Q becomes: if we have 10 ZK nodes and they split in 5 & 5 nodes, does that mean neither side will have quorum because having 10 ZKs was a bad number of ZKs to have to begin with? Thanks, Otis ---- Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm ----- Original Message ----- > From: Mark Miller <markrmil...@gmail.com> > To: solr-user <solr-user@lucene.apache.org> > Cc: > Sent: Monday, June 18, 2012 11:05 AM > Subject: Re: SolrCloud and split-brain > > > On Jun 15, 2012, at 10:33 PM, Otis Gospodnetic wrote: > >> However, if my half brain understands what split brain is then I think > that's not a completely true claim because one can get unlucky and get a > SolrCloud cluster partitioned in a way that one or even all partitions reject > indexing (and update and deletion) requests if they do not have a complete > index. > > That's not split brain. Split brain means that multiple partitioned clusters > think they are *the* cluster and would keep accepting updates. This is a real > problem because when you unsplit the cluster, you cannot reconcile > conflicting > updates easily! In many cases you have to ask the user to resolve the > conflict. > > Yes, you must have a node to serve a shard in order to index to that shard. > You > do not need the whole index - but if an update hashes to a shard that has no > nodes hosting it, it will fail. If there is no node, the document has no > where > to live. Some systems do interesting things like buffer those updates to > other > nodes for a while - we don't plan on anything like that soon. At some point, > you can only survive a loss of so many nodes before its time to give up > accepting updates in any system. If you need to survive catastrophic loss of > nodes, you have to have enough replicas to handle it. Whether those nodes are > partitioned off from the cluster or simply die, it's all the same. You can > only survive so many node loses, and replicas are your defense. > > The lack of split-brain allows your cluster to remain consistent. If you > allow > split brain you have to use something like vector clocks and handle conflict > resolution when the splits rejoin, or you will just have a lot of messed up > data. You generally allow split brain when you want to favor write > availability > in the face of partitions, like Dynamo. But you must have a strategy for > rejoining splits (like vector clocks or something) or you can never properly > go > back to a single, consistent cluster. We favor consistency in the face of > partitions rather than write availability. It seemed like the right choice > for > Solr. > > - Mark Miller > lucidimagination.com >