Thanks for all answers. 

It appears that we will not have a data-center failure tolerant deployment of 
zookeeper without a 3rd datacenter. The other alternative is to forget about 
running zookeepers across datacenters, and instead have a live-warm deployment 
(and we'd have to manually switch/fail-over primary to backup if we lost or 
otherwise needed to do maintenance on the primary side).


-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, July 25, 2013 7:21 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.3.0 - SolrCloud lost all documents when leaders got rebuilt

Picking up on what Donimique mentioned. Your ZK configuration
isn't doing you much good. Not only do you have an even number
6 (which is actually _less_ robust than having 5), but by splitting
them among two data centers you're effectively requiring the data
center with 4 nodes to always be up. If it goes down (or even if the
link between DCs is broken), DC2 will not be able to index
documents since the ZK nodes in DC2 can't find 4 ZK nodes to
work with.

By taking down the ZK quorum, you are effectively "freezing" the Solr
nodes with the snapshot of the system they knew about the last
time there was a quorum. It's a sticky wicket. Let's assume what you're
trying to do was allowed. Now let's assume that instead of the
machines being down you simply lost connectivity between your DCs
so the ZK nodes can't talk to each other. Now they'd each elect their
nodes as leaders. Any incoming indexing requests wold be serviced.

Now the DCs are re-connected. How could the conflicts be resolved?
This is the "split brain" problem, something ZK is specifically designed
to prevent.

Best
Erick

On Wed, Jul 24, 2013 at 6:50 PM, Dominique Bejean
<dominique.bej...@eolya.fr> wrote:
> With 6 zookeeper instances you need at least 4 instances running at the same 
> time. How can you decide to stop 4 instances and have only 2 instances 
> running ? Zookeeper can't work anymore in these conditions.
>
> Dominique
>
> Le 25 juil. 2013 à 00:16, "Joshi, Shital" <shital.jo...@gs.com> a écrit :
>
>> We have SolrCloud cluster (5 shards and 2 replicas) on 10 dynamic compute 
>> boxes (cloud), where 5 machines (leaders) are in datacenter1 and replicas on 
>> datacenter2.  We have 6 zookeeper instances - 4 on datacenter1 and 2 on 
>> datacenter2. The zookeeper instances are on same hosts as Solr nodes. We're 
>> using local disk (/local/data) to store solr index files.
>>
>> Infrastructure team wanted to rebuild dynamic compute boxes on datacenter1 
>> so we handed over all leader hosts to them. By doing so, We lost 4 zookeeper 
>> instances. We were expecting to see all replicas acting as leader. In order 
>> to confirm that, I went to admin console -> cloud page but the page never 
>> returned (kept hanging).  I checked log and saw constant zookeeper host 
>> connection exceptions (the zkHost system property had all 6 zookeeper 
>> instances). I restarted cloud on all replicas but got same error again. This 
>> exception is I think due to the zookeeper bug: 
>> https://issues.apache.org/jira/browse/SOLR-4899 I guess zookeeper never 
>> registered the replicas as leader.
>>
>> After dynamic compute machines were re-built (lost all local data) I 
>> restarted entire cloud (with 6 zookeeper and 10 nodes), the original leaders 
>> were still the leaders (I think zookeeper config never got updated with 
>> replicas being leader, though 2 zookeeper instances were still up). Since 
>> all leaders' /local/data/solr_data was empty, it got replicated to all 
>> replicas and we lost all data in our replica. We lost 26 million documents 
>> on replica. This was very awful.
>>
>> In our start up script (which brings up solr on all nodes one by one), the 
>> leaders are listed first.
>>
>> Any solution to this until Solr 4.4 release?
>>
>> Many Thanks!
>>
>>
>>
>>
>>

Reply via email to