We have SolrCloud cluster (5 shards and 2 replicas) on 10 dynamic compute boxes 
(cloud), where 5 machines (leaders) are in datacenter1 and replicas on 
datacenter2.  We have 6 zookeeper instances - 4 on datacenter1 and 2 on 
datacenter2. The zookeeper instances are on same hosts as Solr nodes. We're 
using local disk (/local/data) to store solr index files.

Infrastructure team wanted to rebuild dynamic compute boxes on datacenter1 so 
we handed over all leader hosts to them. By doing so, We lost 4 zookeeper 
instances. We were expecting to see all replicas acting as leader. In order to 
confirm that, I went to admin console -> cloud page but the page never returned 
(kept hanging).  I checked log and saw constant zookeeper host connection 
exceptions (the zkHost system property had all 6 zookeeper instances). I 
restarted cloud on all replicas but got same error again. This exception is I 
think due to the zookeeper bug: https://issues.apache.org/jira/browse/SOLR-4899 
I guess zookeeper never registered the replicas as leader.

After dynamic compute machines were re-built (lost all local data) I restarted 
entire cloud (with 6 zookeeper and 10 nodes), the original leaders were still 
the leaders (I think zookeeper config never got updated with replicas being 
leader, though 2 zookeeper instances were still up). Since all leaders' 
/local/data/solr_data was empty, it got replicated to all replicas and we lost 
all data in our replica. We lost 26 million documents on replica. This was very 
awful.

In our start up script (which brings up solr on all nodes one by one), the 
leaders are listed first.

Any solution to this until Solr 4.4 release?

Many Thanks!





Reply via email to