We have SolrCloud cluster (5 shards and 2 replicas) on 10 dynamic compute boxes (cloud), where 5 machines (leaders) are in datacenter1 and replicas on datacenter2. We have 6 zookeeper instances - 4 on datacenter1 and 2 on datacenter2. The zookeeper instances are on same hosts as Solr nodes. We're using local disk (/local/data) to store solr index files.
Infrastructure team wanted to rebuild dynamic compute boxes on datacenter1 so we handed over all leader hosts to them. By doing so, We lost 4 zookeeper instances. We were expecting to see all replicas acting as leader. In order to confirm that, I went to admin console -> cloud page but the page never returned (kept hanging). I checked log and saw constant zookeeper host connection exceptions (the zkHost system property had all 6 zookeeper instances). I restarted cloud on all replicas but got same error again. This exception is I think due to the zookeeper bug: https://issues.apache.org/jira/browse/SOLR-4899 I guess zookeeper never registered the replicas as leader. After dynamic compute machines were re-built (lost all local data) I restarted entire cloud (with 6 zookeeper and 10 nodes), the original leaders were still the leaders (I think zookeeper config never got updated with replicas being leader, though 2 zookeeper instances were still up). Since all leaders' /local/data/solr_data was empty, it got replicated to all replicas and we lost all data in our replica. We lost 26 million documents on replica. This was very awful. In our start up script (which brings up solr on all nodes one by one), the leaders are listed first. Any solution to this until Solr 4.4 release? Many Thanks!