Hello, My solr cluster runs on RH Linux with tomcat7 servlet. NumOfShards=40, replicationFactor=2, 40 servers each has 2 replicas. Solr 4.3
For experimental reasons I splitted my cluster to 2 sub-clusters, each containing a single replica of each shard. When connecting back these sub-clusters the sync failed (more than 100 docs indexed per shard) so a replication process started on sub-cluster #2. Due to transient storage limitations needed for the replication process, I removed all the index from sub-cluster #2 before connecting it back, then I connected sub-cluster #2's servers in 3-4 bulks to avoid high disk load. The first bulk replications worked well, but after a while an internal script pkilled all the solr instances, some while replicating. After starting back the servlet I discovered the disaster - on part of the replicas that were in a replicating stage there was a wrong zookeeper leader election - good state replicas (sub-cluster 1) replicated from empty replicas (sub-cluster 2) ending up in removing all documents in these shards!! These are the logs from solr-prod32 (sub cluster #2 - bad state) - the shard1_replica1 is elected to be leader although it was not before the replication process (and shouldn't have the higher version number): 2013-08-13 13:39:15.838 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext Enough replicas found to continue. 2013-08-13 13:39:15.838 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext I may be the new leader - try and sync 2013-08-13 13:39:15.839 [INFO ] org.apache.solr.cloud.SyncStrategy Sync replicas to http://solr-prod32:5050/solr/raw shard1_replica1/ 2013-08-13 13:39:15.841 [INFO ] org.apache.solr.client.solrj.impl.HttpClientUtil Creating new http client, config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false 2013-08-13 13:39:15.844 [INFO ] org.apache.solr.update.PeerSync PeerSync: core=raw_shard1_replica1 url=http://solr-prod32:8080/solr START replicas=[ http://solr-prod02:5080/solr/raw shard1_replica2/] nUpdates=100 2013-08-13 13:39:15.847 [INFO I org.apache.solr.update.PeerSync PeerSync: core=raw shard1_replica1 url=http://solr-prod32:8080/solr DONE. We have no versions. sync failed. 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.SyncStrategy Leader's attempt to sync with shard failed, moving to the next canidate 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext I am the new leader: http://solr-prod32:8080/solr/raw_shard1_replica1/ 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.common.cloud.SolrZkClient makePath: /collections/raw/leaders/shardl 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 40) While in solr-prod02 (sub cluster #1 - good state) I get: 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController publishing core=raw_shard1_replica2 state=down 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController numShards not found on descriptor - reading it from system property 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.core.CoreContainer registering core: raw_shard1_replica2 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.cloud.ZkController Register replica - core:raw_shard1_replica2 address: http://so1r-prod02:8080/solrcollection:raw shard:shard1 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader A cluster state change: WatchedEvent stare:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 40) 2013-08-13 13:39:17.480 [INFO ] org.apache.solr.cloud.ZkController We are httpL//solr-prod02:8080/solr/raw_shard1_replica2/ and leader is http://solr-prod32:8080/solr/raw_shard1_replica1/ 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController No LogReplay needed for core=raw_shard1_replica2 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController Core needs to recover:raw shard1_replica2 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.update.DefaultSolrCoreState Running recovery - first canceling any ongoing recovery 2013-08-13 13:39:17.485 [INFO org.apache.solr.common.cloud.ZkStateReader Updating cloud state from ZooKeeper... 2013-08-13 13:39:17.485 [INFO ] org.apache.solr.cloud.RecoveryStrategy Starting recovery process. core=raw_shard1_rep1ica2 Why was the leader elected wrongly?? Thanks