This is due to leader initiated recovery. When Take a look at https://issues.apache.org/jira/browse/SOLR-9446
On Oct 24, 2016 1:23 PM, "jimtronic" <jimtro...@gmail.com> wrote: > We are running into a timing issue when trying to do a scripted deployment > of > our Solr Cloud cluster. > > Scenario to reproduce (sometimes): > > 1. launch 3 clean solr nodes connected to zookeeper. > 2. create a 1 shard collection with replicas on each node. > 3. load data (more will make the problem worse) > 4. launch 3 more nodes > 5. add replicas to each new node > 6. once entire cluster is healthy, start killing first three nodes. > > Depending on the timing, the second three nodes end up all in RECOVERING > state without a leader. > > This appears to be happening because when the first leader dies, all the > new > nodes go into full replication recovery and if all the old boxes happen to > die during that state, the boxes are stuck. The boxes cannot serve requests > and they eventually (1-8 hours) go into RECOVERY_FAILED state. > > This state is easy to fix with a FORCELEADER call to the collections API, > but that's only remediation, not prevention. > > My question is this: Why do the new nodes have to go into full replication > recovery when they are already up to date? I just added the replica, so it > shouldn't have to a new full replication again. > > Jim > > > > > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810.html > Sent from the Solr - User mailing list archive at Nabble.com. >