This is due to leader initiated recovery. When Take a look at

https://issues.apache.org/jira/browse/SOLR-9446

On Oct 24, 2016 1:23 PM, "jimtronic" <jimtro...@gmail.com> wrote:

> We are running into a timing issue when trying to do a scripted deployment
> of
> our Solr Cloud cluster.
>
> Scenario to reproduce (sometimes):
>
> 1. launch 3 clean solr nodes connected to zookeeper.
> 2. create a 1 shard collection with replicas on each node.
> 3. load data (more will make the problem worse)
> 4. launch 3 more nodes
> 5. add replicas to each new node
> 6. once entire cluster is healthy, start killing first three nodes.
>
> Depending on the timing, the second three nodes end up all in RECOVERING
> state without a leader.
>
> This appears to be happening because when the first leader dies, all the
> new
> nodes go into full replication recovery and if all the old boxes happen to
> die during that state, the boxes are stuck. The boxes cannot serve requests
> and they eventually (1-8 hours) go into RECOVERY_FAILED state.
>
> This state is easy to fix with a FORCELEADER call to the collections API,
> but that's only remediation, not prevention.
>
> My question is this: Why do the new nodes have to go into full replication
> recovery when they are already up to date? I just added the replica, so it
> shouldn't have to a new full replication again.
>
> Jim
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply via email to