Re: Solr Cloud A/B Deployment Issue
Great. Thanks for the work on this patch! Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810p4303357.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Cloud A/B Deployment Issue
Nodes will still go into recovery but only for a short duration. On Oct 26, 2016 1:26 PM, "jimtronic" <jimtro...@gmail.com> wrote: It appears this has all been resolved by the following ticket: https://issues.apache.org/jira/browse/SOLR-9446 My scenario fails in 6.2.1, but works in 6.3 and Master where this bug has been fixed. In the meantime, we can use our workaround to issue a simple delete command that deletes a non-existent document. Jim -- View this message in context: http://lucene.472066.n3. nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810p4303210.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Cloud A/B Deployment Issue
This is due to leader initiated recovery. When Take a look at https://issues.apache.org/jira/browse/SOLR-9446 On Oct 24, 2016 1:23 PM, "jimtronic" <jimtro...@gmail.com> wrote: > We are running into a timing issue when trying to do a scripted deployment > of > our Solr Cloud cluster. > > Scenario to reproduce (sometimes): > > 1. launch 3 clean solr nodes connected to zookeeper. > 2. create a 1 shard collection with replicas on each node. > 3. load data (more will make the problem worse) > 4. launch 3 more nodes > 5. add replicas to each new node > 6. once entire cluster is healthy, start killing first three nodes. > > Depending on the timing, the second three nodes end up all in RECOVERING > state without a leader. > > This appears to be happening because when the first leader dies, all the > new > nodes go into full replication recovery and if all the old boxes happen to > die during that state, the boxes are stuck. The boxes cannot serve requests > and they eventually (1-8 hours) go into RECOVERY_FAILED state. > > This state is easy to fix with a FORCELEADER call to the collections API, > but that's only remediation, not prevention. > > My question is this: Why do the new nodes have to go into full replication > recovery when they are already up to date? I just added the replica, so it > shouldn't have to a new full replication again. > > Jim > > > > > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Solr Cloud A/B Deployment Issue
It appears this has all been resolved by the following ticket: https://issues.apache.org/jira/browse/SOLR-9446 My scenario fails in 6.2.1, but works in 6.3 and Master where this bug has been fixed. In the meantime, we can use our workaround to issue a simple delete command that deletes a non-existent document. Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810p4303210.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Cloud A/B Deployment Issue
Also, if we issue a delete by query where the query is "_version_:0", it also creates a transaction log and then has no trouble transferring leadership between old and new nodes. Still, it seems like when we ADDREPLICA, some sort of transaction log should be started. Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810p4302959.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Cloud A/B Deployment Issue
Interestingly, If I simply add one document to the full cluster after all 6 nodes are active, this entire problem goes away. This appears to be because a transaction log entry is created which in turn prevents the new nodes from going into full replication recovery upon leader change. Adding a document is a hacky solution, however. It seems like new nodes that were added via ADDREPLICA should know more about versions than they currently do. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810p4302949.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Cloud A/B Deployment Issue
We are running into a timing issue when trying to do a scripted deployment of our Solr Cloud cluster. Scenario to reproduce (sometimes): 1. launch 3 clean solr nodes connected to zookeeper. 2. create a 1 shard collection with replicas on each node. 3. load data (more will make the problem worse) 4. launch 3 more nodes 5. add replicas to each new node 6. once entire cluster is healthy, start killing first three nodes. Depending on the timing, the second three nodes end up all in RECOVERING state without a leader. This appears to be happening because when the first leader dies, all the new nodes go into full replication recovery and if all the old boxes happen to die during that state, the boxes are stuck. The boxes cannot serve requests and they eventually (1-8 hours) go into RECOVERY_FAILED state. This state is easy to fix with a FORCELEADER call to the collections API, but that's only remediation, not prevention. My question is this: Why do the new nodes have to go into full replication recovery when they are already up to date? I just added the replica, so it shouldn't have to a new full replication again. Jim -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-A-B-Deployment-Issue-tp4302810.html Sent from the Solr - User mailing list archive at Nabble.com.