Re: Node not recovering, leader elections not occuring

Tom Evans Tue, 19 Jul 2016 09:21:18 -0700

On the nodes that have the replica in a recovering state we now see:

19-07-2016 16:18:28 ERROR RecoveryStrategy:159 - Error while trying to
recover. core=lookups_shard1_replica8:org.apache.solr.common.SolrException:
No registered leader was found after waiting for 4000ms , collection:
lookups slice: shard1
        at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:607)
        at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:593)
        at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


19-07-2016 16:18:28 INFO  RecoveryStrategy:444 - Replay not started,
or was not successful... still buffering updates.
19-07-2016 16:18:28 ERROR RecoveryStrategy:481 - Recovery failed -
trying again... (164)
19-07-2016 16:18:28 INFO  RecoveryStrategy:503 - Wait [12.0] seconds
before trying to recover again (attempt=165)


This is with the "leader that is not the leader" shut down.

Issuing a FORCELEADER via collections API doesn't in fact force a
leader election to occur.

Is there any other way to prompt Solr to have an election?

Cheers

Tom

On Tue, Jul 19, 2016 at 5:10 PM, Tom Evans <tevans...@googlemail.com> wrote:
> There are 11 collections, each only has one shard, and each node has
> 10 replicas (9 collections are on every node, 2 are just on one node).
> We're not seeing any OOM errors on restart.
>
> I think we're being patient waiting for the leader election to occur.
> We stopped the troublesome "leader that is not the leader" server
> about 15-20 minutes ago, but we still have not had a leader election.
>
> Cheers
>
> Tom
>
> On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson <erickerick...@gmail.com> 
> wrote:
>> How many replicas per Solr JVM? And do you
>> see any OOM errors when you bounce a server?
>> And how patient are you being, because it can
>> take 3 minutes for a leaderless shard to decide
>> it needs to elect a leader.
>>
>> See SOLR-7280 and SOLR-7191 for the case
>> where lots of replicas are in the same JVM,
>> the tell-tale symptom is errors in the log as you
>> bring Solr up saying something like
>> "OutOfMemory error.... unable to create native thread"
>>
>> SOLR-7280 has patches for 6x and 7x, with a 5x one
>> being added momentarily.
>>
>> Best,
>> Erick
>>
>> On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans <tevans...@googlemail.com> wrote:
>>> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
>>> of the collections on it marked as "Recovering" or "Recovery Failed".
>>> It attempts to recover from the leader, but the leader responds with:
>>>
>>> Error while trying to recover.
>>> core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> Error from server at http://172.31.1.171:30000/solr: We are not the
>>> leader
>>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>> at 
>>> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
>>> at 
>>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
>>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> at 
>>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: 
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> Error from server at http://172.31.1.171:30000/solr: We are not the
>>> leader
>>> at 
>>> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
>>> at 
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284)
>>> at 
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280)
>>> ... 5 more
>>>
>>> and recovery never occurs.
>>>
>>> Each collection in this state has plenty (10+) of active replicas, but
>>> stopping the server that is marked as the leader doesn't trigger a
>>> leader election amongst these replicas.
>>>
>>> REBALANCELEADERS did nothing.
>>> FORCELEADER complains that there is already a leader.
>>> FORCELEADER with the purported leader stopped took 45 seconds,
>>> reported status of "0" (and no other message) and kept the down node
>>> as the leader (!)
>>> Deleting the failed collection from the failed node and re-adding it
>>> has the same "Leader said I'm not the leader" error message.
>>>
>>> Any other ideas?
>>>
>>> Cheers
>>>
>>> Tom

Re: Node not recovering, leader elections not occuring

Reply via email to