On the nodes that have the replica in a recovering state we now see: 19-07-2016 16:18:28 ERROR RecoveryStrategy:159 - Error while trying to recover. core=lookups_shard1_replica8:org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: lookups slice: shard1 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:607) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:593) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
19-07-2016 16:18:28 INFO RecoveryStrategy:444 - Replay not started, or was not successful... still buffering updates. 19-07-2016 16:18:28 ERROR RecoveryStrategy:481 - Recovery failed - trying again... (164) 19-07-2016 16:18:28 INFO RecoveryStrategy:503 - Wait [12.0] seconds before trying to recover again (attempt=165) This is with the "leader that is not the leader" shut down. Issuing a FORCELEADER via collections API doesn't in fact force a leader election to occur. Is there any other way to prompt Solr to have an election? Cheers Tom On Tue, Jul 19, 2016 at 5:10 PM, Tom Evans <tevans...@googlemail.com> wrote: > There are 11 collections, each only has one shard, and each node has > 10 replicas (9 collections are on every node, 2 are just on one node). > We're not seeing any OOM errors on restart. > > I think we're being patient waiting for the leader election to occur. > We stopped the troublesome "leader that is not the leader" server > about 15-20 minutes ago, but we still have not had a leader election. > > Cheers > > Tom > > On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson <erickerick...@gmail.com> > wrote: >> How many replicas per Solr JVM? And do you >> see any OOM errors when you bounce a server? >> And how patient are you being, because it can >> take 3 minutes for a leaderless shard to decide >> it needs to elect a leader. >> >> See SOLR-7280 and SOLR-7191 for the case >> where lots of replicas are in the same JVM, >> the tell-tale symptom is errors in the log as you >> bring Solr up saying something like >> "OutOfMemory error.... unable to create native thread" >> >> SOLR-7280 has patches for 6x and 7x, with a 5x one >> being added momentarily. >> >> Best, >> Erick >> >> On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans <tevans...@googlemail.com> wrote: >>> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most >>> of the collections on it marked as "Recovering" or "Recovery Failed". >>> It attempts to recover from the leader, but the leader responds with: >>> >>> Error while trying to recover. >>> core=iris_shard1_replica1:java.util.concurrent.ExecutionException: >>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >>> Error from server at http://172.31.1.171:30000/solr: We are not the >>> leader >>> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >>> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >>> at >>> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596) >>> at >>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353) >>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) >>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>> at >>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> at java.lang.Thread.run(Thread.java:745) >>> Caused by: >>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >>> Error from server at http://172.31.1.171:30000/solr: We are not the >>> leader >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576) >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284) >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280) >>> ... 5 more >>> >>> and recovery never occurs. >>> >>> Each collection in this state has plenty (10+) of active replicas, but >>> stopping the server that is marked as the leader doesn't trigger a >>> leader election amongst these replicas. >>> >>> REBALANCELEADERS did nothing. >>> FORCELEADER complains that there is already a leader. >>> FORCELEADER with the purported leader stopped took 45 seconds, >>> reported status of "0" (and no other message) and kept the down node >>> as the leader (!) >>> Deleting the failed collection from the failed node and re-adding it >>> has the same "Leader said I'm not the leader" error message. >>> >>> Any other ideas? >>> >>> Cheers >>> >>> Tom