Hi all,

I have a cluster with 36 Shards and 3 replica per shard. I had to recently
restart the entire cluster - most of the shards & replica are back up - but
a few shards have not had any leaders for a long long time (close to 18
hours now) - I tried reloading these cores and even the servlet containers
hosting these cores. Its only now that all the shards have leaders allocated
- but few of these Leaders are still shown as Recovery Failed status on the
Solr Cloud tree view.


I see the following in the logs for these shards - 
INFO  - 2015-01-20 14:38:19.797;
org.apache.solr.handler.admin.CoreAdminHandler; In WaitForState(recovering):
collection=collection1, shard=shard1, thisCore=collection1_shard1_replica3,
leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true,
currentState=recovering, localState=recovery_failed,
nodeName=10.68.77.9:8983_solr, coreNodeName=core_node2,
onlyIfActiveCheckResult=true, nodeProps:
core_node2:{"state":"recovering","core":"collection1_shard1_replica1","node_name":"10.68.77.9:8983_solr","base_url":"http://10.68.77.9:8983/solr"}


And on other server hosting the replica for this shard - 
ERROR - 2015-01-20 14:38:20.768; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: I was asked to wait on state
recovering for shard3 in collection1 on 10.68.77.9:8983_solr but I still do
not see the requested state. I see state: recovering live:true leader from
ZK: http://10.68.77.3:8983/solr/collection1_shard3_replica3/
        at
org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:999)
        at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:245)
        at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
        at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
        at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
        at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
        at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
        at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
        at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
        at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
        at org.eclipse.jetty.server.Server.handle(Server.java:368)
        at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
        at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
        at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
        at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
        at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
        at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Unknown Source)


I see that there is no replica catch-up going on between any of these
servers now. 
Couple of questions - 
1. What is it that the Solr cloud is waiting on to allocate the leaders for
such shards?
2. Why are few of these shards show leaders in Recovery Failed state? And
how do I recover such shards?

Thanks,
Anand



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Leaders-in-Recovery-Failed-state-tp4180611.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to