[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers
[ https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15317027#comment-15317027 ] James Hardwick commented on SOLR-7021: -- [~forest_soup] since updating to Solr 5.5+ we haven't had such issues. > Leader will not publish core as active without recovering first, but never > recovers > --- > > Key: SOLR-7021 > URL: https://issues.apache.org/jira/browse/SOLR-7021 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.10 >Reporter: James Hardwick >Priority: Critical > Labels: recovery, solrcloud, zookeeper > > A little background: 1 core solr-cloud cluster across 3 nodes, each with its > own shard and each shard with a single replica hence each replica is itself a > leader. > For reasons we won't get into, we witnessed a shard go down in our cluster. > We restarted the cluster but our core/shards still did not come back up. > After inspecting the logs, we found this: > {code} > 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is > http://xxx.xxx.xxx.35:8081/solr/xyzcore/ > 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr > 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - I am the leader, no recovery necessary > 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - publishing core=xyzcore state=active collection=xyzcore > 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - numShards not found on descriptor - reading it from system property > 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - publishing core=xyzcore state=down collection=xyzcore > 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - numShards not found on descriptor - reading it from system property > 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer - > :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' > as active without recovering first! > at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075) > {code} > And at this point the necessary shards never recover correctly and hence our > core never returns to a functional state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-7940) [CollectionAPI] Frequent Cluster Status timeout
[ https://issues.apache.org/jira/browse/SOLR-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032540#comment-15032540 ] James Hardwick edited comment on SOLR-7940 at 11/30/15 10:13 PM: - We are seeing this as well on a 3 node cluster w/ 2 collections. Looks like others are also, across a variety of versions: http://lucene.472066.n3.nabble.com/CLUSTERSTATUS-timeout-tp4173224.html http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201511.mbox/%3c5639dfcf.9020...@decalog.net%3E http://grokbase.com/t/lucene/solr-user/154d0wjr7c/clusterstate-timeout was (Author: hardwickj): We are seeing this as well on a 3 node cluster w/ 2 collections. Looks like others are also, across a variety of versions: http://lucene.472066.n3.nabble.com/CLUSTERSTATUS-timeout-tp4173224.html > [CollectionAPI] Frequent Cluster Status timeout > --- > > Key: SOLR-7940 > URL: https://issues.apache.org/jira/browse/SOLR-7940 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.10.2 > Environment: Ubuntu on Azure >Reporter: Stephan Lagraulet > > Very often we have a timeout when we call > http://server2:8080/solr/admin/collections?action=CLUSTERSTATUS=json > {code} > {"responseHeader": > {"status": 500, > "QTime": 180100}, > "error": > {"msg": "CLUSTERSTATUS the collection time out:180s", > "trace": "org.apache.solr.common.SolrException: CLUSTERSTATUS the collection > time out:180s\n\tat > org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:368)\n\tat > > org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:320)\n\tat > > org.apache.solr.handler.admin.CollectionsHandler.handleClusterStatus(CollectionsHandler.java:640)\n\tat > > org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:220)\n\tat > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:267)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)\n\tat > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)\n\tat > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)\n\tat > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)\n\tat > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)\n\tat > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)\n\tat > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)\n\tat > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)\n\tat > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)\n\tat > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)\n\tat > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)\n\tat > org.eclipse.jetty.server.Server.handle(Server.java:350)\n\tat > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)\n\tat > > org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)\n\tat > > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)\n\tat > org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:630)\n\tat > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)\n\tat > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:77)\n\tat > > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:606)\n\tat > > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:46)\n\tat > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603)\n\tat > > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538)\n\tat > java.lang.Thread.run(Thread.java:745)\n", > "code": 500}} > {code} > The cluster has 3 SolR nodes with 6 small collections replicated on all nodes. > We were using this api to monitor cluster state but it was failing every 10 > minutes. We switched by using ZkStateReader in CloudSolrServer
[jira] [Commented] (SOLR-7940) [CollectionAPI] Frequent Cluster Status timeout
[ https://issues.apache.org/jira/browse/SOLR-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032540#comment-15032540 ] James Hardwick commented on SOLR-7940: -- We are seeing this as well on a 3 node cluster w/ 2 collections. Looks like others are also, across a variety of versions: http://lucene.472066.n3.nabble.com/CLUSTERSTATUS-timeout-tp4173224.html > [CollectionAPI] Frequent Cluster Status timeout > --- > > Key: SOLR-7940 > URL: https://issues.apache.org/jira/browse/SOLR-7940 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.10.2 > Environment: Ubuntu on Azure >Reporter: Stephan Lagraulet > > Very often we have a timeout when we call > http://server2:8080/solr/admin/collections?action=CLUSTERSTATUS=json > {code} > {"responseHeader": > {"status": 500, > "QTime": 180100}, > "error": > {"msg": "CLUSTERSTATUS the collection time out:180s", > "trace": "org.apache.solr.common.SolrException: CLUSTERSTATUS the collection > time out:180s\n\tat > org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:368)\n\tat > > org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:320)\n\tat > > org.apache.solr.handler.admin.CollectionsHandler.handleClusterStatus(CollectionsHandler.java:640)\n\tat > > org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:220)\n\tat > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:267)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)\n\tat > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)\n\tat > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)\n\tat > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)\n\tat > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)\n\tat > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)\n\tat > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)\n\tat > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)\n\tat > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)\n\tat > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)\n\tat > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)\n\tat > org.eclipse.jetty.server.Server.handle(Server.java:350)\n\tat > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)\n\tat > > org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)\n\tat > > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)\n\tat > org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:630)\n\tat > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)\n\tat > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:77)\n\tat > > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:606)\n\tat > > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:46)\n\tat > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603)\n\tat > > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538)\n\tat > java.lang.Thread.run(Thread.java:745)\n", > "code": 500}} > {code} > The cluster has 3 SolR nodes with 6 small collections replicated on all nodes. > We were using this api to monitor cluster state but it was failing every 10 > minutes. We switched by using ZkStateReader in CloudSolrServer and it has > been working for a day without problems. > Is there a kind of deadlock as this call was been made on the three nodes > concurrently? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7940) [CollectionAPI] Frequent Cluster Status timeout
[ https://issues.apache.org/jira/browse/SOLR-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032552#comment-15032552 ] James Hardwick commented on SOLR-7940: -- Actually, we are consistently seeing this on any of a variety of instances we have, all of which are generally uniform in their configuration. I'd love to help if any of the Solr dev's can point me in the right direction for doing any sort of diagnostics. > [CollectionAPI] Frequent Cluster Status timeout > --- > > Key: SOLR-7940 > URL: https://issues.apache.org/jira/browse/SOLR-7940 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.10.2 > Environment: Ubuntu on Azure >Reporter: Stephan Lagraulet > > Very often we have a timeout when we call > http://server2:8080/solr/admin/collections?action=CLUSTERSTATUS=json > {code} > {"responseHeader": > {"status": 500, > "QTime": 180100}, > "error": > {"msg": "CLUSTERSTATUS the collection time out:180s", > "trace": "org.apache.solr.common.SolrException: CLUSTERSTATUS the collection > time out:180s\n\tat > org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:368)\n\tat > > org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:320)\n\tat > > org.apache.solr.handler.admin.CollectionsHandler.handleClusterStatus(CollectionsHandler.java:640)\n\tat > > org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:220)\n\tat > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:267)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)\n\tat > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)\n\tat > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)\n\tat > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)\n\tat > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)\n\tat > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)\n\tat > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)\n\tat > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)\n\tat > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)\n\tat > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)\n\tat > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)\n\tat > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)\n\tat > org.eclipse.jetty.server.Server.handle(Server.java:350)\n\tat > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)\n\tat > > org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)\n\tat > > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)\n\tat > org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:630)\n\tat > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)\n\tat > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:77)\n\tat > > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:606)\n\tat > > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:46)\n\tat > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603)\n\tat > > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538)\n\tat > java.lang.Thread.run(Thread.java:745)\n", > "code": 500}} > {code} > The cluster has 3 SolR nodes with 6 small collections replicated on all nodes. > We were using this api to monitor cluster state but it was failing every 10 > minutes. We switched by using ZkStateReader in CloudSolrServer and it has > been working for a day without problems. > Is there a kind of deadlock as this call was been made on the three nodes > concurrently? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers
[ https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289032#comment-14289032 ] James Hardwick commented on SOLR-7021: -- Yep, we were looking at that one and we're wondering the same. The symptom is different but sounds like the solution might be the same. We'll give it a try! Leader will not publish core as active without recovering first, but never recovers --- Key: SOLR-7021 URL: https://issues.apache.org/jira/browse/SOLR-7021 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.10 Reporter: James Hardwick Priority: Critical Labels: recovery, solrcloud, zookeeper A little background: 1 core solr-cloud cluster across 3 nodes, each with its own shard and each shard with a single replica hence each replica is itself a leader. For reasons we won't get into, we witnessed a shard go down in our cluster. We restarted the cluster but our core/shards still did not come back up. After inspecting the logs, we found this: {code} 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO cloud.ZkController - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is http://xxx.xxx.xxx.35:8081/solr/xyzcore/ 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - I am the leader, no recovery necessary 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - publishing core=xyzcore state=active collection=xyzcore 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController - publishing core=xyzcore state=down collection=xyzcore 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer - :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' as active without recovering first! at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075) {code} And at this point the necessary shards never recover correctly and hence our core never returns to a functional state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers
[ https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289034#comment-14289034 ] James Hardwick commented on SOLR-7021: -- In the mean time, how do we best get around this? It still does not recover when we restart the cluster. Should manually kicking off a core reload for each node do the trick? Leader will not publish core as active without recovering first, but never recovers --- Key: SOLR-7021 URL: https://issues.apache.org/jira/browse/SOLR-7021 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.10 Reporter: James Hardwick Priority: Critical Labels: recovery, solrcloud, zookeeper A little background: 1 core solr-cloud cluster across 3 nodes, each with its own shard and each shard with a single replica hence each replica is itself a leader. For reasons we won't get into, we witnessed a shard go down in our cluster. We restarted the cluster but our core/shards still did not come back up. After inspecting the logs, we found this: {code} 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO cloud.ZkController - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is http://xxx.xxx.xxx.35:8081/solr/xyzcore/ 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - I am the leader, no recovery necessary 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - publishing core=xyzcore state=active collection=xyzcore 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController - publishing core=xyzcore state=down collection=xyzcore 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer - :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' as active without recovering first! at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075) {code} And at this point the necessary shards never recover correctly and hence our core never returns to a functional state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers
[ https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289451#comment-14289451 ] James Hardwick commented on SOLR-7021: -- That worked Shalin. Thank you! Leader will not publish core as active without recovering first, but never recovers --- Key: SOLR-7021 URL: https://issues.apache.org/jira/browse/SOLR-7021 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.10 Reporter: James Hardwick Priority: Critical Labels: recovery, solrcloud, zookeeper A little background: 1 core solr-cloud cluster across 3 nodes, each with its own shard and each shard with a single replica hence each replica is itself a leader. For reasons we won't get into, we witnessed a shard go down in our cluster. We restarted the cluster but our core/shards still did not come back up. After inspecting the logs, we found this: {code} 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO cloud.ZkController - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is http://xxx.xxx.xxx.35:8081/solr/xyzcore/ 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - I am the leader, no recovery necessary 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - publishing core=xyzcore state=active collection=xyzcore 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController - publishing core=xyzcore state=down collection=xyzcore 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer - :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' as active without recovering first! at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075) {code} And at this point the necessary shards never recover correctly and hence our core never returns to a functional state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers
James Hardwick created SOLR-7021: Summary: Leader will not publish core as active without recovering first, but never recovers Key: SOLR-7021 URL: https://issues.apache.org/jira/browse/SOLR-7021 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.10 Reporter: James Hardwick Priority: Critical A little background: 1 core solr-cloud cluster across 3 nodes, each with its own shard and each shard with a single replica hence each replica is itself a leader. For reasons we won't get into, we witnessed a shard go down in our cluster. We restarted the cluster but our core/shards still did not come back up. After inspecting the logs, we found this: {code} 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO cloud.ZkController - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is http://xxx.xxx.xxx.35:8081/solr/xyzcore/ 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - I am the leader, no recovery necessary 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - publishing core=xyzcore state=active collection=xyzcore 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController - publishing core=xyzcore state=down collection=xyzcore 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer - :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' as active without recovering first! at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075) {code} And at this point the necessary shards never recover correctly and hence our core never returns to a functional state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers
[ https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288455#comment-14288455 ] James Hardwick commented on SOLR-7021: -- The key items to note being: * cloud.ZkController - I am the leader, no recovery necessary * core.ZkContainer - :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' as active without recovering first! Leader will not publish core as active without recovering first, but never recovers --- Key: SOLR-7021 URL: https://issues.apache.org/jira/browse/SOLR-7021 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.10 Reporter: James Hardwick Priority: Critical Labels: recovery, solrcloud, zookeeper A little background: 1 core solr-cloud cluster across 3 nodes, each with its own shard and each shard with a single replica hence each replica is itself a leader. For reasons we won't get into, we witnessed a shard go down in our cluster. We restarted the cluster but our core/shards still did not come back up. After inspecting the logs, we found this: {code} 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO cloud.ZkController - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is http://xxx.xxx.xxx.35:8081/solr/xyzcore/ 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - I am the leader, no recovery necessary 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController - publishing core=xyzcore state=active collection=xyzcore 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController - publishing core=xyzcore state=down collection=xyzcore 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer - :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' as active without recovering first! at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075) {code} And at this point the necessary shards never recover correctly and hence our core never returns to a functional state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
[ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202813#comment-14202813 ] James Hardwick edited comment on SOLR-6707 at 11/7/14 10:19 PM: Interesting clusterstate.json in ZK. Why would we have null range/parent properties for an implicitly routed index that has never been split? {code:javascript} { gemindex:{ shards:{shard1:{ range:null, state:active, parent:null, replicas:{ core_node1:{ state:active, core:gemindex, node_name:10.128.26.109:8081_extera-search, base_url:http://10.128.26.109:8081/extera-search}, core_node2:{ state:active, core:gemindex, node_name:10.128.225.154:8081_extera-search, base_url:http://10.128.225.154:8081/extera-search;, leader:true}, core_node3:{ state:active, core:gemindex, node_name:10.128.226.160:8081_extera-search, base_url:http://10.128.226.160:8081/extera-search, router:{name:implicit}}, text-analytics:{ shards:{shard1:{ range:null, state:active, parent:null, replicas:{ core_node1:{ state:recovery_failed, core:text-analytics, node_name:10.128.26.109:8081_extera-search, base_url:http://10.128.26.109:8081/extera-search}, core_node2:{ state:recovery_failed, core:text-analytics, node_name:10.128.225.154:8081_extera-search, base_url:http://10.128.225.154:8081/extera-search}, core_node3:{ state:down, core:text-analytics, node_name:10.128.226.160:8081_extera-search, base_url:http://10.128.226.160:8081/extera-search;, leader:true, router:{name:implicit}}} {code} was (Author: hardwickj): Interesting clusterstate.json in ZK. Why would we have null range/parent properties for an implicitly routed index that has never been split? {code:json} { gemindex:{ shards:{shard1:{ range:null, state:active, parent:null, replicas:{ core_node1:{ state:active, core:gemindex, node_name:10.128.26.109:8081_extera-search, base_url:http://10.128.26.109:8081/extera-search}, core_node2:{ state:active, core:gemindex, node_name:10.128.225.154:8081_extera-search, base_url:http://10.128.225.154:8081/extera-search;, leader:true}, core_node3:{ state:active, core:gemindex, node_name:10.128.226.160:8081_extera-search, base_url:http://10.128.226.160:8081/extera-search, router:{name:implicit}}, text-analytics:{ shards:{shard1:{ range:null, state:active, parent:null, replicas:{ core_node1:{ state:recovery_failed, core:text-analytics, node_name:10.128.26.109:8081_extera-search, base_url:http://10.128.26.109:8081/extera-search}, core_node2:{ state:recovery_failed, core:text-analytics, node_name:10.128.225.154:8081_extera-search, base_url:http://10.128.225.154:8081/extera-search}, core_node3:{ state:down, core:text-analytics, node_name:10.128.226.160:8081_extera-search, base_url:http://10.128.226.160:8081/extera-search;, leader:true, router:{name:implicit}}} {code} Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged - Key: SOLR-6707 URL: https://issues.apache.org/jira/browse/SOLR-6707 Project: Solr Issue Type: Bug Affects Versions: 4.10 Reporter: James Hardwick We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating: - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use. - Solr experienced an internal server error supposedly because of No space left on device even though we appeared to have ~10GB free. - Solr immediately went into recovery, and subsequent leader election for each shard of each core. - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper
[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
[ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202813#comment-14202813 ] James Hardwick commented on SOLR-6707: -- Interesting clusterstate.json in ZK. Why would we have null range/parent properties for an implicitly routed index that has never been split? {code:json} { gemindex:{ shards:{shard1:{ range:null, state:active, parent:null, replicas:{ core_node1:{ state:active, core:gemindex, node_name:10.128.26.109:8081_extera-search, base_url:http://10.128.26.109:8081/extera-search}, core_node2:{ state:active, core:gemindex, node_name:10.128.225.154:8081_extera-search, base_url:http://10.128.225.154:8081/extera-search;, leader:true}, core_node3:{ state:active, core:gemindex, node_name:10.128.226.160:8081_extera-search, base_url:http://10.128.226.160:8081/extera-search, router:{name:implicit}}, text-analytics:{ shards:{shard1:{ range:null, state:active, parent:null, replicas:{ core_node1:{ state:recovery_failed, core:text-analytics, node_name:10.128.26.109:8081_extera-search, base_url:http://10.128.26.109:8081/extera-search}, core_node2:{ state:recovery_failed, core:text-analytics, node_name:10.128.225.154:8081_extera-search, base_url:http://10.128.225.154:8081/extera-search}, core_node3:{ state:down, core:text-analytics, node_name:10.128.226.160:8081_extera-search, base_url:http://10.128.226.160:8081/extera-search;, leader:true, router:{name:implicit}}} {code} Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged - Key: SOLR-6707 URL: https://issues.apache.org/jira/browse/SOLR-6707 Project: Solr Issue Type: Bug Affects Versions: 4.10 Reporter: James Hardwick We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating: - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use. - Solr experienced an internal server error supposedly because of No space left on device even though we appeared to have ~10GB free. - Solr immediately went into recovery, and subsequent leader election for each shard of each core. - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper configs. - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times per second. - This in turn bombarded zookeepers /overseer/queue into oblivion - At some point /overseer/queue becomes so backed up that normal cluster coordination can no longer play out, and Solr topples over. I know this is a bit of an unusual circumstance due to us keeping the dead core around, and our quick solution has been to remove said core. However I can see other potential scenarios that might cause the same issue to arise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Deleted] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
[ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Hardwick updated SOLR-6707: - Comment: was deleted (was: Interesting clusterstate.json in ZK. Why would we have null range/parent properties for an implicitly routed index that has never been split? {code:javascript} { gemindex:{ shards:{shard1:{ range:null, state:active, parent:null, replicas:{ core_node1:{ state:active, core:gemindex, node_name:10.128.26.109:8081_extera-search, base_url:http://10.128.26.109:8081/extera-search}, core_node2:{ state:active, core:gemindex, node_name:10.128.225.154:8081_extera-search, base_url:http://10.128.225.154:8081/extera-search;, leader:true}, core_node3:{ state:active, core:gemindex, node_name:10.128.226.160:8081_extera-search, base_url:http://10.128.226.160:8081/extera-search, router:{name:implicit}}, text-analytics:{ shards:{shard1:{ range:null, state:active, parent:null, replicas:{ core_node1:{ state:recovery_failed, core:text-analytics, node_name:10.128.26.109:8081_extera-search, base_url:http://10.128.26.109:8081/extera-search}, core_node2:{ state:recovery_failed, core:text-analytics, node_name:10.128.225.154:8081_extera-search, base_url:http://10.128.225.154:8081/extera-search}, core_node3:{ state:down, core:text-analytics, node_name:10.128.226.160:8081_extera-search, base_url:http://10.128.226.160:8081/extera-search;, leader:true, router:{name:implicit}}} {code}) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged - Key: SOLR-6707 URL: https://issues.apache.org/jira/browse/SOLR-6707 Project: Solr Issue Type: Bug Affects Versions: 4.10 Reporter: James Hardwick We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating: - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use. - Solr experienced an internal server error supposedly because of No space left on device even though we appeared to have ~10GB free. - Solr immediately went into recovery, and subsequent leader election for each shard of each core. - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper configs. - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times per second. - This in turn bombarded zookeepers /overseer/queue into oblivion - At some point /overseer/queue becomes so backed up that normal cluster coordination can no longer play out, and Solr topples over. I know this is a bit of an unusual circumstance due to us keeping the dead core around, and our quick solution has been to remove said core. However I can see other potential scenarios that might cause the same issue to arise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
[ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202822#comment-14202822 ] James Hardwick commented on SOLR-6707: -- Interesting clusterstate.json in ZK. Why would we have null range/parent properties for an implicitly routed index that has never been split? {code:javascript} { appindex:{ shards:{shard1:{ range:null, state:active, parent:null, replicas:{ core_node1:{ state:active, core:appindex, node_name:xxx.xxx.xxx.xxx:8081_app-search, base_url:http://xxx.xxx.xxx.xxx:8081/app-search}, core_node2:{ state:active, core:appindex, node_name:xxx.xxx.xxx.xxx:8081_app-search, base_url:http://xxx.xxx.xxx.xxx:8081/app-search;, leader:true}, core_node3:{ state:active, core:appindex, node_name:xxx.xxx.xxx.xxx:8081_app-search, base_url:http://xxx.xxx.xxx.xxx:8081/app-search, router:{name:implicit}}, app-analytics:{ shards:{shard1:{ range:null, state:active, parent:null, replicas:{ core_node1:{ state:recovery_failed, core:app-analytics, node_name:xxx.xxx.xxx.xxx:8081_app-search, base_url:http://xxx.xxx.xxx.xxx:8081/app-search}, core_node2:{ state:recovery_failed, core:app-analytics, node_name:xxx.xxx.xxx.xxx:8081_app-search, base_url:http://xxx.xxx.xxx.xxx:8081/app-search}, core_node3:{ state:down, core:app-analytics, node_name:xxx.xxx.xxx.xxx:8081_app-search, base_url:http://xxx.xxx.xxx.xxx:8081/app-search;, leader:true, router:{name:implicit}}} {code} Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged - Key: SOLR-6707 URL: https://issues.apache.org/jira/browse/SOLR-6707 Project: Solr Issue Type: Bug Affects Versions: 4.10 Reporter: James Hardwick We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating: - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use. - Solr experienced an internal server error supposedly because of No space left on device even though we appeared to have ~10GB free. - Solr immediately went into recovery, and subsequent leader election for each shard of each core. - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper configs. - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times per second. - This in turn bombarded zookeepers /overseer/queue into oblivion - At some point /overseer/queue becomes so backed up that normal cluster coordination can no longer play out, and Solr topples over. I know this is a bit of an unusual circumstance due to us keeping the dead core around, and our quick solution has been to remove said core. However I can see other potential scenarios that might cause the same issue to arise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
[ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202844#comment-14202844 ] James Hardwick commented on SOLR-6707: -- This is an excerpt from the the logs immediately following the original exception (shown above) {noformat} 2014-11-03 11:13:58,488 [zkCallback-2-thread-86] INFO cloud.ElectionContext - I am going to be the leader xxx.xxx.xxx.109:8081_app-search 2014-11-03 11:13:58,489 [zkCallback-2-thread-86] INFO cloud.SolrZkClient - makePath: /overseer_elect/leader 2014-11-03 11:13:58,489 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Running the leader process for shard shard1 2014-11-03 11:13:58,489 [zkCallback-2-thread-85] INFO cloud.ShardLeaderElectionContext - Running the leader process for shard shard1 2014-11-03 11:13:58,496 [zkCallback-2-thread-86] INFO cloud.Overseer - Overseer (id=92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_000188) starting 2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Checking if I (core=app-analytics,coreNodeName=core_node1) should try and be the leader. 2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO cloud.ShardLeaderElectionContext - Checking if I (core=appindex,coreNodeName=core_node1) should try and be the leader. 2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - My last published State was down, I won't be the leader. 2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - There may be a better leader candidate than us - going back into recovery 2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO cloud.ElectionContext - canceling election /collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746105 2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO cloud.ShardLeaderElectionContext - My last published State was Active, it's okay to be the leader. 2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO cloud.ShardLeaderElectionContext - I may be the new leader - try and sync 2014-11-03 11:13:58,504 [zkCallback-2-thread-88] INFO update.DefaultSolrCoreState - Running recovery - first canceling any ongoing recovery 2014-11-03 11:13:58,506 [RecoveryThread] INFO cloud.RecoveryStrategy - Starting recovery process. core=app-analytics recoveringAfterStartup=true 2014-11-03 11:13:58,507 [RecoveryThread] ERROR cloud.RecoveryStrategy - No UpdateLog found - cannot recover. core=app-analytics 2014-11-03 11:13:58,507 [RecoveryThread] ERROR cloud.RecoveryStrategy - Recovery failed - I give up. core=app-analytics 2014-11-03 11:13:58,507 [RecoveryThread] INFO cloud.ZkController - publishing core=app-analytics state=recovery_failed collection=app-analytics 2014-11-03 11:13:58,508 [RecoveryThread] INFO cloud.ZkController - numShards not found on descriptor - reading it from system property 2014-11-03 11:13:58,521 [RecoveryThread] WARN cloud.RecoveryStrategy - Stopping recovery for core=app-analytics coreNodeName=core_node1 2014-11-03 11:13:58,560 [zkCallback-2-thread-86] INFO cloud.OverseerAutoReplicaFailoverThread - Starting OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=1 autoReplicaFailoverWaitAfterExpiration=3 autoReplicaFailoverBadNodeExpiration=6 2014-11-03 11:13:58,575 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Running the leader process for shard shard1 2014-11-03 11:13:58,580 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - Checking if I (core=app-analytics,coreNodeName=core_node1) should try and be the leader. 2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - My last published State was recovery_failed, I won't be the leader. 2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO cloud.ShardLeaderElectionContext - There may be a better leader candidate than us - going back into recovery 2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO cloud.ElectionContext - canceling election /collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746107 2014-11-03 11:13:58,583 [zkCallback-2-thread-88] INFO update.DefaultSolrCoreState - Running recovery - first canceling any ongoing recovery 2014-11-03 11:13:58,584 [RecoveryThread] INFO cloud.RecoveryStrategy - Starting recovery process. core=app-analytics recoveringAfterStartup=false 2014-11-03 11:13:58,584 [RecoveryThread] ERROR cloud.RecoveryStrategy - No UpdateLog found - cannot recover. core=app-analytics 2014-11-03 11:13:58,584 [RecoveryThread] ERROR cloud.RecoveryStrategy - Recovery failed - I give up. core=app-analytics 2014-11-03 11:13:58,584 [RecoveryThread] INFO cloud.ZkController - publishing core=app-analytics state=recovery_failed collection=app-analytics 2014-11-03
[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
[ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202853#comment-14202853 ] James Hardwick commented on SOLR-6707: -- Also FYI, the original exception may very well have been from lack of disk space, since we were also noticing Solr occasionally holding onto a Tlog that was absolutely massive (250GB at one point). Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged - Key: SOLR-6707 URL: https://issues.apache.org/jira/browse/SOLR-6707 Project: Solr Issue Type: Bug Affects Versions: 4.10 Reporter: James Hardwick We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating: - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use. - Solr experienced an internal server error supposedly because of No space left on device even though we appeared to have ~10GB free. - Solr immediately went into recovery, and subsequent leader election for each shard of each core. - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper configs. - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times per second. - This in turn bombarded zookeepers /overseer/queue into oblivion - At some point /overseer/queue becomes so backed up that normal cluster coordination can no longer play out, and Solr topples over. I know this is a bit of an unusual circumstance due to us keeping the dead core around, and our quick solution has been to remove said core. However I can see other potential scenarios that might cause the same issue to arise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
[ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Hardwick updated SOLR-6707: - Description: We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating: - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use. - Solr experienced an internal server error supposedly because of No space left on device even though we appeared to have ~10GB free. - Solr immediately went into recovery, and subsequent leader election for each shard of each core. - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper configs. - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times per second. - This in turn bombarded zookeepers /overseer/queue into oblivion - At some point /overseer/queue becomes so backed up that normal cluster coordination can no longer play out, and Solr topples over. I know this is a bit of an unusual circumstance due to us keeping the dead core around, and our quick solution has been to remove said core. However I can see other potential scenarios that might cause the same issue to arise. was: We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating: - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use. - Solr experienced an internal server error I believe due in part to a fairly new feature we are using, which seemingly caused all cores to go down. - Solr immediately went into recovery, and subsequent leader election for each shard of each core. - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper configs. - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times per second. - This in turn bombarded zookeepers /overseer/queue into oblivion - At some point /overseer/queue becomes so backed up that normal cluster coordination can no longer play out, and Solr topples over. I know this is a bit of an unusual circumstance due to us keeping the dead core around, and our quick solution has been to remove said core. However I can see other potential scenarios that might cause the same issue to arise. Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged - Key: SOLR-6707 URL: https://issues.apache.org/jira/browse/SOLR-6707 Project: Solr Issue Type: Bug Affects Versions: 4.10 Reporter: James Hardwick We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating: - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use. - Solr experienced an internal server error supposedly because of No space left on device even though we appeared to have ~10GB free. - Solr immediately went into recovery, and subsequent leader election for each shard of each core. - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper configs. - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times per second. - This in turn bombarded zookeepers /overseer/queue into oblivion - At some point /overseer/queue becomes so backed up that normal cluster coordination can no longer play out, and Solr topples over. I know this is a bit of an unusual circumstance due to us keeping the dead core around, and our quick solution has been to remove said core. However I can see other potential scenarios that might cause the same issue to arise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
[ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201049#comment-14201049 ] James Hardwick commented on SOLR-6707: -- My assumption was wrong about the feature. Here is the initial error that kicked off the sequence: {noformat} 2014-11-03 11:13:37,734 [updateExecutor-1-thread-4] ERROR update.StreamingSolrServers - error org.apache.solr.common.SolrException: Internal Server Error request: http://xxx.xxx.xxx.xxx:8081/app-search/appindex/update?update.chain=updateRequestProcessorChainupdate.distrib=TOLEADERdistrib.from=http%3A%2F%2Fxxx.xxx.xxx.xxx%3A8081%2Fapp-search%2Fappindex%2Fwt=javabinversion=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 2014-11-03 11:13:38,056 [http-bio-8081-exec-336] WARN processor.DistributedUpdateProcessor - Error sending update org.apache.solr.common.SolrException: Internal Server Error request: http://xxx.xxx.xxx.xxx:8081/app-search/appindex/update?update.chain=updateRequestProcessorChainupdate.distrib=TOLEADERdistrib.from=http%3A%2F%2Fxxx.xxx.xxx.xxx%3A8081%2Fapp-search%2Fappindex%2Fwt=javabinversion=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 2014-11-03 11:13:38,364 [http-bio-8081-exec-324] INFO update.UpdateHandler - start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} 2014-11-03 11:13:38,364 [http-bio-8081-exec-324] INFO update.UpdateHandler - No uncommitted changes. Skipping IW.commit. 2014-11-03 11:13:38,365 [http-bio-8081-exec-324] INFO search.SolrIndexSearcher - Opening Searcher@60515a83[appindex] main 2014-11-03 11:13:38,372 [http-bio-8081-exec-324] INFO update.UpdateHandler - end_commit_flush 2014-11-03 11:13:38,373 [updateExecutor-1-thread-6] ERROR update.SolrCmdDistributor - org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space left on device at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:550) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(ConcurrentUpdateSolrServer.java:292) at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:296) at org.apache.solr.update.SolrCmdDistributor.access$000(SolrCmdDistributor.java:53) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:283) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 2014-11-03 11:13:40,812 [http-bio-8081-exec-336] WARN processor.DistributedUpdateProcessor - Error sending update org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space left on device at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:550) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(ConcurrentUpdateSolrServer.java:292) at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:296) at org.apache.solr.update.SolrCmdDistributor.access$000(SolrCmdDistributor.java:53) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:283) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at
[jira] [Created] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
James Hardwick created SOLR-6707: Summary: Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged Key: SOLR-6707 URL: https://issues.apache.org/jira/browse/SOLR-6707 Project: Solr Issue Type: Bug Affects Versions: 4.10 Reporter: James Hardwick We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating: - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use. - Solr experienced an internal server error I believe due in part to a fairly new feature we are using, which seemingly caused all cores to go down. - Solr immediately went into recovery, and subsequent leader election for each shard of each core. - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper configs. - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times per second. - This in turn bombarded zookeepers /overseer/queue into oblivion - At some point /overseer/queue becomes so backed up that normal cluster coordination can no longer play out, and Solr topples over. I know this is a bit of an unusual circumstance due to us keeping the dead core around, and our quick solution has been to remove said core. However I can see other potential scenarios that might cause the same issue to arise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org