[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers

2016-06-06 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15317027#comment-15317027
 ] 

James Hardwick commented on SOLR-7021:
--

[~forest_soup] since updating to Solr 5.5+ we haven't had such issues. 

> Leader will not publish core as active without recovering first, but never 
> recovers
> ---
>
> Key: SOLR-7021
> URL: https://issues.apache.org/jira/browse/SOLR-7021
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 4.10
>Reporter: James Hardwick
>Priority: Critical
>  Labels: recovery, solrcloud, zookeeper
>
> A little background: 1 core solr-cloud cluster across 3 nodes, each with its 
> own shard and each shard with a single replica hence each replica is itself a 
> leader. 
> For reasons we won't get into, we witnessed a shard go down in our cluster. 
> We restarted the cluster but our core/shards still did not come back up. 
> After inspecting the logs, we found this:
> {code}
> 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is 
> http://xxx.xxx.xxx.35:8081/solr/xyzcore/
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - I am the leader, no recovery necessary
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - publishing core=xyzcore state=active collection=xyzcore
> 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - numShards not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - publishing core=xyzcore state=down collection=xyzcore
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - numShards not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer  - 
> :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' 
> as active without recovering first!
>   at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
> {code}
> And at this point the necessary shards never recover correctly and hence our 
> core never returns to a functional state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-7940) [CollectionAPI] Frequent Cluster Status timeout

2015-11-30 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032540#comment-15032540
 ] 

James Hardwick edited comment on SOLR-7940 at 11/30/15 10:13 PM:
-

We are seeing this as well on a 3 node cluster w/ 2 collections. 

Looks like others are also, across a variety of versions: 
http://lucene.472066.n3.nabble.com/CLUSTERSTATUS-timeout-tp4173224.html
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201511.mbox/%3c5639dfcf.9020...@decalog.net%3E
http://grokbase.com/t/lucene/solr-user/154d0wjr7c/clusterstate-timeout


was (Author: hardwickj):
We are seeing this as well on a 3 node cluster w/ 2 collections. 

Looks like others are also, across a variety of versions: 
http://lucene.472066.n3.nabble.com/CLUSTERSTATUS-timeout-tp4173224.html

> [CollectionAPI] Frequent Cluster Status timeout
> ---
>
> Key: SOLR-7940
> URL: https://issues.apache.org/jira/browse/SOLR-7940
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 4.10.2
> Environment: Ubuntu on Azure
>Reporter: Stephan Lagraulet
>
> Very often we have a timeout when we call 
> http://server2:8080/solr/admin/collections?action=CLUSTERSTATUS=json
> {code}
> {"responseHeader": 
> {"status": 500,
> "QTime": 180100},
> "error": 
> {"msg": "CLUSTERSTATUS the collection time out:180s",
> "trace": "org.apache.solr.common.SolrException: CLUSTERSTATUS the collection 
> time out:180s\n\tat 
> org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:368)\n\tat
>  
> org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:320)\n\tat
>  
> org.apache.solr.handler.admin.CollectionsHandler.handleClusterStatus(CollectionsHandler.java:640)\n\tat
>  
> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:220)\n\tat
>  
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:267)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)\n\tat
>  
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)\n\tat
>  org.eclipse.jetty.server.Server.handle(Server.java:350)\n\tat 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)\n\tat
>  
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)\n\tat
>  
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)\n\tat
>  org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:630)\n\tat 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)\n\tat 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:77)\n\tat
>  
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:606)\n\tat
>  
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:46)\n\tat
>  
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603)\n\tat
>  
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538)\n\tat
>  java.lang.Thread.run(Thread.java:745)\n",
> "code": 500}}
> {code}
> The cluster has 3 SolR nodes with 6 small collections replicated on all nodes.
> We were using this api to monitor cluster state but it was failing every 10 
> minutes. We switched by using ZkStateReader in CloudSolrServer 

[jira] [Commented] (SOLR-7940) [CollectionAPI] Frequent Cluster Status timeout

2015-11-30 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032540#comment-15032540
 ] 

James Hardwick commented on SOLR-7940:
--

We are seeing this as well on a 3 node cluster w/ 2 collections. 

Looks like others are also, across a variety of versions: 
http://lucene.472066.n3.nabble.com/CLUSTERSTATUS-timeout-tp4173224.html

> [CollectionAPI] Frequent Cluster Status timeout
> ---
>
> Key: SOLR-7940
> URL: https://issues.apache.org/jira/browse/SOLR-7940
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 4.10.2
> Environment: Ubuntu on Azure
>Reporter: Stephan Lagraulet
>
> Very often we have a timeout when we call 
> http://server2:8080/solr/admin/collections?action=CLUSTERSTATUS=json
> {code}
> {"responseHeader": 
> {"status": 500,
> "QTime": 180100},
> "error": 
> {"msg": "CLUSTERSTATUS the collection time out:180s",
> "trace": "org.apache.solr.common.SolrException: CLUSTERSTATUS the collection 
> time out:180s\n\tat 
> org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:368)\n\tat
>  
> org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:320)\n\tat
>  
> org.apache.solr.handler.admin.CollectionsHandler.handleClusterStatus(CollectionsHandler.java:640)\n\tat
>  
> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:220)\n\tat
>  
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:267)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)\n\tat
>  
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)\n\tat
>  org.eclipse.jetty.server.Server.handle(Server.java:350)\n\tat 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)\n\tat
>  
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)\n\tat
>  
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)\n\tat
>  org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:630)\n\tat 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)\n\tat 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:77)\n\tat
>  
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:606)\n\tat
>  
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:46)\n\tat
>  
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603)\n\tat
>  
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538)\n\tat
>  java.lang.Thread.run(Thread.java:745)\n",
> "code": 500}}
> {code}
> The cluster has 3 SolR nodes with 6 small collections replicated on all nodes.
> We were using this api to monitor cluster state but it was failing every 10 
> minutes. We switched by using ZkStateReader in CloudSolrServer and it has 
> been working for a day without problems.
> Is there a kind of deadlock as this call was been made on the three nodes 
> concurrently?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7940) [CollectionAPI] Frequent Cluster Status timeout

2015-11-30 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032552#comment-15032552
 ] 

James Hardwick commented on SOLR-7940:
--

Actually, we are consistently seeing this on any of a variety of instances we 
have, all of which are generally uniform in their configuration. 

I'd love to help if any of the Solr dev's can point me in the right direction 
for doing any sort of diagnostics. 

> [CollectionAPI] Frequent Cluster Status timeout
> ---
>
> Key: SOLR-7940
> URL: https://issues.apache.org/jira/browse/SOLR-7940
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 4.10.2
> Environment: Ubuntu on Azure
>Reporter: Stephan Lagraulet
>
> Very often we have a timeout when we call 
> http://server2:8080/solr/admin/collections?action=CLUSTERSTATUS=json
> {code}
> {"responseHeader": 
> {"status": 500,
> "QTime": 180100},
> "error": 
> {"msg": "CLUSTERSTATUS the collection time out:180s",
> "trace": "org.apache.solr.common.SolrException: CLUSTERSTATUS the collection 
> time out:180s\n\tat 
> org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:368)\n\tat
>  
> org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:320)\n\tat
>  
> org.apache.solr.handler.admin.CollectionsHandler.handleClusterStatus(CollectionsHandler.java:640)\n\tat
>  
> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:220)\n\tat
>  
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:267)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1338)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)\n\tat
>  
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)\n\tat
>  org.eclipse.jetty.server.Server.handle(Server.java:350)\n\tat 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)\n\tat
>  
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)\n\tat
>  
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)\n\tat
>  org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:630)\n\tat 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)\n\tat 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:77)\n\tat
>  
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:606)\n\tat
>  
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:46)\n\tat
>  
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:603)\n\tat
>  
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:538)\n\tat
>  java.lang.Thread.run(Thread.java:745)\n",
> "code": 500}}
> {code}
> The cluster has 3 SolR nodes with 6 small collections replicated on all nodes.
> We were using this api to monitor cluster state but it was failing every 10 
> minutes. We switched by using ZkStateReader in CloudSolrServer and it has 
> been working for a day without problems.
> Is there a kind of deadlock as this call was been made on the three nodes 
> concurrently?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers

2015-01-23 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289032#comment-14289032
 ] 

James Hardwick commented on SOLR-7021:
--

Yep, we were looking at that one and we're wondering the same. The symptom is 
different but sounds like the solution might be the same. We'll give it a try!

 Leader will not publish core as active without recovering first, but never 
 recovers
 ---

 Key: SOLR-7021
 URL: https://issues.apache.org/jira/browse/SOLR-7021
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.10
Reporter: James Hardwick
Priority: Critical
  Labels: recovery, solrcloud, zookeeper

 A little background: 1 core solr-cloud cluster across 3 nodes, each with its 
 own shard and each shard with a single replica hence each replica is itself a 
 leader. 
 For reasons we won't get into, we witnessed a shard go down in our cluster. 
 We restarted the cluster but our core/shards still did not come back up. 
 After inspecting the logs, we found this:
 {code}
 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is 
 http://xxx.xxx.xxx.35:8081/solr/xyzcore/
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - I am the leader, no recovery necessary
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - publishing core=xyzcore state=active collection=xyzcore
 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - numShards not found on descriptor - reading it from system property
 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - publishing core=xyzcore state=down collection=xyzcore
 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - numShards not found on descriptor - reading it from system property
 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer  - 
 :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' 
 as active without recovering first!
   at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
 {code}
 And at this point the necessary shards never recover correctly and hence our 
 core never returns to a functional state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers

2015-01-23 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289034#comment-14289034
 ] 

James Hardwick commented on SOLR-7021:
--

In the mean time, how do we best get around this? It still does not recover 
when we restart the cluster. Should manually kicking off a core reload for each 
node do the trick? 

 Leader will not publish core as active without recovering first, but never 
 recovers
 ---

 Key: SOLR-7021
 URL: https://issues.apache.org/jira/browse/SOLR-7021
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.10
Reporter: James Hardwick
Priority: Critical
  Labels: recovery, solrcloud, zookeeper

 A little background: 1 core solr-cloud cluster across 3 nodes, each with its 
 own shard and each shard with a single replica hence each replica is itself a 
 leader. 
 For reasons we won't get into, we witnessed a shard go down in our cluster. 
 We restarted the cluster but our core/shards still did not come back up. 
 After inspecting the logs, we found this:
 {code}
 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is 
 http://xxx.xxx.xxx.35:8081/solr/xyzcore/
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - I am the leader, no recovery necessary
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - publishing core=xyzcore state=active collection=xyzcore
 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - numShards not found on descriptor - reading it from system property
 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - publishing core=xyzcore state=down collection=xyzcore
 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - numShards not found on descriptor - reading it from system property
 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer  - 
 :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' 
 as active without recovering first!
   at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
 {code}
 And at this point the necessary shards never recover correctly and hence our 
 core never returns to a functional state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers

2015-01-23 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289451#comment-14289451
 ] 

James Hardwick commented on SOLR-7021:
--

That worked Shalin. Thank you!

 Leader will not publish core as active without recovering first, but never 
 recovers
 ---

 Key: SOLR-7021
 URL: https://issues.apache.org/jira/browse/SOLR-7021
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.10
Reporter: James Hardwick
Priority: Critical
  Labels: recovery, solrcloud, zookeeper

 A little background: 1 core solr-cloud cluster across 3 nodes, each with its 
 own shard and each shard with a single replica hence each replica is itself a 
 leader. 
 For reasons we won't get into, we witnessed a shard go down in our cluster. 
 We restarted the cluster but our core/shards still did not come back up. 
 After inspecting the logs, we found this:
 {code}
 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is 
 http://xxx.xxx.xxx.35:8081/solr/xyzcore/
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - I am the leader, no recovery necessary
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - publishing core=xyzcore state=active collection=xyzcore
 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - numShards not found on descriptor - reading it from system property
 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - publishing core=xyzcore state=down collection=xyzcore
 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - numShards not found on descriptor - reading it from system property
 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer  - 
 :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' 
 as active without recovering first!
   at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
 {code}
 And at this point the necessary shards never recover correctly and hence our 
 core never returns to a functional state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers

2015-01-22 Thread James Hardwick (JIRA)
James Hardwick created SOLR-7021:


 Summary: Leader will not publish core as active without recovering 
first, but never recovers
 Key: SOLR-7021
 URL: https://issues.apache.org/jira/browse/SOLR-7021
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.10
Reporter: James Hardwick
Priority: Critical


A little background: 1 core solr-cloud cluster across 3 nodes, each with its 
own shard and each shard with a single replica hence each replica is itself a 
leader. 

For reasons we won't get into, we witnessed a shard go down in our cluster. We 
restarted the cluster but our core/shards still did not come back up. After 
inspecting the logs, we found this:

{code}
015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - 
We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is 
http://xxx.xxx.xxx.35:8081/solr/xyzcore/
2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - 
No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - 
I am the leader, no recovery necessary
2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - 
publishing core=xyzcore state=active collection=xyzcore
2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - 
numShards not found on descriptor - reading it from system property
2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - 
publishing core=xyzcore state=down collection=xyzcore
2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - 
numShards not found on descriptor - reading it from system property
2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer  - 
:org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' 
as active without recovering first!
at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
{code}

And at this point the necessary shards never recover correctly and hence our 
core never returns to a functional state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers

2015-01-22 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288455#comment-14288455
 ] 

James Hardwick commented on SOLR-7021:
--

The key items to note being:

* cloud.ZkController  - I am the leader, no recovery necessary
* core.ZkContainer  - :org.apache.solr.common.SolrException: Cannot publish 
state of core 'xyzcore' as active without recovering first!

 Leader will not publish core as active without recovering first, but never 
 recovers
 ---

 Key: SOLR-7021
 URL: https://issues.apache.org/jira/browse/SOLR-7021
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.10
Reporter: James Hardwick
Priority: Critical
  Labels: recovery, solrcloud, zookeeper

 A little background: 1 core solr-cloud cluster across 3 nodes, each with its 
 own shard and each shard with a single replica hence each replica is itself a 
 leader. 
 For reasons we won't get into, we witnessed a shard go down in our cluster. 
 We restarted the cluster but our core/shards still did not come back up. 
 After inspecting the logs, we found this:
 {code}
 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is 
 http://xxx.xxx.xxx.35:8081/solr/xyzcore/
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - I am the leader, no recovery necessary
 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - publishing core=xyzcore state=active collection=xyzcore
 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - numShards not found on descriptor - reading it from system property
 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - publishing core=xyzcore state=down collection=xyzcore
 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
 - numShards not found on descriptor - reading it from system property
 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer  - 
 :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' 
 as active without recovering first!
   at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
 {code}
 And at this point the necessary shards never recover correctly and hence our 
 core never returns to a functional state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

2014-11-07 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202813#comment-14202813
 ] 

James Hardwick edited comment on SOLR-6707 at 11/7/14 10:19 PM:


Interesting clusterstate.json in ZK. Why would we have null range/parent 
properties for an implicitly routed index that has never been split?

{code:javascript}
{
  gemindex:{
shards:{shard1:{
range:null,
state:active,
parent:null,
replicas:{
  core_node1:{
state:active,
core:gemindex,
node_name:10.128.26.109:8081_extera-search,
base_url:http://10.128.26.109:8081/extera-search},
  core_node2:{
state:active,
core:gemindex,
node_name:10.128.225.154:8081_extera-search,
base_url:http://10.128.225.154:8081/extera-search;,
leader:true},
  core_node3:{
state:active,
core:gemindex,
node_name:10.128.226.160:8081_extera-search,
base_url:http://10.128.226.160:8081/extera-search,
router:{name:implicit}},
  text-analytics:{
shards:{shard1:{
range:null,
state:active,
parent:null,
replicas:{
  core_node1:{
state:recovery_failed,
core:text-analytics,
node_name:10.128.26.109:8081_extera-search,
base_url:http://10.128.26.109:8081/extera-search},
  core_node2:{
state:recovery_failed,
core:text-analytics,
node_name:10.128.225.154:8081_extera-search,
base_url:http://10.128.225.154:8081/extera-search},
  core_node3:{
state:down,
core:text-analytics,
node_name:10.128.226.160:8081_extera-search,
base_url:http://10.128.226.160:8081/extera-search;,
leader:true,
router:{name:implicit}}}
{code}


was (Author: hardwickj):
Interesting clusterstate.json in ZK. Why would we have null range/parent 
properties for an implicitly routed index that has never been split?

{code:json}
{
  gemindex:{
shards:{shard1:{
range:null,
state:active,
parent:null,
replicas:{
  core_node1:{
state:active,
core:gemindex,
node_name:10.128.26.109:8081_extera-search,
base_url:http://10.128.26.109:8081/extera-search},
  core_node2:{
state:active,
core:gemindex,
node_name:10.128.225.154:8081_extera-search,
base_url:http://10.128.225.154:8081/extera-search;,
leader:true},
  core_node3:{
state:active,
core:gemindex,
node_name:10.128.226.160:8081_extera-search,
base_url:http://10.128.226.160:8081/extera-search,
router:{name:implicit}},
  text-analytics:{
shards:{shard1:{
range:null,
state:active,
parent:null,
replicas:{
  core_node1:{
state:recovery_failed,
core:text-analytics,
node_name:10.128.26.109:8081_extera-search,
base_url:http://10.128.26.109:8081/extera-search},
  core_node2:{
state:recovery_failed,
core:text-analytics,
node_name:10.128.225.154:8081_extera-search,
base_url:http://10.128.225.154:8081/extera-search},
  core_node3:{
state:down,
core:text-analytics,
node_name:10.128.226.160:8081_extera-search,
base_url:http://10.128.226.160:8081/extera-search;,
leader:true,
router:{name:implicit}}}
{code}

 Recovery/election for invalid core results in rapid-fire re-attempts until 
 /overseer/queue is clogged
 -

 Key: SOLR-6707
 URL: https://issues.apache.org/jira/browse/SOLR-6707
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.10
Reporter: James Hardwick

 We experienced an issue the other day that brought a production solr server 
 down, and this is what we found after investigating:
 - Running solr instance with two separate cores, one of which is perpetually 
 down because it's configs are not yet completely updated for Solr-cloud. This 
 was thought to be harmless since it's not currently in use. 
 - Solr experienced an internal server error supposedly because of No space 
 left on device even though we appeared to have ~10GB free. 
 - Solr immediately went into recovery, and subsequent leader election for 
 each shard of each core. 
 - Our primary core recovered immediately. Our additional core which was never 
 active in the first place, attempted to recover but of course couldn't due to 
 the improper 

[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

2014-11-07 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202813#comment-14202813
 ] 

James Hardwick commented on SOLR-6707:
--

Interesting clusterstate.json in ZK. Why would we have null range/parent 
properties for an implicitly routed index that has never been split?

{code:json}
{
  gemindex:{
shards:{shard1:{
range:null,
state:active,
parent:null,
replicas:{
  core_node1:{
state:active,
core:gemindex,
node_name:10.128.26.109:8081_extera-search,
base_url:http://10.128.26.109:8081/extera-search},
  core_node2:{
state:active,
core:gemindex,
node_name:10.128.225.154:8081_extera-search,
base_url:http://10.128.225.154:8081/extera-search;,
leader:true},
  core_node3:{
state:active,
core:gemindex,
node_name:10.128.226.160:8081_extera-search,
base_url:http://10.128.226.160:8081/extera-search,
router:{name:implicit}},
  text-analytics:{
shards:{shard1:{
range:null,
state:active,
parent:null,
replicas:{
  core_node1:{
state:recovery_failed,
core:text-analytics,
node_name:10.128.26.109:8081_extera-search,
base_url:http://10.128.26.109:8081/extera-search},
  core_node2:{
state:recovery_failed,
core:text-analytics,
node_name:10.128.225.154:8081_extera-search,
base_url:http://10.128.225.154:8081/extera-search},
  core_node3:{
state:down,
core:text-analytics,
node_name:10.128.226.160:8081_extera-search,
base_url:http://10.128.226.160:8081/extera-search;,
leader:true,
router:{name:implicit}}}
{code}

 Recovery/election for invalid core results in rapid-fire re-attempts until 
 /overseer/queue is clogged
 -

 Key: SOLR-6707
 URL: https://issues.apache.org/jira/browse/SOLR-6707
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.10
Reporter: James Hardwick

 We experienced an issue the other day that brought a production solr server 
 down, and this is what we found after investigating:
 - Running solr instance with two separate cores, one of which is perpetually 
 down because it's configs are not yet completely updated for Solr-cloud. This 
 was thought to be harmless since it's not currently in use. 
 - Solr experienced an internal server error supposedly because of No space 
 left on device even though we appeared to have ~10GB free. 
 - Solr immediately went into recovery, and subsequent leader election for 
 each shard of each core. 
 - Our primary core recovered immediately. Our additional core which was never 
 active in the first place, attempted to recover but of course couldn't due to 
 the improper configs. 
 - Solr then began rapid-fire reattempting recovery of said node, trying maybe 
 20-30 times per second.
 - This in turn bombarded zookeepers /overseer/queue into oblivion
 - At some point /overseer/queue becomes so backed up that normal cluster 
 coordination can no longer play out, and Solr topples over. 
 I know this is a bit of an unusual circumstance due to us keeping the dead 
 core around, and our quick solution has been to remove said core. However I 
 can see other potential scenarios that might cause the same issue to arise. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

2014-11-07 Thread James Hardwick (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Hardwick updated SOLR-6707:
-
Comment: was deleted

(was: Interesting clusterstate.json in ZK. Why would we have null range/parent 
properties for an implicitly routed index that has never been split?

{code:javascript}
{
  gemindex:{
shards:{shard1:{
range:null,
state:active,
parent:null,
replicas:{
  core_node1:{
state:active,
core:gemindex,
node_name:10.128.26.109:8081_extera-search,
base_url:http://10.128.26.109:8081/extera-search},
  core_node2:{
state:active,
core:gemindex,
node_name:10.128.225.154:8081_extera-search,
base_url:http://10.128.225.154:8081/extera-search;,
leader:true},
  core_node3:{
state:active,
core:gemindex,
node_name:10.128.226.160:8081_extera-search,
base_url:http://10.128.226.160:8081/extera-search,
router:{name:implicit}},
  text-analytics:{
shards:{shard1:{
range:null,
state:active,
parent:null,
replicas:{
  core_node1:{
state:recovery_failed,
core:text-analytics,
node_name:10.128.26.109:8081_extera-search,
base_url:http://10.128.26.109:8081/extera-search},
  core_node2:{
state:recovery_failed,
core:text-analytics,
node_name:10.128.225.154:8081_extera-search,
base_url:http://10.128.225.154:8081/extera-search},
  core_node3:{
state:down,
core:text-analytics,
node_name:10.128.226.160:8081_extera-search,
base_url:http://10.128.226.160:8081/extera-search;,
leader:true,
router:{name:implicit}}}
{code})

 Recovery/election for invalid core results in rapid-fire re-attempts until 
 /overseer/queue is clogged
 -

 Key: SOLR-6707
 URL: https://issues.apache.org/jira/browse/SOLR-6707
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.10
Reporter: James Hardwick

 We experienced an issue the other day that brought a production solr server 
 down, and this is what we found after investigating:
 - Running solr instance with two separate cores, one of which is perpetually 
 down because it's configs are not yet completely updated for Solr-cloud. This 
 was thought to be harmless since it's not currently in use. 
 - Solr experienced an internal server error supposedly because of No space 
 left on device even though we appeared to have ~10GB free. 
 - Solr immediately went into recovery, and subsequent leader election for 
 each shard of each core. 
 - Our primary core recovered immediately. Our additional core which was never 
 active in the first place, attempted to recover but of course couldn't due to 
 the improper configs. 
 - Solr then began rapid-fire reattempting recovery of said node, trying maybe 
 20-30 times per second.
 - This in turn bombarded zookeepers /overseer/queue into oblivion
 - At some point /overseer/queue becomes so backed up that normal cluster 
 coordination can no longer play out, and Solr topples over. 
 I know this is a bit of an unusual circumstance due to us keeping the dead 
 core around, and our quick solution has been to remove said core. However I 
 can see other potential scenarios that might cause the same issue to arise. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

2014-11-07 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202822#comment-14202822
 ] 

James Hardwick commented on SOLR-6707:
--

Interesting clusterstate.json in ZK. Why would we have null range/parent 
properties for an implicitly routed index that has never been split?

{code:javascript}
{
  appindex:{
shards:{shard1:{
range:null,
state:active,
parent:null,
replicas:{
  core_node1:{
state:active,
core:appindex,
node_name:xxx.xxx.xxx.xxx:8081_app-search,
base_url:http://xxx.xxx.xxx.xxx:8081/app-search},
  core_node2:{
state:active,
core:appindex,
node_name:xxx.xxx.xxx.xxx:8081_app-search,
base_url:http://xxx.xxx.xxx.xxx:8081/app-search;,
leader:true},
  core_node3:{
state:active,
core:appindex,
node_name:xxx.xxx.xxx.xxx:8081_app-search,
base_url:http://xxx.xxx.xxx.xxx:8081/app-search,
router:{name:implicit}},
  app-analytics:{
shards:{shard1:{
range:null,
state:active,
parent:null,
replicas:{
  core_node1:{
state:recovery_failed,
core:app-analytics,
node_name:xxx.xxx.xxx.xxx:8081_app-search,
base_url:http://xxx.xxx.xxx.xxx:8081/app-search},
  core_node2:{
state:recovery_failed,
core:app-analytics,
node_name:xxx.xxx.xxx.xxx:8081_app-search,
base_url:http://xxx.xxx.xxx.xxx:8081/app-search},
  core_node3:{
state:down,
core:app-analytics,
node_name:xxx.xxx.xxx.xxx:8081_app-search,
base_url:http://xxx.xxx.xxx.xxx:8081/app-search;,
leader:true,
router:{name:implicit}}}
{code}

 Recovery/election for invalid core results in rapid-fire re-attempts until 
 /overseer/queue is clogged
 -

 Key: SOLR-6707
 URL: https://issues.apache.org/jira/browse/SOLR-6707
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.10
Reporter: James Hardwick

 We experienced an issue the other day that brought a production solr server 
 down, and this is what we found after investigating:
 - Running solr instance with two separate cores, one of which is perpetually 
 down because it's configs are not yet completely updated for Solr-cloud. This 
 was thought to be harmless since it's not currently in use. 
 - Solr experienced an internal server error supposedly because of No space 
 left on device even though we appeared to have ~10GB free. 
 - Solr immediately went into recovery, and subsequent leader election for 
 each shard of each core. 
 - Our primary core recovered immediately. Our additional core which was never 
 active in the first place, attempted to recover but of course couldn't due to 
 the improper configs. 
 - Solr then began rapid-fire reattempting recovery of said node, trying maybe 
 20-30 times per second.
 - This in turn bombarded zookeepers /overseer/queue into oblivion
 - At some point /overseer/queue becomes so backed up that normal cluster 
 coordination can no longer play out, and Solr topples over. 
 I know this is a bit of an unusual circumstance due to us keeping the dead 
 core around, and our quick solution has been to remove said core. However I 
 can see other potential scenarios that might cause the same issue to arise. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

2014-11-07 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202844#comment-14202844
 ] 

James Hardwick commented on SOLR-6707:
--

This is an excerpt from the the logs immediately following the original 
exception (shown above)

{noformat}
2014-11-03 11:13:58,488 [zkCallback-2-thread-86] INFO  cloud.ElectionContext  - 
I am going to be the leader xxx.xxx.xxx.109:8081_app-search
2014-11-03 11:13:58,489 [zkCallback-2-thread-86] INFO  cloud.SolrZkClient  - 
makePath: /overseer_elect/leader
2014-11-03 11:13:58,489 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Running the leader process for shard shard1
2014-11-03 11:13:58,489 [zkCallback-2-thread-85] INFO  
cloud.ShardLeaderElectionContext  - Running the leader process for shard shard1
2014-11-03 11:13:58,496 [zkCallback-2-thread-86] INFO  cloud.Overseer  - 
Overseer (id=92718232187174914-xxx.xxx.xxx.109:8081_app-search-n_000188) 
starting
2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Checking if I 
(core=app-analytics,coreNodeName=core_node1) should try and be the leader.
2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO  
cloud.ShardLeaderElectionContext  - Checking if I 
(core=appindex,coreNodeName=core_node1) should try and be the leader.
2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - My last published State was down, I won't 
be the leader.
2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - There may be a better leader candidate than 
us - going back into recovery
2014-11-03 11:13:58,499 [zkCallback-2-thread-88] INFO  cloud.ElectionContext  - 
canceling election 
/collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746105
2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO  
cloud.ShardLeaderElectionContext  - My last published State was Active, it's 
okay to be the leader.
2014-11-03 11:13:58,499 [zkCallback-2-thread-85] INFO  
cloud.ShardLeaderElectionContext  - I may be the new leader - try and sync
2014-11-03 11:13:58,504 [zkCallback-2-thread-88] INFO  
update.DefaultSolrCoreState  - Running recovery - first canceling any ongoing 
recovery
2014-11-03 11:13:58,506 [RecoveryThread] INFO  cloud.RecoveryStrategy  - 
Starting recovery process.  core=app-analytics recoveringAfterStartup=true
2014-11-03 11:13:58,507 [RecoveryThread] ERROR cloud.RecoveryStrategy  - No 
UpdateLog found - cannot recover. core=app-analytics
2014-11-03 11:13:58,507 [RecoveryThread] ERROR cloud.RecoveryStrategy  - 
Recovery failed - I give up. core=app-analytics
2014-11-03 11:13:58,507 [RecoveryThread] INFO  cloud.ZkController  - publishing 
core=app-analytics state=recovery_failed collection=app-analytics
2014-11-03 11:13:58,508 [RecoveryThread] INFO  cloud.ZkController  - numShards 
not found on descriptor - reading it from system property
2014-11-03 11:13:58,521 [RecoveryThread] WARN  cloud.RecoveryStrategy  - 
Stopping recovery for core=app-analytics coreNodeName=core_node1
2014-11-03 11:13:58,560 [zkCallback-2-thread-86] INFO  
cloud.OverseerAutoReplicaFailoverThread  - Starting 
OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=1 
autoReplicaFailoverWaitAfterExpiration=3 
autoReplicaFailoverBadNodeExpiration=6
2014-11-03 11:13:58,575 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Running the leader process for shard shard1
2014-11-03 11:13:58,580 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - Checking if I 
(core=app-analytics,coreNodeName=core_node1) should try and be the leader.
2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - My last published State was 
recovery_failed, I won't be the leader.
2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO  
cloud.ShardLeaderElectionContext  - There may be a better leader candidate than 
us - going back into recovery
2014-11-03 11:13:58,581 [zkCallback-2-thread-88] INFO  cloud.ElectionContext  - 
canceling election 
/collections/app-analytics/leader_elect/shard1/election/92718232187174914-core_node1-n_0001746107
2014-11-03 11:13:58,583 [zkCallback-2-thread-88] INFO  
update.DefaultSolrCoreState  - Running recovery - first canceling any ongoing 
recovery
2014-11-03 11:13:58,584 [RecoveryThread] INFO  cloud.RecoveryStrategy  - 
Starting recovery process.  core=app-analytics recoveringAfterStartup=false
2014-11-03 11:13:58,584 [RecoveryThread] ERROR cloud.RecoveryStrategy  - No 
UpdateLog found - cannot recover. core=app-analytics
2014-11-03 11:13:58,584 [RecoveryThread] ERROR cloud.RecoveryStrategy  - 
Recovery failed - I give up. core=app-analytics
2014-11-03 11:13:58,584 [RecoveryThread] INFO  cloud.ZkController  - publishing 
core=app-analytics state=recovery_failed collection=app-analytics
2014-11-03 

[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

2014-11-07 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202853#comment-14202853
 ] 

James Hardwick commented on SOLR-6707:
--

Also FYI, the original exception may very well have been from lack of disk 
space, since we were also noticing Solr occasionally holding onto a Tlog that 
was absolutely massive (250GB at one point).

 Recovery/election for invalid core results in rapid-fire re-attempts until 
 /overseer/queue is clogged
 -

 Key: SOLR-6707
 URL: https://issues.apache.org/jira/browse/SOLR-6707
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.10
Reporter: James Hardwick

 We experienced an issue the other day that brought a production solr server 
 down, and this is what we found after investigating:
 - Running solr instance with two separate cores, one of which is perpetually 
 down because it's configs are not yet completely updated for Solr-cloud. This 
 was thought to be harmless since it's not currently in use. 
 - Solr experienced an internal server error supposedly because of No space 
 left on device even though we appeared to have ~10GB free. 
 - Solr immediately went into recovery, and subsequent leader election for 
 each shard of each core. 
 - Our primary core recovered immediately. Our additional core which was never 
 active in the first place, attempted to recover but of course couldn't due to 
 the improper configs. 
 - Solr then began rapid-fire reattempting recovery of said node, trying maybe 
 20-30 times per second.
 - This in turn bombarded zookeepers /overseer/queue into oblivion
 - At some point /overseer/queue becomes so backed up that normal cluster 
 coordination can no longer play out, and Solr topples over. 
 I know this is a bit of an unusual circumstance due to us keeping the dead 
 core around, and our quick solution has been to remove said core. However I 
 can see other potential scenarios that might cause the same issue to arise. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

2014-11-06 Thread James Hardwick (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Hardwick updated SOLR-6707:
-
Description: 
We experienced an issue the other day that brought a production solr server 
down, and this is what we found after investigating:

- Running solr instance with two separate cores, one of which is perpetually 
down because it's configs are not yet completely updated for Solr-cloud. This 
was thought to be harmless since it's not currently in use. 
- Solr experienced an internal server error supposedly because of No space 
left on device even though we appeared to have ~10GB free. 
- Solr immediately went into recovery, and subsequent leader election for each 
shard of each core. 
- Our primary core recovered immediately. Our additional core which was never 
active in the first place, attempted to recover but of course couldn't due to 
the improper configs. 
- Solr then began rapid-fire reattempting recovery of said node, trying maybe 
20-30 times per second.
- This in turn bombarded zookeepers /overseer/queue into oblivion
- At some point /overseer/queue becomes so backed up that normal cluster 
coordination can no longer play out, and Solr topples over. 

I know this is a bit of an unusual circumstance due to us keeping the dead core 
around, and our quick solution has been to remove said core. However I can see 
other potential scenarios that might cause the same issue to arise. 

  was:
We experienced an issue the other day that brought a production solr server 
down, and this is what we found after investigating:

- Running solr instance with two separate cores, one of which is perpetually 
down because it's configs are not yet completely updated for Solr-cloud. This 
was thought to be harmless since it's not currently in use. 
- Solr experienced an internal server error I believe due in part to a fairly 
new feature we are using, which seemingly caused all cores to go down. 
- Solr immediately went into recovery, and subsequent leader election for each 
shard of each core. 
- Our primary core recovered immediately. Our additional core which was never 
active in the first place, attempted to recover but of course couldn't due to 
the improper configs. 
- Solr then began rapid-fire reattempting recovery of said node, trying maybe 
20-30 times per second.
- This in turn bombarded zookeepers /overseer/queue into oblivion
- At some point /overseer/queue becomes so backed up that normal cluster 
coordination can no longer play out, and Solr topples over. 

I know this is a bit of an unusual circumstance due to us keeping the dead core 
around, and our quick solution has been to remove said core. However I can see 
other potential scenarios that might cause the same issue to arise. 


 Recovery/election for invalid core results in rapid-fire re-attempts until 
 /overseer/queue is clogged
 -

 Key: SOLR-6707
 URL: https://issues.apache.org/jira/browse/SOLR-6707
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.10
Reporter: James Hardwick

 We experienced an issue the other day that brought a production solr server 
 down, and this is what we found after investigating:
 - Running solr instance with two separate cores, one of which is perpetually 
 down because it's configs are not yet completely updated for Solr-cloud. This 
 was thought to be harmless since it's not currently in use. 
 - Solr experienced an internal server error supposedly because of No space 
 left on device even though we appeared to have ~10GB free. 
 - Solr immediately went into recovery, and subsequent leader election for 
 each shard of each core. 
 - Our primary core recovered immediately. Our additional core which was never 
 active in the first place, attempted to recover but of course couldn't due to 
 the improper configs. 
 - Solr then began rapid-fire reattempting recovery of said node, trying maybe 
 20-30 times per second.
 - This in turn bombarded zookeepers /overseer/queue into oblivion
 - At some point /overseer/queue becomes so backed up that normal cluster 
 coordination can no longer play out, and Solr topples over. 
 I know this is a bit of an unusual circumstance due to us keeping the dead 
 core around, and our quick solution has been to remove said core. However I 
 can see other potential scenarios that might cause the same issue to arise. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

2014-11-06 Thread James Hardwick (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201049#comment-14201049
 ] 

James Hardwick commented on SOLR-6707:
--

My assumption was wrong about the feature. Here is the initial error that 
kicked off the sequence:

{noformat}
2014-11-03 11:13:37,734 [updateExecutor-1-thread-4] ERROR 
update.StreamingSolrServers  - error
org.apache.solr.common.SolrException: Internal Server Error
 
 
 
request: 
http://xxx.xxx.xxx.xxx:8081/app-search/appindex/update?update.chain=updateRequestProcessorChainupdate.distrib=TOLEADERdistrib.from=http%3A%2F%2Fxxx.xxx.xxx.xxx%3A8081%2Fapp-search%2Fappindex%2Fwt=javabinversion=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2014-11-03 11:13:38,056 [http-bio-8081-exec-336] WARN  
processor.DistributedUpdateProcessor  - Error sending update
org.apache.solr.common.SolrException: Internal Server Error
 
 
 
request: 
http://xxx.xxx.xxx.xxx:8081/app-search/appindex/update?update.chain=updateRequestProcessorChainupdate.distrib=TOLEADERdistrib.from=http%3A%2F%2Fxxx.xxx.xxx.xxx%3A8081%2Fapp-search%2Fappindex%2Fwt=javabinversion=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2014-11-03 11:13:38,364 [http-bio-8081-exec-324] INFO  update.UpdateHandler  - 
start 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2014-11-03 11:13:38,364 [http-bio-8081-exec-324] INFO  update.UpdateHandler  - 
No uncommitted changes. Skipping IW.commit.
2014-11-03 11:13:38,365 [http-bio-8081-exec-324] INFO  search.SolrIndexSearcher 
 - Opening Searcher@60515a83[appindex] main
2014-11-03 11:13:38,372 [http-bio-8081-exec-324] INFO  update.UpdateHandler  - 
end_commit_flush
2014-11-03 11:13:38,373 [updateExecutor-1-thread-6] ERROR 
update.SolrCmdDistributor  - 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space 
left on device
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:550)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(ConcurrentUpdateSolrServer.java:292)
at 
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:296)
at 
org.apache.solr.update.SolrCmdDistributor.access$000(SolrCmdDistributor.java:53)
at 
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:283)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
 
2014-11-03 11:13:40,812 [http-bio-8081-exec-336] WARN  
processor.DistributedUpdateProcessor  - Error sending update
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space 
left on device
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:550)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(ConcurrentUpdateSolrServer.java:292)
at 
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:296)
at 
org.apache.solr.update.SolrCmdDistributor.access$000(SolrCmdDistributor.java:53)
at 
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:283)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 

[jira] [Created] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

2014-11-05 Thread James Hardwick (JIRA)
James Hardwick created SOLR-6707:


 Summary: Recovery/election for invalid core results in rapid-fire 
re-attempts until /overseer/queue is clogged
 Key: SOLR-6707
 URL: https://issues.apache.org/jira/browse/SOLR-6707
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.10
Reporter: James Hardwick


We experienced an issue the other day that brought a production solr server 
down, and this is what we found after investigating:

- Running solr instance with two separate cores, one of which is perpetually 
down because it's configs are not yet completely updated for Solr-cloud. This 
was thought to be harmless since it's not currently in use. 
- Solr experienced an internal server error I believe due in part to a fairly 
new feature we are using, which seemingly caused all cores to go down. 
- Solr immediately went into recovery, and subsequent leader election for each 
shard of each core. 
- Our primary core recovered immediately. Our additional core which was never 
active in the first place, attempted to recover but of course couldn't due to 
the improper configs. 
- Solr then began rapid-fire reattempting recovery of said node, trying maybe 
20-30 times per second.
- This in turn bombarded zookeepers /overseer/queue into oblivion
- At some point /overseer/queue becomes so backed up that normal cluster 
coordination can no longer play out, and Solr topples over. 

I know this is a bit of an unusual circumstance due to us keeping the dead core 
around, and our quick solution has been to remove said core. However I can see 
other potential scenarios that might cause the same issue to arise. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org