Re: Dead node, but clusterstate.json says active, won't sync on restart
If you removed the tlog and index and restart it should resync, or something is really crazy. It doesn't, or at least if it tries, it's somehow failing. I'd be ok with the sync failing for some reason if the node wasn't also serving queries. -Greg On Tue, Jan 28, 2014 at 11:10 AM, Mark Miller markrmil...@gmail.com wrote: Sounds like a bug. 4.6.1 is out any minute - you might try that. There was a replication bug that may be involved. If you removed the tlog and index and restart it should resync, or something is really crazy. The clusterstate.json is a red herring. You have to merge the live nodes info with the state to know the real state. - Mark http://www.about.me/markrmiller On Jan 28, 2014, at 12:31 PM, Greg Preston gpres...@marinsoftware.com wrote: ** Using solrcloud 4.4.0 ** I had to kill a running solrcloud node. There is still a replica for that shard, so everything is functional. We've done some indexing while the node was killed. I'd like to bring back up the downed node and have it resync from the other replica. But when I restart the downed node, it joins back up as active immediately, and doesn't resync. I even wiped the data directory on the downed node, hoping that would force it to sync on restart, but it doesn't. I'm assuming this is related to the state still being listed as active in clusterstate.json for the downed node? Since it comes back as active, it's serving queries and giving old results. How can I force this node to do a recovery on restart? Thanks. -Greg
Re: Dead node, but clusterstate.json says active, won't sync on restart
What's in the logs of the node that won't recover on restart after clearing the index and tlog - Mark On Jan 29, 2014, at 11:41 AM, Greg Preston gpres...@marinsoftware.com wrote: If you removed the tlog and index and restart it should resync, or something is really crazy. It doesn't, or at least if it tries, it's somehow failing. I'd be ok with the sync failing for some reason if the node wasn't also serving queries. -Greg On Tue, Jan 28, 2014 at 11:10 AM, Mark Miller markrmil...@gmail.com wrote: Sounds like a bug. 4.6.1 is out any minute - you might try that. There was a replication bug that may be involved. If you removed the tlog and index and restart it should resync, or something is really crazy. The clusterstate.json is a red herring. You have to merge the live nodes info with the state to know the real state. - Mark http://www.about.me/markrmiller On Jan 28, 2014, at 12:31 PM, Greg Preston gpres...@marinsoftware.com wrote: ** Using solrcloud 4.4.0 ** I had to kill a running solrcloud node. There is still a replica for that shard, so everything is functional. We've done some indexing while the node was killed. I'd like to bring back up the downed node and have it resync from the other replica. But when I restart the downed node, it joins back up as active immediately, and doesn't resync. I even wiped the data directory on the downed node, hoping that would force it to sync on restart, but it doesn't. I'm assuming this is related to the state still being listed as active in clusterstate.json for the downed node? Since it comes back as active, it's serving queries and giving old results. How can I force this node to do a recovery on restart? Thanks. -Greg
Re: Dead node, but clusterstate.json says active, won't sync on restart
I've attached the log of the downed node (truffle-solr-4). This is the relevant log entry from the node it should replicate from (truffle-solr-5): [29 Jan 2014 19:31:29] [qtp1614415528-74] ERROR (org.apache.solr.common.SolrException) - org.apache.solr.common.SolrException: I was asked to wait on state recovering for truffle-solr-4:8983_solr but I still do not see the requested state. I see state: active live:true at org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:966) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:191) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:611) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) You can see that 4 is serving queries. It appears that 4 tries to recover from 5, but 5 is confused about the state of 4? 4 had an empty index and tlog when it was started. We will eventually upgrade to 4.6.x or 4.7.x, but we've got a pretty extensive regression testing cycle, so there is some delay in upgrading versions. -Greg On Wed, Jan 29, 2014 at 9:08 AM, Mark Miller markrmil...@gmail.com wrote: What's in the logs of the node that won't recover on restart after clearing the index and tlog - Mark On Jan 29, 2014, at 11:41 AM, Greg Preston gpres...@marinsoftware.com wrote: If you removed the tlog and index and restart it should resync, or something is really crazy. It doesn't, or at least if it tries, it's somehow failing. I'd be ok with the sync failing for some reason if the node wasn't also serving queries. -Greg On Tue, Jan 28, 2014 at 11:10 AM, Mark Miller markrmil...@gmail.com wrote: Sounds like a bug. 4.6.1 is out any minute - you might try that. There was a replication bug that may be involved. If you removed the tlog and index and restart it should resync, or something is really crazy. The clusterstate.json is a red herring. You have to merge the live nodes info with the state to know the real state. - Mark http://www.about.me/markrmiller On Jan 28, 2014, at 12:31 PM, Greg Preston gpres...@marinsoftware.com wrote: ** Using solrcloud 4.4.0 ** I had to kill a running solrcloud node. There is still a replica for that shard, so everything is functional. We've done some indexing while the node was killed. I'd like to bring back up the downed node and have it resync from the other replica. But when I restart the downed node, it joins back up as active immediately, and doesn't resync. I even wiped the data directory on the downed node, hoping that would force it to sync on restart, but it doesn't. I'm assuming this is related to the state still being listed as active in clusterstate.json for the downed node? Since it comes back as active, it's serving queries and giving old results. How can I force this node to do a recovery on restart? Thanks. -Greg [29 Jan 2014 19:28:57] [main] INFO (org.eclipse.jetty.server.Server) - jetty-8.1.10.v20130312 [29 Jan 2014 19:28:57] [main] INFO (org.eclipse.jetty.deploy.providers.ScanningAppProvider) - Deployment monitor /home/solr/solr/solr-4.4.0/example/contexts at interval 0 [29 Jan 2014 19:28:57] [main] INFO (org.eclipse.jetty.deploy.DeploymentManager) - Deployable added: /home/solr/solr/solr-4.4.0/example/contexts/solr-jetty-context.xml [29 Jan 2014 19:28:58] [main] INFO
Dead node, but clusterstate.json says active, won't sync on restart
** Using solrcloud 4.4.0 ** I had to kill a running solrcloud node. There is still a replica for that shard, so everything is functional. We've done some indexing while the node was killed. I'd like to bring back up the downed node and have it resync from the other replica. But when I restart the downed node, it joins back up as active immediately, and doesn't resync. I even wiped the data directory on the downed node, hoping that would force it to sync on restart, but it doesn't. I'm assuming this is related to the state still being listed as active in clusterstate.json for the downed node? Since it comes back as active, it's serving queries and giving old results. How can I force this node to do a recovery on restart? Thanks. -Greg
Re: Dead node, but clusterstate.json says active, won't sync on restart
On 1/28/2014 10:31 AM, Greg Preston wrote: ** Using solrcloud 4.4.0 ** I had to kill a running solrcloud node. There is still a replica for that shard, so everything is functional. We've done some indexing while the node was killed. I'd like to bring back up the downed node and have it resync from the other replica. But when I restart the downed node, it joins back up as active immediately, and doesn't resync. I even wiped the data directory on the downed node, hoping that would force it to sync on restart, but it doesn't. I'm assuming this is related to the state still being listed as active in clusterstate.json for the downed node? Since it comes back as active, it's serving queries and giving old results. How can I force this node to do a recovery on restart? This might be completely wrong, but hopefully it will help you: Perhaps a graceful stop of that node will result in the proper clusterstate so it will work the next time it's started? That may already be what you've done, so this may not help at all ... but you did say kill which might mean that it wasn't a clean shutdown of Solr. Thanks, Shawn
Re: Dead node, but clusterstate.json says active, won't sync on restart
Thanks for the idea. I tried it, and the state for the bad node, even after an orderly shutdown, is still active in clusterstate.json. I see this in the logs on restart: [28 Jan 2014 18:25:29] [RecoveryThread] ERROR (org.apache.solr.common.SolrException) - Error while trying to recover. core=marin:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was asked to wait on state recovering for truffle-solr-4:8983_solr but I still do not see the requested state. I see state: active live:true at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219) -Greg On Tue, Jan 28, 2014 at 9:53 AM, Shawn Heisey s...@elyograg.org wrote: On 1/28/2014 10:31 AM, Greg Preston wrote: ** Using solrcloud 4.4.0 ** I had to kill a running solrcloud node. There is still a replica for that shard, so everything is functional. We've done some indexing while the node was killed. I'd like to bring back up the downed node and have it resync from the other replica. But when I restart the downed node, it joins back up as active immediately, and doesn't resync. I even wiped the data directory on the downed node, hoping that would force it to sync on restart, but it doesn't. I'm assuming this is related to the state still being listed as active in clusterstate.json for the downed node? Since it comes back as active, it's serving queries and giving old results. How can I force this node to do a recovery on restart? This might be completely wrong, but hopefully it will help you: Perhaps a graceful stop of that node will result in the proper clusterstate so it will work the next time it's started? That may already be what you've done, so this may not help at all ... but you did say kill which might mean that it wasn't a clean shutdown of Solr. Thanks, Shawn
Re: Dead node, but clusterstate.json says active, won't sync on restart
Sounds like a bug. 4.6.1 is out any minute - you might try that. There was a replication bug that may be involved. If you removed the tlog and index and restart it should resync, or something is really crazy. The clusterstate.json is a red herring. You have to merge the live nodes info with the state to know the real state. - Mark http://www.about.me/markrmiller On Jan 28, 2014, at 12:31 PM, Greg Preston gpres...@marinsoftware.com wrote: ** Using solrcloud 4.4.0 ** I had to kill a running solrcloud node. There is still a replica for that shard, so everything is functional. We've done some indexing while the node was killed. I'd like to bring back up the downed node and have it resync from the other replica. But when I restart the downed node, it joins back up as active immediately, and doesn't resync. I even wiped the data directory on the downed node, hoping that would force it to sync on restart, but it doesn't. I'm assuming this is related to the state still being listed as active in clusterstate.json for the downed node? Since it comes back as active, it's serving queries and giving old results. How can I force this node to do a recovery on restart? Thanks. -Greg