Re: Dead node, but clusterstate.json says active, won't sync on restart

2014-01-29 Thread Greg Preston
If you removed the tlog and index and restart it should resync, or
something is really crazy.

It doesn't, or at least if it tries, it's somehow failing.  I'd be ok with
the sync failing for some reason if the node wasn't also serving queries.


-Greg


On Tue, Jan 28, 2014 at 11:10 AM, Mark Miller markrmil...@gmail.com wrote:

 Sounds like a bug. 4.6.1 is out any minute - you might try that. There was
 a replication bug that may be involved.

 If you removed the tlog and index and restart it should resync, or
 something is really crazy.

 The clusterstate.json is a red herring. You have to merge the live nodes
 info with the state to know the real state.

 - Mark

 http://www.about.me/markrmiller

  On Jan 28, 2014, at 12:31 PM, Greg Preston gpres...@marinsoftware.com
 wrote:
 
  ** Using solrcloud 4.4.0 **
 
  I had to kill a running solrcloud node.  There is still a replica for
 that
  shard, so everything is functional.  We've done some indexing while the
  node was killed.
 
  I'd like to bring back up the downed node and have it resync from the
 other
  replica.  But when I restart the downed node, it joins back up as active
  immediately, and doesn't resync.  I even wiped the data directory on the
  downed node, hoping that would force it to sync on restart, but it
 doesn't.
 
  I'm assuming this is related to the state still being listed as active in
  clusterstate.json for the downed node?  Since it comes back as active,
 it's
  serving queries and giving old results.
 
  How can I force this node to do a recovery on restart?
 
  Thanks.
 
 
  -Greg



Re: Dead node, but clusterstate.json says active, won't sync on restart

2014-01-29 Thread Mark Miller
What's in the logs of the node that won't recover on restart after clearing the 
index and tlog 

- Mark

On Jan 29, 2014, at 11:41 AM, Greg Preston gpres...@marinsoftware.com wrote:

 If you removed the tlog and index and restart it should resync, or
 something is really crazy.
 
 It doesn't, or at least if it tries, it's somehow failing.  I'd be ok with
 the sync failing for some reason if the node wasn't also serving queries.
 
 
 -Greg
 
 
 On Tue, Jan 28, 2014 at 11:10 AM, Mark Miller markrmil...@gmail.com wrote:
 
 Sounds like a bug. 4.6.1 is out any minute - you might try that. There was
 a replication bug that may be involved.
 
 If you removed the tlog and index and restart it should resync, or
 something is really crazy.
 
 The clusterstate.json is a red herring. You have to merge the live nodes
 info with the state to know the real state.
 
 - Mark
 
 http://www.about.me/markrmiller
 
 On Jan 28, 2014, at 12:31 PM, Greg Preston gpres...@marinsoftware.com
 wrote:
 
 ** Using solrcloud 4.4.0 **
 
 I had to kill a running solrcloud node.  There is still a replica for
 that
 shard, so everything is functional.  We've done some indexing while the
 node was killed.
 
 I'd like to bring back up the downed node and have it resync from the
 other
 replica.  But when I restart the downed node, it joins back up as active
 immediately, and doesn't resync.  I even wiped the data directory on the
 downed node, hoping that would force it to sync on restart, but it
 doesn't.
 
 I'm assuming this is related to the state still being listed as active in
 clusterstate.json for the downed node?  Since it comes back as active,
 it's
 serving queries and giving old results.
 
 How can I force this node to do a recovery on restart?
 
 Thanks.
 
 
 -Greg
 


Re: Dead node, but clusterstate.json says active, won't sync on restart

2014-01-29 Thread Greg Preston
I've attached the log of the downed node (truffle-solr-4).
This is the relevant log entry from the node it should replicate from
(truffle-solr-5):

[29 Jan 2014 19:31:29] [qtp1614415528-74] ERROR
(org.apache.solr.common.SolrException) -
org.apache.solr.common.SolrException: I was asked to wait on state
recovering for truffle-solr-4:8983_solr but I still do not see the
requested state. I see state: active live:true
at
org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:966)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:191)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:611)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)

You can see that 4 is serving queries.  It appears that 4 tries to recover
from 5, but 5 is confused about the state of 4?  4 had an empty index and
tlog when it was started.

We will eventually upgrade to 4.6.x or 4.7.x, but we've got a pretty
extensive regression testing cycle, so there is some delay in upgrading
versions.



-Greg


On Wed, Jan 29, 2014 at 9:08 AM, Mark Miller markrmil...@gmail.com wrote:

 What's in the logs of the node that won't recover on restart after
 clearing the index and tlog

 - Mark

 On Jan 29, 2014, at 11:41 AM, Greg Preston gpres...@marinsoftware.com
 wrote:

  If you removed the tlog and index and restart it should resync, or
  something is really crazy.
 
  It doesn't, or at least if it tries, it's somehow failing.  I'd be ok
 with
  the sync failing for some reason if the node wasn't also serving queries.
 
 
  -Greg
 
 
  On Tue, Jan 28, 2014 at 11:10 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
  Sounds like a bug. 4.6.1 is out any minute - you might try that. There
 was
  a replication bug that may be involved.
 
  If you removed the tlog and index and restart it should resync, or
  something is really crazy.
 
  The clusterstate.json is a red herring. You have to merge the live nodes
  info with the state to know the real state.
 
  - Mark
 
  http://www.about.me/markrmiller
 
  On Jan 28, 2014, at 12:31 PM, Greg Preston 
 gpres...@marinsoftware.com
  wrote:
 
  ** Using solrcloud 4.4.0 **
 
  I had to kill a running solrcloud node.  There is still a replica for
  that
  shard, so everything is functional.  We've done some indexing while the
  node was killed.
 
  I'd like to bring back up the downed node and have it resync from the
  other
  replica.  But when I restart the downed node, it joins back up as
 active
  immediately, and doesn't resync.  I even wiped the data directory on
 the
  downed node, hoping that would force it to sync on restart, but it
  doesn't.
 
  I'm assuming this is related to the state still being listed as active
 in
  clusterstate.json for the downed node?  Since it comes back as active,
  it's
  serving queries and giving old results.
 
  How can I force this node to do a recovery on restart?
 
  Thanks.
 
 
  -Greg
 

[29 Jan 2014 19:28:57] [main] INFO  (org.eclipse.jetty.server.Server) - jetty-8.1.10.v20130312
[29 Jan 2014 19:28:57] [main] INFO  (org.eclipse.jetty.deploy.providers.ScanningAppProvider) - Deployment monitor /home/solr/solr/solr-4.4.0/example/contexts at interval 0
[29 Jan 2014 19:28:57] [main] INFO  (org.eclipse.jetty.deploy.DeploymentManager) - Deployable added: /home/solr/solr/solr-4.4.0/example/contexts/solr-jetty-context.xml
[29 Jan 2014 19:28:58] [main] INFO  

Dead node, but clusterstate.json says active, won't sync on restart

2014-01-28 Thread Greg Preston
** Using solrcloud 4.4.0 **

I had to kill a running solrcloud node.  There is still a replica for that
shard, so everything is functional.  We've done some indexing while the
node was killed.

I'd like to bring back up the downed node and have it resync from the other
replica.  But when I restart the downed node, it joins back up as active
immediately, and doesn't resync.  I even wiped the data directory on the
downed node, hoping that would force it to sync on restart, but it doesn't.

I'm assuming this is related to the state still being listed as active in
clusterstate.json for the downed node?  Since it comes back as active, it's
serving queries and giving old results.

How can I force this node to do a recovery on restart?

Thanks.


-Greg


Re: Dead node, but clusterstate.json says active, won't sync on restart

2014-01-28 Thread Shawn Heisey

On 1/28/2014 10:31 AM, Greg Preston wrote:

** Using solrcloud 4.4.0 **

I had to kill a running solrcloud node.  There is still a replica for that
shard, so everything is functional.  We've done some indexing while the
node was killed.

I'd like to bring back up the downed node and have it resync from the other
replica.  But when I restart the downed node, it joins back up as active
immediately, and doesn't resync.  I even wiped the data directory on the
downed node, hoping that would force it to sync on restart, but it doesn't.

I'm assuming this is related to the state still being listed as active in
clusterstate.json for the downed node?  Since it comes back as active, it's
serving queries and giving old results.

How can I force this node to do a recovery on restart?


This might be completely wrong, but hopefully it will help you: Perhaps 
a graceful stop of that node will result in the proper clusterstate so 
it will work the next time it's started? That may already be what you've 
done, so this may not help at all ... but you did say kill which might 
mean that it wasn't a clean shutdown of Solr.


Thanks,
Shawn



Re: Dead node, but clusterstate.json says active, won't sync on restart

2014-01-28 Thread Greg Preston
Thanks for the idea.  I tried it, and the state for the bad node, even
after an orderly shutdown, is still active in clusterstate.json.  I see
this in the logs on restart:

[28 Jan 2014 18:25:29] [RecoveryThread] ERROR
(org.apache.solr.common.SolrException) - Error while trying to recover.
core=marin:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
I was asked to wait on state recovering for truffle-solr-4:8983_solr but I
still do not see the requested state. I see state: active live:true
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)





-Greg


On Tue, Jan 28, 2014 at 9:53 AM, Shawn Heisey s...@elyograg.org wrote:

 On 1/28/2014 10:31 AM, Greg Preston wrote:

 ** Using solrcloud 4.4.0 **

 I had to kill a running solrcloud node.  There is still a replica for that
 shard, so everything is functional.  We've done some indexing while the
 node was killed.

 I'd like to bring back up the downed node and have it resync from the
 other
 replica.  But when I restart the downed node, it joins back up as active
 immediately, and doesn't resync.  I even wiped the data directory on the
 downed node, hoping that would force it to sync on restart, but it
 doesn't.

 I'm assuming this is related to the state still being listed as active in
 clusterstate.json for the downed node?  Since it comes back as active,
 it's
 serving queries and giving old results.

 How can I force this node to do a recovery on restart?


 This might be completely wrong, but hopefully it will help you: Perhaps a
 graceful stop of that node will result in the proper clusterstate so it
 will work the next time it's started? That may already be what you've done,
 so this may not help at all ... but you did say kill which might mean
 that it wasn't a clean shutdown of Solr.

 Thanks,
 Shawn




Re: Dead node, but clusterstate.json says active, won't sync on restart

2014-01-28 Thread Mark Miller
Sounds like a bug. 4.6.1 is out any minute - you might try that. There was a 
replication bug that may be involved. 

If you removed the tlog and index and restart it should resync, or something is 
really crazy. 

The clusterstate.json is a red herring. You have to merge the live nodes info 
with the state to know the real state. 

- Mark

http://www.about.me/markrmiller

 On Jan 28, 2014, at 12:31 PM, Greg Preston gpres...@marinsoftware.com wrote:
 
 ** Using solrcloud 4.4.0 **
 
 I had to kill a running solrcloud node.  There is still a replica for that
 shard, so everything is functional.  We've done some indexing while the
 node was killed.
 
 I'd like to bring back up the downed node and have it resync from the other
 replica.  But when I restart the downed node, it joins back up as active
 immediately, and doesn't resync.  I even wiped the data directory on the
 downed node, hoping that would force it to sync on restart, but it doesn't.
 
 I'm assuming this is related to the state still being listed as active in
 clusterstate.json for the downed node?  Since it comes back as active, it's
 serving queries and giving old results.
 
 How can I force this node to do a recovery on restart?
 
 Thanks.
 
 
 -Greg