Pardon me for hijacking the thread, but I'm curious about something you
said, Erick.  I always thought that the point (in part) of going through
the pain of using zookeeper and creating replicas was so that the system
could seamlessly recover from catastrophic failures.  Wouldn't an OOM
condition have a similar effect (or maybe java is better at cleanup on that
kind of error)?  The reason I ask is that I'm trying to set up a solr
system that is highly available and I'm a little bit surprised that a kill
-9 on one process on one machine could put the entire system in a bad
state.  Is it common to have to address problems like this with manual
intervention in production systems?  Ideally, I'd hope to be able to set up
a system where a single node dying a horrible death would never require
intervention.

On Tue, Jul 19, 2016 at 8:54 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> First of all, killing with -9 is A Very Bad Idea. You can
> leave write lock files laying around. You can leave
> the state in an "interesting" place. You haven't given
> Solr a chance to tell Zookeeper that it's going away.
> (which would set the state to "down"). In short
> when you do this you have to deal with the consequences
> yourself, one of which is this mismatch between
> cluster state and live_nodes.
>
> Now, that rant done the bin/solr script tries to stop Solr
> gracefully but issues a kill if solr doesn't stop nicely. Personally
> I think that timeout should be longer, but that's another story.
>
> The onlyIfDown='true' option is there specifically as a
> safety valve. It was provided for those who want to guard against
> typos and the like, so just don't specify it and you should be fine.
>
> Best,
> Erick
>
> On Mon, Jul 18, 2016 at 11:51 PM, Jerome Yang <jey...@pivotal.io> wrote:
> > Hi all,
> >
> > Here's the situation.
> > I'm using solr5.3 in cloud mode.
> >
> > I have 4 nodes.
> >
> > After use "kill -9 pid-solr-node" to kill 2 nodes.
> > These replicas in the two nodes still are "ACTIVE" in zookeeper's
> > state.json.
> >
> > The problem is, when I try to delete these down replicas with
> > parameter onlyIfDown='true'.
> > It says,
> > "Delete replica failed: Attempted to remove replica :
> > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is
> > 'active'."
> >
> > From this link:
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >
> > It says:
> > *NOTE*: when the node the replica is hosted on crashes, the replica's
> state
> > may remain ACTIVE in ZK. To determine if the replica is truly active, you
> > must also verify that its node
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.html#getNodeName--
> >
> > is
> > under /live_nodes in ZK (or use ClusterState.liveNodesContain(String)
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/ClusterState.html#liveNodesContain-java.lang.String-
> >
> > ).
> >
> > So, is this a bug?
> >
> > Regards,
> > Jerome
>

Reply via email to