The heartbeat that keeps the node alive is the connection it maintains with 
ZooKeeper.

We don’t currently have anything built in that will actively make sure each 
node can serve queries and remove it from clusterstatem.json if it cannot. If a 
replica is maintaining it’s connection with ZooKeeper and in most cases, if it 
is accepting updates, it will appear up. Load balancing should handle the 
failures, but I guess it depends on how sticky the request fails are.

In the past, I’ve seen this handled on a different search engine by having a 
variety of external agent scripts that would occasionally attempt to do a 
query, and if things did not go right, it killed the process to cause it to try 
and startup again (supervised process).

I’m not sure what the right long term feature for Solr is here, but feel free 
to start a JIRA issue around it.

One simple improvement might even be a background thread that periodically 
checks some local readings and depending on the results, pulls itself out of 
the mix as best it can (remove itself from clusterstate.json or simply closes 
it’s zk conneciton).

- Mark

http://about.me/markrmiller

On Mar 2, 2014, at 3:42 PM, Gregg Donovan <gregg...@gmail.com> wrote:

> We had a brief SolrCloud outage this weekend when a node's SSD began to
> fail but the node still appeared to be up to the rest of the SolrCloud
> cluster (i.e. still green in clusterstate.json). Distributed queries that
> reached this node would fail but whatever heartbeat keeps the node in the
> clustrstate.json must have continued to succeed.
> 
> We eventually had to power the node down to get it to be removed from
> clusterstate.json.
> 
> This is our first foray into SolrCloud, so I'm still somewhat fuzzy on what
> the default heartbeat mechanism is and how we may augment it to be sure
> that the disk is checked as part of the heartbeat and/or we verify that it
> can serve queries.
> 
> Any pointers would be appreciated.
> 
> Thanks!
> 
> --Gregg

Reply via email to