[ 
https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784908#action_12784908
 ] 

Patrick Hunt commented on SOLR-1277:
------------------------------------

bq. Any pointers on ways to deal with this?

>From our experience with hbase (which is the only place we've seen this issue 
>so far, at least to this extent) you need to think about:

1) client timeout value tradeoffs
2) effects of session expiration due to gc pause, potential ways to mitigate

for 1) there is a tradeoff (the good thing is that not all clients need to use 
the same timeout, so you can tune based on the client type, you can even have 
multiple sessions for a single client, each with it's own timeout) You can set 
the timeout higher, so if your zk client pauses you don't get expired, however 
this also means that if your client crashes the session won't be expired until 
the timeout expires. This means that the rest of your system will not be 
notified of the change (say you are doing leader election) for longer than you 
might like.

for 2) you need to think about the potential failure cases and their effects. 
a) Say your ZK client (solr component X) fails (the host crashes), do you need 
to know about this in 5 seconds, or 30sec? b) Say the host is network 
partitioned due to a burp in the network that lasts 5 seconds, is this ok, or 
does the rest of the solr system need to know about this? c) Say component X gc 
pauses for 4 minutes, do you want the rest of the system to react immed, or 
consider this "ok" and just wait around for a while for X to come back.... but 
keep in mind that from the perspective of "the rest of your system" you don't 
know the difference between a) or b or c (etc...), from their viewpoint X is 
gone and they don't know why (unless it eventually comes back)

In hbase case session expiration is expensive as the region server master will 
reallocate the table (or some such). In your case the effects of X going down 
may not be very expensive. If this is the case then having a low(er) session 
timeout for X may not be a problem. (just deal with the session timeout when it 
does happen, X will eventually come back) 

If X recovery is expensive you may want to set the timeout very high. but as I 
said this makes the system less responsive if X has a real problem. Another 
option we explored with hbase is to use a "lease" recipe instead. Set a very 
high timeout, but have X update the znode (still ephemeral) every N seconds. If 
the rest of the system (whoever is interested in X status) doesn't see an 
update from X in T seconds, then perhaps you log a warning ("where is X?"). Say 
you don't see an update from X in T*2 seconds, then page the operator "warning, 
maybe problems with X". Say you don't see in T*3 seconds (perhaps this is the 
timeout you use, in which case the znode is removed), consider X down, cleanup 
and enact recovery. These are madeup actions/times, but you can see what I'm 
getting at. With lease it's not "all or nothing". You (solr) have the option to 
take actions based on the lease time, rather than just the znode being deleted 
in the typical case (all or nothing). The tradeoff here is that it's a bit more 
complicted for you - you need to implement the lease rather than just relying 
on the znode being deleted - you would of course set a watch on the znode to 
get notified when the znode is removed (etc...)


> Implement a Solr specific naming service (using Zookeeper)
> ----------------------------------------------------------
>
>                 Key: SOLR-1277
>                 URL: https://issues.apache.org/jira/browse/SOLR-1277
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, 
> SOLR-1277.patch, zookeeper-3.2.1.jar
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The goal is to give Solr server clusters self-healing attributes
> where if a server fails, indexing and searching don't stop and
> all of the partitions remain searchable. For configuration, the
> ability to centrally deploy a new configuration without servers
> going offline.
> We can start with basic failover and start from there?
> Features:
> * Automatic failover (i.e. when a server fails, clients stop
> trying to index to or search it)
> * Centralized configuration management (i.e. new solrconfig.xml
> or schema.xml propagates to a live Solr cluster)
> * Optionally allow shards of a partition to be moved to another
> server (i.e. if a server gets hot, move the hot segments out to
> cooler servers). Ideally we'd have a way to detect hot segments
> and move them seamlessly. With NRT this becomes somewhat more
> difficult but not impossible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to