[ https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784908#action_12784908 ]
Patrick Hunt commented on SOLR-1277: ------------------------------------ bq. Any pointers on ways to deal with this? >From our experience with hbase (which is the only place we've seen this issue >so far, at least to this extent) you need to think about: 1) client timeout value tradeoffs 2) effects of session expiration due to gc pause, potential ways to mitigate for 1) there is a tradeoff (the good thing is that not all clients need to use the same timeout, so you can tune based on the client type, you can even have multiple sessions for a single client, each with it's own timeout) You can set the timeout higher, so if your zk client pauses you don't get expired, however this also means that if your client crashes the session won't be expired until the timeout expires. This means that the rest of your system will not be notified of the change (say you are doing leader election) for longer than you might like. for 2) you need to think about the potential failure cases and their effects. a) Say your ZK client (solr component X) fails (the host crashes), do you need to know about this in 5 seconds, or 30sec? b) Say the host is network partitioned due to a burp in the network that lasts 5 seconds, is this ok, or does the rest of the solr system need to know about this? c) Say component X gc pauses for 4 minutes, do you want the rest of the system to react immed, or consider this "ok" and just wait around for a while for X to come back.... but keep in mind that from the perspective of "the rest of your system" you don't know the difference between a) or b or c (etc...), from their viewpoint X is gone and they don't know why (unless it eventually comes back) In hbase case session expiration is expensive as the region server master will reallocate the table (or some such). In your case the effects of X going down may not be very expensive. If this is the case then having a low(er) session timeout for X may not be a problem. (just deal with the session timeout when it does happen, X will eventually come back) If X recovery is expensive you may want to set the timeout very high. but as I said this makes the system less responsive if X has a real problem. Another option we explored with hbase is to use a "lease" recipe instead. Set a very high timeout, but have X update the znode (still ephemeral) every N seconds. If the rest of the system (whoever is interested in X status) doesn't see an update from X in T seconds, then perhaps you log a warning ("where is X?"). Say you don't see an update from X in T*2 seconds, then page the operator "warning, maybe problems with X". Say you don't see in T*3 seconds (perhaps this is the timeout you use, in which case the znode is removed), consider X down, cleanup and enact recovery. These are madeup actions/times, but you can see what I'm getting at. With lease it's not "all or nothing". You (solr) have the option to take actions based on the lease time, rather than just the znode being deleted in the typical case (all or nothing). The tradeoff here is that it's a bit more complicted for you - you need to implement the lease rather than just relying on the znode being deleted - you would of course set a watch on the znode to get notified when the znode is removed (etc...) > Implement a Solr specific naming service (using Zookeeper) > ---------------------------------------------------------- > > Key: SOLR-1277 > URL: https://issues.apache.org/jira/browse/SOLR-1277 > Project: Solr > Issue Type: New Feature > Affects Versions: 1.4 > Reporter: Jason Rutherglen > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 1.5 > > Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, > SOLR-1277.patch, zookeeper-3.2.1.jar > > Original Estimate: 672h > Remaining Estimate: 672h > > The goal is to give Solr server clusters self-healing attributes > where if a server fails, indexing and searching don't stop and > all of the partitions remain searchable. For configuration, the > ability to centrally deploy a new configuration without servers > going offline. > We can start with basic failover and start from there? > Features: > * Automatic failover (i.e. when a server fails, clients stop > trying to index to or search it) > * Centralized configuration management (i.e. new solrconfig.xml > or schema.xml propagates to a live Solr cluster) > * Optionally allow shards of a partition to be moved to another > server (i.e. if a server gets hot, move the hot segments out to > cooler servers). Ideally we'd have a way to detect hot segments > and move them seamlessly. With NRT this becomes somewhat more > difficult but not impossible? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.