On Thu, Mar 6, 2014 at 6:30 AM, Ravikumar Govindarajan < [email protected]> wrote:
> I came to know about zk.session.timeout variable just now, while reading > more about this problem. > > This will only trigger dead-node notification after the configured timeout > exceeds. Setting it to 3-4 mins must be fine for OOMs and rolling-restarts. > Well it works that way for OOMs and for when the process drop hard (Think kill -9). However when a shard server is shutdown it currently ends it's session in ZooKeeper, thus triggering a layout change. > > Only extra stuff I am looking for, is to divert search calls to a read-only > shard instance during this 3-4 mins time to avoid mini-outages > Yes, and I think that the controllers will automatically spread the queries across those servers that are online. The BlurClient class already takes a list of connection strings and treats all connections as equals. For example, it's current use is to provide the client with all the controllers connection strings. Internally if any one of the controllers goes down or has a network issue another controller is automatically retried without the user having to do anything. There is back off, ping, and pooling logic in the BlurClientManager that the BlurClient utilizes. Aaron > > -- > Ravi > > > > On Thu, Mar 6, 2014 at 3:34 PM, Ravikumar Govindarajan < > [email protected]> wrote: > > > What do you think of giving an extra leeway for shard-server failover > > cases? > > > > Ex: Whenever a shard-server process gets killed, the controller-node does > > not immediately update-layout, but rather mark it as a suspect. > > > > When we have a read-only back-up of shard, searches can continue > > unhindered. Indexing during this time can be diverted to a queue, which > > will store and retry-ops, when shard-server comes online again. > > > > Over configured number of attempts/time, if the shard-server does not > come > > up, then one controller-server can authoritatively mark it as down and > > update the layout. > > > > -- > > Ravi > > > > >
