>
> Well it works that way for OOMs and for when the process drop hard (Think
> kill -9). However when a shard server is shutdown it currently ends it's
> session in ZooKeeper, thus triggering a layout change.
Yes, may be we can have a config to determine whether it shud end/maintain
the session in ZK when doing a normal shutdown and then subsequent restart.
By this way, both MTTR-conscious and layout-conscious settings can be
supported.
How do you think we can detect that a particular shard-server is
struggling/shut-down and hence incoming search-requests need to go to some
other server?
I am listing few paths off the top of my head
1. Process baby-sitters like supervisord, alerting controllers
2. Tracking first network-exception in controller and diverting to read-only
instance. Periodically may be re-try
3. Take a statistics based decision, based on previous response times etc..
4. Build some kind of leasing mechanism in ZK etc...
--
Ravi
On Fri, Mar 7, 2014 at 8:01 AM, Aaron McCurry <[email protected]> wrote:
> On Thu, Mar 6, 2014 at 6:30 AM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > I came to know about zk.session.timeout variable just now, while reading
> > more about this problem.
> >
> > This will only trigger dead-node notification after the configured
> timeout
> > exceeds. Setting it to 3-4 mins must be fine for OOMs and
> rolling-restarts.
> >
>
> Well it works that way for OOMs and for when the process drop hard (Think
> kill -9). However when a shard server is shutdown it currently ends it's
> session in ZooKeeper, thus triggering a layout change.
>
>
> >
> > Only extra stuff I am looking for, is to divert search calls to a
> read-only
> > shard instance during this 3-4 mins time to avoid mini-outages
> >
>
> Yes, and I think that the controllers will automatically spread the queries
> across those servers that are online. The BlurClient class already takes a
> list of connection strings and treats all connections as equals. For
> example, it's current use is to provide the client with all the controllers
> connection strings. Internally if any one of the controllers goes down or
> has a network issue another controller is automatically retried without the
> user having to do anything. There is back off, ping, and pooling logic in
> the BlurClientManager that the BlurClient utilizes.
>
> Aaron
>
>
> >
> > --
> > Ravi
> >
> >
> >
> > On Thu, Mar 6, 2014 at 3:34 PM, Ravikumar Govindarajan <
> > [email protected]> wrote:
> >
> > > What do you think of giving an extra leeway for shard-server failover
> > > cases?
> > >
> > > Ex: Whenever a shard-server process gets killed, the controller-node
> does
> > > not immediately update-layout, but rather mark it as a suspect.
> > >
> > > When we have a read-only back-up of shard, searches can continue
> > > unhindered. Indexing during this time can be diverted to a queue, which
> > > will store and retry-ops, when shard-server comes online again.
> > >
> > > Over configured number of attempts/time, if the shard-server does not
> > come
> > > up, then one controller-server can authoritatively mark it as down and
> > > update the layout.
> > >
> > > --
> > > Ravi
> > >
> > >
> >
>