What do you think of giving an extra leeway for shard-server failover cases?
Ex: Whenever a shard-server process gets killed, the controller-node does not immediately update-layout, but rather mark it as a suspect. When we have a read-only back-up of shard, searches can continue unhindered. Indexing during this time can be diverted to a queue, which will store and retry-ops, when shard-server comes online again. Over configured number of attempts/time, if the shard-server does not come up, then one controller-server can authoritatively mark it as down and update the layout. -- Ravi
