Clarificaiton: If we restart nodes after reloading collection and before
pausing, then recovery works fine.

On Thu, Jan 14, 2016 at 12:08 PM, Gili Nachum <gilinac...@gmail.com> wrote:

> Hi,
>
> Our Solr cluster is running VMs that could freeze for more than the ZK
> tick time (it's a non critical CI/CD pipeline running on an overloaded
> ESX). When this happens the node's shards will be registered as down. Then
> when the node is back recovery takes place, and all shards replicas end up
> active state. Everyone is happy.
>
> However, we noticed that recover doesn't take place if the collection was
> reloaded and the server didn't restart since. Shards end up in done state.
> Before providing log messages, I wonder if this is a known issue?
>
> Reproducing recipe (assume two nodes):
> 1. Before starting: restart both solr1 and solr2: all shards are active.
> 2. Reload the collection
> 3. Cause disconnect by freezing the Java process:
> On Solr2: kill -SIGSTOP <solr server pid> and then in 2 min kill -SIGCONT
> <solr server pid>
> 4. solr2 shard replicas are *Down *forever. No recovery.
>
> If we omit step #2, the cluster recovers as expected.
>

Reply via email to