Clarificaiton: If we restart nodes after reloading collection and before pausing, then recovery works fine.
On Thu, Jan 14, 2016 at 12:08 PM, Gili Nachum <gilinac...@gmail.com> wrote: > Hi, > > Our Solr cluster is running VMs that could freeze for more than the ZK > tick time (it's a non critical CI/CD pipeline running on an overloaded > ESX). When this happens the node's shards will be registered as down. Then > when the node is back recovery takes place, and all shards replicas end up > active state. Everyone is happy. > > However, we noticed that recover doesn't take place if the collection was > reloaded and the server didn't restart since. Shards end up in done state. > Before providing log messages, I wonder if this is a known issue? > > Reproducing recipe (assume two nodes): > 1. Before starting: restart both solr1 and solr2: all shards are active. > 2. Reload the collection > 3. Cause disconnect by freezing the Java process: > On Solr2: kill -SIGSTOP <solr server pid> and then in 2 min kill -SIGCONT > <solr server pid> > 4. solr2 shard replicas are *Down *forever. No recovery. > > If we omit step #2, the cluster recovers as expected. >