Opps. Got omitted. v4.72. plus it kept reproducing after upgrading to v4.9 (was trying to see if it was fixed later on).
On Thu, Jan 14, 2016 at 5:26 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > Which version of Solr is this on? > > On Thu, Jan 14, 2016 at 4:10 PM, Gili Nachum <gilinac...@gmail.com> wrote: > > Clarificaiton: If we restart nodes after reloading collection and before > > pausing, then recovery works fine. > > > > On Thu, Jan 14, 2016 at 12:08 PM, Gili Nachum <gilinac...@gmail.com> > wrote: > > > >> Hi, > >> > >> Our Solr cluster is running VMs that could freeze for more than the ZK > >> tick time (it's a non critical CI/CD pipeline running on an overloaded > >> ESX). When this happens the node's shards will be registered as down. > Then > >> when the node is back recovery takes place, and all shards replicas end > up > >> active state. Everyone is happy. > >> > >> However, we noticed that recover doesn't take place if the collection > was > >> reloaded and the server didn't restart since. Shards end up in done > state. > >> Before providing log messages, I wonder if this is a known issue? > >> > >> Reproducing recipe (assume two nodes): > >> 1. Before starting: restart both solr1 and solr2: all shards are active. > >> 2. Reload the collection > >> 3. Cause disconnect by freezing the Java process: > >> On Solr2: kill -SIGSTOP <solr server pid> and then in 2 min kill > -SIGCONT > >> <solr server pid> > >> 4. solr2 shard replicas are *Down *forever. No recovery. > >> > >> If we omit step #2, the cluster recovers as expected. > >> > > > > -- > Regards, > Shalin Shekhar Mangar. >