Re: Solr 6.3.0 - recovery failed

2017-02-01 Thread Joe Obernberger
I brought down the whole cluster again, and brought up one server at a time, waiting for it to go green before launching another. Now all replicas are OK, including the one that was in the perma-recovery mode before. I do notice a large amount of network activity (basically pegging the

Re: Solr 6.3.0 - recovery failed

2017-02-01 Thread Joe Obernberger
Thank you. I do not see any caused block in the solr.log. --- 2017-02-01 18:37:57.566 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34

Re: Solr 6.3.0 - recovery failed

2017-02-01 Thread Alessandro Benedetti
I can't debug the code now, but if you access the logs, directly ( not from the ui), is there any " caused by" associated to the recovery failure exception? Cheers On 1 Feb 2017 6:28 p.m., "Joe Obernberger" wrote: > In HDFS when a node fails it will leave

Re: Solr 6.3.0 - recovery failed

2017-02-01 Thread Joe Obernberger
In HDFS when a node fails it will leave behind write.lock files in HDFS. These files have to be manually removed; otherwise the shards/replicas that have write.lock files left behind will not start. Since I can't tell which physical node is hosting which shard/replica, I stop all the nodes,

Re: Solr 6.3.0 - recovery failed

2017-02-01 Thread Alessandro Benedetti
Ok, it is clearer now. You have 9 solr nodes running, one per physical machine. So each node has a number cores ( both replicas and leaders). When the node died, you got a lot of indexes corrupted. I still miss why you restarted the others 8 working nodes ( I was expecting you to restart only

Re: Solr 6.3.0 - recovery failed

2017-02-01 Thread Joe Obernberger
Thank you for the response. There are no virtual machines in the configuration. The collection has 45 shards with 3 replicas each spread across the 9 physical boxes; each box is running one copy of solr. I've tried to restart just the one node after the other 8 (and all their

Re: Solr 6.3.0 - recovery failed

2017-02-01 Thread alessandro.benedetti
Let me try to summarize . How many virtual machines on top of the 9 physical ? How many Solr processes ( replicas ? ) If you had 1 node compromised. I assume you have replicas as well right ? Can you explain a little bit better your replicas configuration ? Why you had to stop all the nodes ? I

Solr 6.3.0 - recovery failed

2017-02-01 Thread Joe Obernberger
Hi All - I had one node in a 45 shard cluster (9 physical machines) run out of memory. I stopped all the nodes in the cluster and removed any lingering write.lock files from the OOM in HDFS. All the nodes recovered except one replica of one shard that happens to be on the node that ran out