I brought down the whole cluster again, and brought up one server at a
time, waiting for it to go green before launching another. Now all
replicas are OK, including the one that was in the perma-recovery mode
before. I do notice a large amount of network activity (basically
pegging the
Thank you. I do not see any caused block in the solr.log.
---
2017-02-01 18:37:57.566 INFO
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New
r:core_node34) [c:Worldline2New s:shard22 r:core_node34
I can't debug the code now, but if you access the logs, directly ( not
from the ui), is there any " caused by" associated to the recovery
failure exception?
Cheers
On 1 Feb 2017 6:28 p.m., "Joe Obernberger"
wrote:
> In HDFS when a node fails it will leave
In HDFS when a node fails it will leave behind write.lock files in
HDFS. These files have to be manually removed; otherwise the
shards/replicas that have write.lock files left behind will not start.
Since I can't tell which physical node is hosting which shard/replica, I
stop all the nodes,
Ok, it is clearer now.
You have 9 solr nodes running, one per physical machine.
So each node has a number cores ( both replicas and leaders).
When the node died, you got a lot of indexes corrupted.
I still miss why you restarted the others 8 working nodes ( I was expecting
you to restart only
Thank you for the response.
There are no virtual machines in the configuration. The collection has
45 shards with 3 replicas each spread across the 9 physical boxes; each
box is running one copy of solr. I've tried to restart just the one
node after the other 8 (and all their
Let me try to summarize .
How many virtual machines on top of the 9 physical ?
How many Solr processes ( replicas ? )
If you had 1 node compromised.
I assume you have replicas as well right ?
Can you explain a little bit better your replicas configuration ?
Why you had to stop all the nodes ?
I
Hi All - I had one node in a 45 shard cluster (9 physical machines) run
out of memory. I stopped all the nodes in the cluster and removed any
lingering write.lock files from the OOM in HDFS. All the nodes
recovered except one replica of one shard that happens to be on the node
that ran out