I brought down the whole cluster again, and brought up one server at a time, waiting for it to go green before launching another. Now all replicas are OK, including the one that was in the perma-recovery mode before. I do notice a large amount of network activity (basically pegging the interface) when a node is brought up. I suspect this is especially true since these nodes are not dataNodes in HDFS.

-Joe


On 2/1/2017 1:37 PM, Alessandro Benedetti wrote:
I can't debug the code  now,  but if you access the logs,  directly ( not
from the ui),  is there any " caused by"  associated to the recovery
failure exception?
Cheers

On 1 Feb 2017 6:28 p.m., "Joe Obernberger" <joseph.obernber...@gmail.com>
wrote:

In HDFS when a node fails it will leave behind write.lock files in HDFS.
These files have to be manually removed; otherwise the shards/replicas that
have write.lock files left behind will not start.  Since I can't tell which
physical node is hosting which shard/replica, I stop all the nodes, delete
all the write.lock files in HDFS and restart.

You are correct - only one replica is failing to start.  The other
replicas on the same physical node are coming up OK. Picture is worth a
thousand words so:
http://lovehorsepower.com/images/Cluster1.jpg

Errors:
http://lovehorsepower.com/images/ClusterSolr2.jpg

-Joe

On 2/1/2017 1:20 PM, Alessandro Benedetti wrote:

Ok,  it is clearer now.
You have 9 solr nodes running,  one per physical machine.
So each node has a number cores ( both replicas and leaders).
When the node died,  you got a lot of indexes corrupted.
I still miss why you restarted the others 8 working nodes ( I was
expecting
you to restart only the failed one)

When you mention that only one replica  is failing,  you mean that the
solr
node is up and running and only  one solr core ( the replica of one shard)
   keeps failing?
Or all the local cores in that node are failing  to recover?

Cheers

On 1 Feb 2017 6:07 p.m., "Joe Obernberger" <joseph.obernber...@gmail.com>
wrote:

Thank you for the response.
There are no virtual machines in the configuration.  The collection has 45
shards with 3 replicas each spread across the 9 physical boxes; each box
is
running one copy of solr.  I've tried to restart just the one node after
the other 8 (and all their shards/replicas) came up, but this one replica
seems to be in perma-recovery.

Shard Count: 45
replicationFactor: 3
maxShardsPerNode: 50
router: compositeId
autoAddReplicas: false

SOLR_JAVA_MEM options are -Xms16g - Xmx32g

_TUNE is:
"-XX:+UseG1GC \
-XX:MaxDirectMemorySize=8g
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=32m \
-XX:MaxGCPauseMillis=500 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:ParallelGCThreads=16 \
-XX:+UseLargePages \
-XX:-ResizePLAB \
-XX:+AggressiveOpts"

So far it has retried 22 times.  The cluster is accessible and OK, but I'm
afraid to continue indexing data if this one node will never come back.
Thanks for help!

-Joe



On 2/1/2017 12:58 PM, alessandro.benedetti wrote:

Let me try to summarize .
How many virtual machines on top of the 9 physical ?
How many Solr processes ( replicas ? )

If you had 1 node compromised.
I assume you have replicas as well right ?

Can you explain a little bit better your replicas configuration ?
Why you had to stop all the nodes ?

I would expect the stop of the solr node failing, cleanup of the index
and
restart.
Automatically it would recover from the leader.

Something is suspicious here, let us know !

Cheers



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.nabble
.com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
Sent from the Solr - User mailing list archive at Nabble.com.



Reply via email to