Thank you for the response.
There are no virtual machines in the configuration. The collection has 45 shards with 3 replicas each spread across the 9 physical boxes; each box is running one copy of solr. I've tried to restart just the one node after the other 8 (and all their shards/replicas) came up, but this one replica seems to be in perma-recovery.

Shard Count: 45
replicationFactor: 3
maxShardsPerNode: 50
router: compositeId
autoAddReplicas: false

SOLR_JAVA_MEM options are -Xms16g - Xmx32g

_TUNE is:
"-XX:+UseG1GC \
-XX:MaxDirectMemorySize=8g
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=32m \
-XX:MaxGCPauseMillis=500 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:ParallelGCThreads=16 \
-XX:+UseLargePages \
-XX:-ResizePLAB \
-XX:+AggressiveOpts"

So far it has retried 22 times. The cluster is accessible and OK, but I'm afraid to continue indexing data if this one node will never come back.
Thanks for help!

-Joe


On 2/1/2017 12:58 PM, alessandro.benedetti wrote:
Let me try to summarize .
How many virtual machines on top of the 9 physical ?
How many Solr processes ( replicas ? )

If you had 1 node compromised.
I assume you have replicas as well right ?

Can you explain a little bit better your replicas configuration ?
Why you had to stop all the nodes ?

I would expect the stop of the solr node failing, cleanup of the index and
restart.
Automatically it would recover from the leader.

Something is suspicious here, let us know !

Cheers



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to