I can't debug the code now, but if you access the logs, directly ( not from the ui), is there any " caused by" associated to the recovery failure exception? Cheers
On 1 Feb 2017 6:28 p.m., "Joe Obernberger" <joseph.obernber...@gmail.com> wrote: > In HDFS when a node fails it will leave behind write.lock files in HDFS. > These files have to be manually removed; otherwise the shards/replicas that > have write.lock files left behind will not start. Since I can't tell which > physical node is hosting which shard/replica, I stop all the nodes, delete > all the write.lock files in HDFS and restart. > > You are correct - only one replica is failing to start. The other > replicas on the same physical node are coming up OK. Picture is worth a > thousand words so: > http://lovehorsepower.com/images/Cluster1.jpg > > Errors: > http://lovehorsepower.com/images/ClusterSolr2.jpg > > -Joe > > On 2/1/2017 1:20 PM, Alessandro Benedetti wrote: > >> Ok, it is clearer now. >> You have 9 solr nodes running, one per physical machine. >> So each node has a number cores ( both replicas and leaders). >> When the node died, you got a lot of indexes corrupted. >> I still miss why you restarted the others 8 working nodes ( I was >> expecting >> you to restart only the failed one) >> >> When you mention that only one replica is failing, you mean that the >> solr >> node is up and running and only one solr core ( the replica of one shard) >> keeps failing? >> Or all the local cores in that node are failing to recover? >> >> Cheers >> >> On 1 Feb 2017 6:07 p.m., "Joe Obernberger" <joseph.obernber...@gmail.com> >> wrote: >> >> Thank you for the response. >> There are no virtual machines in the configuration. The collection has 45 >> shards with 3 replicas each spread across the 9 physical boxes; each box >> is >> running one copy of solr. I've tried to restart just the one node after >> the other 8 (and all their shards/replicas) came up, but this one replica >> seems to be in perma-recovery. >> >> Shard Count: 45 >> replicationFactor: 3 >> maxShardsPerNode: 50 >> router: compositeId >> autoAddReplicas: false >> >> SOLR_JAVA_MEM options are -Xms16g - Xmx32g >> >> _TUNE is: >> "-XX:+UseG1GC \ >> -XX:MaxDirectMemorySize=8g >> -XX:+PerfDisableSharedMem \ >> -XX:+ParallelRefProcEnabled \ >> -XX:G1HeapRegionSize=32m \ >> -XX:MaxGCPauseMillis=500 \ >> -XX:InitiatingHeapOccupancyPercent=75 \ >> -XX:ParallelGCThreads=16 \ >> -XX:+UseLargePages \ >> -XX:-ResizePLAB \ >> -XX:+AggressiveOpts" >> >> So far it has retried 22 times. The cluster is accessible and OK, but I'm >> afraid to continue indexing data if this one node will never come back. >> Thanks for help! >> >> -Joe >> >> >> >> On 2/1/2017 12:58 PM, alessandro.benedetti wrote: >> >> Let me try to summarize . >>> How many virtual machines on top of the 9 physical ? >>> How many Solr processes ( replicas ? ) >>> >>> If you had 1 node compromised. >>> I assume you have replicas as well right ? >>> >>> Can you explain a little bit better your replicas configuration ? >>> Why you had to stop all the nodes ? >>> >>> I would expect the stop of the solr node failing, cleanup of the index >>> and >>> restart. >>> Automatically it would recover from the leader. >>> >>> Something is suspicious here, let us know ! >>> >>> Cheers >>> >>> >>> >>> ----- >>> --------------- >>> Alessandro Benedetti >>> Search Consultant, R&D Software Engineer, Director >>> Sease Ltd. - www.sease.io >>> -- >>> View this message in context: http://lucene.472066.n3.nabble >>> .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >