Re: Solr 6.3.0 - recovery failed
I brought down the whole cluster again, and brought up one server at a time, waiting for it to go green before launching another. Now all replicas are OK, including the one that was in the perma-recovery mode before. I do notice a large amount of network activity (basically pegging the interface) when a node is brought up. I suspect this is especially true since these nodes are not dataNodes in HDFS. -Joe On 2/1/2017 1:37 PM, Alessandro Benedetti wrote: I can't debug the code now, but if you access the logs, directly ( not from the ui), is there any " caused by" associated to the recovery failure exception? Cheers On 1 Feb 2017 6:28 p.m., "Joe Obernberger" wrote: In HDFS when a node fails it will leave behind write.lock files in HDFS. These files have to be manually removed; otherwise the shards/replicas that have write.lock files left behind will not start. Since I can't tell which physical node is hosting which shard/replica, I stop all the nodes, delete all the write.lock files in HDFS and restart. You are correct - only one replica is failing to start. The other replicas on the same physical node are coming up OK. Picture is worth a thousand words so: http://lovehorsepower.com/images/Cluster1.jpg Errors: http://lovehorsepower.com/images/ClusterSolr2.jpg -Joe On 2/1/2017 1:20 PM, Alessandro Benedetti wrote: Ok, it is clearer now. You have 9 solr nodes running, one per physical machine. So each node has a number cores ( both replicas and leaders). When the node died, you got a lot of indexes corrupted. I still miss why you restarted the others 8 working nodes ( I was expecting you to restart only the failed one) When you mention that only one replica is failing, you mean that the solr node is up and running and only one solr core ( the replica of one shard) keeps failing? Or all the local cores in that node are failing to recover? Cheers On 1 Feb 2017 6:07 p.m., "Joe Obernberger" wrote: Thank you for the response. There are no virtual machines in the configuration. The collection has 45 shards with 3 replicas each spread across the 9 physical boxes; each box is running one copy of solr. I've tried to restart just the one node after the other 8 (and all their shards/replicas) came up, but this one replica seems to be in perma-recovery. Shard Count: 45 replicationFactor: 3 maxShardsPerNode: 50 router: compositeId autoAddReplicas: false SOLR_JAVA_MEM options are -Xms16g - Xmx32g _TUNE is: "-XX:+UseG1GC \ -XX:MaxDirectMemorySize=8g -XX:+PerfDisableSharedMem \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=32m \ -XX:MaxGCPauseMillis=500 \ -XX:InitiatingHeapOccupancyPercent=75 \ -XX:ParallelGCThreads=16 \ -XX:+UseLargePages \ -XX:-ResizePLAB \ -XX:+AggressiveOpts" So far it has retried 22 times. The cluster is accessible and OK, but I'm afraid to continue indexing data if this one node will never come back. Thanks for help! -Joe On 2/1/2017 12:58 PM, alessandro.benedetti wrote: Let me try to summarize . How many virtual machines on top of the 9 physical ? How many Solr processes ( replicas ? ) If you had 1 node compromised. I assume you have replicas as well right ? Can you explain a little bit better your replicas configuration ? Why you had to stop all the nodes ? I would expect the stop of the solr node failing, cleanup of the index and restart. Automatically it would recover from the leader. Something is suspicious here, let us know ! Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 6.3.0 - recovery failed
Thank you. I do not see any caused block in the solr.log. --- 2017-02-01 18:37:57.566 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Replay not started, or was not successful... still buffering updates. 2017-02-01 18:37:57.566 ERROR (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Recovery failed - trying again... (50) 2017-02-01 18:37:57.566 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Wait [12.0] seconds before trying to recover again (attempt=51) 2017-02-01 18:38:57.567 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Begin buffering updates. core=[Worldline2New_shard22_replica2] 2017-02-01 18:38:57.567 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.u.UpdateLog Restarting buffering. previous=RecoveryInfo{adds=0 deletes=0 deleteByQuery=0 errors=0 positionOfStart=0} 2017-02-01 18:38:57.567 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.u.UpdateLog Starting to buffer updates. HDFSUpdateLog{state=BUFFERING, tlog=null} 2017-02-01 18:38:57.567 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Publishing state of core [Worldline2New_shard22_replica2] as recovering, leader is [http://cordelia:9100/solr/Worldline2New_shard22_replica1/] and I am [http://bilbo:9100/solr/Worldline2New_shard22_replica2/] 2017-02-01 18:38:57.586 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Sending prep recovery command to [http://cordelia:9100/solr]; [WaitForState: action=PREPRECOVERY&core=Worldline2New_shard22_replica1&nodeName=bilbo:9100_solr&coreNodeName=core_node34&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] 2017-02-01 18:38:57.644 INFO (zkCallback-5-thread-49-processing-n:bilbo:9100_solr) [ ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/Worldline2New/state.json] for collection [Worldline2New] has occurred - updating... (live nodes size: [9]) 2017-02-01 18:39:04.594 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Starting Replication Recovery. 2017-02-01 18:39:04.594 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Attempting to replicate from [http://cordelia:9100/solr/Worldline2New_shard22_replica1/]. 2017-02-01 18:39:04.604 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.h.IndexFetcher Master's generation: 12398 2017-02-01 18:39:04.612 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.h.IndexFetcher Master's version: 1485965089535 2017-02-01 18:39:04.612 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New r:core_node34) [c:Worldline2New s:shard22 r:core_node34 x:Worldline2New_shard22_replica2] o.a.s.h.IndexFetcher Slave's generation: 12358 2017-02-01 18:39:04.612 INFO (recoveryExecutor-3-thread-8-processing-n:bilbo:9100
Re: Solr 6.3.0 - recovery failed
I can't debug the code now, but if you access the logs, directly ( not from the ui), is there any " caused by" associated to the recovery failure exception? Cheers On 1 Feb 2017 6:28 p.m., "Joe Obernberger" wrote: > In HDFS when a node fails it will leave behind write.lock files in HDFS. > These files have to be manually removed; otherwise the shards/replicas that > have write.lock files left behind will not start. Since I can't tell which > physical node is hosting which shard/replica, I stop all the nodes, delete > all the write.lock files in HDFS and restart. > > You are correct - only one replica is failing to start. The other > replicas on the same physical node are coming up OK. Picture is worth a > thousand words so: > http://lovehorsepower.com/images/Cluster1.jpg > > Errors: > http://lovehorsepower.com/images/ClusterSolr2.jpg > > -Joe > > On 2/1/2017 1:20 PM, Alessandro Benedetti wrote: > >> Ok, it is clearer now. >> You have 9 solr nodes running, one per physical machine. >> So each node has a number cores ( both replicas and leaders). >> When the node died, you got a lot of indexes corrupted. >> I still miss why you restarted the others 8 working nodes ( I was >> expecting >> you to restart only the failed one) >> >> When you mention that only one replica is failing, you mean that the >> solr >> node is up and running and only one solr core ( the replica of one shard) >> keeps failing? >> Or all the local cores in that node are failing to recover? >> >> Cheers >> >> On 1 Feb 2017 6:07 p.m., "Joe Obernberger" >> wrote: >> >> Thank you for the response. >> There are no virtual machines in the configuration. The collection has 45 >> shards with 3 replicas each spread across the 9 physical boxes; each box >> is >> running one copy of solr. I've tried to restart just the one node after >> the other 8 (and all their shards/replicas) came up, but this one replica >> seems to be in perma-recovery. >> >> Shard Count: 45 >> replicationFactor: 3 >> maxShardsPerNode: 50 >> router: compositeId >> autoAddReplicas: false >> >> SOLR_JAVA_MEM options are -Xms16g - Xmx32g >> >> _TUNE is: >> "-XX:+UseG1GC \ >> -XX:MaxDirectMemorySize=8g >> -XX:+PerfDisableSharedMem \ >> -XX:+ParallelRefProcEnabled \ >> -XX:G1HeapRegionSize=32m \ >> -XX:MaxGCPauseMillis=500 \ >> -XX:InitiatingHeapOccupancyPercent=75 \ >> -XX:ParallelGCThreads=16 \ >> -XX:+UseLargePages \ >> -XX:-ResizePLAB \ >> -XX:+AggressiveOpts" >> >> So far it has retried 22 times. The cluster is accessible and OK, but I'm >> afraid to continue indexing data if this one node will never come back. >> Thanks for help! >> >> -Joe >> >> >> >> On 2/1/2017 12:58 PM, alessandro.benedetti wrote: >> >> Let me try to summarize . >>> How many virtual machines on top of the 9 physical ? >>> How many Solr processes ( replicas ? ) >>> >>> If you had 1 node compromised. >>> I assume you have replicas as well right ? >>> >>> Can you explain a little bit better your replicas configuration ? >>> Why you had to stop all the nodes ? >>> >>> I would expect the stop of the solr node failing, cleanup of the index >>> and >>> restart. >>> Automatically it would recover from the leader. >>> >>> Something is suspicious here, let us know ! >>> >>> Cheers >>> >>> >>> >>> - >>> --- >>> Alessandro Benedetti >>> Search Consultant, R&D Software Engineer, Director >>> Sease Ltd. - www.sease.io >>> -- >>> View this message in context: http://lucene.472066.n3.nabble >>> .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >
Re: Solr 6.3.0 - recovery failed
In HDFS when a node fails it will leave behind write.lock files in HDFS. These files have to be manually removed; otherwise the shards/replicas that have write.lock files left behind will not start. Since I can't tell which physical node is hosting which shard/replica, I stop all the nodes, delete all the write.lock files in HDFS and restart. You are correct - only one replica is failing to start. The other replicas on the same physical node are coming up OK. Picture is worth a thousand words so: http://lovehorsepower.com/images/Cluster1.jpg Errors: http://lovehorsepower.com/images/ClusterSolr2.jpg -Joe On 2/1/2017 1:20 PM, Alessandro Benedetti wrote: Ok, it is clearer now. You have 9 solr nodes running, one per physical machine. So each node has a number cores ( both replicas and leaders). When the node died, you got a lot of indexes corrupted. I still miss why you restarted the others 8 working nodes ( I was expecting you to restart only the failed one) When you mention that only one replica is failing, you mean that the solr node is up and running and only one solr core ( the replica of one shard) keeps failing? Or all the local cores in that node are failing to recover? Cheers On 1 Feb 2017 6:07 p.m., "Joe Obernberger" wrote: Thank you for the response. There are no virtual machines in the configuration. The collection has 45 shards with 3 replicas each spread across the 9 physical boxes; each box is running one copy of solr. I've tried to restart just the one node after the other 8 (and all their shards/replicas) came up, but this one replica seems to be in perma-recovery. Shard Count: 45 replicationFactor: 3 maxShardsPerNode: 50 router: compositeId autoAddReplicas: false SOLR_JAVA_MEM options are -Xms16g - Xmx32g _TUNE is: "-XX:+UseG1GC \ -XX:MaxDirectMemorySize=8g -XX:+PerfDisableSharedMem \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=32m \ -XX:MaxGCPauseMillis=500 \ -XX:InitiatingHeapOccupancyPercent=75 \ -XX:ParallelGCThreads=16 \ -XX:+UseLargePages \ -XX:-ResizePLAB \ -XX:+AggressiveOpts" So far it has retried 22 times. The cluster is accessible and OK, but I'm afraid to continue indexing data if this one node will never come back. Thanks for help! -Joe On 2/1/2017 12:58 PM, alessandro.benedetti wrote: Let me try to summarize . How many virtual machines on top of the 9 physical ? How many Solr processes ( replicas ? ) If you had 1 node compromised. I assume you have replicas as well right ? Can you explain a little bit better your replicas configuration ? Why you had to stop all the nodes ? I would expect the stop of the solr node failing, cleanup of the index and restart. Automatically it would recover from the leader. Something is suspicious here, let us know ! Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 6.3.0 - recovery failed
Ok, it is clearer now. You have 9 solr nodes running, one per physical machine. So each node has a number cores ( both replicas and leaders). When the node died, you got a lot of indexes corrupted. I still miss why you restarted the others 8 working nodes ( I was expecting you to restart only the failed one) When you mention that only one replica is failing, you mean that the solr node is up and running and only one solr core ( the replica of one shard) keeps failing? Or all the local cores in that node are failing to recover? Cheers On 1 Feb 2017 6:07 p.m., "Joe Obernberger" wrote: Thank you for the response. There are no virtual machines in the configuration. The collection has 45 shards with 3 replicas each spread across the 9 physical boxes; each box is running one copy of solr. I've tried to restart just the one node after the other 8 (and all their shards/replicas) came up, but this one replica seems to be in perma-recovery. Shard Count: 45 replicationFactor: 3 maxShardsPerNode: 50 router: compositeId autoAddReplicas: false SOLR_JAVA_MEM options are -Xms16g - Xmx32g _TUNE is: "-XX:+UseG1GC \ -XX:MaxDirectMemorySize=8g -XX:+PerfDisableSharedMem \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=32m \ -XX:MaxGCPauseMillis=500 \ -XX:InitiatingHeapOccupancyPercent=75 \ -XX:ParallelGCThreads=16 \ -XX:+UseLargePages \ -XX:-ResizePLAB \ -XX:+AggressiveOpts" So far it has retried 22 times. The cluster is accessible and OK, but I'm afraid to continue indexing data if this one node will never come back. Thanks for help! -Joe On 2/1/2017 12:58 PM, alessandro.benedetti wrote: > Let me try to summarize . > How many virtual machines on top of the 9 physical ? > How many Solr processes ( replicas ? ) > > If you had 1 node compromised. > I assume you have replicas as well right ? > > Can you explain a little bit better your replicas configuration ? > Why you had to stop all the nodes ? > > I would expect the stop of the solr node failing, cleanup of the index and > restart. > Automatically it would recover from the leader. > > Something is suspicious here, let us know ! > > Cheers > > > > - > --- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3.nabble > .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Solr 6.3.0 - recovery failed
Thank you for the response. There are no virtual machines in the configuration. The collection has 45 shards with 3 replicas each spread across the 9 physical boxes; each box is running one copy of solr. I've tried to restart just the one node after the other 8 (and all their shards/replicas) came up, but this one replica seems to be in perma-recovery. Shard Count: 45 replicationFactor: 3 maxShardsPerNode: 50 router: compositeId autoAddReplicas: false SOLR_JAVA_MEM options are -Xms16g - Xmx32g _TUNE is: "-XX:+UseG1GC \ -XX:MaxDirectMemorySize=8g -XX:+PerfDisableSharedMem \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=32m \ -XX:MaxGCPauseMillis=500 \ -XX:InitiatingHeapOccupancyPercent=75 \ -XX:ParallelGCThreads=16 \ -XX:+UseLargePages \ -XX:-ResizePLAB \ -XX:+AggressiveOpts" So far it has retried 22 times. The cluster is accessible and OK, but I'm afraid to continue indexing data if this one node will never come back. Thanks for help! -Joe On 2/1/2017 12:58 PM, alessandro.benedetti wrote: Let me try to summarize . How many virtual machines on top of the 9 physical ? How many Solr processes ( replicas ? ) If you had 1 node compromised. I assume you have replicas as well right ? Can you explain a little bit better your replicas configuration ? Why you had to stop all the nodes ? I would expect the stop of the solr node failing, cleanup of the index and restart. Automatically it would recover from the leader. Something is suspicious here, let us know ! Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 6.3.0 - recovery failed
Let me try to summarize . How many virtual machines on top of the 9 physical ? How many Solr processes ( replicas ? ) If you had 1 node compromised. I assume you have replicas as well right ? Can you explain a little bit better your replicas configuration ? Why you had to stop all the nodes ? I would expect the stop of the solr node failing, cleanup of the index and restart. Automatically it would recover from the leader. Something is suspicious here, let us know ! Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr 6.3.0 - recovery failed
Hi All - I had one node in a 45 shard cluster (9 physical machines) run out of memory. I stopped all the nodes in the cluster and removed any lingering write.lock files from the OOM in HDFS. All the nodes recovered except one replica of one shard that happens to be on the node that ran out of memory. The error is: Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed. at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:159) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:408) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Anything I can check? The index is stored in HDFS. It seems to keep looping retrying over and over. Thank you! -Joe