We have a 5 node solrcloud. When a solr node's disk had issue and Raid5 downgraded, a recovery on the node was triggered. But there's a hanging happens. The node disappears in the live_nodes list.
Could anyone help comment why this happens? Thanks! The only meaningful call stacks are: "zkCallback-4-thread-50-processing-n:sgdsolar17.swg.usma.ibm.com:8983_solr-EventThread" #7791 daemon prio=5 os_prio=0 tid=0x00007f7e26467800 nid=0x4df7 waiting on condition [0x00007f7e01adf000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f8315800070> (a java.util.concurrent.FutureTask) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) at java.util.concurrent.FutureTask.get(FutureTask.java:191) at org.apache.solr.update.DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:349) - locked <0x00007f7fd0cefd28> (a java.lang.Object) at org.apache.solr.core.CoreContainer.cancelCoreRecoveries(CoreContainer.java:617) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:295) at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:158) at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:56) at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:132) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) "updateExecutor-2-thread-620-processing-n:sgdsolar17.swg.usma.ibm.com:8983_solr x:collection36_shard1_replica2 s:shard1 c:collection36 r:core_node1" #7779 prio=5 os_prio=0 tid=0x00007f7e8827e000 nid=0x4dea waiting on condition [0x00007f7ed0f9f000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f7fd562e860> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at org.apache.solr.update.VersionInfo.blockUpdates(VersionInfo.java:118) at org.apache.solr.update.UpdateLog.dropBufferedUpdates(UpdateLog.java:1140) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:467) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://lucene.472066.n3.nabble.com/Downgraded-Raid5-cause-endless-recovery-and-hang-tp4288677.html Sent from the Solr - User mailing list archive at Nabble.com.