We have a 5 node solrcloud. When a solr node's disk had issue and Raid5
downgraded, a recovery on the node was triggered. But there's a hanging
happens. The node disappears in the live_nodes list. 

Could anyone help comment why this happens? Thanks!

The only meaningful call stacks are:
"zkCallback-4-thread-50-processing-n:sgdsolar17.swg.usma.ibm.com:8983_solr-EventThread"
#7791 daemon prio=5 os_prio=0 tid=0x00007f7e26467800 nid=0x4df7 waiting on
condition [0x00007f7e01adf000]
java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <0x00007f8315800070> (a
java.util.concurrent.FutureTask)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
        at java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at
org.apache.solr.update.DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:349)
        - locked <0x00007f7fd0cefd28> (a java.lang.Object)
        at
org.apache.solr.core.CoreContainer.cancelCoreRecoveries(CoreContainer.java:617)
        at org.apache.solr.cloud.ZkController$1.command(ZkController.java:295)
        at
org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:158)
        at
org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:56)
        at
org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:132)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

"updateExecutor-2-thread-620-processing-n:sgdsolar17.swg.usma.ibm.com:8983_solr
x:collection36_shard1_replica2 s:shard1 c:collection36 r:core_node1" #7779
prio=5 os_prio=0 tid=0x00007f7e8827e000 nid=0x4dea waiting on condition
[0x00007f7ed0f9f000]
java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <0x00007f7fd562e860> (a
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
        at org.apache.solr.update.VersionInfo.blockUpdates(VersionInfo.java:118)
        at
org.apache.solr.update.UpdateLog.dropBufferedUpdates(UpdateLog.java:1140)
        at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:467)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Downgraded-Raid5-cause-endless-recovery-and-hang-tp4288677.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to