Streaming stuck on one node during Repair

Jake Maizel Fri, 02 Sep 2011 11:55:32 -0700

Hello,

I have one node of a cluster that is stuck in a streaming out state
sending to the node that is being repaired.


If I looked the AE Thread in jconsole I see this trace:

Name: AE-SERVICE-STAGE:1
State: WAITING on java.util.concurrent.FutureTask$Sync@7e3e0044
Total blocked: 0  Total waited: 23

Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
java.util.concurrent.FutureTask.get(FutureTask.java:83)
org.apache.cassandra.service.AntiEntropyService$Differencer.performStreamingRepair(AntiEntropyService.java:515)
org.apache.cassandra.service.AntiEntropyService$Differencer.run(AntiEntropyService.java:475)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)

The Steam stage shows this trace:

Name: STREAM-STAGE:1
State: WAITING on org.apache.cassandra.utils.SimpleCondition@1158f928
Total blocked: 9  Total waited: 16

Stack trace:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:38)
org.apache.cassandra.streaming.StreamOutManager.waitForStreamCompletion(StreamOutManager.java:164)
org.apache.cassandra.streaming.StreamOut.transferSSTables(StreamOut.java:138)
org.apache.cassandra.service.AntiEntropyService$Differencer$1.runMayThrow(AntiEntropyService.java:511)
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
java.util.concurrent.FutureTask.run(FutureTask.java:138)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)

Is there a way to unstick these threads?  Or am I stuck restarting the
node and then rerunning the entire repair?  All the other nodes seemed
to complete properly and one is still running.  I am thinking to wait
until the current one finishes and then restart the stuck nodes then
once its up run repair again on the node needing it.

Thoughts?

(0.6.6 on a 7 nodes cluster)



-- 
Jake Maizel
Head of Network Operations
Soundcloud

Mail & GTalk: j...@soundcloud.com
Skype: jakecloud

Rosenthaler strasse 13, 101 19, Berlin, DE

Streaming stuck on one node during Repair

Reply via email to