[ https://issues.apache.org/jira/browse/CASSANDRA-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Erik Onnen updated CASSANDRA-1766: ---------------------------------- Attachment: CASSANDRA-1766.patch Not sure it's exactly related but I encountered an issue where a stream failed post AE and was just wedged with the following stack trace: "STREAM-STAGE:1" prio=10 tid=0x00007ff2440a5800 nid=0x3c3c in Object.wait() [0x00007ff24a21f000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00007ff28884fad8> (a org.apache.cassandra.utils.SimpleCondition) at java.lang.Object.wait(Object.java:485) at org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:38) - locked <0x00007ff28884fad8> (a org.apache.cassandra.utils.SimpleCondition) at org.apache.cassandra.streaming.StreamOutManager.waitForStreamCompletion(StreamOutManager.java:164) at org.apache.cassandra.streaming.StreamOut.transferSSTables(StreamOut.java:138) at org.apache.cassandra.service.AntiEntropyService$Differencer$1.runMayThrow(AntiEntropyService.java:511) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) We suspect that this occurred because the destination node was in drain state, although from reading the code it appears that any failed stream where the destination goes away would be susceptible to this issue. In this case, the StreamManager will never unblock making subsequent repairs to any node that was pending transfer impossible. I've attached a patch that smooths out some possible streaming issues: * Catches streaming errors. Near as I can tell, if an error occurred during streaming because the remote node went away, it would bubble all the way out of the executor and not even be logged. Worse, it would keep the current pending file wedged and never allow it to be cleared. This patch will remove the failed transfer when an IOException occurs. Could be it should be more general * Allows for manual purging of pending files to a host via JMX which means un-sticking a wedged transfer no-longer requires a restart of that node. It also unfortunately results in removal of the file which could require anti-compaction again but this was the least painful path through the code. * Corrects an unlikely but potentially fatal scenario where concurrent mutation/read from the file and fileMap references could result in dirty reads by making them concurrency-safe collections. Only way I could see this happening is if someone were to run repair multiple times in succession while streaming was happening. Unlikely but possible and the effects on unsafe map reads can result in a completely unresponsive JVM. I'm not entirely sure this is the right thing to do but I though I'd float it out there for review. Whatever the correct fix, I think there needs to be a way to cancel pending streams so that they aren't stuck. > Streaming never makes progress > ------------------------------ > > Key: CASSANDRA-1766 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1766 > Project: Cassandra > Issue Type: Bug > Affects Versions: 0.6.7 > Reporter: Brandon Williams > Fix For: 0.6.9 > > Attachments: CASSANDRA-1766.patch > > > I have a client that can never complete a bootstrap. AC finishes, streaming > begins. Stream initiate completes, and the sources wait on the transfer to > finish, but progress is never made on any stream. Nodetool reports streaming > is happening, the socket is held open, but nothing happens. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.