[ 
https://issues.apache.org/jira/browse/CASSANDRA-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Onnen updated CASSANDRA-1766:
----------------------------------

    Attachment: CASSANDRA-1766.patch

Not sure it's exactly related but I encountered an issue where a stream failed 
post AE and was just wedged with the following stack trace:

"STREAM-STAGE:1" prio=10 tid=0x00007ff2440a5800 nid=0x3c3c in Object.wait() 
[0x00007ff24a21f000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00007ff28884fad8> (a 
org.apache.cassandra.utils.SimpleCondition)
        at java.lang.Object.wait(Object.java:485)
        at 
org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:38)
        - locked <0x00007ff28884fad8> (a 
org.apache.cassandra.utils.SimpleCondition)
        at 
org.apache.cassandra.streaming.StreamOutManager.waitForStreamCompletion(StreamOutManager.java:164)
        at 
org.apache.cassandra.streaming.StreamOut.transferSSTables(StreamOut.java:138)
        at 
org.apache.cassandra.service.AntiEntropyService$Differencer$1.runMayThrow(AntiEntropyService.java:511)
        at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

We suspect that this occurred because the destination node was in drain state, 
although from reading the code it appears that any failed stream where the 
destination goes away would be susceptible to this issue. In this case, the 
StreamManager will never unblock making subsequent repairs to any node that was 
pending transfer impossible.

I've attached a patch that smooths out some possible streaming issues:

* Catches streaming errors. Near as I can tell, if an error occurred during 
streaming because the remote node went away, it would bubble all the way out of 
the executor and not even be logged. Worse, it would keep the current pending 
file wedged and never allow it to be cleared. This patch will remove the failed 
transfer when an IOException occurs. Could be it should be more general
* Allows for manual purging of pending files to a host via JMX which means 
un-sticking a wedged transfer no-longer requires a restart of that node. It 
also unfortunately results in removal of the file which could require 
anti-compaction again but this was the least painful path through the code.
* Corrects an unlikely but potentially fatal scenario where concurrent 
mutation/read from the file and fileMap references could result in dirty reads 
by making them concurrency-safe collections. Only way I could see this 
happening is if someone were to run repair multiple times in succession while 
streaming was happening. Unlikely but possible and the effects on unsafe map 
reads can result in a completely unresponsive JVM.


I'm not entirely sure this is the right thing to do but I though I'd float it 
out there for review. Whatever the correct fix, I think there needs to be a way 
to cancel pending streams so that they aren't stuck.

> Streaming never makes progress
> ------------------------------
>
>                 Key: CASSANDRA-1766
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1766
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6.7
>            Reporter: Brandon Williams
>             Fix For: 0.6.9
>
>         Attachments: CASSANDRA-1766.patch
>
>
> I have a client that can never complete a bootstrap.  AC finishes, streaming 
> begins.  Stream initiate completes, and the sources wait on the transfer to 
> finish, but progress is never made on any stream.  Nodetool reports streaming 
> is happening, the socket is held open, but nothing happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to