[ https://issues.apache.org/jira/browse/CASSANDRA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903538#action_12903538 ]
Nick Bailey commented on CASSANDRA-1216: ---------------------------------------- After some more thinking I think there are two problems here. * The timeout for waiting on a stream to complete - An arbitrary timeout here is not the right way to do this. What we really need is the concept of stream progress. We should be able to verify that a stream is progressing or not and based on that retry it. CASSANDRA-1438 kind of relates to this problem and could be modified to implement this. * The timeout waiting for nodes to confirm replication - Ideally there could be no timeout here. The problem though is if a node that should be grabbing data goes down permanently, removeToken will wait forever. I think it's reasonable to have some sort of timeout in this case. A log message/error can indicate which machines were being waited on for replication. An administrator should know if that machine went down or is still streaming. That will determine if repair needs to be run. The alternative to this I guess would be periodically waking up and checking that the nodes we are waiting on are still alive. That wouldn't be particularly hard to implement I don't think returning immediately from the call is the right approach. That is part of the reason why this ticket is created. In the case that replication fails somewhere, there is no feedback to the user. At least timing out eventually provides information about which machines we think failed to replicate data. As far as multiple remove calls and the coordinator going down. I think there should be a 'force' option in the case the coordinator goes down and you believe the rest of the nodes completed the operation. To prevent multiple calls to removeToken there should just be a check to make sure the coordinator is dead before another call can be performed. So besides those few changes above, I think we should either implement this part way with a time out for stream replication or postpone completion here until we add the concept of stream progress. > removetoken drops node from ring before re-replicating its data is finished > --------------------------------------------------------------------------- > > Key: CASSANDRA-1216 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1216 > Project: Cassandra > Issue Type: Bug > Components: Core > Affects Versions: 0.7 beta 1 > Reporter: Jonathan Ellis > Assignee: Nick Bailey > Fix For: 0.7 beta 2 > > Attachments: 0001-Add-callbacks-to-streaming.patch, > 0002-Modify-removeToken-to-be-similar-to-decommission.patch, > 0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch > > > this means that if something goes wrong during the re-replication (e.g. a > source node is restarted) there is (a) no indication that anything has gone > wrong and (b) no way to restart the process (other than the Big Hammer of > running repair) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.