[ https://issues.apache.org/jira/browse/CASSANDRA-12008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399170#comment-15399170 ]
Kaide Mu commented on CASSANDRA-12008: -------------------------------------- bq. It seems getStreamedRanges is querying the AVAILABLE_RANGES table instead of STREAMED_RANGES, that's why is generating the Undefined column name operation error. Unbelievable, but yes it was the error, already fixed it, thanks! bq. Maybe it's not working because of the previous error? Perhaps it would help to add a unit test on StreamStateStoreTest to verify that updateStreamedRanges and getStreamedRanges is being populated correctly and working as expected. You can also add debug logs to troubleshoot. Another stupid error, wasn't adding {{StreamTransferTask}} to {{SessionCompleteEvent}}, fixed. bq. SystemKeyspace.getStreamedRanges is being called from inside a for-loop what may be inefficient, it's maybe better to retrieve it before and re-use it inside the loop. I've added a new strategy, please let me know what do you think about it. Some additional modifications, we are not going to pass description to {{StreamTransferTask}} constructor, if we do so it will raise an error because when task is created {{StreamResultFuture}} is not initialized yet, thus {{StreamSession.description()}} will return a null value at creation time. So instead we will obtain {{StreamSession}} from {{StreamTransferTask.getSession()}} when each {{StreamTransferTask}} is complete i.e when {{StreamStateStore.handleStreamEvent}} is invoked. All these means that we are going to only pass its responsible keyspace. Some minor details: Don't know if there's some problem with current implementation or there's something weird in the set-up, but it skips twice the same range: {quote} DEBUG [RMI TCP Connection(9)-127.0.0.1] 2016-07-29 12:48:36,301 StorageService.java:4556 - Range (3074457345618258602,-9223372036854775808] already in /127.0.0.3, skipping DEBUG [RMI TCP Connection(9)-127.0.0.1] 2016-07-29 12:48:36,301 StorageService.java:4556 - Range (3074457345618258602,-9223372036854775808] already in /127.0.0.3, skipping {quote} I think it's the set-up itself since {{StorageService.getChangedRangesForLeaving}} is also returning the same range twice {quote} DEBUG [RMI TCP Connection(9)-127.0.0.1] 2016-07-29 12:48:36,289 StorageService.java:2526 - Range (3074457345618258602,-9223372036854775808] will be responsibility of /127.0.0.3 DEBUG [RMI TCP Connection(9)-127.0.0.1] 2016-07-29 12:48:36,294 StorageService.java:2526 - Range (3074457345618258602,-9223372036854775808] will be responsibility of /127.0.0.3 {quote} You can find latest working patch via: https://github.com/apache/cassandra/compare/trunk...kdmu:trunk-12008?expand=1 > Make decommission operations resumable > -------------------------------------- > > Key: CASSANDRA-12008 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12008 > Project: Cassandra > Issue Type: Improvement > Components: Streaming and Messaging > Reporter: Tom van der Woerdt > Assignee: Kaide Mu > Priority: Minor > > We're dealing with large data sets (multiple terabytes per node) and > sometimes we need to add or remove nodes. These operations are very dependent > on the entire cluster being up, so while we're joining a new node (which > sometimes takes 6 hours or longer) a lot can go wrong and in a lot of cases > something does. > It would be great if the ability to retry streams was implemented. > Example to illustrate the problem : > {code} > 03:18 PM ~ $ nodetool decommission > error: Stream failed > -- StackTrace -- > org.apache.cassandra.streaming.StreamException: Stream failed > at > org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) > at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310) > at > com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457) > at > com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) > at > com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) > at > com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) > at > org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:210) > at > org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186) > at > org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:430) > at > org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:622) > at > org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:486) > at > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:274) > at java.lang.Thread.run(Thread.java:745) > 08:04 PM ~ $ nodetool decommission > nodetool: Unsupported operation: Node in LEAVING state; wait for status to > become normal or restart > See 'nodetool help' or 'nodetool help <command>'. > {code} > Streaming failed, probably due to load : > {code} > ERROR [STREAM-IN-/<ipaddr>] 2016-06-14 18:05:47,275 StreamSession.java:520 - > [Stream #<streamid>] Streaming error occurred > java.net.SocketTimeoutException: null > at > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211) > ~[na:1.8.0_77] > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) > ~[na:1.8.0_77] > at > java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) > ~[na:1.8.0_77] > at > org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:54) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:268) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77] > {code} > If implementing retries is not possible, can we have a 'nodetool decommission > resume'? -- This message was sent by Atlassian JIRA (v6.3.4#6332)