[ https://issues.apache.org/jira/browse/CASSANDRA-12008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402388#comment-15402388 ]
Paulo Motta commented on CASSANDRA-12008: ----------------------------------------- Thanks for the update. This is looking better and we're nearly done, see follow up below: * Code ** Fix indentation of {{logger.debug("DECOMMISSIONING")}} ** The {{isDecommissioning.get()}} should use a {{compareAndSet}} to avoid starting simultaneous decommision sessions. See the {{isRebuilding}} check. Also, add a test to verify it's not possible to start multiple decommission simultaneously based on the solution on CASSANDRA-11687 to avoid test flakiness. ** on {{SessionCompleteEvent}} use {{Collections.unmodifiableMap}} when copying the {{transferredRangesPerKeyspace}} map to avoid modifications to the ma ** In order to avoid allocating a {{HashSet}} when it's not necessary, change this {noformat} Set<Range<Token>> toBeUpdated = new HashSet<>(); if (transferredRangesPerKeyspace.containsKey(keyspace)) { toBeUpdated = transferredRangesPerKeyspace.get(keyspace); } {noformat} with this: {noformat} Set<Range<Token>> toBeUpdated = transferredRangesPerKeyspace.get(keyspace) if (toBeUpdated == null) { toBeUpdated = new HashSet<>(); } {noformat} ** {{Error while decommissioning node}} is never printed because the {{ExecutionException}} is being wrapped in a {{RuntimeException}} on {{unbootstrap}}, so perhaps you can modify {{unbootstrap}} to throw {{ExecutionException | InterruptedException}} and catch that on {{decomission}} to wrap in {{RuntimeException}}. * dtests ** Simply running {{stress read}} will not fail if the keys are not there, you need to either compare the retrieved keys or check that there was no failure on the stress process (see {{bootstrap_test}} for examples). ** When verifying if the retrieved data is correct on {{resumable_decommission_test}}, you need to stop either node1 or node3 when querying the other otherwise the data may be in only one of these nodes (while it must be in both nodes, since RF=2 and N=2). ** Perhaps reduce the number of keys to 10k so the test will be faster. ** On {{resumable_decommission_test}} set {{stream_throughput_outbound_megabits_per_sec}} to {{1}} to the streaming will be slower and allow more time for interrupting. ** Perhaps it's better for {{InterruptDecommission}} to watch on {{rebuild from dc}} since this is print before {{"Executing streaming plan for Unbootstrap"}} ** Instead of counting for {{decommission_error}} you can add a {{self.fail("second rebuild should fail")}} after {{node2.nodetool('decommission')}} and on the {{except}} part perhaps check that the following message is being print on logs {{Error while decommissioning node}} - see new version of {{simple_rebuild_test}} from CASSANDRA-11687. ** bq. I found that streamed range skipping behaviour log check-up is not working *** This is probably because the {{Range (-2556370087840976503,-2548250017122308073] already in /127.0.0.3, skipping}} message is only being print on {{debug.log}} so you should pass a {{filename='debug.log'}} to {{watch_log_for}}. When you modify {{StreamStateStore}} to {{updateStreamedRanges}} for requested ranges (ie. bootstrap), there could be a collision between received and transferred ranges for the same peer. While this collision will not show up in decommission, bootstrap and rebuild, since we only transfer in one direction, this may be confusing and source of problems in the future, so in order to avoid creating another table to support that in the future, I think we can modify {{streamed_ranges}} to include an {{outgoing}} boolean primary key field indicating if it's an incoming or outgoing transfer. WDYT [~yukim] [~kdmu]? > Make decommission operations resumable > -------------------------------------- > > Key: CASSANDRA-12008 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12008 > Project: Cassandra > Issue Type: Improvement > Components: Streaming and Messaging > Reporter: Tom van der Woerdt > Assignee: Kaide Mu > Priority: Minor > > We're dealing with large data sets (multiple terabytes per node) and > sometimes we need to add or remove nodes. These operations are very dependent > on the entire cluster being up, so while we're joining a new node (which > sometimes takes 6 hours or longer) a lot can go wrong and in a lot of cases > something does. > It would be great if the ability to retry streams was implemented. > Example to illustrate the problem : > {code} > 03:18 PM ~ $ nodetool decommission > error: Stream failed > -- StackTrace -- > org.apache.cassandra.streaming.StreamException: Stream failed > at > org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) > at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310) > at > com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457) > at > com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) > at > com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) > at > com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) > at > org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:210) > at > org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186) > at > org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:430) > at > org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:622) > at > org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:486) > at > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:274) > at java.lang.Thread.run(Thread.java:745) > 08:04 PM ~ $ nodetool decommission > nodetool: Unsupported operation: Node in LEAVING state; wait for status to > become normal or restart > See 'nodetool help' or 'nodetool help <command>'. > {code} > Streaming failed, probably due to load : > {code} > ERROR [STREAM-IN-/<ipaddr>] 2016-06-14 18:05:47,275 StreamSession.java:520 - > [Stream #<streamid>] Streaming error occurred > java.net.SocketTimeoutException: null > at > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211) > ~[na:1.8.0_77] > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) > ~[na:1.8.0_77] > at > java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) > ~[na:1.8.0_77] > at > org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:54) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:268) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77] > {code} > If implementing retries is not possible, can we have a 'nodetool decommission > resume'? -- This message was sent by Atlassian JIRA (v6.3.4#6332)