[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Semb Wever updated CASSANDRA-18110: ------------------------------------------- Severity: Critical (was: Normal) > Streaming fails during multiple concurrent host replacements > ------------------------------------------------------------ > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission > Reporter: Jon Meredith > Assignee: Jon Meredith > Priority: Urgent > Fix For: 4.1 > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed > at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] > at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] > at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] > at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] > at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] > at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] > at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] > at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] > at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] > at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] > at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] > at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] > at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] > at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] > at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] > at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] > at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] > at java.lang.Thread.run(Thread.java:829) [?:?] > Suppressed: java.nio.channels.ClosedChannelException > at > org.apache.cassandra.net.AsyncStreamingOutputPlus.doFlush(AsyncStreamingOutputPlus.java:82) > ~[cassandra.jar] > at > org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:229) > ~[cassandra.jar] > at > org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248) > ~[cassandra.jar] > at > org.apache.cassandra.streaming.async.NettyStreamingChannel$1.close(NettyStreamingChannel.java:141) > ~[cassandra.jar] > at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:312) > [cassandra.jar] > at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] > at > org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] > at > org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] > at java.lang.Thread.run(Thread.java:829) [?:?] > Caused by: io.netty.channel.unix.Errors$NativeIoException: > writevAddresses(..) failed: Connection timed out > {code} > Running a debug build with extra logging added to > org.apache.cassandra.streaming.StreamSession#progress shows long periods > where the streaming task does not log any progress events of up to 15 mins in > duration. After these stalls the next attempt to write hits EOF. > Example caught by extra logging in > org.apache.cassandra.streaming.StreamSession#progress > {code:java} > INFO 2022-12-07T15:10:56,149 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event > /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 22216704/25673094 bytes (86%) > s > ent to idx:0/1.2.3.4 for file /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db > bytes 22216704 out of total 25673094 > # ... 15min delay during which other streams can be seen transmitting to the > same and other hosts > INFO 2022-12-07T15:25:29,614 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event > /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 22282240/25673094 bytes (86%) > sent to idx:0/1.2.3.4 for file > /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db bytes 22282240 out of total > 25673094 > INFO 2022-12-07T15:25:29,615 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event > /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 22347776/25673094 bytes (87%) > sent to idx:0/1.2.3.4 for file > /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db bytes 22347776 out of total > 25673094 > INFO 2022-12-07T15:25:29,626 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event > /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 23855104/25673094 bytes (92%) > sent to idx:0/1.2.3.4 for file > /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db bytes 23855104 out of total > 25673094 > INFO 2022-12-07T15:25:29,627 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event > /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 23920640/25673094 bytes (93%) > sent to idx:0/1.2.3.4 for file > /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db bytes 23920640 out of total > 25673094 > # ...12min delay while other streams continue to transmit > INFO 2022-12-07T15:37:29,775 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event > /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 23986176/25673094 bytes (93%) > sent to idx:0/1.2.3.4 for file > /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db bytes 23986176 out of total > 25673094 > {code} > The issue has not succesfully been reproduced in other clusters or in a > dtest, but is reliable reproducible on the cluster. The hosts being used are > relatively large and have an available CPU count of 32 causing the streaming > system to allow up to 32 concurrent streams. > [https://github.com/apache/cassandra/blob/81c616826ab3c8aa96467771d0e074a874efdd77/src/java/org/apache/cassandra/streaming/async/StreamingMultiplexedChannel.java#L93] > One possible explanation is that the stream rate limiter is not being fair > and blocking sends long enough for TCP user timeout to fire and cause the > kernel to close the socket which is detected on the next write, another is > something is taking unusually long flushing the socket in > AsyncChannelOutputPlus.beginFlush. > Restarting the instances with the streaming TCP user timeout disabled on the > cluster (set to zero in the boostrapping instances config, and updated with > JMX on the other instances) allowed the host-replacements to complete > bootstrap successfully. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org