[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655745#comment-17655745 ] Abe Ratnofsky commented on CASSANDRA-18110: --- Left review with a few questions on GitHub > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: David Capwell >Priority: Normal > Fix For: 4.1.x > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] >at java.lang.Thread.run(Thread.java:829) [?:?] >Suppressed: java.nio.channels.ClosedChannelException >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.doFlush(AsyncStreamingOutputPlus.java:82) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:229) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.NettyStreamingChannel$1.close(NettyStreamingChannel.java:141) > ~[cassandra.jar]
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655595#comment-17655595 ] David Capwell commented on CASSANDRA-18110: --- Update: I added a feature flag to disable tracking, and made it so the costs are lower and more predictable (file progress does require a ObjectLongHashMap to track per-file progress) > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: David Capwell >Priority: Normal > Fix For: 4.1.x > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] >at java.lang.Thread.run(Thread.java:829) [?:?] >Suppressed: java.nio.channels.ClosedChannelException >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.doFlush(AsyncStreamingOutputPlus.java:82) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:229) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248) > ~[cassandra.jar]
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655544#comment-17655544 ] David Capwell commented on CASSANDRA-18110: --- .bq I'm referring to the pacing of updates to the vtable data, not the streaming data. In org.apache.cassandra.streaming.StreamingState#handleStreamEvent, only update this.sessions on a FILE_PROGRESS event if it hasn't been updated in X millis. For other event types, update this.sessions right away. We could, but that doesn't avoid the 15m to build a single event you and Jon saw. .bq An even better solution would be to not re-create this.sessions but instead update it based on the deltas yep, that's what I did, already sent out patch doing this! > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.1.x > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] >at java.lang.Thread.run(Thread.java:829) [?:?] >Suppressed: java.nio.channels.ClosedChannelException >at
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655542#comment-17655542 ] David Capwell commented on CASSANDRA-18110: --- pushed update that avoids the double counting of prepare and file progress events, this memory cost for file progress is less than before this patch, but it no longer is just updating a counter (we need the delta between events) > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.1.x > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] >at java.lang.Thread.run(Thread.java:829) [?:?] >Suppressed: java.nio.channels.ClosedChannelException >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.doFlush(AsyncStreamingOutputPlus.java:82) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:229) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncCha
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655536#comment-17655536 ] Abe Ratnofsky commented on CASSANDRA-18110: --- > Changing the pacing could have side effects non of us are aware of... I'm referring to the pacing of updates to the vtable data, not the streaming data. In org.apache.cassandra.streaming.StreamingState#handleStreamEvent, only update this.sessions on a FILE_PROGRESS event if it hasn't been updated in X millis. For other event types, update this.sessions right away. This is my first thought at least - would reduce the time spent updating sessions when FILE_PROGRESS is frequent (which it should be). An even better solution would be to not re-create this.sessions but instead update it based on the deltas, since the incoming event should only update for a single ProgressInfo (peer, direction, file, etc) and the handler is already synchronized. > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.1.x > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLo
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655530#comment-17655530 ] David Capwell commented on CASSANDRA-18110: --- more fun, SessionPreparedEvent is seen twice; we double post the same event! > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.1.x > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] >at java.lang.Thread.run(Thread.java:829) [?:?] >Suppressed: java.nio.channels.ClosedChannelException >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.doFlush(AsyncStreamingOutputPlus.java:82) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:229) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.NettyStreamingChannel$1.close(NettyStreamingChann
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655525#comment-17655525 ] David Capwell commented on CASSANDRA-18110: --- sent out a PR, but seems my bytesReceived/bytesSent needs to be changed... this one is a bit more annoying to deal with > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.1.x > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] >at java.lang.Thread.run(Thread.java:829) [?:?] >Suppressed: java.nio.channels.ClosedChannelException >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.doFlush(AsyncStreamingOutputPlus.java:82) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:229) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.Nett
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654713#comment-17654713 ] David Capwell commented on CASSANDRA-18110: --- bq. For #3 - wouldn't disabling tracking mean that the virtual table isn't usable? Correct, this allows you to break the vtable if the feature causes things to not be stable. bq. I'm also in favor of #4. I am working on a patch doing both #3 and #4; so fix the issue by doing less work, and allow opt-out... bq. I was also thinking of changing the pacing (via debounce) for individual file progress events, so a progress event does not trigger handleStreamEvent if it was called within the last X millis. I don't think we have enough people familiar with Streaming to make that safe... Even the vtable was a problem as no one is around who knows streaming! Changing the pacing could have side effects non of us are aware of... > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.1.x > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654633#comment-17654633 ] Abe Ratnofsky commented on CASSANDRA-18110: --- For #3 - wouldn't disabling tracking mean that the virtual table isn't usable? This would put us in a situation where you can either use the vtable or stream successfully on larger clusters, which seems to defeat the purpose of the vtable. I'm also in favor of #4. I was also thinking of changing the pacing (via debounce) for individual file progress events, so a progress event does not trigger handleStreamEvent if it was called within the last X millis. > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.1.x > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] >at java.lang.Thread.run(Thread.java:829) [?:?] >Suppressed: java.nio.channels.ClosedChannelException >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.doFlush(AsyncStreamingOutputPlus.java
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654630#comment-17654630 ] David Capwell commented on CASSANDRA-18110: --- [~jonmeredith] if you don't mind I can take this as you found the issue was caused by the vtable I worked on. Looking at the code / stack trace the issue looks to be in org.apache.cassandra.streaming.StreamingState#handleStreamEvent the line {code} sessions = Sessions.create(streamProgress.values()); {code} The single usage of this field is org.apache.cassandra.db.virtual.StreamingVirtualTable#updateDataSet, which is only needed when the vtable is requested... There are a few options I think we can take (can take multiple) 1) don't be eager and have org.apache.cassandra.streaming.StreamingState#sessions build on-demand from org.apache.cassandra.streaming.StreamingState#streamProgress 2) org.apache.cassandra.streaming.StreamingState#onSuccess and org.apache.cassandra.streaming.StreamingState#onFailure could build the sessions eagerly (like it does today), but push this logic to another thread 3) feature flag to allow tracking to be disabled (why I didn't do this in the first place) 4) Sessions is just received/sent so why do we need to compute using 100% of events? can we not just use a counter? Given that building 1 session can take 15m, then #1 just breaks the vtable... for this reason I am in favor of #3 and #4... ill look closer to see if I can write #4 and see if there are any tradeoffs I don't see yet > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.1.x > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654100#comment-17654100 ] C. Scott Andreas commented on CASSANDRA-18110: -- This one's a hall-of-famer for sure. > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.1.x > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] >at java.lang.Thread.run(Thread.java:829) [?:?] >Suppressed: java.nio.channels.ClosedChannelException >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.doFlush(AsyncStreamingOutputPlus.java:82) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:229) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.NettyStreamingChannel$1.close(NettyStreamingChannel.java:141) > ~[cassandra.jar] >
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651433#comment-17651433 ] Jon Meredith commented on CASSANDRA-18110: -- tl;dr an event listener added to collect data for the streaming status vtable is causing lock contention and starving the stream deserializer thread which causes a TCP user timeout on the sender that fails the stream. Details - CASSANDRA-17390 exposed streaming status over a virtual table by listening to all streaming events with class {{StreamingState}}. It maintains a {{SessionInfo}} structure for each peer active in streaming, tracking a {{ProgressInfo}} for each streamed file that tracks transferred/total bytes for the file and updates it on every streaming event calling {{org.apache.cassandra.streaming.StreamingState.Sessions#create}}. Streaming events include {{FILE_PROGRESS}} events generated on the {{StreamDeserializingTask}} for every section (64kb) read by {{CompressedStreamReader}}, and for every file components for {{CassandraEntireSSTableStreamReader}}. They happen frequently during heavy streaming activity. In the event handler, {{Sessions.create}} recreates a summary {{org.apache.cassandra.streaming.StreamingState.Sessions}} object by iterating over all of the active session/files it knows about (multiple times) summing received/total for bytes and #files. In one heap dump investigated the top three sessions has 5612, 1780 and 1528 received files. Generating the summary takes longer as the bootstrap proceeds and creates contention in the synchronized org.apache.cassandra.streaming.StreamResultFuture.fireStreamEvent method. Method synchronization is unfair and causes starvation for some of the {{StreamDeserializerTask}} threads that consume the {{AsyncStreamingInputPlus}} slowing calls to netty to read from the streaming channel in {{rebuffer}}. On the unfairly scheduled threads, eventually socket reads slow down enough to keep the TCP receive queue above the high water mark for 300s preventing an ACK of pending bytes to the sender, which trips the TCP user timeout on the sender causing it to close it before completing the transfer and failing the stream. Raising the streaming TCP user timeout or disabling it entirely is still the correct workaround for this issue until it can be resolved. {code:java} "Stream-Deserializer-/1.2.3.4:7000-deadbeef" #176 daemon prio=5 os_prio=0 cpu=299135.87ms elapsed=8341.21s tid=0x7f25e2064a00 nid=0xe02f runnable [0x7f25b03a4000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.streaming.SessionInfo.getTotalSizeInProgress(SessionInfo.java:172) at org.apache.cassandra.streaming.SessionInfo.getTotalSizeReceived(SessionInfo.java:125) at org.apache.cassandra.streaming.StreamingState$Sessions.create(StreamingState.java:379) at org.apache.cassandra.streaming.StreamingState.handleStreamEvent(StreamingState.java:255) - locked <0x000604ac7bf0> (a org.apache.cassandra.streaming.StreamingState) at org.apache.cassandra.streaming.StreamResultFuture.fireStreamEvent(StreamResultFuture.java:218) - locked <0x00060c3e57a0> (a org.apache.cassandra.streaming.StreamResultFuture) at org.apache.cassandra.streaming.StreamResultFuture.handleProgress(StreamResultFuture.java:208) at org.apache.cassandra.streaming.StreamSession.progress(StreamSession.java:1096) at org.apache.cassandra.db.streaming.CassandraCompressedStreamReader.read(CassandraCompressedStreamReader.java:96) at org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:84) - locked <0x00064a637510> (a org.apache.cassandra.db.streaming.CassandraIncomingFile) at org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:55) at org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:41) at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:50) at org.apache.cassandra.streaming.StreamDeserializingTask.run(StreamDeserializingTask.java:59) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(java.base@11.0.16/Thread.java:829) {code} other streaming threads blocked on the lock {code:java} java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.cassandra.streaming.StreamResultFuture.fireStreamEvent(StreamResultFuture.java:214) - waiting to lock <0x00060c3e57a0> (a org.apache.cassandra.streaming.StreamResultFuture) at org.apache.cassandra.streaming.StreamResultFuture.handleProgress(StreamResultFuture.java:208) at org.apache.cassandra.streaming.StreamSession.progress(StreamSession.java
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17646377#comment-17646377 ] Jon Meredith commented on CASSANDRA-18110: -- No success reproducing this issue in a test environment still. Given that all attempts have failed to reproduce and have completed investigating our theories how this could happen without finding anything so far, this should not block the release. Downgrading severity to Normal and will update this ticket with any future findings. > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.1 > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] >at java.lang.Thread.run(Thread.java:829) [?:?] >Suppressed: java.nio.channels.ClosedChannelException >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.doFlush(AsyncStreamingOutputPlus.java:82) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:229) >
[jira] [Commented] (CASSANDRA-18110) Streaming fails during multiple concurrent host replacements
[ https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645812#comment-17645812 ] Jon Meredith commented on CASSANDRA-18110: -- To clarify the workaround, on any new nodes being started, ensure the config is set with {code:java} internode_streaming_tcp_user_timeout: 0s {code} and on existing nodes in the cluster, the timeout can be reset with JMX without restarting {code:java} org.apache.cassandra.db/StorageService/InternodeStreamingTcpUserTimeoutInMS{code} before the bootstrap attempt. > Streaming fails during multiple concurrent host replacements > > > Key: CASSANDRA-18110 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18110 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Urgent > Fix For: 4.1 > > > Running four concurrent host replacements on a 4.1.0 development cluster has > repeatably failed to complete bootstrap with all four hosts failing bootsrrap > and staying in JOINING, logging the message. > {code:java} > ERROR 2022-12-07T21:15:48,860 [main] > org.apache.cassandra.service.StorageService:2019 - Error while waiting on > bootstrap to complete. Bootstrap will have to be restarted. > {code} > Bootstrap fails as the the FileStreamTasks on the streaming followers > encounter an EOF while transmitting the files. > {code:java} > ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] > org.apache.cassandra.streaming.StreamSession:718 - [Stream > #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session > with peer 1.2.3.4:7000 through 1.2.3.4:40292 > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel > this output stream was writing to has been closed >at > org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88) > ~[cassandra.jar] >at > org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39) > ~[cassandra.jar] >at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311) > [cassandra.jar] >at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > [cassandra.jar] >at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > [cassandra.jar] >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] >at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [netty-all-4.1.58.Final.jar:4.1.58.Final] >at java.lang.Thread.run(Thread.java:829) [?:?] >Suppressed: java.nio.channels.ClosedChannelException >at > org.apache.cassandra.net.AsyncStreamingOutputPlus.doFlush(AsyncStreamingOutputPlus.java:82) > ~[cassandra.jar] >at > org.apache.cassandra.net.AsyncChannelOutputPlus.flush(Async