[ https://issues.apache.org/jira/browse/CASSANDRA-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305935#comment-17305935 ]
Gianluca Righetto commented on CASSANDRA-15892: ----------------------------------------------- I did some more investigation recently and got to the bottom of this issue and this is actually a runtime problem, not simply a flaky test. I have a patch available, but I'll try to break it down below in a way that makes it easier to understand what's going on: *Dtest*: - The dtest is designed to initially write data to only 2 nodes, then a third node is started and _nodetool rebuild_ is invoked, so the sstables start to be streamed to node 3. - The goal of the test is to make sure the rebuild process is able to continue later, in case the first attempt fails for whatever reason. - To simulate a failure, byteman is used on node 3 to throw an exception (only once) when it receives a message from node 2. - That will make node 2 receive a Stream Failed message back. If that message is processed and the event-loop thread tries to close the network channel while the writer thread is Parked waiting for the stream to complete, there will be a deadlock. - The second time _nodetool rebuild_ command is invoked by the dtest, node 2 will be in deadlock already, so the test will remain on-hold indefinitely (eventually time out on Jenkins). *Reproduction steps*: In your IDE, set a couple of breakpoints at the following lines, then attach the debugger to the process of node 2 and make sure the first method below (_StreamSession#closeSession_) is executed before the line of the second breakpoint. [https://github.com/apache/cassandra/blob/4d49308a3fbc354850126a5ec128b11a3aca4007/src/java/org/apache/cassandra/streaming/StreamSession.java#L485] [https://github.com/apache/cassandra/blob/4d49308a3fbc354850126a5ec128b11a3aca4007/src/java/org/apache/cassandra/net/AsyncStreamingOutputPlus.java#L91] *Patch*: The patch simply handles the channel's _closeFuture_ callback asynchronously, that allows for the channel to be fully closed and then the [error handler|https://github.com/apache/cassandra/blob/4d49308a3fbc354850126a5ec128b11a3aca4007/src/java/org/apache/cassandra/net/AsyncChannelOutputPlus.java#L100] in _AsyncChannelOutputPlus_ is executed to clean things up and Unpark the thread, so no deadlock occurs. C* patch: [https://github.com/grighetto/cassandra/pull/3] Dtest patch: [https://github.com/apache/cassandra-dtest/pull/130] (remove @flaky annotation and add ignore-log pattern) *CircleCI Results*: JVM 11 Dtests: [https://app.circleci.com/pipelines/github/grighetto/cassandra/32/workflows/77ef945f-8cf9-4c8b-84c6-13bc48078275/jobs/198] JVM 8 Dtests: [https://app.circleci.com/pipelines/github/grighetto/cassandra/32/workflows/dfc132e9-e4c4-4ea5-bc8e-d56e73679bcf/jobs/189] > JAVA 8/11: test_resumable_rebuild - rebuild_test.TestRebuild > ------------------------------------------------------------ > > Key: CASSANDRA-15892 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15892 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest/python > Reporter: Ekaterina Dimitrova > Assignee: Gianluca Righetto > Priority: Normal > Fix For: 4.0-rc > > Time Spent: 0.5h > Remaining Estimate: 0h > > JAVA 11: > test_resumable_rebuild - rebuild_test.TestRebuild > Fails locally and in > [CircleCI | > [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/11202c7e-6c94-4d4e-bbbf-9e2fa9791ad0/jobs/1338]] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org