[ 
https://issues.apache.org/jira/browse/CASSANDRA-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305935#comment-17305935
 ] 

Gianluca Righetto commented on CASSANDRA-15892:
-----------------------------------------------

I did some more investigation recently and got to the bottom of this issue and 
this is actually a runtime problem, not simply a flaky test. I have a patch 
available, but I'll try to break it down below in a way that makes it easier to 
understand what's going on:

*Dtest*:
 - The dtest is designed to initially write data to only 2 nodes, then a third 
node is started and _nodetool rebuild_ is invoked, so the sstables start to be 
streamed to node 3.
 - The goal of the test is to make sure the rebuild process is able to continue 
later, in case the first attempt fails for whatever reason.
 - To simulate a failure, byteman is used on node 3 to throw an exception (only 
once) when it receives a message from node 2.
 - That will make node 2 receive a Stream Failed message back. If that message 
is processed and the event-loop thread tries to close the network channel while 
the writer thread is Parked waiting for the stream to complete, there will be a 
deadlock.
 - The second time _nodetool rebuild_ command is invoked by the dtest, node 2 
will be in deadlock already, so the test will remain on-hold indefinitely 
(eventually time out on Jenkins).

*Reproduction steps*:

In your IDE, set a couple of breakpoints at the following lines, then attach 
the debugger to the process of node 2 and make sure the first method below 
(_StreamSession#closeSession_) is executed before the line of the second 
breakpoint.

[https://github.com/apache/cassandra/blob/4d49308a3fbc354850126a5ec128b11a3aca4007/src/java/org/apache/cassandra/streaming/StreamSession.java#L485]

[https://github.com/apache/cassandra/blob/4d49308a3fbc354850126a5ec128b11a3aca4007/src/java/org/apache/cassandra/net/AsyncStreamingOutputPlus.java#L91]

*Patch*:

The patch simply handles the channel's _closeFuture_ callback asynchronously, 
that allows for the channel to be fully closed and then the [error 
handler|https://github.com/apache/cassandra/blob/4d49308a3fbc354850126a5ec128b11a3aca4007/src/java/org/apache/cassandra/net/AsyncChannelOutputPlus.java#L100]
 in _AsyncChannelOutputPlus_ is executed to clean things up and Unpark the 
thread, so no deadlock occurs.

C* patch: [https://github.com/grighetto/cassandra/pull/3]
 Dtest patch: [https://github.com/apache/cassandra-dtest/pull/130] (remove 
@flaky annotation and add ignore-log pattern)

*CircleCI Results*:

JVM 11 Dtests: 
[https://app.circleci.com/pipelines/github/grighetto/cassandra/32/workflows/77ef945f-8cf9-4c8b-84c6-13bc48078275/jobs/198]
 JVM 8 Dtests: 
[https://app.circleci.com/pipelines/github/grighetto/cassandra/32/workflows/dfc132e9-e4c4-4ea5-bc8e-d56e73679bcf/jobs/189]

> JAVA 8/11: test_resumable_rebuild - rebuild_test.TestRebuild
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-15892
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15892
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/python
>            Reporter: Ekaterina Dimitrova
>            Assignee: Gianluca Righetto
>            Priority: Normal
>             Fix For: 4.0-rc
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> JAVA 11:
> test_resumable_rebuild - rebuild_test.TestRebuild
> Fails locally and in  
> [CircleCI | 
> [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/11202c7e-6c94-4d4e-bbbf-9e2fa9791ad0/jobs/1338]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to