[ 
https://issues.apache.org/jira/browse/CASSANDRA-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308415#comment-17308415
 ] 

Gianluca Righetto commented on CASSANDRA-15892:
-----------------------------------------------

[~jasonstack] I attached a thread dump to this ticket which should make things 
more clear, but this is the sequence of events:

Node 2 runs {{AsyncChannelOutputPlus#flush -> 
AsyncStreamingOutputPlus#doFlush}} which contains the following

{code}
        ChannelPromise promise = beginFlush(byteCount, 0, Integer.MAX_VALUE);
        channel.writeAndFlush(GlobalBufferPoolAllocator.wrap(flush), promise);
{code}
        
That promise is responsible for unparking the thread in a later moment (it's 
supposed to be called when the flush either completes or fails).
Still in {{AsyncChannelOutputPlus#flush}}, it calls {{waitUntilFlushed(0, 0)}} 
and at this point the thread is parked.

Now it receives a SESSION_FAILED message, so it will try to close the channel, 
but the problem is that the channel has a close listener that relies on a 
synchronized method, {{StreamSession#onChannelClose}}, and since that thread is 
parked, it blocks. In other words, netty can't properly close the channel and 
the flush promise above is never called, so the thread is never unparked, hence 
the deadlock.

> JAVA 8/11: test_resumable_rebuild - rebuild_test.TestRebuild
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-15892
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15892
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/python
>            Reporter: Ekaterina Dimitrova
>            Assignee: Gianluca Righetto
>            Priority: Normal
>             Fix For: 4.0-rc
>
>         Attachments: CASSANDRA-15892-07babf3c-node2-deadlock-thread-dump.txt, 
> Screenshot 2021-03-23 at 19.37.35.png, Screenshot 2021-03-23 at 19.53.51.png
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> JAVA 11:
> test_resumable_rebuild - rebuild_test.TestRebuild
> Fails locally and in  
> [CircleCI | 
> [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/11202c7e-6c94-4d4e-bbbf-9e2fa9791ad0/jobs/1338]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to