[ https://issues.apache.org/jira/browse/CASSANDRA-16061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189074#comment-17189074 ]
Berenguer Blasi commented on CASSANDRA-16061: --------------------------------------------- This is the failure for reference as ci-cass logs are gone now: {noformat} ===Flaky Test Report=== test_move_forwards_and_cleanup failed; it passed 0 out of the required 1 times. <class 'ccmlib.node.TimeoutError'> 02 Sep 2020 07:18:39 [node4] Missing: ['Starting listening for CQL clients']: INFO [main] 2020-09-02 09:17:30,390 YamlConfigura..... See system.log for remainder [<TracebackEntry /media/sf_VBoxSharedFolder/dtestsvbox/transient_replication_ring_test.py:301>, <TracebackEntry /media/sf_VBoxSharedFolder/dtestsvbox/transient_replication_ring_test.py:231>, <TracebackEntry /media/sf_VBoxSharedFolder/dtestsvbox/transient_replication_ring_test.py:47>, <TracebackEntry /media/sf_VBoxSharedFolder/dtestsvbox/src/ccm/ccmlib/node.py:798>, <TracebackEntry /media/sf_VBoxSharedFolder/dtestsvbox/src/ccm/ccmlib/node.py:591>, <TracebackEntry /media/sf_VBoxSharedFolder/dtestsvbox/src/ccm/ccmlib/node.py:548>] ===End Flaky Test Report=== {noformat} It's hard to repro. There is some exotic race where [boostrap.get()|https://github.com/apache/cassandra/blob/23ba48aa935d3f81e66b65285fa8e7972f94dcfe/src/java/org/apache/cassandra/service/StorageService.java#L1584] will block as a default connection never completes blocking [here|https://github.com/apache/cassandra/blob/23ba48aa935d3f81e66b65285fa8e7972f94dcfe/src/java/org/apache/cassandra/streaming/DefaultConnectionFactory.java#L50]. That just times out the test. That shouldn't be as the default connection has a built in timeout. Even forcing a timeout myself when waiting on it won't do the trick. Somehow connecting to node1 is not possible. I have been debugging this as much as I can. The netty code needs some time to penetrate and I don't have a full grasp of it despite what I saw made sense. {{bootstrap.get()}} blocks on AbstractFuture [parking|https://github.com/google/guava/blob/v27.0/guava/src/com/google/common/util/concurrent/AbstractFuture.java#L523] the thread. If you google a bit you'll find many people getting blocked threads around this area given guava makes some assumptions apparently. I am taking a break here as I am not progressing so I need to go back to the drawing board on how to approach this. If sbdy wants to try the challenge feel free to assign it. > transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_and_cleanup > ------------------------------------------------------------------------------------------------ > > Key: CASSANDRA-16061 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16061 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest/python > Reporter: Ekaterina Dimitrova > Assignee: Berenguer Blasi > Priority: Normal > Fix For: 4.0-beta > > > Failing here, also locally: > [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/312/workflows/da4ce69c-e778-467e-b9f3-27ab166a8321/jobs/1945] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org