[ https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981236#comment-15981236 ]
Christian Esken edited comment on CASSANDRA-13265 at 4/24/17 2:32 PM: ---------------------------------------------------------------------- First here is a summary and the question I have: The tests work if I add "DatabaseDescriptor.daemonInitialization();" to the unit test of the affected branches. Is this a good idea, [~aweisberg]? Now the long story: This is the status for branch cassandra-13265-3.0: - (/) Running unit tests in Eclipse: Works - (/)/(?) CircleCI: All normal tests work fine. "Your build ran 4754 tests in junit with 0 failures". The build fails for me with: Target "stress-test" does not exist in the project "apache-cassandra". As "ant test" worked, I would guess that the patch is fine. I will reverify the specific unit test locally This is the status for branch cassandra-13265-3.11 and cassandra-13265-trunk: - (/) Running unit tests in Eclipse: Works - (x) Running unit tests with CircleCI or "ant test" fails, due to non-initialized DatabaseDescriptor. When I add the following to the unit test of cassandra-13265-3.11, the unit test works. {code} DatabaseDescriptor.daemonInitialization(); {code} {code} [junit] Null Test: Caused an ERROR [junit] null [junit] java.lang.ExceptionInInitializerError [junit] at java.lang.Class.forName0(Native Method) [junit] at java.lang.Class.forName(Class.java:264) [junit] Caused by: java.lang.NullPointerException [junit] at org.apache.cassandra.config.DatabaseDescriptor.getWriteRpcTimeout(DatabaseDescriptor.java:1400) [junit] at org.apache.cassandra.net.MessagingService$Verb$1.getTimeout(MessagingService.java:121) [junit] at org.apache.cassandra.net.OutboundTcpConnectionTest.<clinit>(OutboundTcpConnectionTest.java:43) {code} was (Author: cesken): First here is the summary: The tests work if I add "DatabaseDescriptor.daemonInitialization();" to the unit test of the affected branches. Is this a good idea, [~aweisberg]? Now the long story: This is the status for branch cassandra-13265-3.0: - (/) Running unit tests in Eclipse: Works - (/)/(?) CircleCI: All normal tests work fine. "Your build ran 4754 tests in junit with 0 failures". The build fails for me with: Target "stress-test" does not exist in the project "apache-cassandra". As "ant test" worked, I would guess that the patch is fine. I will reverify the specific unit test locally This is the status for branch cassandra-13265-3.11 and cassandra-13265-trunk: - (/) Running unit tests in Eclipse: Works - (x) Running unit tests with CircleCI or "ant test" fails, due to non-initialized DatabaseDescriptor. When I add the following to the unit test of cassandra-13265-3.11, the unit test works. {code} DatabaseDescriptor.daemonInitialization(); {code} {code} [junit] Null Test: Caused an ERROR [junit] null [junit] java.lang.ExceptionInInitializerError [junit] at java.lang.Class.forName0(Native Method) [junit] at java.lang.Class.forName(Class.java:264) [junit] Caused by: java.lang.NullPointerException [junit] at org.apache.cassandra.config.DatabaseDescriptor.getWriteRpcTimeout(DatabaseDescriptor.java:1400) [junit] at org.apache.cassandra.net.MessagingService$Verb$1.getTimeout(MessagingService.java:121) [junit] at org.apache.cassandra.net.OutboundTcpConnectionTest.<clinit>(OutboundTcpConnectionTest.java:43) {code} > Expiration in OutboundTcpConnection can block the reader Thread > --------------------------------------------------------------- > > Key: CASSANDRA-13265 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13265 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 3.0.9 > Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version > 1.8.0_112-b15) > Linux 3.16 > Reporter: Christian Esken > Assignee: Christian Esken > Fix For: 3.0.x > > Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, > cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz > > > I observed that sometimes a single node in a Cassandra cluster fails to > communicate to the other nodes. This can happen at any time, during peak load > or low load. Restarting that single node from the cluster fixes the issue. > Before going in to details, I want to state that I have analyzed the > situation and am already developing a possible fix. Here is the analysis so > far: > - A Threaddump in this situation showed 324 Threads in the > OutboundTcpConnection class that want to lock the backlog queue for doing > expiration. > - A class histogram shows 262508 instances of > OutboundTcpConnection$QueuedMessage. > What is the effect of it? As soon as the Cassandra node has reached a certain > amount of queued messages, it starts thrashing itself to death. Each of the > Thread fully locks the Queue for reading and writing by calling > iterator.next(), making the situation worse and worse. > - Writing: Only after 262508 locking operation it can progress with actually > writing to the Queue. > - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and > fully lock the Queue > This means: Writing blocks the Queue for reading, and readers might even be > starved which makes the situation even worse. > ----- > The setup is: > - 3-node cluster > - replication factor 2 > - Consistency LOCAL_ONE > - No remote DC's > - high write throughput (100000 INSERT statements per second and more during > peak times). > -- This message was sent by Atlassian JIRA (v6.3.15#6346)