[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-10-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197193#comment-16197193
 ] 

ASF GitHub Bot commented on CASSANDRA-13265:


Github user christian-esken commented on the issue:

https://github.com/apache/cassandra/pull/95
  
Closing PR, as it has been merged in all relevant banches. See 
https://issues.apache.org/jira/browse/CASSANDRA-13265


> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.14, 3.11.0, 4.0
>
> Attachments: cassandra-13265-2.2-dtest_stdout.txt, 
> cassandra-13265-trun-dtest_stdout.txt, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-10-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197194#comment-16197194
 ] 

ASF GitHub Bot commented on CASSANDRA-13265:


Github user christian-esken closed the pull request at:

https://github.com/apache/cassandra/pull/95


> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.14, 3.11.0, 4.0
>
> Attachments: cassandra-13265-2.2-dtest_stdout.txt, 
> cassandra-13265-trun-dtest_stdout.txt, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-10-09 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197195#comment-16197195
 ] 

Christian Esken commented on CASSANDRA-13265:
-

PR closed: https://github.com/apache/cassandra/pull/95

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.14, 3.11.0, 4.0
>
> Attachments: cassandra-13265-2.2-dtest_stdout.txt, 
> cassandra-13265-trun-dtest_stdout.txt, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-10-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195765#comment-16195765
 ] 

ASF GitHub Bot commented on CASSANDRA-13265:


Github user jeffjirsa commented on the issue:

https://github.com/apache/cassandra/pull/95
  
Hi @christian-esken , CASSANDRA-13265 has been merged, do you mind closing 
this PR? (we can't close it for you)



> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.14, 3.11.0, 4.0
>
> Attachments: cassandra-13265-2.2-dtest_stdout.txt, 
> cassandra-13265-trun-dtest_stdout.txt, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-05-05 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998310#comment-15998310
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


I'm sorry :-( I'll try to get a more concrete handle on fix versions earlier 
next time. It's always been something I've struggled calibrating because 
different committers are all over the map in terms of how far back they will 
backport things.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.14, 3.11.0, 4.0
>
> Attachments: cassandra-13265-2.2-dtest_stdout.txt, 
> cassandra-13265-trun-dtest_stdout.txt, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-05-05 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998277#comment-15998277
 ] 

Christian Esken commented on CASSANDRA-13265:
-

It is fine, I do not use 2.2. I was just wondering because you asked me to 
start with 2.2, which required more effort and made things a bit more 
complicated. If you hadn't asked, I would have done patches just for the HEAD 
of version 3 (cassandra-3.0) and version 4 (trunk). 

So, mission complete. Thanks Ariel for guiding me through my first Cassandra 
patch. :-)

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.14, 3.11.0, 4.0
>
> Attachments: cassandra-13265-2.2-dtest_stdout.txt, 
> cassandra-13265-trun-dtest_stdout.txt, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-05-04 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996951#comment-15996951
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


I decided to start from 3.0 because 2.2 is really critical fixes only and this 
issue has been broken since 2.1 without eliciting a lot of complaints until 
now. It's a judgement call. I took a quick poll and the response was that 3.0+ 
was the place to start.

Is this something that's impacting you in 2.2?

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.14, 3.11.0, 4.0
>
> Attachments: cassandra-13265-2.2-dtest_stdout.txt, 
> cassandra-13265-trun-dtest_stdout.txt, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-05-04 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996322#comment-15996322
 ] 

Christian Esken commented on CASSANDRA-13265:
-

Thanks. I have seen your commit in three branches. I did not yet see the 
changes in cassandra-2.2, when looking at  
https://github.com/apache/cassandra/commits/cassandra-2.2 . Is this an 
omission, or is the github repo is not current?

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.14, 3.11.0, 4.0
>
> Attachments: cassandra-13265-2.2-dtest_stdout.txt, 
> cassandra-13265-trun-dtest_stdout.txt, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-05-03 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15994820#comment-15994820
 ] 

Christian Esken commented on CASSANDRA-13265:
-

Done. Squashed and pushed. 

I also removed my "stress-test" patch in the 3.0 branch, as it is not related 
and also does not look like a proper fix. As a reference, here is the patch:
{code}
-- case $CIRCLE_NODE_INDEX in 0) ant eclipse-warnings; ant test ;; 1) ant 
long-test ;; 2) ant test-compression ;; 3) ant stress-test ;;esac:
+- case $CIRCLE_NODE_INDEX in 0) ant eclipse-warnings; ant test ;; 1) ant 
long-test ;; 2) ant test-compression ;;esac:
{code}


> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra-13265-2.2-dtest_stdout.txt, 
> cassandra-13265-trun-dtest_stdout.txt, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-05-01 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15991146#comment-15991146
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


So the dtests failing on the branch don't match the failures on trunk, but I 
looked at some of the other branches and they are also failing similar random 
tests. So I am running trunk using the branch job to see what the test failures 
look like in that scenario. I don't think you broke anything it's just doing 
due diligence is a pain right now.

I looked at the utests and they look fine. Right now it runs 4 containers, but 
the first container is most important as that is where the tests are closest to 
zero failures since we only used to run those on every commit. The others 
failures don't look related (famous last words).

Meanwhile can you squash the branches that aren't squashed?

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra-13265-2.2-dtest_stdout.txt, 
> cassandra-13265-trun-dtest_stdout.txt, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-27 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986786#comment-15986786
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


You don't need to get all the tests passing. Only the ones broken specifically 
by your changes. So you have to run the tests on the base branch 
(cassandra-2.2, cassandra-3.0, cassandra-3.11, cassandra-trunk). Compare the 
results to 
https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-2.2-test-all/ 
or the one specific to the version of your branch.

You also need to be aware that a test might just be unreliable in CircleCI. The 
unit tests seem to be reliable to me when I was running them with my 
circle.yml, but I am not sure about the one on trunk.

Certainly it runs additional tests which weren't running all the time before 
Circle and I don't think they were reliable when we started running them.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-27 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986426#comment-15986426
 ] 

Christian Esken commented on CASSANDRA-13265:
-

I am fixing the branches, while you work on the dtests. I will continue 
updating this comment as long as I work on it.
|| branch || sqaushed? || Unit Tests OK? || comment ||
|  cassandra-13265-3.0 |  no | (CircleCI  currently running) | |
|  cassandra-13265-3.11 | no | CircleCI  (/) | |
|  cassandra-13265-2.2  | yes | ant test   (/) | CicrleCI hasn't kicked off 
tests for the branch | 
|  cassandra-13265-trunk  | no | CircleCI (?)  | My unit test works. Bu there 
is a strange unrelated unit test failure: ClassNotFoundException: 
org.apache.cassandra.stress.CompactionStress |


> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-26 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985008#comment-15985008
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


I think the stress-test target just doesn't exist in 3.0.

Adding {{DatabaseDescriptor.daemonInitialization()}} is the correct fix I think.

I a struggling to just get a run of the dtests for 2.2 and trunk before I 
merge. This runs output seems corrupted 
https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/30/

The trunk trunk run doesn't even appear.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-24 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981236#comment-15981236
 ] 

Christian Esken commented on CASSANDRA-13265:
-

First here is the summary: The tests work if I add 
"DatabaseDescriptor.daemonInitialization();" to the unit test of the affected 
branches. Is this a good idea, [~aweisberg]?

Now the long story:

This is the status for branch cassandra-13265-3.0:
- (/) Running unit tests in Eclipse: Works
 - (/)/(?) CircleCI: All normal tests work fine. "Your build ran 4754 tests in 
junit with 0 failures".  The build fails for me with: Target "stress-test" does 
not exist in the project "apache-cassandra". As "ant test" worked, I would 
guess that the patch is fine. I will reverify the specific unit test locally


This is the status for branch cassandra-13265-3.11 and cassandra-13265-trunk:
- (/) Running unit tests in Eclipse: Works
- (x) Running unit tests with CircleCI or "ant test" fails, due to 
non-initialized DatabaseDescriptor.
  When I add the following to the unit test of cassandra-13265-3.11, the unit 
test works. 
{code}
   DatabaseDescriptor.daemonInitialization();
{code}

{code}
[junit] Null Test:  Caused an ERROR
[junit] null
[junit] java.lang.ExceptionInInitializerError
[junit] at java.lang.Class.forName0(Native Method)
[junit] at java.lang.Class.forName(Class.java:264)
[junit] Caused by: java.lang.NullPointerException
[junit] at 
org.apache.cassandra.config.DatabaseDescriptor.getWriteRpcTimeout(DatabaseDescriptor.java:1400)
[junit] at 
org.apache.cassandra.net.MessagingService$Verb$1.getTimeout(MessagingService.java:121)
[junit] at 
org.apache.cassandra.net.OutboundTcpConnectionTest.(OutboundTcpConnectionTest.java:43)
{code}

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-21 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15978748#comment-15978748
 ] 

Christian Esken commented on CASSANDRA-13265:
-

I pulled the changes, fixed the CHANGES.txt and pushed everything again. Now 
CircleCI is kicking off the builds for the branches. Looks like we are getting 
somewhere. :-)

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-21 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15978730#comment-15978730
 ] 

Christian Esken commented on CASSANDRA-13265:
-

Hmm, I rebased on 2.2 and trunk. I am surprised that 3.11 is not current, as 
3.11 is not that old. I will now clean my repo, including deleting the bad 
"13625" branches and rebasing 3.0 and 3.11. 

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-20 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977067#comment-15977067
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


The branches are out of date. 3.11 ones doesn't have a circle.yml. They really 
need to be rebased onto something relatively recent so they can run in Circle.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-20 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976867#comment-15976867
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


That circle.yml is broken. For instance OutboundTcpConnectionTest failed to 
start because it NPEed in DatabaseDescriptor but it's not in the summary. Or 
maybe it's Circle. I'm not sure. I'm also not sure why your are only getting 
builds for trunk.

If you fix the unit test I think it's ready to commit. The dtests ran on Apache 
Jenkins for everything except trunk. Trunk timed out on one test so I'm going 
to try it again.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-20 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976279#comment-15976279
 ] 

Christian Esken commented on CASSANDRA-13265:
-

Unfortunately some test failed, not because of bugs but due to technical 
issues, mostly with 
"com.datastax.driver.core.exceptions.NoHostAvailableException". Are these the 
"dtest" issues in CircleCI you mentioned?

I tried to run the tests locally, but even "ant test" runs > 1 hour and keeps 
failing  with Timeout, NoHostAvailableException, or similar. I don't know why 
the tests fail, as my Laptop should be capable of doing it. I am frequently 
running a 3-node Cassandra on it via ccm and that works properly.

Currently I think I did all I can do. Let me know if I can check something 
else. What is your proposal how do we continue here?

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-19 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974885#comment-15974885
 ] 

Christian Esken commented on CASSANDRA-13265:
-

No problem. I was away for Easter, so I did not even notice you being busy. 
I just started my CircleCI test for the first time. Its working on the first 
branch (trunk) for an hour and is not complete yet, so I guess with all the 
branches it can take a day to complete. I have restarted the build with more 
parallelism and will send an update whenever that is complete. Hopefully that 
will create a more acceptable turnaround time. 
https://circleci.com/gh/christian-esken/cassandra/3

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-19 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974671#comment-15974671
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


Sorry I just had a really busy week last week and I've been trying to get 
Circle to the point it can run the dtests. I'm mostly there it's just a few 
failing tests that remain.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-19 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974590#comment-15974590
 ] 

Christian Esken commented on CASSANDRA-13265:
-

bq. For CHANGES.TXT the entry should go at the top of the list of entries for 
the version the change is for. I don't know why.
I also haven't seen this mentioned. Probably someone could add that to 
https://wiki.apache.org/cassandra/HowToContribute or 
http://cassandra.apache.org/doc/latest/development/how_to_commit.html . Anyhow 
I have fixed that.

bq. set up with CircleCI [...] Also you transposed 13625 and 13265 
I changed the branches to correct the transposing 13625 and 13265. I didn't 
find any other place than the branch names. I will try to find out about how to 
do the CircleCI stuff. Meanwhile here are the updated links:

https://github.com/christian-esken/cassandra/commits/cassandra-13265-2.2
https://github.com/christian-esken/cassandra/commits/cassandra-13265-3.0
https://github.com/christian-esken/cassandra/commits/cassandra-13265-3.11
https://github.com/christian-esken/cassandra/commits/cassandra-13265-trunk

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-13 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967870#comment-15967870
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


Nevermind I'll run them. I have to anyways for the dtests until we get the 
dtests running in CircleCI. I updated my copies. 

For CHANGES.TXT the entry should go at the top of the list of entries for the 
version the change is for. I don't know why.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-13 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967857#comment-15967857
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


OK, don't forget to get set up with CircleCI and post the block with the test 
results.

Also you transposed 13625 and 13265 :-)

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-13 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967484#comment-15967484
 ] 

Christian Esken commented on CASSANDRA-13265:
-

There were different reasons why the build failed, e.g. somehow Eclipse did not 
pick up the build parameters for 2.2 after "ant generate-eclipse-files" and the 
build was done with Java 8 language level (lambdas). Looks like building and 
testing in Eclipse alone is not enough, so I redid everything manually in the 
console and fixed the issues. As you recommended, I have created branches that 
follow your naming  (cassandra-13625-3.0) with squashed commits. The new 
branches are:

https://github.com/christian-esken/cassandra/commits/cassandra-13625-2.2
https://github.com/christian-esken/cassandra/commits/cassandra-13625-3.11
https://github.com/christian-esken/cassandra/commits/cassandra-13625-3.0
https://github.com/christian-esken/cassandra/commits/cassandra-13625-trunk

About CHANGES.TXT: I added changes to  all branches where in the appropriate 
versions. Please check, as the naming conventions within Cassandra are still 
not clear to me(e.g. there exists a 3.11 branch, a 3.0.11 release and a 3.11.0 
changelog entry).

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-11 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964943#comment-15964943
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


There seem to be some build issues in various branches? Maybe because I rebased?

You should register with CircleCI so it will automatically build and run the 
unit tests for you out of your repo when you commit.

||Code|utests|dtests||
|[2.2|https://github.com/aweisberg/cassandra/pull/new/cassandra-13265-2.2]|[utests|https://circleci.com/gh/aweisberg/cassandra/tree/cassandra-13265-2%2E2]||
|[3.0|https://github.com/aweisberg/cassandra/pull/new/cassandra-13265-3.0]|[utests|https://circleci.com/gh/aweisberg/cassandra/tree/cassandra-13265-3%2E0]||
|[3.11|https://github.com/aweisberg/cassandra/pull/new/cassandra-13265-3.11]|[utests|https://circleci.com/gh/aweisberg/cassandra/tree/cassandra-13265-3%2E11]||
|[trunk|https://github.com/aweisberg/cassandra/pull/new/cassandra-13265-trunk]|[utests|https://circleci.com/gh/aweisberg/cassandra/tree/cassandra-13625-trunk]||


> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-11 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964613#comment-15964613
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


Squashing is preferred, but I like to keep the history. When I squash I create 
a second -squashed branch to hold the squash commit. I never delete branches 
and tolerate the cluttered namespace. If we used pull requests I would delete 
branches since the pull request preserves the information, but we don't :-( 
Since people tend to work on multiple tickets they don't name the branch 
cassandra-3.0 they do something like cassandra-13625-3.0. The commit process I 
follow is http://cassandra.apache.org/doc/latest/development/how_to_commit.html.

For the commit message don't list multiple reviewers just the one in the JIRA 
(me). I have been told that line is automatically parsed so we want to stick to 
the expected format. Also some OCD people want a line break in between the 
first and last line of the commit message.

Having CHANGES.TXT in the patch is helpful so I don't forget to add it in one 
branch. If it's not there and I follow the commit process I have to add it at 
each branch. For the entry also include the ticket number in parens at the end.

I'll kick off the tests now.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-11 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964522#comment-15964522
 ] 

Christian Esken commented on CASSANDRA-13265:
-

Done.  2 organizational topics left:

I will add the required line to the commit message. Looks OK?!?
 bq. patch by Christian Esken; reviewed by Ariel Weisberg  and Jason Brown for 
CASSANDRA-13265
My proposals for the CHANGES.txt would be the following text. Can you do that, 
Ariel? I do not know in which versions to add that, as they are upcoming 
versions.
 bq. Expire OutboundTcpConnection messages by a single Thread 

Here are the branches.The cassandra-3.0 is already squashed. If that branch is 
OK, I will also squash the other 3 branches.

https://github.com/christian-esken/cassandra/commits/cassandra-3.0
https://github.com/christian-esken/cassandra/commits/cassandra-3.11
https://github.com/christian-esken/cassandra/commits/trunk
https://github.com/christian-esken/cassandra/commits/cassandra-2.2

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-10 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963230#comment-15963230
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


Great! Most regular contributors are linking to a branch in their fork, but 
there are issues with that (people force update, delete branches). You don't 
have to attach a patch to get it to submit patch. 

I do need a patch for each branch. 2.2, 3.0, 3.11 and trunk. We try to push the 
work of dealing with the merge conflicts to the assignees over the committers. 
Use whatever method you prefer to submit the code.

When you have all of the patches I'll kick off test runs. It's going to take a 
while for the dtests to run right now.


> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 3.0.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-04-10 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963095#comment-15963095
 ] 

Christian Esken commented on CASSANDRA-13265:
-

Done. My highest priority is the 3.0 branch. I created a patch (single file, 
squashed) for 3.0, that I also applied to my Github fork 
https://github.com/christian-esken/cassandra/commits/cassandra-3.0 . Please 
have a look at the attached file 
0001-3.0-Expire-OTC-messages-by-a-single-Thread.patch .

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-16 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928300#comment-15928300
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


Great! Almost there.

* [Use Config.PROPERTY_PREFIX with the property 
name.|https://github.com/apache/cassandra/pull/95/files#diff-c7ef124561c4cde1c906f28ad3883a88R139]
* [Still using big I integer 
here.|https://github.com/apache/cassandra/pull/95/files#diff-71f06c193f5b5e270cf8ac695164f43aR2685]
* [And 
here.|https://github.com/apache/cassandra/pull/95/files#diff-097eb77c5d9d80a48e7547fbb81822caR62]

After that the last step is to create 2.2, 3.0, and 3.11 branches. Then I'll 
run the tests for each of the branches and if they look good I'll commit.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-15 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926348#comment-15926348
 ] 

Christian Esken commented on CASSANDRA-13265:
-

bq.  I still think it's a good idea to avoid hard coding this kind of value so 
operators have options without recompiling. [...] A java property. 
(/) static final int BACKLOG_PURGE_SIZE = 
Integer.getInteger("OTC_BACKLOG_PURGE_SIZE", 1024);

bq. I think we should log the drops especially due to timeouts as they happen 
rather than at the end.
(/) I agree, and I did not change that. After the loop I simply add the 
unprocessed messages, which happen due to {{break inner;}}. I'll push today, so 
you have a chance to see that. 

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-15 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926015#comment-15926015
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


bq. I see. I thought it would just clutter the cassanda.yaml, as nobody ever 
would change the value. But if you feel it is important enough, I can do so.
So not a property in the YAML. A java property. So something like 
"Integer.getInteger("property", 1024);" instead of just "1024".
bq. PS: I will be soon on vacation for two weeks, so please don't wonder why 
you do not see any updates from me.
Enjoy


> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-15 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925815#comment-15925815
 ] 

Christian Esken commented on CASSANDRA-13265:
-

bq. it's very low effort to add a property
I see. I thought it would just clutter the cassanda.yaml, as nobody ever would 
change the value. But if you feel it is important enough, I can do so.

bq. Not a huge deal, but I think we should log the drops especially due to 
timeouts as they happen
OK. I can follow your argument. I will rewrite it, also adding comments with 
explanation. 

PS: I will be soon on vacation for two weeks, so please don't wonder why you do 
not see any updates from me.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-14 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924658#comment-15924658
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


bq. I still think it's a good idea to avoid hard coding this kind of value so 
operators have options without recompiling.
I would like BACKLOG_PURGE_SIZE to be kept hard coded for now. It has been 
there for quite some time hard coded, and in the long term I do not think it 
should be kept as-is. For example it would better to purge on the number of 
actually DROPPABLE messages in the queue (or their weight if you want to extend 
even further)
I agree but I don't want to add more to this change set and I want to backport 
it to at least 2.2. I suggest it primarily because it's very low effort to add 
a property as opposed to a full YAML + JMX config option.

bq. At Github? I can do so. But no PR, right? I saw it mentioned that one 
should not open PR's for Cassandra on Github as they cannot be handled (it's 
just a mirror).
Project policy is to not use PRs even for comments. Code review comments go in 
JIRA. What I and some others do is link to a compare view of what we intend to 
merge. Don't delete the branches your links point to because it invalidates the 
information in JIRA.

bq. I looked in more detail, and I think this a flaw in the original code 
"suggested" me to do this: drainedMessages.clear() is called twice, and one 
time would be enough. IMO it would be better to only keep the one at the end of 
the method and also do the drop-counting for the drained messages there. This 
would also cover a rather exotic case of the catch (Exception e) in the run() 
method. If an Exception is thrown, then there is a danger of nothing being 
counted.
So there is the issue of not counting drops as they happen. The thread can 
block for a long time writing messages so if we wait to count drops they won't 
show up quite as promptly. Not a huge deal, but I think we should log the drops 
especially due to timeouts as they happen rather than at the end. It's 
definitely true that we don't need to clear the backlog twice and we could log 
the drops due to items remaining in the backlog there. The loop decrements a 
counter every time it runs so you know how many remaining elements are being 
dropped if you hit break inner.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-14 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924135#comment-15924135
 ] 

Christian Esken commented on CASSANDRA-13265:
-

Marking off feedback:
- (/) A smaller value could potentially ...
- (/)  You shouldn't need the check for null? Usually we "just" make sure its 
not null
OK.  I thought it might be possible to set this to null, but even JConsole 
refuses it.
- (/) Using a boxed integer makes it a bit confusing ...
  ACK. Happily changed that. Looks like I followed bad examples.
- (/) Avoid unrelated whitespace changes.
  OK. I missed that after moving the field.
- (x)  I still think it's a good idea to avoid hard coding this kind of value 
so operators have options without recompiling.
- (/) Fun fact. You don't need backlogNextExpirationTime to be volatile. You 
can piggyback on backlogExpirationActive to get the desired effects from the 
Java memory model. [...] I wouldn't change it ...
Yes.  I am  aware of that and using that technique often. Here I did not like 
it as visibility effects would not be obvious, unless explicitly documented. 
You are probably aware what Brian Goetz says about piggybacking in his JCIP 
book. BTW: A more obvious usage is for me in status fields, e.g. to make the 
results of a Future visible. I won't change it either, so marking this as done.
- (x)Breaking out the uber bike shedding this could be maybeExpireMessages.
- (x)Swap the order of these two stores so it doesn't do extra expirations.
  Ouch. That hurts. I was quite tired yesterday. :-|
- (?)  This is not quite correct you can't count drainCount as dropped because 
some of the drained messages may have been sent during iteration.
  In progress. I am wondering if we should include fixing the drop count it in 
this patch, as it will likely create even more conflicts. OTOH I have to touch 
some related methods anyhow. I will think about it.


> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-14 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923915#comment-15923915
 ] 

Christian Esken commented on CASSANDRA-13265:
-

bq. This is not quite correct you can't count drainCount as dropped because 
some of the drained messages may have been sent during iteration.
I looked in more detail, and I think this a flaw in the original code 
"suggested" me to do this:  {{drainedMessages.clear()}} is called twice, and 
one time would be enough. IMO it would be better to only keep the one at the 
end of the method and also do the drop-counting for the drained messages there. 
This would also cover a rather exotic case of the {{catch (Exception e)}} in 
the {{run()}} method. If an Exception is thrown, then there is a danger of 
nothing being counted.

bq. Using a boxed integer
bq. You shouldn't need the check for null?
>From a brief check, this refers to a similar point. I saw many configuration 
>options to allow null and followed that route. I am absolutely happy to make 
>it non-boxed.

bq. The right way to do it is create a branch for all the versions where this 
is going to be fixed. Start at 2.2, merge to 3.0, merge to 3.11, then merge to 
trunk. 
At Github? I can do so. But no PR, right? I saw it mentioned that one should 
not open PR's for Cassandra on Github as they cannot be handled (it's just a 
mirror).


> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-13 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15908545#comment-15908545
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


The right way to do it is create a branch for all the versions where this is 
going to be fixed. Start at 2.2, merge to 3.0, merge to 3.11, then merge to 
trunk. 

You can get away with one field. Check the next expiration time, CAS it to 
{{Long.MAX_VALUE}}, then when you are done store the next expiration time in 
it. Doing it with two fields also works. I wouldn't bother changing it.

If you set the default value via a property it will work fine. It will set it 
once when it loads the class at startup and then overwrite it with YAML 
contents or JMX invocations. Generally we do set the default value in config 
via assignment. Doing it via property gives yet another way to set the value, 
but it's the least important. It's more useful for things which aren't in the 
YAML. Adding YAML properties adds a bit of boiler plate.

* [A smaller value could potentially expire messages slightly sooner at the 
expense of more CPU time and queue contention while iterating the backlog of 
messages.
* [You shouldn't need the check for null? Usually we "just" make sure its not 
null and skip the 
boilerplate.|https://github.com/apache/cassandra/pull/95/files#diff-a8a9935b164cd23da473fd45784fd1ddR1973]
* [Avoid unrelated whitespace changes.| 
https://github.com/apache/cassandra/pull/95/files#diff-a8a9935b164cd23da473fd45784fd1ddL2034]
*  [I still think it's a good idea to avoid hard coding this kind of value so 
operators have options without 
recompiling.|https://github.com/apache/cassandra/pull/95/files#diff-c7ef124561c4cde1c906f28ad3883a88R139]
* Fun fact. You don't need {{backlogNextExpirationTime}} to be volatile. You 
can piggyback on {{backlogExpirationActive}} to get the desired effects from 
the Java memory model. A store to {{backlogExpirationActive}} makes a prior 
stores (by the current thread) to {{backlogNextExpirationTime}} visible. A read 
of {{backlogExpirationActive}} would make prior stores to 
{{backlogNextExpirationTimeVisible}} by the last writer to 
{{backlogExpirationActive}} visible. The volatile read is close to free so I 
wouldn't change it just so it's not sensitive to the order the fields are 
accessed in.
* [Breaking out the uber bike shedding this could be 
maybeExpireMessages.|https://github.com/apache/cassandra/pull/95/files#diff-c7ef124561c4cde1c906f28ad3883a88R600]
* [Swap the order of these two stores so it doesn't do extra 
expirations.|https://github.com/apache/cassandra/pull/95/files#diff-c7ef124561c4cde1c906f28ad3883a88R636]
* [Using a boxed integer makes it a bit confusing because now everyone has to 
know how null is handled. What's the diff between null and 0? Better to let 0 
be disabled and not have 
null.|https://github.com/apache/cassandra/pull/95/files#diff-097eb77c5d9d80a48e7547fbb81822caR62]
* [This is not quite correct you can't count drainCount as dropped because some 
of the drained messages may have been sent during 
iteration.|https://github.com/apache/cassandra/pull/95/files#diff-c7ef124561c4cde1c906f28ad3883a88R270]

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writ

[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-13 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15907493#comment-15907493
 ] 

Christian Esken commented on CASSANDRA-13265:
-

The change is now finished, including configuration, MBean and testing. I also 
tested an interval of 0 ms, which is close to what Cassandra does today.

Please have  a look. What is the recommended way of getting this into the 
official Cassandra repo?  A patch, or would someone with write access take it 
directly from Github? This also has impact on how the merge conflicts will be 
resolved, as there are now 3 files with conflicts e branch has now merge 
conflicts on 3 files according to https://github.com/apache/cassandra/pull/95. 
I did not want to rebase my branch without asking.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-10 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905409#comment-15905409
 ] 

Christian Esken commented on CASSANDRA-13265:
-

I committed the change containing the configuration, as I would like some 
feedback whether I am on the right path. Please note that I did not yet have 
time for tests (planned for next Monday), but I thought it is better to give a 
chance to review the current changes.

Please note that I had to add back "AtomicBoolean backlogExpirationActive" 
Otherwise I cannot guarantee that only a single Thread is iterating the Queue, 
especially if a small expiration interval (1ms, or 0ms) is configured. The 
"AtomicLong backlogNextExpirationTime" could now be "volatile long".

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-10 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15904949#comment-15904949
 ] 

Christian Esken commented on CASSANDRA-13265:
-

I am nearly done with the configuration, and have two questions about it:

1.  How to handle the default value? My approach is to pre-configure the 
default value in Config:
{code}
public static final int otc_backlog_expiration_interval_in_ms_default = 200;
public volatile Integer otc_backlog_expiration_interval_in_ms = 
otc_backlog_expiration_interval_in_ms_default;
{code}


- Additionally I will handle null values, that might come in via MBean in the 
getter of DatabaseDescriptor:
{code}
public static Integer getOtcBacklogExpirationInterval()
{
Integer confValue = conf.otc_backlog_expiration_interval_in_ms;
return confValue != null ? confValue : 
Config.otc_backlog_expiration_interval_in_ms_default;
}
{code}

2. How to read the config value? I am seeing some Integer.getInteger(propName, 
defaultValue), but this looks strange to me. I think changes from JMX would not 
even be reflected. Thus I am calling the getter from above: 
{{DatabaseDescriptor.getOtcBacklogExpirationInterval()}}. Is thte latter OK?


> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-08 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15901495#comment-15901495
 ] 

Christian Esken commented on CASSANDRA-13265:
-

I pushed the changes noted in the former comment.
I am plannig to do "Configurabilty and default value" tomorrow.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-08 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15901414#comment-15901414
 ] 

Christian Esken commented on CASSANDRA-13265:
-

I will update the status while working on the individual topics:
(x)This needs to be configurable from the YAML and via JMX. 
(/)It should include drained messages as well.
(/)Typo, "thus letting"
(/)Extra line break
(/)We don't do/allow author tags
(x)Use TimeUnit
(x)It isn't using the constant.
(x)Just in case maybe assert the droppable/non-droppable status of the 
verbs. Or does it not matter since the tests will fail anyways?



> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-08 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15901047#comment-15901047
 ] 

Christian Esken commented on CASSANDRA-13265:
-

bq. It should include drained messages as well
OMG, right. I thought about that concluded it is correct to exclude. But 
obviously the messages are already drained from the queue, so they must be 
added.

bq. We don't do/allow author tags
OOPS, always this oversmart IDE :-)

bq. This needs to be configurable
Oh. Even more work. This is getting bigger than anticipated, but I am happy to 
do it. Thanks for the hints on how do do the Configuration. I will work on it 
later this day or tomorrow.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-07 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1595#comment-1595
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


Awesome.
* [This needs to be configurable from the YAML and via JMX. Add it to 
{{StorageProxyMBean}}/{{StorageProxy}}, store the value in {{Config}} then get 
it from {{DatabaseDescriptor}} it will need to be a volatile field in 
{{Config}}. Add a setter and a getter. I have the setter return the previous 
value, but a lot of the existing ones don't. 10 seconds is also way too high as 
a default. I propose 200 milliseconds. It's an improvement over today, but 
still aggressive at trying to free memory. Also use 
{{java.util.concurrent.TimeUnit}} to convert to the value you 
want.|https://github.com/apache/cassandra/pull/95/files#diff-c7ef124561c4cde1c906f28ad3883a88R134]
* [It should include drained messages as well. To be 100% correct you would 
have to count how many messages you have iterated over and sent and then 
subtract.|https://github.com/apache/cassandra/pull/95/files#diff-c7ef124561c4cde1c906f28ad3883a88R261]
* [Typo, "thus 
letting"|https://github.com/apache/cassandra/pull/95/files#diff-c7ef124561c4cde1c906f28ad3883a88R604]
* [Extra line 
break|https://github.com/apache/cassandra/pull/95/files#diff-c7ef124561c4cde1c906f28ad3883a88R625]
* [We don't do/allow author 
tags|https://github.com/apache/cassandra/pull/95/files#diff-c16fff43d2aafe61a4219656d1ab6f9eR26]
* [Use 
TimeUnit|https://github.com/apache/cassandra/pull/95/files#diff-c16fff43d2aafe61a4219656d1ab6f9eR36]
* [It isn't using the 
constant.|https://github.com/apache/cassandra/pull/95/files#diff-c16fff43d2aafe61a4219656d1ab6f9eR118]
* [Just in case maybe assert the droppable/non-droppable status of the verbs. 
Or does it not matter since the tests will fail 
anyways?|https://github.com/apache/cassandra/pull/95/files#diff-c16fff43d2aafe61a4219656d1ab6f9eR33]

I had a sad and unfortunate thought. We are going about expiring wrong by 
counting messages. We really want to expire based on the weight of the queue. 
Also I really really hate that bounded queues that aren't weighted are a thing. 
Let's not do that here though since I want to backport this to other versions.


> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-07 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899773#comment-15899773
 ] 

Christian Esken commented on CASSANDRA-13265:
-

I added three changes:
- Implemented unit test
- Count dropped messages if Cassandra cannot write to the socket 
- Fix the QueuedMessage.isTimedOut(), which was prone to a System.nanoTime() 
wrap bug, as it used a check of type: aNanos < bNanos

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-06 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898266#comment-15898266
 ] 

Jason Brown commented on CASSANDRA-13265:
-

bq. we were supposed to mark those as dropped

We probably should count them as dropped as we are dropping them, huh ;) fwiw, 
looks like it's always been implemented this way, since CASSANDRA-3005

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-06 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898249#comment-15898249
 ] 

Ariel Weisberg commented on CASSANDRA-13265:


I think you are correct that we were supposed to mark those as dropped. 
[~jasobrown] do you agree?

I think the approximate nanoTime/currenTimeMillis approach where a single 
thread periodically updates the time is reasonable. If you added nanoTime to 
ApproximateTime with it's own configuration I think it would be fine to use it 
in this context.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-06 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897641#comment-15897641
 ] 

Christian Esken commented on CASSANDRA-13265:
-

I have one question about a code fragment. When the socket is not available the 
backlog is cleared, but no drops are counted. Looks like an omission to me, or 
is it intentional? {{dropped.addAndGet(backlog.size(); )}} would be an 
approximation. We likely cannot get closer as {{backlog.clear();}} does not 
tell how much elements were removed.

{code}
if (qm.isTimedOut())
dropped.incrementAndGet();
else if (socket != null || connect())
writeConnected(qm, count == 1 && backlog.isEmpty());
else
{
// clear out the queue, else gossip messages back up.
drainedMessages.clear();
// dropped.addAndGet(backlog.size()); //  TODO Should 
dropped statistics be counted in this case?
backlog.clear();
break inner;
}
{code}

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-06 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897121#comment-15897121
 ] 

Christian Esken commented on CASSANDRA-13265:
-

I was already looking into doing a unit test it but it requires access to the 
queue which means making it package level access and using 
{{@VisibleForTesting}}. I will do that tomorrow, unless there are arguments 
against it.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-03 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15894783#comment-15894783
 ] 

Jason Brown commented on CASSANDRA-13265:
-

I'm +1 on the ticket, but [~aweisberg] should also weigh in.

As a note, we have {{ApproximateTime}} for an estimated time, but that works 
off of {{System.currentTimeMillis()}} rather than {{System.nanoTime()}}, so 
monotonicity.

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

2017-03-03 Thread Christian Esken (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15894000#comment-15894000
 ] 

Christian Esken commented on CASSANDRA-13265:
-

Your "nit" totally makes sense, as {{nanoTime()}} is expensive if called very 
often. Good catch, actually I should have seen that myself. I committed the 
change, including the required change to RetriedQueuedMessage. I also removed 
two warnings by adding the generic types.

As a side note, one could even use an estimated time here, like 
EstimatorTimeSource which works with a regularly updated cached time. I was 
able to boost the performance of Triava Cache by factor 2 - 10 depending on the 
test. An example is in 
https://github.com/trivago/triava/blob/master/src/examples/java/com/trivago/examples/TimeSourceExample.java,
 estimatorTimeSource().

> Expiration in OutboundTcpConnection can block the reader Thread
> ---
>
> Key: CASSANDRA-13265
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>Reporter: Christian Esken
>Assignee: Christian Esken
> Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (10 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)