[jira] [Comment Edited] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

Christian Esken (JIRA) Fri, 10 Mar 2017 03:46:33 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15904949#comment-15904949
 ]


Christian Esken edited comment on CASSANDRA-13265 at 3/10/17 11:45 AM:
-----------------------------------------------------------------------

I am nearly done with the configuration, and have two questions about it:

1.  How to handle the default value? My approach is to pre-configure the 
default value in Config:
{code}
    public static final int otc_backlog_expiration_interval_in_ms_default = 200;
    public volatile Integer otc_backlog_expiration_interval_in_ms = 
otc_backlog_expiration_interval_in_ms_default;
{code}

Additionally I will handle null values, that might come in via MBean in the 
getter of DatabaseDescriptor:
{code}
    public static Integer getOtcBacklogExpirationInterval()
    {
        Integer confValue = conf.otc_backlog_expiration_interval_in_ms;
        return confValue != null ? confValue : 
Config.otc_backlog_expiration_interval_in_ms_default;
    }
{code}

2. How to read the config value? I am seeing some Integer.getInteger(propName, 
defaultValue), but this looks strange to me. I think changes from JMX would not 
even be reflected. Thus I am calling the getter from above: 
{{DatabaseDescriptor.getOtcBacklogExpirationInterval()}}. Is thte latter OK?



was (Author: cesken):
I am nearly done with the configuration, and have two questions about it:

1.  How to handle the default value? My approach is to pre-configure the 
default value in Config:
{code}
    public static final int otc_backlog_expiration_interval_in_ms_default = 200;
    public volatile Integer otc_backlog_expiration_interval_in_ms = 
otc_backlog_expiration_interval_in_ms_default;
{code}


- Additionally I will handle null values, that might come in via MBean in the 
getter of DatabaseDescriptor:
{code}
    public static Integer getOtcBacklogExpirationInterval()
    {
        Integer confValue = conf.otc_backlog_expiration_interval_in_ms;
        return confValue != null ? confValue : 
Config.otc_backlog_expiration_interval_in_ms_default;
    }
{code}

2. How to read the config value? I am seeing some Integer.getInteger(propName, 
defaultValue), but this looks strange to me. I think changes from JMX would not 
even be reflected. Thus I am calling the getter from above: 
{{DatabaseDescriptor.getOtcBacklogExpirationInterval()}}. Is thte latter OK?


> Expiration in OutboundTcpConnection can block the reader Thread
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-13265
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 
> 1.8.0_112-b15)
> Linux 3.16
>            Reporter: Christian Esken
>            Assignee: Christian Esken
>         Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, 
> cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to 
> communicate to the other nodes. This can happen at any time, during peak load 
> or low load. Restarting that single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the 
> situation and am already developing a possible fix. Here is the analysis so 
> far:
> - A Threaddump in this situation showed  324 Threads in the 
> OutboundTcpConnection class that want to lock the backlog queue for doing 
> expiration.
> - A class histogram shows 262508 instances of 
> OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain 
> amount of queued messages, it starts thrashing itself to death. Each of the 
> Thread fully locks the Queue for reading and writing by calling 
> iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually 
> writing to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and 
> fully lock the Queue
> This means: Writing blocks the Queue for reading, and readers might even be 
> starved which makes the situation even worse.
> -----
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (100000 INSERT statements per second and more during 
> peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread

Reply via email to