[ 
https://issues.apache.org/jira/browse/CASSANDRA-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-5218:
--------------------------------------

    Priority: Minor  (was: Major)

> Log explosion when another cluster node is down and remaining node is 
> overloaded.
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-5218
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5218
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.1.7
>            Reporter: Sergey Olefir
>            Priority: Minor
>
> I have Cassandra 1.1.7 cluster with 4 nodes in 2 datacenters (2+2). 
> Replication is configured as DC1:2,DC2:2 (i.e. every node holds the entire 
> data). 
> I am load-testing counter increments at the rate of about 10k per second. All 
> writes are directed to two nodes in DC1 (DC2 nodes are basically backup). In 
> total there's 100 separate clients executing 1-2 batch updates per second. 
> We wanted to test what happens if one node goes down, so we brought one node 
> down in DC1 (i.e. the node that was handling half of the incoming writes). 
> This led to a complete explosion of logs on the remaining alive node in DC1. 
> There are hundreds of megabytes of logs within an hour all basically saying 
> the same thing: 
> ERROR [ReplicateOnWriteStage:5653390] 2013-01-22 12:44:33,611 
> AbstractCassandraDaemon.java (line 135) Exception in thread 
> Thread[ReplicateOnWriteStage:5653390,5,main] 
> java.lang.RuntimeException: java.util.concurrent.TimeoutException 
>         at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1275)
>  
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>  
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>  
>         at java.lang.Thread.run(Thread.java:662) 
> Caused by: java.util.concurrent.TimeoutException 
>         at 
> org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:311)
>  
>         at 
> org.apache.cassandra.service.StorageProxy$7$1.runMayThrow(StorageProxy.java:585)
>  
>         at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1271)
>  
>         ... 3 more 
> The logs are completely swamped with this and are thus unusable. It may also 
> negatively impact the node performance.
> According to Aaron Morton:
> {quote}The error is the coordinator node protecting it's self. 
> Basically it cannot handle the volume of local writes + the writes for HH.  
> The number of in flight hints is greater than…
>     private static volatile int maxHintsInProgress = 1024 * 
> Runtime.getRuntime().availableProcessors();{quote}
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/node-down-log-explosion-tp7584932p7584957.html
> I think there are two issues here:
> (a) the same exception occurring for the same reason doesn't need to be 
> spammed into log many times per second;
> (b) exception message ought to be more clear about cause -- i.e. in this case 
> some message about "overload" or "load shedding" might be appropriate.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to