[ https://issues.apache.org/jira/browse/CASSANDRA-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis updated CASSANDRA-5218: -------------------------------------- Priority: Minor (was: Major) > Log explosion when another cluster node is down and remaining node is > overloaded. > --------------------------------------------------------------------------------- > > Key: CASSANDRA-5218 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5218 > Project: Cassandra > Issue Type: Bug > Affects Versions: 1.1.7 > Reporter: Sergey Olefir > Priority: Minor > > I have Cassandra 1.1.7 cluster with 4 nodes in 2 datacenters (2+2). > Replication is configured as DC1:2,DC2:2 (i.e. every node holds the entire > data). > I am load-testing counter increments at the rate of about 10k per second. All > writes are directed to two nodes in DC1 (DC2 nodes are basically backup). In > total there's 100 separate clients executing 1-2 batch updates per second. > We wanted to test what happens if one node goes down, so we brought one node > down in DC1 (i.e. the node that was handling half of the incoming writes). > This led to a complete explosion of logs on the remaining alive node in DC1. > There are hundreds of megabytes of logs within an hour all basically saying > the same thing: > ERROR [ReplicateOnWriteStage:5653390] 2013-01-22 12:44:33,611 > AbstractCassandraDaemon.java (line 135) Exception in thread > Thread[ReplicateOnWriteStage:5653390,5,main] > java.lang.RuntimeException: java.util.concurrent.TimeoutException > at > org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1275) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > at java.lang.Thread.run(Thread.java:662) > Caused by: java.util.concurrent.TimeoutException > at > org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:311) > > at > org.apache.cassandra.service.StorageProxy$7$1.runMayThrow(StorageProxy.java:585) > > at > org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1271) > > ... 3 more > The logs are completely swamped with this and are thus unusable. It may also > negatively impact the node performance. > According to Aaron Morton: > {quote}The error is the coordinator node protecting it's self. > Basically it cannot handle the volume of local writes + the writes for HH. > The number of in flight hints is greater than… > private static volatile int maxHintsInProgress = 1024 * > Runtime.getRuntime().availableProcessors();{quote} > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/node-down-log-explosion-tp7584932p7584957.html > I think there are two issues here: > (a) the same exception occurring for the same reason doesn't need to be > spammed into log many times per second; > (b) exception message ought to be more clear about cause -- i.e. in this case > some message about "overload" or "load shedding" might be appropriate. -- This message was sent by Atlassian JIRA (v6.1#6144)