Delaney Manders created CASSANDRA-4225: ------------------------------------------
Summary: EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI Key: CASSANDRA-4225 URL: https://issues.apache.org/jira/browse/CASSANDRA-4225 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.0 Environment: Amazon Linux AMI release 2012.03 3.2.12-3.2.4.amzn1.x86_64 m1.xlarge Nodes have: Cassandra built and installed from source. Ant binary (apache-ant-1.8.3-bin.tar.gz), automake(1.11.1), autoconf(2.64), libtool(2.2.10) installed from AWS repository. Sun Java: > java -version java version "1.6.0_31" Java(TM) SE Runtime Environment (build 1.6.0_31-b04) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode) Only system changes are: echo "root soft memlock unlimited" | sudo tee -a /etc/security/limits.conf echo "root hard memlock unlimited" | sudo tee -a /etc/security/limits.conf Setup scripts available. Cassandra cluster has two datacenters, with DC1 having 8 nodes and DC2 having 4, DC2 being reserved for Hadoop jobs. DC2 nodes have not had the same frequency of hard crashes, though it has happened. Storage is set up with 4 ephemeral drives raided for commit, 4 EBS drives raided for storage. Usage is exclusively write, with all mutations being done in batch mutations, where each batch mutation has a set of columns added/modified to a single key. There are ~2000 threads streaming batch mutations from a web edge of varying size, distributed across DC1. Client is Hector(1.0-5) w/ DynamicLoadBalancing. In an effort to mitigate this issue, I've removed jna.jar & platform.jar from $CASSANDRA_HOME/lib, and set disk_access_mode: standard in $CASSANDRA_HOME/conf.cassandra.yaml. Neither has seemed to help. Reporter: Delaney Manders At fairly random intervals, about once/day, one of my Cassandra nodes does a hard crash (kernel panic). I can find no system logs (/var/log/*) which have any errors. No cassandra logs have any errors. On one machine I was watching as it went down, and caught the following comment: > Message from syslogd@domU-12-31-38-00-64-31 at May 3 18:24:17 ... > kernel:[252906.019808] Oops: 0002 [#1] SMP An AWS support guy found one entry in the console logs: > [30178.298308] Pid: 2238, comm: java Not tainted 3.2.12-3.2.4.amzn1.x86_64 #1 I've replaced two of the nodes with new instances, but all are showing the same behaviour. It's very reproduceable on my system, though it takes a little waiting. Leaving it running is no big deal for another day or so, I just need to restart Cassandra every once in a while when I get alerted. I'm open to any additional requested debugging steps before bailing and going back to 1.0.9. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira