[ https://issues.apache.org/jira/browse/CASSANDRA-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paulo Motta updated CASSANDRA-10821: ------------------------------------ Description: We were writing to the DB from EC2 instances in us-east-1 at a rate of about 3000 per second, replication us-east:2 us-west:2, LeveledCompaction and DeflateCompressor. After about 48 hours some nodes had over 800 pending compactions and a few of them started getting killed for Linux OOM. Priam attempts to restart the nodes, but they fail because of corrupted saved_cahce files. Loading has finished, and the cluster is mostly idle, but 6 of the nodes were killed again last night by OOM. This is the log message where the node won't restart: ERROR [main] 2015-12-05 13:59:13,754 CassandraDaemon.java:635 - Detected unreadable sstables /media/ephemeral0/cassandra/saved_caches/KeyCache-ca.db, please check NEWS.txt and ensure that you have upgraded through all required intermediate versions, running upgradesstables This is the dmesg where the node is terminated: [360803.234422] Out of memory: Kill process 10809 (java) score 949 or sacrifice child [360803.237544] Killed process 10809 (java) total-vm:438484092kB, anon-rss:29228012kB, file-rss:107576kB This is what Compaction Stats look like currently: pending tasks: 1096 id compaction type keyspace table completed total unit progress 93eb3200-9b58-11e5-b9f1-ffef1041ec45 Compaction overlordpreprod document 8670748796 839129219651 bytes 1.03% Compaction system hints 30 1921326518 bytes 0.00% Active compaction remaining time : 27h33m47s Only 6 of the 32 nodes have compactions pending, and all on the order of 1000. was: We were writing to the DB from EC2 instances in us-east-1 at a rate of about 3000 per second, replication us-east:2 us-west:2, LeveledCompaction and DeflateCompressor. After about 48 hours some nodes had over 800 pending compactions and a few of them started getting killed for Linux OOM. Priam attempts to restart the nodes, but they fail because of corrupted saved_cahce files. Loading has finished, and the cluster is mostly idle, but 6 of the nodes were killed again last night by OOM. This is the log message where the node won't restart: ERROR [main] 2015-12-05 13:59:13,754 CassandraDaemon.java:635 - Detected unreadable sstables /media/ephemeral0/cassandra/saved_caches/KeyCache-ca.db, please check NEWS.txt and ensure that you have upgraded through all required intermediate versions, running upgradesstables This is the dmesg where the node is terminated: [360803.234422] Out of memory: Kill process 10809 (java) score 949 or sacrifice child [360803.237544] Killed process 10809 (java) total-vm:438484092kB, anon-rss:29228012kB, file-rss:107576kB This is what Compaction Stats look like currently: pending tasks: 1096 id compaction type keyspace table completed total unit progress 93eb3200-9b58-11e5-b9f1-ffef1041ec45 Compaction overlordpreprod document 8670748796 839129219651 bytes 1.03% Compaction system hints 30 1921326518 bytes 0.00% Active compaction remaining time : 27h33m47s Only 6 of the 32 nodes have compactions pending, and all on the order of 1000. > OOM Killer terminates Cassandra when Compactions use too much memory then > won't restart > --------------------------------------------------------------------------------------- > > Key: CASSANDRA-10821 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10821 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction > Environment: EC2 32 x i2.xlarge split between us-east-1a,c and > us-west 2a,b > Linux 4.1.10-17.31.amzn1.x86_64 #1 SMP Sat Oct 24 01:31:37 UTC 2015 x86_64 > x86_64 x86_64 GNU/Linux > Java(TM) SE Runtime Environment (build 1.8.0_65-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode) > Cassandra version: 2.2.3 > Reporter: tbartold > Priority: Normal > > > We were writing to the DB from EC2 instances in us-east-1 at a rate of about > 3000 per second, replication us-east:2 us-west:2, LeveledCompaction and > DeflateCompressor. > After about 48 hours some nodes had over 800 pending compactions and a few of > them started getting killed for Linux OOM. Priam attempts to restart the > nodes, but they fail because of corrupted saved_cahce files. > Loading has finished, and the cluster is mostly idle, but 6 of the nodes were > killed again last night by OOM. > This is the log message where the node won't restart: > ERROR [main] 2015-12-05 13:59:13,754 CassandraDaemon.java:635 - Detected > unreadable sstables /media/ephemeral0/cassandra/saved_caches/KeyCache-ca.db, > please check NEWS.txt and ensure that you have upgraded through all required > intermediate versions, running upgradesstables > This is the dmesg where the node is terminated: > [360803.234422] Out of memory: Kill process 10809 (java) score 949 or > sacrifice child > [360803.237544] Killed process 10809 (java) total-vm:438484092kB, > anon-rss:29228012kB, file-rss:107576kB > This is what Compaction Stats look like currently: > pending tasks: 1096 > id compaction type keyspace table completed total unit progress > 93eb3200-9b58-11e5-b9f1-ffef1041ec45 Compaction overlordpreprod document > 8670748796 839129219651 bytes 1.03% > Compaction system hints 30 1921326518 bytes 0.00% > Active compaction remaining time : 27h33m47s > Only 6 of the 32 nodes have compactions pending, and all on the order of 1000. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org