[ 
https://issues.apache.org/jira/browse/CASSANDRA-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-10821:
------------------------------------
    Description: 
 

We were writing to the DB from EC2 instances in us-east-1 at a rate of about 
3000 per second, replication us-east:2 us-west:2, LeveledCompaction and 
DeflateCompressor.

After about 48 hours some nodes had over 800 pending compactions and a few of 
them started getting killed for Linux OOM. Priam attempts to restart the nodes, 
but they fail because of corrupted saved_cahce files.

Loading has finished, and the cluster is mostly idle, but 6 of the nodes were 
killed again last night by OOM.

This is the log message where the node won't restart:

ERROR [main] 2015-12-05 13:59:13,754 CassandraDaemon.java:635 - Detected 
unreadable sstables /media/ephemeral0/cassandra/saved_caches/KeyCache-ca.db, 
please check NEWS.txt and ensure that you have upgraded through all required 
intermediate versions, running upgradesstables

This is the dmesg where the node is terminated:

[360803.234422] Out of memory: Kill process 10809 (java) score 949 or sacrifice 
child
[360803.237544] Killed process 10809 (java) total-vm:438484092kB, 
anon-rss:29228012kB, file-rss:107576kB

This is what Compaction Stats look like currently:

pending tasks: 1096
id compaction type keyspace table completed total unit progress
93eb3200-9b58-11e5-b9f1-ffef1041ec45 Compaction overlordpreprod document 
8670748796 839129219651 bytes 1.03%
Compaction system hints 30 1921326518 bytes 0.00%
Active compaction remaining time : 27h33m47s

Only 6 of the 32 nodes have compactions pending, and all on the order of 1000.

  was:
We were writing to the DB from EC2 instances in us-east-1 at a rate of about 
3000 per second, replication us-east:2 us-west:2, LeveledCompaction and 
DeflateCompressor.

After about 48 hours some nodes had over 800 pending compactions and a few of 
them started getting killed for Linux OOM. Priam attempts to restart the nodes, 
but they fail because of corrupted saved_cahce files.

Loading has finished, and the cluster is mostly idle, but 6 of the nodes were 
killed again last night by OOM.

This is the log message where the node won't restart:

ERROR [main] 2015-12-05 13:59:13,754 CassandraDaemon.java:635 - Detected 
unreadable sstables /media/ephemeral0/cassandra/saved_caches/KeyCache-ca.db, 
please check NEWS.txt and ensure that you have upgraded through all required 
intermediate versions, running upgradesstables

This is the dmesg where the node is terminated:

[360803.234422] Out of memory: Kill process 10809 (java) score 949 or sacrifice 
child
[360803.237544] Killed process 10809 (java) total-vm:438484092kB, 
anon-rss:29228012kB, file-rss:107576kB

This is what Compaction Stats look like currently:

pending tasks: 1096
                                     id   compaction type          keyspace     
 table    completed          total    unit   progress
   93eb3200-9b58-11e5-b9f1-ffef1041ec45        Compaction   overlordpreprod   
document   8670748796   839129219651   bytes      1.03%
                                               Compaction            system     
 hints           30     1921326518   bytes      0.00%
Active compaction remaining time :  27h33m47s

Only 6 of the 32 nodes have compactions pending, and all on the order of 1000.


> OOM Killer terminates Cassandra when Compactions use too much memory then 
> won't restart
> ---------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10821
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10821
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Compaction
>         Environment: EC2 32 x i2.xlarge split between us-east-1a,c and 
> us-west 2a,b
> Linux  4.1.10-17.31.amzn1.x86_64 #1 SMP Sat Oct 24 01:31:37 UTC 2015 x86_64 
> x86_64 x86_64 GNU/Linux
> Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
> Cassandra version: 2.2.3
>            Reporter: tbartold
>            Priority: Normal
>
>  
> We were writing to the DB from EC2 instances in us-east-1 at a rate of about 
> 3000 per second, replication us-east:2 us-west:2, LeveledCompaction and 
> DeflateCompressor.
> After about 48 hours some nodes had over 800 pending compactions and a few of 
> them started getting killed for Linux OOM. Priam attempts to restart the 
> nodes, but they fail because of corrupted saved_cahce files.
> Loading has finished, and the cluster is mostly idle, but 6 of the nodes were 
> killed again last night by OOM.
> This is the log message where the node won't restart:
> ERROR [main] 2015-12-05 13:59:13,754 CassandraDaemon.java:635 - Detected 
> unreadable sstables /media/ephemeral0/cassandra/saved_caches/KeyCache-ca.db, 
> please check NEWS.txt and ensure that you have upgraded through all required 
> intermediate versions, running upgradesstables
> This is the dmesg where the node is terminated:
> [360803.234422] Out of memory: Kill process 10809 (java) score 949 or 
> sacrifice child
> [360803.237544] Killed process 10809 (java) total-vm:438484092kB, 
> anon-rss:29228012kB, file-rss:107576kB
> This is what Compaction Stats look like currently:
> pending tasks: 1096
> id compaction type keyspace table completed total unit progress
> 93eb3200-9b58-11e5-b9f1-ffef1041ec45 Compaction overlordpreprod document 
> 8670748796 839129219651 bytes 1.03%
> Compaction system hints 30 1921326518 bytes 0.00%
> Active compaction remaining time : 27h33m47s
> Only 6 of the 32 nodes have compactions pending, and all on the order of 1000.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to