[ 
https://issues.apache.org/jira/browse/CASSANDRA-9573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581188#comment-14581188
 ] 

Benedict commented on CASSANDRA-9573:
-------------------------------------

I suspect we are catching the mmapped files via mlockall. We now map them on 
construction (instead of re-mapping them repeatedly for each reader), but our 
preflight checks appear to load the system tables prior to mlockall, so all 
system sstables present during startup would be forced to remain resident.

It's proving difficult to reproduce locally for reasons unrelated to this 
ticket. Could you try running your script again with JNA disabled, and see if 
it behaves?

> OOM when loading compressed sstables (system.hints)
> ---------------------------------------------------
>
>                 Key: CASSANDRA-9573
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9573
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Alan Boudreault
>            Assignee: Benedict
>            Priority: Critical
>             Fix For: 2.2.0 rc2
>
>         Attachments: hs_err_pid11243.log, 
> java-hints-issue-2015-06-09.snapshot, system.log, yourkit.ss.tar.gz
>
>
> [~andrew.tolbert] discovered an issue while running endurance tests on 2.2. A 
> Node was not able to start and was killed by the OOM Killer.
> Briefly, Cassandra use an excessive amount of memory when loading compressed 
> sstables (off-heap?). We have initially seen the issue with system.hints 
> before knowing it was related to compression. system.hints use lz4 
> compression by default. If we have a sstable of, say 8-10G, Cassandra will be 
> killed by the OOM killer after 1-2 minutes. I can reproduce that bug 
> everytime locally. 
> * the issue also happens if we have 10G of data splitted in 13MB sstables.
> * I can reproduce the issue if I put a lot of data in the system.hints table.
> * I cannot reproduce the issue with a standard table using the same 
> compression (LZ4). Something seems to be different when it's hints?
> You wont see anything in the node system.log but you'll see this in 
> /var/log/syslog.log:
> {code}
> Out of memory: Kill process 30777 (java) score 600 or sacrifice child
> {code}
> The issue has been introduced in this commit but is not related to the 
> performance issue in CASSANDRA-9240: 
> https://github.com/apache/cassandra/commit/aedce5fc6ba46ca734e91190cfaaeb23ba47a846
> Here is the core dump and some yourkit snapshots in attachments. I am not 
> sure you will be able to get useful information from them.
> core dump: http://dl.alanb.ca/core.tar.gz
> Not sure if this is related, but all dumps and snapshot points to 
> EstimatedHistogramReservoir ... and we can see many 
> javax.management.InstanceAlreadyExistsException: 
> org.apache.cassandra.metrics:... exceptions in system.log before it hangs 
> then crash.
> To reproduce the issue: 
> 1. created a cluster of 3 nodes
> 2. start the whole cluster
> 3. shutdown node2 and node3
> 4. writes 10-15G of data on node1 with replication factor 3. You should see a 
> lot of hints.
> 5. stop node1
> 6. start node2 and node3
> 7. start node1, you should OOM.
> //cc [~tjake] [~benedict] [~andrew.tolbert]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to