[ https://issues.apache.org/jira/browse/CASSANDRA-14355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992973#comment-16992973 ]
Chris Kistner commented on CASSANDRA-14355: ------------------------------------------- [~benedict], thank you for the feedback. After we've done some extensive investigation on our end, we pinned our issue to Cassandra Reaper terminating the Cassandra repair sessions if they don't finish within 30 minutes ("hangingRepairTimeoutMins" default), and then those repair threads aren't closed by Cassandra 3.11.4 and thus retain ~200MB of data on heap in our case. The other annoying part when this happens is that both Cassandra & Cassandra Reaper logs it as if the repair session was successful, unless you go and look at the logs in detail. I think I should rather create a new bug for Cassandra not closing the Repair session thread correctly, since the original issue's "io.netty.util.concurrent.FastThreadLocalThread" class was referenced from "Native-Transport-Requests" and "ReadStage" threads, where as ours were all from "Repair" threads. We might be able to see if we can reproduce the Cassandra issue where it is not closing Repair threads properly on 3.11.4 and then on 3.11.5. So far we've only experienced these OOME issues in client production environments and right now all our clients are in code/change freeze, so we won't be able to test 3.11.5 for about a month in a large production environment. So for now we'll just schedule per host & per table repairs using cron scripts and doing a different host each day of the week instead of using Cassandra Reaper that might terminate our repair sessions. > Memory leak > ----------- > > Key: CASSANDRA-14355 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14355 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Core > Environment: Debian Jessie, OpenJDK 1.8.0_151 > Reporter: Eric Evans > Priority: Normal > Fix For: 3.11.x > > Attachments: 01_Screenshot from 2018-04-04 14-24-00.png, > 02_Screenshot from 2018-04-04 14-28-33.png, 03_Screenshot from 2018-04-04 > 14-24-50.png, LongGC_Dominator-Tree.png, LongGC_Histogram.png, > LongGC_Problem-Suspect-1_FastThreadLocalThread.png, LongGC_nodetool_info.txt > > > We're seeing regular, frequent {{OutOfMemoryError}} exceptions. Similar to > CASSANDRA-13754, an analysis of the heap dumps shows the heap consumed by the > {{threadLocals}} member of the instances of > {{io.netty.util.concurrent.FastThreadLocalThread}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org