Hi Vincent, Here are a few pointers for disabling swap : - https://docs.datastax.com/en/cassandra/2.0/cassandra/install/installRecommendSettings.html - http://stackoverflow.com/questions/22988824/why-swap-needs-to-be-turned-off-in-datastax-cassandra
Tombstones are definitely the kind of object that can clutter your heap, lead to frequent GC pauses and could be part of why you run into OOM from time to time. I cannot answer for sure though as it is a bit more complex than that actually. You do not have crazy high GC pauses, although a 5s pause should not happen on a healthy cluster. Getting back to big partitions, I've had the case in production where a multi GB partition was filling a 26GB G1 heap when being compacted. Eventually, the old gen took all the available space in the heap, leaving no room for the young gen, but it actually never OOMed. To be honest, I would have preferred an OOM to the inefficient 50s GC pauses we've had because such a slow node can (and did) affect the whole cluster. I think you may have a combination of things happening here and you should work on improving them all : - spot precisely which are your big partitions to understand why you have some (data modeling issue or data source bad behavior) : look for "large partition" warnings in the cassandra logs, it will give you the partition key - try to reduce the number of tombstones you're reading by changing your queries or data model, or maybe by setting up an aggressive tombstone pruning strategy : http://cassandra.apache.org/doc/latest/operating/compaction.html?highlight=unchecked_tombstone_compaction#common-options You could benefit from setting unchecked_tombstone_compaction to true and tuning both tombstone_threshold and tombstone_compaction_interval - Follow recommended production settings and fully disable swap from your Cassandra nodes You might want to scale down from the 20GB heap as the OOM Killer will stop your process either way, and it might allow you to have an analyzable heap dump. Such a heap dump could tell us if there are lots of tombstones there when the JVM dies. I hope that's helpful as there is no easy answer here, and the problem should be narrowed down by fixing all potential causes. Cheers, On Mon, Nov 21, 2016 at 5:10 PM Vincent Rischmann <m...@vrischmann.me> wrote: > Thanks for your answer Alexander. > > We're writing constantly to the table, we estimate it's something like > 1.5k to 2k writes per second. Some of these requests update a bunch of > fields, some update fields + append something to a set. > We don't read constantly from it but when we do it's a lot of read, up to > 20k reads per second sometimes. > For this particular keyspace everything is using the size tiered > compaction strategy. > > - Every node is a physical server, has a 8-Core CPU, 32GB of ram and 3TB > of SSD. > - Java version is 1.8.0_101 for all nodes except one which is using > 1.8.0_111 (only for about a week I think, before that it used 1.8.0_101 > too). > - We're using the G1 GC. I looked at the 19th and on that day we had: > - 1505 GCs > - 2 Old Gen GCs which took around 5s each > - the rest are New Gen GCs, with only 1 other 1s. There's 15 to 20 GCs > which took between 0.4 and 0.7s. The rest is between 250ms and 400ms > approximately. > Sometimes, there are 3/4/5 GCs in a row in like 2 seconds, each taking > between 250ms to 400ms, but it's kinda rare from what I can see. > - So regarding GC logs, I have them enabled, I've got a bunch of gc.log.X > files in /var/log/cassandra, but somehow I can't find any log files for > certain periods. On one node which crashed this morning I lost like a week > of GC logs, no idea what is happening there... > - I'll just put a couple of warnings here, there are around 9k just for > today. > > WARN [SharedPool-Worker-8] 2016-11-21 17:02:00,497 > SliceQueryFilter.java:320 - Read 2001 live and 11129 tombstone cells in > foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000 > columns were requested, slices=[-] > WARN [SharedPool-Worker-1] 2016-11-21 17:02:02,559 > SliceQueryFilter.java:320 - Read 2001 live and 11064 tombstone cells in > foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000 > columns were requested, slices=[di[42FB29E1-8C99-45BE-8A44-9480C50C6BC4]:!-] > WARN [SharedPool-Worker-2] 2016-11-21 17:02:05,286 > SliceQueryFilter.java:320 - Read 2001 live and 11064 tombstone cells in > foo.install_info for key: foo@IOS:7 (see tombstone_warn_threshold). 2000 > columns were requested, slices=[di[42FB29E1-8C99-45BE-8A44-9480C50C6BC4]:!-] > WARN [SharedPool-Worker-11] 2016-11-21 17:02:08,860 > SliceQueryFilter.java:320 - Read 2001 live and 19966 tombstone cells in > foo.install_info for key: foo@IOS:10 (see tombstone_warn_threshold). 2000 > columns were requested, slices=[-] > > So, we're guessing this is bad since it's warning us, however does this > have a significant on the heap / GC ? I don't really know. > > - cfstats tells me this: > > Average live cells per slice (last five minutes): 1458.029594846951 > Maximum live cells per slice (last five minutes): 2001.0 > Average tombstones per slice (last five minutes): 1108.2466913854232 > Maximum tombstones per slice (last five minutes): 22602.0 > > - regarding swap, it's not disabled anywhere, I must say we never really > thought about it. Does it provide a significant benefit ? > > Thanks for your help, really appreciated ! > > On Mon, Nov 21, 2016, at 04:13 PM, Alexander Dejanovski wrote: > > Vincent, > > only the 2.68GB partition is out of bounds here, all the others (<256MB) > shouldn't be much of a problem. > It could put pressure on your heap if it is often read and/or compacted. > But to answer your question about the 1% harming the cluster, a few big > partitions can definitely be a big problem depending on your access > patterns. > Which compaction strategy are you using on this table ? > > Could you provide/check the following things on a node that crashed > recently : > > - Hardware specifications (how many cores ? how much RAM ? Bare metal > or VMs ?) > - Java version > - GC pauses throughout a day (grep GCInspector > /var/log/cassandra/system.log) : check if you have many pauses that take > more than 1 second > - GC logs at the time of a crash (if you don't produce any, you should > activate them in cassandra-env.sh) > - Tombstone warnings in the logs and high number of tombstone read in > cfstats > - Make sure swap is disabled > > > Cheers, > > > On Mon, Nov 21, 2016 at 2:57 PM Vincent Rischmann <m...@vrischmann.me> > wrote: > > > @Vladimir > > We tried with 12Gb and 16Gb, the problem appeared eventually too. > In this particular cluster we have 143 tables across 2 keyspaces. > > @Alexander > > We have one table with a max partition of 2.68GB, one of 256 MB, a bunch > with the size varying between 10MB to 100MB ~. Then there's the rest with > the max lower than 10MB. > > On the biggest, the 99% is around 60MB, 98% around 25MB, 95% around 5.5MB. > On the one with max of 256MB, the 99% is around 4.6MB, 98% around 2MB. > > Could the 1% here really have that much impact ? We do write a lot to the > biggest table and read quite often too, however I have no way to know if > that big partition is ever read. > > > On Mon, Nov 21, 2016, at 01:09 PM, Alexander Dejanovski wrote: > > Hi Vincent, > > one of the usual causes of OOMs is very large partitions. > Could you check your nodetool cfstats output in search of large partitions > ? If you find one (or more), run nodetool cfhistograms on those tables to > get a view of the partition sizes distribution. > > Thanks > > On Mon, Nov 21, 2016 at 12:01 PM Vladimir Yudovin <vla...@winguzone.com> > wrote: > > > Did you try any value in the range 8-20 (e.g. 60-70% of physical memory). > Also how many tables do you have across all keyspaces? Each table can > consume minimum 1M of Java heap. > > Best regards, Vladimir Yudovin, > > *Winguzone <https://winguzone.com?from=list> - Hosted Cloud > CassandraLaunch your cluster in minutes.* > > > ---- On Mon, 21 Nov 2016 05:13:12 -0500*Vincent Rischmann > <m...@vrischmann.me <m...@vrischmann.me>>* wrote ---- > > Hello, > > we have a 8 node Cassandra 2.1.15 cluster at work which is giving us a lot > of trouble lately. > > The problem is simple: nodes regularly die because of an out of memory > exception or the Linux OOM killer decides to kill the process. > For a couple of weeks now we increased the heap to 20Gb hoping it would > solve the out of memory errors, but in fact it didn't; instead of getting > out of memory exception the OOM killer killed the JVM. > > We reduced the heap on some nodes to 8Gb to see if it would work better, > but some nodes crashed again with out of memory exception. > > I suspect some of our tables are badly modelled, which would cause > Cassandra to allocate a lot of data, however I don't how to prove that > and/or find which table is bad, and which query is responsible. > > I tried looking at metrics in JMX, and tried profiling using mission > control but it didn't really help; it's possible I missed it because I have > no idea what to look for exactly. > > Anyone have some advice for troubleshooting this ? > > Thanks. > > -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > > -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > > -- ----------------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com