Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
On Apr 7, 2011, at 23:43 , Jonathan Ellis wrote: The history is that, way back in the early days, we used to max it out the other way (MTT=128) but observed behavior is that objects that survive 1 new gen collection are very likely to survive forever. Just a quick note: my own tests seem to indicate likewise - I've been running one machine in our cluster with MTT=8 for some time now, and I'm not seeing any real difference between MTT=1 and MTT=8, except for a (very) slight increase in CPU usage, consistent with the increased copying. We have a read-heavy cluster that uses RowCaches a lot, so YMMV. /Janne
Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
2011/4/7 Jonathan Ellis jbel...@gmail.com Hypothesis: it's probably the flush causing the CMS, not the snapshot linking. Confirmation possibility #1: Add a logger.warn to CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be called, but let's rule it out. Confirmation possibility #2: Force some flushes w/o snapshot. Either way: concurrent mode failure is the easy GC problem. Hopefully you really are seeing mostly that -- this means the JVM didn't start CMS early enough, so it ran out of space before it could finish the concurrent collection, so it falls back to stop-the-world. The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction and (possibly) increasing heap capacity if your heap is simply too full too much of the time. You can also mitigate it by increasing the phi threshold for the failure detector, so the node doing the GC doesn't mark everyone else as dead. (Eventually your heap will fragment and you will see STW collections due to promotion failed, but you should see that much less frequently. GC tuning to reduce fragmentation may be possible based on your workload, but that's out of scope here and in any case the real fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.) Jonatan do you have plans to backport this to 0.7 branch. (Because It's very hard to tune CMS, and if people is novice in java this task becomes much harder )
Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
No, 2252 is not suitable for backporting to 0.7. On Thu, Apr 7, 2011 at 7:33 AM, ruslan usifov ruslan.usi...@gmail.com wrote: 2011/4/7 Jonathan Ellis jbel...@gmail.com Hypothesis: it's probably the flush causing the CMS, not the snapshot linking. Confirmation possibility #1: Add a logger.warn to CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be called, but let's rule it out. Confirmation possibility #2: Force some flushes w/o snapshot. Either way: concurrent mode failure is the easy GC problem. Hopefully you really are seeing mostly that -- this means the JVM didn't start CMS early enough, so it ran out of space before it could finish the concurrent collection, so it falls back to stop-the-world. The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction and (possibly) increasing heap capacity if your heap is simply too full too much of the time. You can also mitigate it by increasing the phi threshold for the failure detector, so the node doing the GC doesn't mark everyone else as dead. (Eventually your heap will fragment and you will see STW collections due to promotion failed, but you should see that much less frequently. GC tuning to reduce fragmentation may be possible based on your workload, but that's out of scope here and in any case the real fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.) Jonatan do you have plans to backport this to 0.7 branch. (Because It's very hard to tune CMS, and if people is novice in java this task becomes much harder ) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
I'll capture what I we're seeing here for anyone else who may look into this in more detail later. Our standard heap growth is ~300K in between collections with regular ParNew collections happening on average about every 4 seconds. All very healthy. The memtable flush (where we see almost all our CMS activity) seems to have some balloon effect that despite a 64MB memtable size, causes over 512MB heap to be consumed in half a second. In addition to the hefty amount of garbage it causes, due to the MaxTenuringThreshold=1 setting most of that garbage seems to spill immediately into the tenured generation which quickly fills and triggers a CMS. The rate of garbage overflowing to tenured seems to outstrip the speed of the concurrent mark worker which is almost always interrupted and failed to a concurrent collection. However, the tenured collection is usually hugely effective, recovering over half the total heap. Two questions for the group then: 1) Does this seem like a sane amount of garbage (512MB) to generate when flushing a 64MB table to disk? 2) Is this possibly a case of the MaxTenuringThreshold=1 working against cassandra? The flush seems to create a lot of garbage very quickly such that normal CMS isn't even possible. I'm sure there was a reason to introduce this setting but I'm not sure it's universally beneficial. Is there any history on the decision to opt for immediate promotion rather than using an adaptable number of survivor generations?
Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
On Thu, Apr 7, 2011 at 2:27 PM, Erik Onnen eon...@gmail.com wrote: 1) Does this seem like a sane amount of garbage (512MB) to generate when flushing a 64MB table to disk? Sort of -- that's just about exactly the amount of space you'd expect 64MB of serialized data to take, in memory. (Not very efficient, I know.) So, you would expect that much to be available to GC, after a flush. Also, flush creates a buffer equal to in_memory_compaction_limit. So that will also generate a spike. I think you upgraded from 0.6 -- if the converter turned row size warning limit into i_m_c_l then it could be much larger. Otherwise, not sure why flush would consume that much *extra* though. Smells like something unexpected in the flush code to me. I don't see anything obvious though. SSTableWriter serializes directly to the outputstream without (m)any other allocations. 2) Is this possibly a case of the MaxTenuringThreshold=1 working against cassandra? The flush seems to create a lot of garbage very quickly such that normal CMS isn't even possible. I'm sure there was a reason to introduce this setting but I'm not sure it's universally beneficial. Is there any history on the decision to opt for immediate promotion rather than using an adaptable number of survivor generations? The history is that, way back in the early days, we used to max it out the other way (MTT=128) but observed behavior is that objects that survive 1 new gen collection are very likely to survive forever. This fits with what we expect theoretically: read requests and ephemera from write requests will happen in a small number of ms, but memtable data is not GC-able until flush. (Rowcache data of course is effectively unbounded in tenure.) Keeping long-lived data in a survivor space just makes new gen collections take longer since you are copying that data back and forth over and over. (We have advised some read-heavy customers to ramp up to MTT=16, so it's not a hard-and-fast rule, but it still feels like a reasonable starting point to me.) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Flush / Snapshot Triggering Full GCs, Leaving Ring
Hello, We're running a six-node 0.7.4 ring in EC2 on m1.xlarge instances with 4GB heap (15GB total memory, 4 cores, dataset fits in RAM, storage on ephemeral disk). We've noticed a brief flurry of query failures during the night corresponding with our backup schedule. More specifically, our logs suggest that calling nodetool snapshot on a node is triggering 12 to 16 second CMS GCs and a promotion failure resulting in a full stop-the-world collection, during which the node is marked dead by the ring until re-joining shortly after. Here's a log from one of the nodes, along with system info and JVM options: https://gist.github.com/e12c6cae500e118676d1 At 13:15:00, our backup cron job runs, which calls nodetool flush, then nodetool snapshot. (After investigating, we noticed that calling both flush and snapshot is unnecessary, and have since updated the script to only call snapshot). While writing memtables, we'll generally see a GC logged out via Cassandra such as: GC for ConcurrentMarkSweep: 16113 ms, 1755422432 reclaimed leaving 1869123536 used; max is 4424663040. In the JVM GC logs, we'll often see a tenured promotion failure occurring during this collection, resulting in a full stop-the-world GC like this (different node): 1180629.380: [CMS1180634.414: [CMS-concurrent-mark: 6.041/6.468 secs] [Times: user=8.00 sys=0.10, real=6.46 secs] (concurrent mode failure): 3904635K-1700629K(4109120K), 16.0548910 secs] 3958389K-1700629K(4185792K), [CMS Perm : 19610K-19601K(32796K)], 16.1057040 secs] [Times: user=14.39 sys=0.02, real=16.10 secs] During the GC, the rest of the ring will shun the node, and when the collection completes, the node will mark all other hosts in the ring as dead. The node and ring stabilize shortly after once detecting each other as up and completing hinted handoff (details in log). We've enabled JNA on one of the nodes to prevent forking a subprocess to call `ln` during a snapshot yesterday and still observed a concurrent mode failure collection following a flush/snapshot, but the CMS length was shorter (9 seconds) and did not result in the node being shunned from the ring. While the query failures that result from this activity are brief, our retry threshold is set to 6 for timeout exceptions. We're concerned that we're exceeding that, and would like to figure out why we see long CMS collections + promotion failures triggering full GCs during a snapshot. Has anyone seen this, or have suggestions on how to prevent full GCs from occurring during a flush / snapshot? Thanks, - Scott --- C. Scott Andreas Engineer, Urban Airship, Inc. http://www.urbanairship.com
Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
Hypothesis: it's probably the flush causing the CMS, not the snapshot linking. Confirmation possibility #1: Add a logger.warn to CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be called, but let's rule it out. Confirmation possibility #2: Force some flushes w/o snapshot. Either way: concurrent mode failure is the easy GC problem. Hopefully you really are seeing mostly that -- this means the JVM didn't start CMS early enough, so it ran out of space before it could finish the concurrent collection, so it falls back to stop-the-world. The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction and (possibly) increasing heap capacity if your heap is simply too full too much of the time. You can also mitigate it by increasing the phi threshold for the failure detector, so the node doing the GC doesn't mark everyone else as dead. (Eventually your heap will fragment and you will see STW collections due to promotion failed, but you should see that much less frequently. GC tuning to reduce fragmentation may be possible based on your workload, but that's out of scope here and in any case the real fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.) On Wed, Apr 6, 2011 at 2:07 PM, C. Scott Andreas csco...@urbanairship.com wrote: Hello, We're running a six-node 0.7.4 ring in EC2 on m1.xlarge instances with 4GB heap (15GB total memory, 4 cores, dataset fits in RAM, storage on ephemeral disk). We've noticed a brief flurry of query failures during the night corresponding with our backup schedule. More specifically, our logs suggest that calling nodetool snapshot on a node is triggering 12 to 16 second CMS GCs and a promotion failure resulting in a full stop-the-world collection, during which the node is marked dead by the ring until re-joining shortly after. Here's a log from one of the nodes, along with system info and JVM options: https://gist.github.com/e12c6cae500e118676d1 At 13:15:00, our backup cron job runs, which calls nodetool flush, then nodetool snapshot. (After investigating, we noticed that calling both flush and snapshot is unnecessary, and have since updated the script to only call snapshot). While writing memtables, we'll generally see a GC logged out via Cassandra such as: GC for ConcurrentMarkSweep: 16113 ms, 1755422432 reclaimed leaving 1869123536 used; max is 4424663040. In the JVM GC logs, we'll often see a tenured promotion failure occurring during this collection, resulting in a full stop-the-world GC like this (different node): 1180629.380: [CMS1180634.414: [CMS-concurrent-mark: 6.041/6.468 secs] [Times: user=8.00 sys=0.10, real=6.46 secs] (concurrent mode failure): 3904635K-1700629K(4109120K), 16.0548910 secs] 3958389K-1700629K(4185792K), [CMS Perm : 19610K-19601K(32796K)], 16.1057040 secs] [Times: user=14.39 sys=0.02, real=16.10 secs] During the GC, the rest of the ring will shun the node, and when the collection completes, the node will mark all other hosts in the ring as dead. The node and ring stabilize shortly after once detecting each other as up and completing hinted handoff (details in log). We've enabled JNA on one of the nodes to prevent forking a subprocess to call `ln` during a snapshot yesterday and still observed a concurrent mode failure collection following a flush/snapshot, but the CMS length was shorter (9 seconds) and did not result in the node being shunned from the ring. While the query failures that result from this activity are brief, our retry threshold is set to 6 for timeout exceptions. We're concerned that we're exceeding that, and would like to figure out why we see long CMS collections + promotion failures triggering full GCs during a snapshot. Has anyone seen this, or have suggestions on how to prevent full GCs from occurring during a flush / snapshot? Thanks, - Scott --- C. Scott Andreas Engineer, Urban Airship, Inc. http://www.urbanairship.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com