Hypothesis: it's probably the flush causing the CMS, not the snapshot linking.

Confirmation possibility #1: Add a logger.warn to
CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be
called, but let's rule it out.

Confirmation possibility #2: Force some flushes w/o snapshot.

Either way: "concurrent mode failure" is the easy GC problem.
Hopefully you really are seeing mostly that -- this means the JVM
didn't start CMS early enough, so it ran out of space before it could
finish the concurrent collection, so it falls back to stop-the-world.
The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction
and (possibly) increasing heap capacity if your heap is simply too
full too much of the time.

You can also mitigate it by increasing the phi threshold for the
failure detector, so the node doing the GC doesn't mark everyone else
as dead.

(Eventually your heap will fragment and you will see STW collections
due to "promotion failed," but you should see that much less
frequently. GC tuning to reduce fragmentation may be possible based on
your workload, but that's out of scope here and in any case the "real"
fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.)

On Wed, Apr 6, 2011 at 2:07 PM, C. Scott Andreas
<csco...@urbanairship.com> wrote:
> Hello,
>
> We're running a six-node 0.7.4 ring in EC2 on m1.xlarge instances with 4GB 
> heap (15GB total memory, 4 cores, dataset fits in RAM, storage on ephemeral 
> disk). We've noticed a brief flurry of query failures during the night 
> corresponding with our backup schedule. More specifically, our logs suggest 
> that calling "nodetool snapshot" on a node is triggering 12 to 16 second CMS 
> GCs and a promotion failure resulting in a full stop-the-world collection, 
> during which the node is marked dead by the ring until re-joining shortly 
> after.
>
> Here's a log from one of the nodes, along with system info and JVM options: 
> https://gist.github.com/e12c6cae500e118676d1
>
> At 13:15:00, our backup cron job runs, which calls nodetool flush, then 
> nodetool snapshot. (After investigating, we noticed that calling both flush 
> and snapshot is unnecessary, and have since updated the script to only call 
> snapshot). While writing memtables, we'll generally see a GC logged out via 
> Cassandra such as:
>
> "GC for ConcurrentMarkSweep: 16113 ms, 1755422432 reclaimed leaving 
> 1869123536 used; max is 4424663040."
>
> In the JVM GC logs, we'll often see a tenured promotion failure occurring 
> during this collection, resulting in a full stop-the-world GC like this 
> (different node):
>
> 1180629.380: [CMS1180634.414: [CMS-concurrent-mark: 6.041/6.468 secs] [Times: 
> user=8.00 sys=0.10, real=6.46 secs]
>  (concurrent mode failure): 3904635K->1700629K(4109120K), 16.0548910 secs] 
> 3958389K->1700629K(4185792K), [CMS Perm : 19610K->19601K(32796K)], 16.1057040 
> secs] [Times: user=14.39 sys=0.02, real=16.10 secs]
>
> During the GC, the rest of the ring will shun the node, and when the 
> collection completes, the node will mark all other hosts in the ring as dead. 
> The node and ring stabilize shortly after once detecting each other as up and 
> completing hinted handoff (details in log).
>
> We've enabled JNA on one of the nodes to prevent forking a subprocess to call 
> `ln` during a snapshot yesterday and still observed a concurrent mode failure 
> collection following a flush/snapshot, but the CMS length was shorter (9 
> seconds) and did not result in the node being shunned from the ring.
>
> While the query failures that result from this activity are brief, our retry 
> threshold is set to 6 for timeout exceptions. We're concerned that we're 
> exceeding that, and would like to figure out why we see long CMS collections 
> + promotion failures triggering full GCs during a snapshot.
>
> Has anyone seen this, or have suggestions on how to prevent full GCs from 
> occurring during a flush / snapshot?
>
> Thanks,
>
> - Scott
>
> ---
>
> C. Scott Andreas
> Engineer, Urban Airship, Inc.
> http://www.urbanairship.com



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Reply via email to