Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

2011-04-10 Thread Janne Jalkanen

On Apr 7, 2011, at 23:43 , Jonathan Ellis wrote:

 The history is that, way back in the early days, we used to max it out
 the other way (MTT=128) but observed behavior is that objects that
 survive 1 new gen collection are very likely to survive forever.

Just a quick note: my own tests seem to indicate likewise - I've been running 
one machine in our cluster with MTT=8 for some time now, and I'm not seeing any 
real difference between MTT=1 and MTT=8, except for a (very) slight increase in 
CPU usage, consistent with the increased copying. We have a read-heavy cluster 
that uses RowCaches a lot, so YMMV.

/Janne



Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

2011-04-07 Thread ruslan usifov
2011/4/7 Jonathan Ellis jbel...@gmail.com

 Hypothesis: it's probably the flush causing the CMS, not the snapshot
 linking.

 Confirmation possibility #1: Add a logger.warn to
 CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be
 called, but let's rule it out.

 Confirmation possibility #2: Force some flushes w/o snapshot.

 Either way: concurrent mode failure is the easy GC problem.
 Hopefully you really are seeing mostly that -- this means the JVM
 didn't start CMS early enough, so it ran out of space before it could
 finish the concurrent collection, so it falls back to stop-the-world.
 The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction
 and (possibly) increasing heap capacity if your heap is simply too
 full too much of the time.

 You can also mitigate it by increasing the phi threshold for the
 failure detector, so the node doing the GC doesn't mark everyone else
 as dead.

 (Eventually your heap will fragment and you will see STW collections
 due to promotion failed, but you should see that much less
 frequently. GC tuning to reduce fragmentation may be possible based on
 your workload, but that's out of scope here and in any case the real
 fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.)


Jonatan do you have plans to backport this to 0.7 branch. (Because It's very
hard to tune CMS, and if people is novice in java this task becomes much
harder )


Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

2011-04-07 Thread Jonathan Ellis
No, 2252 is not suitable for backporting to 0.7.

On Thu, Apr 7, 2011 at 7:33 AM, ruslan usifov ruslan.usi...@gmail.com wrote:


 2011/4/7 Jonathan Ellis jbel...@gmail.com

 Hypothesis: it's probably the flush causing the CMS, not the snapshot
 linking.

 Confirmation possibility #1: Add a logger.warn to
 CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be
 called, but let's rule it out.

 Confirmation possibility #2: Force some flushes w/o snapshot.

 Either way: concurrent mode failure is the easy GC problem.
 Hopefully you really are seeing mostly that -- this means the JVM
 didn't start CMS early enough, so it ran out of space before it could
 finish the concurrent collection, so it falls back to stop-the-world.
 The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction
 and (possibly) increasing heap capacity if your heap is simply too
 full too much of the time.

 You can also mitigate it by increasing the phi threshold for the
 failure detector, so the node doing the GC doesn't mark everyone else
 as dead.

 (Eventually your heap will fragment and you will see STW collections
 due to promotion failed, but you should see that much less
 frequently. GC tuning to reduce fragmentation may be possible based on
 your workload, but that's out of scope here and in any case the real
 fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.)


 Jonatan do you have plans to backport this to 0.7 branch. (Because It's very
 hard to tune CMS, and if people is novice in java this task becomes much
 harder )




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

2011-04-07 Thread Erik Onnen
I'll capture what I we're seeing here for anyone else who may look
into this in more detail later.

Our standard heap growth is ~300K in between collections with regular
ParNew collections happening on average about every 4 seconds. All
very healthy.

The memtable flush (where we see almost all our CMS activity) seems to
have some balloon effect that despite a 64MB memtable size, causes
over 512MB heap to be consumed in half a second. In addition to the
hefty amount of garbage it causes, due to the MaxTenuringThreshold=1
setting most of that garbage seems to spill immediately into the
tenured generation which quickly fills and triggers a CMS. The rate of
garbage overflowing to tenured seems to outstrip the speed of the
concurrent mark worker which is almost always interrupted and failed
to a concurrent collection. However, the tenured collection is usually
hugely effective, recovering over half the total heap.

Two questions for the group then:

1) Does this seem like a sane amount of garbage (512MB) to generate
when flushing a 64MB table to disk?
2) Is this possibly a case of the MaxTenuringThreshold=1 working
against cassandra? The flush seems to create a lot of garbage very
quickly such that normal CMS isn't even possible. I'm sure there was a
reason to introduce this setting but I'm not sure it's universally
beneficial. Is there any history on the decision to opt for immediate
promotion rather than using an adaptable number of survivor
generations?


Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

2011-04-07 Thread Jonathan Ellis
On Thu, Apr 7, 2011 at 2:27 PM, Erik Onnen eon...@gmail.com wrote:
 1) Does this seem like a sane amount of garbage (512MB) to generate
 when flushing a 64MB table to disk?

Sort of -- that's just about exactly the amount of space you'd expect
64MB of serialized data to take, in memory. (Not very efficient, I
know.)  So, you would expect that much to be available to GC, after a
flush.

Also, flush creates a buffer equal to in_memory_compaction_limit.  So
that will also generate a spike.  I think you upgraded from 0.6 -- if
the converter turned row size warning limit into i_m_c_l then it could
be much larger.

Otherwise, not sure why flush would consume that much *extra* though.
Smells like something unexpected in the flush code to me.  I don't see
anything obvious though.  SSTableWriter serializes directly to the
outputstream without (m)any other allocations.

 2) Is this possibly a case of the MaxTenuringThreshold=1 working
 against cassandra? The flush seems to create a lot of garbage very
 quickly such that normal CMS isn't even possible. I'm sure there was a
 reason to introduce this setting but I'm not sure it's universally
 beneficial. Is there any history on the decision to opt for immediate
 promotion rather than using an adaptable number of survivor
 generations?

The history is that, way back in the early days, we used to max it out
the other way (MTT=128) but observed behavior is that objects that
survive 1 new gen collection are very likely to survive forever.
This fits with what we expect theoretically: read requests and
ephemera from write requests will happen in a small number of ms, but
memtable data is not GC-able until flush. (Rowcache data of course is
effectively unbounded in tenure.)  Keeping long-lived data in a
survivor space just makes new gen collections take longer since you
are copying that data back and forth over and over.

(We have advised some read-heavy customers to ramp up to MTT=16, so
it's not a hard-and-fast rule, but it still feels like a reasonable
starting point to me.)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Flush / Snapshot Triggering Full GCs, Leaving Ring

2011-04-06 Thread C. Scott Andreas
Hello,

We're running a six-node 0.7.4 ring in EC2 on m1.xlarge instances with 4GB heap 
(15GB total memory, 4 cores, dataset fits in RAM, storage on ephemeral disk). 
We've noticed a brief flurry of query failures during the night corresponding 
with our backup schedule. More specifically, our logs suggest that calling 
nodetool snapshot on a node is triggering 12 to 16 second CMS GCs and a 
promotion failure resulting in a full stop-the-world collection, during which 
the node is marked dead by the ring until re-joining shortly after.

Here's a log from one of the nodes, along with system info and JVM options: 
https://gist.github.com/e12c6cae500e118676d1

At 13:15:00, our backup cron job runs, which calls nodetool flush, then 
nodetool snapshot. (After investigating, we noticed that calling both flush and 
snapshot is unnecessary, and have since updated the script to only call 
snapshot). While writing memtables, we'll generally see a GC logged out via 
Cassandra such as:

GC for ConcurrentMarkSweep: 16113 ms, 1755422432 reclaimed leaving 1869123536 
used; max is 4424663040.

In the JVM GC logs, we'll often see a tenured promotion failure occurring 
during this collection, resulting in a full stop-the-world GC like this 
(different node):

1180629.380: [CMS1180634.414: [CMS-concurrent-mark: 6.041/6.468 secs] [Times: 
user=8.00 sys=0.10, real=6.46 secs] 
 (concurrent mode failure): 3904635K-1700629K(4109120K), 16.0548910 secs] 
3958389K-1700629K(4185792K), [CMS Perm : 19610K-19601K(32796K)], 16.1057040 
secs] [Times: user=14.39 sys=0.02, real=16.10 secs]

During the GC, the rest of the ring will shun the node, and when the collection 
completes, the node will mark all other hosts in the ring as dead. The node and 
ring stabilize shortly after once detecting each other as up and completing 
hinted handoff (details in log).

We've enabled JNA on one of the nodes to prevent forking a subprocess to call 
`ln` during a snapshot yesterday and still observed a concurrent mode failure 
collection following a flush/snapshot, but the CMS length was shorter (9 
seconds) and did not result in the node being shunned from the ring.

While the query failures that result from this activity are brief, our retry 
threshold is set to 6 for timeout exceptions. We're concerned that we're 
exceeding that, and would like to figure out why we see long CMS collections + 
promotion failures triggering full GCs during a snapshot.

Has anyone seen this, or have suggestions on how to prevent full GCs from 
occurring during a flush / snapshot?

Thanks,

- Scott

---

C. Scott Andreas
Engineer, Urban Airship, Inc.
http://www.urbanairship.com

Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

2011-04-06 Thread Jonathan Ellis
Hypothesis: it's probably the flush causing the CMS, not the snapshot linking.

Confirmation possibility #1: Add a logger.warn to
CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be
called, but let's rule it out.

Confirmation possibility #2: Force some flushes w/o snapshot.

Either way: concurrent mode failure is the easy GC problem.
Hopefully you really are seeing mostly that -- this means the JVM
didn't start CMS early enough, so it ran out of space before it could
finish the concurrent collection, so it falls back to stop-the-world.
The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction
and (possibly) increasing heap capacity if your heap is simply too
full too much of the time.

You can also mitigate it by increasing the phi threshold for the
failure detector, so the node doing the GC doesn't mark everyone else
as dead.

(Eventually your heap will fragment and you will see STW collections
due to promotion failed, but you should see that much less
frequently. GC tuning to reduce fragmentation may be possible based on
your workload, but that's out of scope here and in any case the real
fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.)

On Wed, Apr 6, 2011 at 2:07 PM, C. Scott Andreas
csco...@urbanairship.com wrote:
 Hello,

 We're running a six-node 0.7.4 ring in EC2 on m1.xlarge instances with 4GB 
 heap (15GB total memory, 4 cores, dataset fits in RAM, storage on ephemeral 
 disk). We've noticed a brief flurry of query failures during the night 
 corresponding with our backup schedule. More specifically, our logs suggest 
 that calling nodetool snapshot on a node is triggering 12 to 16 second CMS 
 GCs and a promotion failure resulting in a full stop-the-world collection, 
 during which the node is marked dead by the ring until re-joining shortly 
 after.

 Here's a log from one of the nodes, along with system info and JVM options: 
 https://gist.github.com/e12c6cae500e118676d1

 At 13:15:00, our backup cron job runs, which calls nodetool flush, then 
 nodetool snapshot. (After investigating, we noticed that calling both flush 
 and snapshot is unnecessary, and have since updated the script to only call 
 snapshot). While writing memtables, we'll generally see a GC logged out via 
 Cassandra such as:

 GC for ConcurrentMarkSweep: 16113 ms, 1755422432 reclaimed leaving 
 1869123536 used; max is 4424663040.

 In the JVM GC logs, we'll often see a tenured promotion failure occurring 
 during this collection, resulting in a full stop-the-world GC like this 
 (different node):

 1180629.380: [CMS1180634.414: [CMS-concurrent-mark: 6.041/6.468 secs] [Times: 
 user=8.00 sys=0.10, real=6.46 secs]
  (concurrent mode failure): 3904635K-1700629K(4109120K), 16.0548910 secs] 
 3958389K-1700629K(4185792K), [CMS Perm : 19610K-19601K(32796K)], 16.1057040 
 secs] [Times: user=14.39 sys=0.02, real=16.10 secs]

 During the GC, the rest of the ring will shun the node, and when the 
 collection completes, the node will mark all other hosts in the ring as dead. 
 The node and ring stabilize shortly after once detecting each other as up and 
 completing hinted handoff (details in log).

 We've enabled JNA on one of the nodes to prevent forking a subprocess to call 
 `ln` during a snapshot yesterday and still observed a concurrent mode failure 
 collection following a flush/snapshot, but the CMS length was shorter (9 
 seconds) and did not result in the node being shunned from the ring.

 While the query failures that result from this activity are brief, our retry 
 threshold is set to 6 for timeout exceptions. We're concerned that we're 
 exceeding that, and would like to figure out why we see long CMS collections 
 + promotion failures triggering full GCs during a snapshot.

 Has anyone seen this, or have suggestions on how to prevent full GCs from 
 occurring during a flush / snapshot?

 Thanks,

 - Scott

 ---

 C. Scott Andreas
 Engineer, Urban Airship, Inc.
 http://www.urbanairship.com



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com