Hey all,

Our setup is 5 machines running Cassandra 0.7.0 with 24GB of heap and 1.5TB 
disk each collocated in a DC. We're doing bulk imports from each of the nodes 
with RF = 2 and write consistency ANY (write perf is very important). The 
behavior we're seeing is this:


-          Nodes often see each other as dead even though none of the nodes 
actually go down. I suspect this may be due to long GCs. It seems like 
increasing the RPC timeout could help this, but I'm not convinced this is the 
root of the problem. Note that in this case writes return with the 
UnavailableException.

-          As mentioned, long GCs. We see the ParNew GC doing a lot of smaller 
collections (few hundred MB) which are very fast (few hundred ms), but every 
once in a while the ConcurrentMarkSweep will take a LONG time (up to 15 min!) 
to collect upwards of 15GB at once.

-          On some nodes, we see a lot of pending MutationStages build up (e.g. 
500K), which leads to the messages "Dropped X MUTATION messages in the last 
5000ms," presumably meaning that Cassandra has decided to not write one of the 
replicas of the data. This is not a HUGE deal, but is less than ideal.

-          The end result is that a bunch of writes end up failing due to the 
UnavailableExceptions, so not all of our data is getting into Cassandra.

So my question is: what is the best way to avoid this behavior? Our memtable 
thresholds are fairly low (256MB) so there should be plenty of heap space to 
work with. We may experiment with write consistency ONE or ALL to see if the 
perf hit is not too bad, but I wanted to get some opinions on why this might be 
happening. Thanks!

-Jeffrey

Reply via email to