Try turning on GC logging in Cassandra-env.sh, specifically:

        -XX:+PrintGCApplicationStoppedTime
        -Xloggc:/var/log/cassandra/gc.log

Look for things like: "Total time for which application threads were
stopped: 52.8795600 seconds". Anything over about a few seconds may be
causing your problem.

Stop the world GC is a real pain. In my cluster I was, and still am to some
extent, seeing each node go 'down' about 10-30 times a day and up to a few
hundred when running major compactions (by greping through the Cassandra
system log). GC tuning is an art into itself but if this is your problem,
try:
        - lower memtable flush thresholds
        - reduce new gen size (which is explicitly set in 0.7.1+, the -Xmn
setting)
        - reducing CMSInitiatingOccupancyFraction from 75 to 60 or so (maybe
less)
        - set -XX:ParallelGCThreads=<NUMBER OF CPU CORES>
        - set -XX:ParallelCMSThreads=<NUMBER OF CPU CORES>

Again, I would recommend you do some more research into GC tuning
(http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html is a
good place to start). Most of my recommendations above will probably reduce
the chance of your nodes going 'down' but may have pretty severe negative
performance impacts. In my cluster, I found the measures needed to ensure
the node never (or rarely, it cant be completely prevented) went down just
were not worth it. I have ended up running the nodes closer to the wire and
living with an increased rate of client side exceptions and nodes going down
for short periods.

Dan

-----Original Message-----
From: Andy Skalet [mailto:aeska...@bitjug.com] 
Sent: February-17-11 4:18
To: Peter Schuller
Cc: user@cassandra.apache.org
Subject: Re: frequent client exceptions on 0.7.0

On Thu, Feb 17, 2011 at 12:37 AM, Peter Schuller
<peter.schul...@infidyne.com> wrote:
> Bottom line: Check /var/log/cassandra/system.log to begin with and see
> if it's reporting anything or being restarted.

Thanks, Peter.

In the system.log, I see quite a few of these across several machines.
 Everything else in the log is INFO level.

 WARN [ScheduledTasks:1] 2011-02-17 07:19:47,491 MessagingService.java
(line 545) Dropped 182 READ messages in the last 5000ms
 WARN [ScheduledTasks:1] 2011-02-17 08:10:06,142 MessagingService.java
(line 545) Dropped 31 READ messages in the last 5000ms
 WARN [ScheduledTasks:1] 2011-02-17 08:11:12,237 MessagingService.java
(line 545) Dropped 54 READ messages in the last 5000ms
 WARN [ScheduledTasks:1] 2011-02-17 08:11:17,392 MessagingService.java
(line 545) Dropped 487 READ messages in the last 5000ms

The machines are in EC2 with firewall permission to talk to each
other, so while not the most solid of network environments, at least
pretty common these days.  System is not going down, and cassandra
process is not dying.

Andy
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.872 / Virus Database: 271.1.1/3447 - Release Date: 02/16/11
02:34:00

Reply via email to