Still looking for help!  We have stopped almost ALL traffic to the cluster and 
still some nodes are showing almost 1000% CPU for cassandra with no iostat 
activity.   We were running cleanup on one of the nodes that was not showing 
load spikes however now when I attempt to stop cleanup there via nodetool stop 
cleanup the java task for stopping cleanup itself is at 1500% and has not 
returned after 2 minutes.  This is VERY odd behavior.  Any ideas?  Hardware 
failure?  Network?  We are not seeing anything there but wanted to get ideas.

Thanks

From: Keith Wright <kwri...@nanigans.com<mailto:kwri...@nanigans.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, August 20, 2013 8:32 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Nodes get stuck

Hi all,

    We are using C* 1.2.4 with Vnodes and SSD.  We have seen behavior recently 
where 3 of our nodes get locked up in high load in what appears to be a GC 
spiral while the rest of the cluster (7 total nodes) appears fine.  When I run 
a tpstats, I see the following (assuming tpstats returns at all) and top shows 
cassandra pegged at 2000%.  Obviously we have a large number of blocked reads.  
In the past I could explain this due to unexpectedly wide rows however we have 
handled that.  When the cluster starts to meltdown like this its hard to get 
visibility into what's going on and what triggered the issue as everything 
starts to pile on.  Opscenter becomes unusable and because the effected nodes 
are in GC pressure, getting any data via nodetool or JMX is also difficult.  
What do people do to handle these situations?  We are going to start graphing 
reads/writes/sec/CF to Ganglia in the hopes that it helps.

Thanks

Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
ReadStage                       256       381     1245117434         0          
       0
RequestResponseStage              0         0     1161495947         0          
       0
MutationStage                     8         8      481721887         0          
       0
ReadRepairStage                   0         0       85770600         0          
       0
ReplicateOnWriteStage             0         0       21896804         0          
       0
GossipStage                       0         0        1546196         0          
       0
AntiEntropyStage                  0         0           5009         0          
       0
MigrationStage                    0         0           1082         0          
       0
MemtablePostFlusher               0         0          10178         0          
       0
FlushWriter                       0         0           6081         0          
    2075
MiscStage                         0         0             57         0          
       0
commitlog_archiver                0         0              0         0          
       0
AntiEntropySessions               0         0              0         0          
       0
InternalResponseStage             0         0              6         0          
       0
HintedHandoff                     1         1            246         0          
       0

Message type           Dropped
RANGE_SLICE                482
READ_REPAIR                  0
BINARY                       0
READ                    515762
MUTATION                    39
_TRACE                       0
REQUEST_RESPONSE            29

Reply via email to