Still looking for help! We have stopped almost ALL traffic to the cluster and still some nodes are showing almost 1000% CPU for cassandra with no iostat activity. We were running cleanup on one of the nodes that was not showing load spikes however now when I attempt to stop cleanup there via nodetool stop cleanup the java task for stopping cleanup itself is at 1500% and has not returned after 2 minutes. This is VERY odd behavior. Any ideas? Hardware failure? Network? We are not seeing anything there but wanted to get ideas.
Thanks From: Keith Wright <kwri...@nanigans.com<mailto:kwri...@nanigans.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Date: Tuesday, August 20, 2013 8:32 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Subject: Nodes get stuck Hi all, We are using C* 1.2.4 with Vnodes and SSD. We have seen behavior recently where 3 of our nodes get locked up in high load in what appears to be a GC spiral while the rest of the cluster (7 total nodes) appears fine. When I run a tpstats, I see the following (assuming tpstats returns at all) and top shows cassandra pegged at 2000%. Obviously we have a large number of blocked reads. In the past I could explain this due to unexpectedly wide rows however we have handled that. When the cluster starts to meltdown like this its hard to get visibility into what's going on and what triggered the issue as everything starts to pile on. Opscenter becomes unusable and because the effected nodes are in GC pressure, getting any data via nodetool or JMX is also difficult. What do people do to handle these situations? We are going to start graphing reads/writes/sec/CF to Ganglia in the hopes that it helps. Thanks Pool Name Active Pending Completed Blocked All time blocked ReadStage 256 381 1245117434 0 0 RequestResponseStage 0 0 1161495947 0 0 MutationStage 8 8 481721887 0 0 ReadRepairStage 0 0 85770600 0 0 ReplicateOnWriteStage 0 0 21896804 0 0 GossipStage 0 0 1546196 0 0 AntiEntropyStage 0 0 5009 0 0 MigrationStage 0 0 1082 0 0 MemtablePostFlusher 0 0 10178 0 0 FlushWriter 0 0 6081 0 2075 MiscStage 0 0 57 0 0 commitlog_archiver 0 0 0 0 0 AntiEntropySessions 0 0 0 0 0 InternalResponseStage 0 0 6 0 0 HintedHandoff 1 1 246 0 0 Message type Dropped RANGE_SLICE 482 READ_REPAIR 0 BINARY 0 READ 515762 MUTATION 39 _TRACE 0 REQUEST_RESPONSE 29