Hey Robert, you might want to start by looking into the statistics of cassandra, either exposed via nodetool or if you have monitoring system monitor the important metrics. I have read this article moment ago and I hope it help you http://aryanet.com/blog/cassandra-garbage-collector-tuning to begin to understand where and how to determine the root cause.
jason On Tue, Mar 31, 2015 at 8:22 PM, Robert Wille <rwi...@fold3.com> wrote: > I moved my site over to Cassandra a few months ago, and everything has been > just peachy until a few hours ago (yes, it would be in the middle of the > night) when my entire cluster suffered death by GC. By death by GC, I mean > this: > > [rwille@cas031 cassandra]$ grep GC system.log | head -5 > INFO [ScheduledTasks:1] 2015-03-31 02:49:57,480 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 30219 ms for 1 collections, 7664429440 used; max > is 8329887744 > INFO [ScheduledTasks:1] 2015-03-31 02:50:32,180 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 30673 ms for 1 collections, 7707488712 used; max > is 8329887744 > INFO [ScheduledTasks:1] 2015-03-31 02:51:05,108 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 30453 ms for 1 collections, 7693634672 used; max > is 8329887744 > INFO [ScheduledTasks:1] 2015-03-31 02:51:38,787 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 30691 ms for 1 collections, 7686028472 used; max > is 8329887744 > INFO [ScheduledTasks:1] 2015-03-31 02:52:12,452 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 30346 ms for 1 collections, 7701401200 used; max > is 8329887744 > > I’m pretty sure I know what triggered it. When I first started developing to > Cassandra, I found the IN clause to be supremely useful, and I used it a lot. > Later I figured out it was a bad thing and repented and fixed my code, but I > missed one spot. A maintenance task spent a couple of hours repeatedly > issuing queries with IN clauses with 1000 items in the clause and the whole > system went belly up. > > I get that my bad queries caused Cassandra to require more heap than was > available, but here’s what I don’t understand. When the crap hit the fan, the > maintenance task died due to a timeout error, but the cluster never > recovered. I would have expected that when I was no longer issuing the bad > queries, that the heap would get cleaned up and life would resume to normal. > Can anybody help me understand why Cassandra wouldn’t recover? How is it that > GC pressure will cause heap to be permanently uncollectable? > > This makes me pretty worried. I can fix my code, but I don’t really have > control over spikes. If memory pressure spikes, I can tolerate some timeouts > and errors, but if it can’t come back when the pressure is gone, that seems > pretty bad. > > Any insights would be greatly appreciated > > Robert >