I moved my site over to Cassandra a few months ago, and everything has been just peachy until a few hours ago (yes, it would be in the middle of the night) when my entire cluster suffered death by GC. By death by GC, I mean this:
[rwille@cas031 cassandra]$ grep GC system.log | head -5 INFO [ScheduledTasks:1] 2015-03-31 02:49:57,480 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30219 ms for 1 collections, 7664429440 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:50:32,180 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30673 ms for 1 collections, 7707488712 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:51:05,108 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30453 ms for 1 collections, 7693634672 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:51:38,787 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30691 ms for 1 collections, 7686028472 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:52:12,452 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30346 ms for 1 collections, 7701401200 used; max is 8329887744 I’m pretty sure I know what triggered it. When I first started developing to Cassandra, I found the IN clause to be supremely useful, and I used it a lot. Later I figured out it was a bad thing and repented and fixed my code, but I missed one spot. A maintenance task spent a couple of hours repeatedly issuing queries with IN clauses with 1000 items in the clause and the whole system went belly up. I get that my bad queries caused Cassandra to require more heap than was available, but here’s what I don’t understand. When the crap hit the fan, the maintenance task died due to a timeout error, but the cluster never recovered. I would have expected that when I was no longer issuing the bad queries, that the heap would get cleaned up and life would resume to normal. Can anybody help me understand why Cassandra wouldn’t recover? How is it that GC pressure will cause heap to be permanently uncollectable? This makes me pretty worried. I can fix my code, but I don’t really have control over spikes. If memory pressure spikes, I can tolerate some timeouts and errors, but if it can’t come back when the pressure is gone, that seems pretty bad. Any insights would be greatly appreciated Robert