I moved my site over to Cassandra a few months ago, and everything has been 
just peachy until a few hours ago (yes, it would be in the middle of the night) 
when my entire cluster suffered death by GC. By death by GC, I mean this:

[rwille@cas031 cassandra]$ grep GC system.log | head -5
 INFO [ScheduledTasks:1] 2015-03-31 02:49:57,480 GCInspector.java (line 116) GC 
for ConcurrentMarkSweep: 30219 ms for 1 collections, 7664429440 used; max is 
8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:50:32,180 GCInspector.java (line 116) GC 
for ConcurrentMarkSweep: 30673 ms for 1 collections, 7707488712 used; max is 
8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:51:05,108 GCInspector.java (line 116) GC 
for ConcurrentMarkSweep: 30453 ms for 1 collections, 7693634672 used; max is 
8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:51:38,787 GCInspector.java (line 116) GC 
for ConcurrentMarkSweep: 30691 ms for 1 collections, 7686028472 used; max is 
8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:52:12,452 GCInspector.java (line 116) GC 
for ConcurrentMarkSweep: 30346 ms for 1 collections, 7701401200 used; max is 
8329887744

I’m pretty sure I know what triggered it. When I first started developing to 
Cassandra, I found the IN clause to be supremely useful, and I used it a lot. 
Later I figured out it was a bad thing and repented and fixed my code, but I 
missed one spot. A maintenance task spent a couple of hours repeatedly issuing 
queries with IN clauses with 1000 items in the clause and the whole system went 
belly up.

I get that my bad queries caused Cassandra to require more heap than was 
available, but here’s what I don’t understand. When the crap hit the fan, the 
maintenance task died due to a timeout error, but the cluster never recovered. 
I would have expected that when I was no longer issuing the bad queries, that 
the heap would get cleaned up and life would resume to normal. Can anybody help 
me understand why Cassandra wouldn’t recover? How is it that GC pressure will 
cause heap to be permanently uncollectable?

This makes me pretty worried. I can fix my code, but I don’t really have 
control over spikes. If memory pressure spikes, I can tolerate some timeouts 
and errors, but if it can’t come back when the pressure is gone, that seems 
pretty bad.

Any insights would be greatly appreciated

Robert


Reply via email to