Hey Robert, you might want to start by looking into the statistics of
cassandra, either exposed via nodetool or if you have monitoring
system monitor the important metrics. I have read this article moment
ago and I hope it help you
http://aryanet.com/blog/cassandra-garbage-collector-tuning to begin to
understand where and how to determine the root cause.

jason

On Tue, Mar 31, 2015 at 8:22 PM, Robert Wille <rwi...@fold3.com> wrote:
> I moved my site over to Cassandra a few months ago, and everything has been 
> just peachy until a few hours ago (yes, it would be in the middle of the 
> night) when my entire cluster suffered death by GC. By death by GC, I mean 
> this:
>
> [rwille@cas031 cassandra]$ grep GC system.log | head -5
>  INFO [ScheduledTasks:1] 2015-03-31 02:49:57,480 GCInspector.java (line 116) 
> GC for ConcurrentMarkSweep: 30219 ms for 1 collections, 7664429440 used; max 
> is 8329887744
>  INFO [ScheduledTasks:1] 2015-03-31 02:50:32,180 GCInspector.java (line 116) 
> GC for ConcurrentMarkSweep: 30673 ms for 1 collections, 7707488712 used; max 
> is 8329887744
>  INFO [ScheduledTasks:1] 2015-03-31 02:51:05,108 GCInspector.java (line 116) 
> GC for ConcurrentMarkSweep: 30453 ms for 1 collections, 7693634672 used; max 
> is 8329887744
>  INFO [ScheduledTasks:1] 2015-03-31 02:51:38,787 GCInspector.java (line 116) 
> GC for ConcurrentMarkSweep: 30691 ms for 1 collections, 7686028472 used; max 
> is 8329887744
>  INFO [ScheduledTasks:1] 2015-03-31 02:52:12,452 GCInspector.java (line 116) 
> GC for ConcurrentMarkSweep: 30346 ms for 1 collections, 7701401200 used; max 
> is 8329887744
>
> I’m pretty sure I know what triggered it. When I first started developing to 
> Cassandra, I found the IN clause to be supremely useful, and I used it a lot. 
> Later I figured out it was a bad thing and repented and fixed my code, but I 
> missed one spot. A maintenance task spent a couple of hours repeatedly 
> issuing queries with IN clauses with 1000 items in the clause and the whole 
> system went belly up.
>
> I get that my bad queries caused Cassandra to require more heap than was 
> available, but here’s what I don’t understand. When the crap hit the fan, the 
> maintenance task died due to a timeout error, but the cluster never 
> recovered. I would have expected that when I was no longer issuing the bad 
> queries, that the heap would get cleaned up and life would resume to normal. 
> Can anybody help me understand why Cassandra wouldn’t recover? How is it that 
> GC pressure will cause heap to be permanently uncollectable?
>
> This makes me pretty worried. I can fix my code, but I don’t really have 
> control over spikes. If memory pressure spikes, I can tolerate some timeouts 
> and errors, but if it can’t come back when the pressure is gone, that seems 
> pretty bad.
>
> Any insights would be greatly appreciated
>
> Robert
>

Reply via email to