On Fri, Mar 3, 2017 at 11:18 AM, Shravan Ch <chall...@outlook.com> wrote:
> More than 30 plus Cassandra servers in the primary DC went down OOM
> exception below. What puzzles me is the scale at which it happened (at the
> same minute). I will share some more details below.

You'd be surprised; When it's the result of aberrant data/workload,
then having many nodes OOM at once is more common than you might
think.

> System Log: http://pastebin.com/iPeYrWVR

The traceback shows the OOM occurring during a read (a slice), not a
write.  What does your data model and queries look like?  Do you do
deletes (TTLs maybe)? Did the OOM result in a heap dump?

> GC Log: http://pastebin.com/CzNNGs0r
>
> During the OOM I saw lot of WARNings like the below (these were there for
> quite sometime may be weeks)
> WARN  [SharedPool-Worker-81] 2017-03-01 19:55:41,209 BatchStatement.java:252
> - Batch of prepared statements for [keyspace.table] is of size 225455,
> exceeding specified threshold of 65536 by 159919.
>
> Environment:
> We are using ApacheCassandra-2.1.9 on Multi DC cluster. Primary DC (more C*
> nodes on SSD and apps run here)  and secondary DC (geographically remote and
> more like a DR to primary) on SAS drives.
> Cassandra config:
>
> Java 1.8.0_65
> Garbage Collector: G1GC
> memtable_allocation_type: offheap_objects
>
> Post this OOM I am seeing huge hints pile up on majority of the nodes and
> the pending hints keep going up. I have increased HintedHandoff CoreThreads
> to 6 but that did not help (I admit that I tried this on one node to try).
>
> nodetool compactionstats -H
> pending tasks: 3
> compaction type            keyspace                          table
> completed      total    unit   progress
>         Compaction              system                          hints
> 28.5 GB   92.38 GB   bytes     30.85%



-- 
Eric Evans
john.eric.ev...@gmail.com

Reply via email to