[jira] [Comment Edited] (CASSANDRA-10937) OOM on multiple nodes on write load (v. 3.0.0), problem also present on DSE-4.8.3, but there it survives more time

Jack Krupansky (JIRA) Mon, 18 Jan 2016 09:17:33 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105495#comment-15105495
 ]


Jack Krupansky edited comment on CASSANDRA-10937 at 1/18/16 5:16 PM:
---------------------------------------------------------------------

I still don't see any reason to believe that there is a bug here and that the 
primary issue is that you are overloading the cluster. Sure, Cassandra should 
do a better job of shedding/failing excessive incoming requests, and there is 
an open Jira ticket to add just such a freature, but even with that new 
feature, the net effect will be the same - it will still be up to the 
application and operations to properly size the cluster and throttle 
application load before it gets to Cassandra.

OOM is not typically an indication of a software bug. Sure, sometimes code has 
memory leaks, but with a highly dynamic system such as Cassandra, it typically 
means either a misconfigured JVM or just very heavy load. Sometimes OOM simply 
means that there is a lot of background processing going on (like compactions 
or hinted handoff) that is having trouble keeping up with incoming requests. 
Sometimes OOM occurs because you have too large a heap which defers GC but then 
GC takes too long and further incoming requests simply generate more pressure 
on the heap faster than that massive GC can deal with it.

It is indeed tricky to make sure the JVM has enough heap but not too much. DSE 
typically runs with a larger heap by default. You can try increasing your heap 
to 10 or 12G. But if you make the heap too big, the big GC can bite you as 
described above. In that case, the heap needs to be reduced. Typically you 
don't need a heap smaller than 8 GB. If OOM occurs with a 8 GB heap it 
typically means the load on that node is simply too heavy.

Be sure to review the recommendations in this blog post on reasonable 
recommendations:
http://www.datastax.com/dev/blog/how-not-to-benchmark-cassandra

A few questions that will help us better understand what you are really trying 
to do:

1. How much reading are you doing and when relative to writes?
2. Are you doing any updates or deletes? (Cause compaction, which can fall 
behind your write/update load.)
3. How much data is on the cluster (rows)?
4. How many tables?
5. What RF? RF=3 would be the recommendation, but if you have a heavy read load 
you may need RF=5, although heavy load usually means you just need a lot more 
nodes so that the fraction of incoming requests going to a particular node are 
dramatically reduced. RF>3 is only needed if there is high load for each 
particular row or partition.
6. Have you tested using cassandra-stress? That's the gold standard around here.
7. Are your clients using token-aware routing? (Otherwise a write must be 
bounced from the coordinating node to the node owning the token for the 
partition key.)
8. Are you using batches for your writes? If, so, do all the writes in one 
batch have the same partition key? (If not, adds more network hops.)
9. What expectations did you have as to how many writes/reads a given number of 
nodes should be able to handle?



was (Author: jkrupan):
I still don't see any reason to believe that there is a bug here and that the 
primary issue is that you are overloading the cluster. Sure, Cassandra should 
do a better job of shedding/failing excessive incoming requests, and there is 
an open Jira ticket to add just such a freature, but even with that new 
feature, the net effect will be the same - it will still be up to the 
application and operations to properly size the cluster and throttle 
application load before it gets to Cassandra.

OOM is not typically an indication of a software bug. Sure, sometimes code has 
memory leaks, but with a highly dynamic system such as Cassandra, it typically 
means either a misconfigured JVM or just very heavy load. Sometimes OOM simply 
means that there is a lot of background processing going on (like compactions 
or hinted handoff) that is having trouble keeping up with incoming requests. 
Sometimes OOM occurs because you have too large a heap which defers GC but then 
GC takes too long and further incoming requests simply generate more pressure 
on the heap faster than that massive GC can deal with it.

It is indeed tricky to make sure the JVM has enough heap but not too much. DSE 
typically runs with a larger heap by default. You can try increasing your heap 
to 10 or 12G. But if you make the heap too big, the big GC can bite you as 
described above. In that case, the heap needs to be reduced. Typically you 
don't need a heap smaller than 8 GB. If OOM occurs with a 8 GB heap it 
typically means the load on that node is simply too heavy.

Be sure to review the recommendations in this blog post on reasonable 
recommendations:
http://www.datastax.com/dev/blog/how-not-to-benchmark-cassandra

A few questions that will help us better understand what you are really trying 
to do:

1. How much reading are you doing and when relative to writes?
2. Are you doing any updates or deletes? (Cause compaction, which can fall 
behind your write/update load.)
3. How much data is on the cluster (rows)?
4. How many tables?
5. What RF? RF=3 would be the recommendation, but if you have a heavy read load 
you may need RF=5.
6. Have you tested using cassandra-stress? That's the gold standard around here.
7. Are your clients using token-aware routing? (Otherwise a write must be 
bounced from the coordinating node to the node owning the token for the 
partition key.)
8. Are you using batches for your writes? If, so, do all the writes in one 
batch have the same partition key? (If not, adds more network hops.)


> OOM on multiple nodes on write load (v. 3.0.0), problem also present on 
> DSE-4.8.3, but there it survives more time
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10937
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10937
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra : 3.0.0
> Installed as open archive, no connection to any OS specific installer.
> Java:
> Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
> OS :
> Linux version 2.6.32-431.el6.x86_64 
> (mockbu...@x86-023.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
> Hat 4.4.7-4) (GCC) ) #1 SMP Sun Nov 10 22:19:54 EST 2013
> We have:
> 8 guests ( Linux OS as above) on 2 (VMWare managed) physical hosts. Each 
> physical host keeps 4 guests.
> Physical host parameters(shared by all 4 guests):
> Model: HP ProLiant DL380 Gen9
> Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
> 46 logical processors.
> Hyperthreading - enabled
> Each guest assigned to have:
> 1 disk 300 Gb for seq. log (NOT SSD)
> 1 disk 4T for data (NOT SSD)
> 11 CPU cores
> Disks are local, not shared.
> Memory on each host -  24 Gb total.
> 8 (or 6, tested both) Gb - cassandra heap
> (lshw and cpuinfo attached in file test2.rar)
>            Reporter: Peter Kovgan
>            Priority: Critical
>         Attachments: gc-stat.txt, more-logs.rar, some-heap-stats.rar, 
> test2.rar, test3.rar, test4.rar, test5.rar, test_2.1.rar, 
> test_2.1_logs_older.rar, test_2.1_restart_attempt_log.rar
>
>
> 8 cassandra nodes.
> Load test started with 4 clients(different and not equal machines), each 
> running 1000 threads.
> Each thread assigned in round-robin way to run one of 4 different inserts. 
> Consistency->ONE.
> I attach the full CQL schema of tables and the query of insert.
> Replication factor - 2:
> create keyspace OBLREPOSITORY_NY with replication = 
> {'class':'NetworkTopologyStrategy','NY':2};
> Initiall throughput is:
> 215.000  inserts /sec
> or
> 54Mb/sec, considering single insert size a bit larger than 256byte.
> Data:
> all fields(5-6) are short strings, except one is BLOB of 256 bytes.
> After about a 2-3 hours of work, I was forced to increase timeout from 2000 
> to 5000ms, for some requests failed for short timeout.
> Later on(after aprox. 12 hous of work) OOM happens on multiple nodes.
> (all failed nodes logs attached)
> I attach also java load client and instructions how set-up and use 
> it.(test2.rar)
> Update:
> Later on test repeated with lesser load (100000 mes/sec) with more relaxed 
> CPU (idle 25%), with only 2 test clients, but anyway test failed.
> Update:
> DSE-4.8.3 also failed on OOM (3 nodes from 8), but here it survived 48 hours, 
> not 10-12.
> Attachments:
> test2.rar -contains most of material
> more-logs.rar - contains additional nodes logs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-10937) OOM on multiple nodes on write load (v. 3.0.0), problem also present on DSE-4.8.3, but there it survives more time

Reply via email to