[jira] [Updated] (CASSANDRA-10937) OOM on multiple nodes on write load (v. 3.0.0), problem absent on DSE-4.8.3, so it is a bug!

Peter Kovgan (JIRA) Mon, 28 Dec 2015 22:01:49 -0800

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Kovgan updated CASSANDRA-10937:
-------------------------------------
    Description: 
8 cassandra nodes.

Load test started with 4 clients(different and not equal machines), each 
running 1000 threads.
Each thread assigned in round-robin way to run one of 4 different inserts. 
Consistency->ONE.

I attach the full CQL schema of tables and the query of insert.

Replication factor - 2:
create keyspace OBLREPOSITORY_NY with replication = 
{'class':'NetworkTopologyStrategy','NY':2};

Initiall throughput is:
215.000  inserts /sec
or
54Mb/sec, considering single insert size a bit larger than 256byte.

Data:
all fields(5-6) are short strings, except one is BLOB of 256 bytes.

After about a 2-3 hours of work, I was forced to increase timeout from 2000 to 
5000ms, for some requests failed for short timeout.

Later on(after aprox. 12 hous of work) OOM happens on multiple nodes.
(all failed nodes logs attached)

I attach also java load client and instructions how set-up and use 
it.(test2.rar)

Update:

Later on test repeated with lesser load (100000 mes/sec) with more relaxed CPU 
(idle 25%), with only 2 test clients, but anyway test failed.

At the end (29/12/16) tested DSE-4.8.3 with the load 100 000 mes/sec and it 
survived. The same installation pattern.

I think this is a bug, because OOM happens on later stage , when system runs 10 
hours and accumulated data on each node is about 250Gb. It is problem growing 
with time. Definitely.


Attachments:
test2.rar -contains most of material
more-logs.rar - contains additional nodes logs






  was:
8 cassandra nodes.

Load test started with 4 clients(different and not equal machines), each 
running 1000 threads.
Each thread assigned in round-robin way to run one of 4 different inserts. 
Consistency->ONE.

I attach the full CQL schema of tables and the query of insert.

Replication factor - 2:
create keyspace OBLREPOSITORY_NY with replication = 
{'class':'NetworkTopologyStrategy','NY':2};

Initiall throughput is:
215.000  inserts /sec
or
54Mb/sec, considering single insert size a bit larger than 256byte.

Data:
all fields(5-6) are short strings, except one is BLOB of 256 bytes.

After about a 2-3 hours of work, I was forced to increase timeout from 2000 to 
5000ms, for some requests failed for short timeout.

Later on(after aprox. 12 hous of work) OOM happens on multiple nodes.
(all failed nodes logs attached)

I attach also java load client and instructions how set-up and use it.

The test is important for our strategic project and we hope it is curable.

Attachments:
test2.rar -contains most of material
more-logs.rar - contains additional nodes logs





> OOM on multiple nodes on write load (v. 3.0.0), problem absent on DSE-4.8.3, 
> so it is a bug!
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10937
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10937
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra : 3.0.0
> Installed as open archive, no connection to any OS specific installer.
> Java:
> Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
> OS :
> Linux version 2.6.32-431.el6.x86_64 
> (mockbu...@x86-023.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
> Hat 4.4.7-4) (GCC) ) #1 SMP Sun Nov 10 22:19:54 EST 2013
> We have:
> 8 guests ( Linux OS as above) on 2 (VMWare managed) physical hosts. Each 
> physical host keeps 4 guests.
> Physical host parameters(shared by all 4 guests):
> Model: HP ProLiant DL380 Gen9
> Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
> 46 logical processors.
> Hyperthreading - enabled
> Each guest assigned to have:
> 1 disk 300 Gb for seq. log (NOT SSD)
> 1 disk 4T for data (NOT SSD)
> 11 CPU cores
> Disks are local, not shared.
> Memory on each host -  24 Gb total.
> 8 (or 6, tested both) Gb - cassandra heap
> (lshw and cpuinfo attached in file test2.rar)
>            Reporter: Peter Kovgan
>            Priority: Critical
>         Attachments: gc-stat.txt, more-logs.rar, some-heap-stats.rar, 
> test2.rar, test3.rar, test4.rar
>
>
> 8 cassandra nodes.
> Load test started with 4 clients(different and not equal machines), each 
> running 1000 threads.
> Each thread assigned in round-robin way to run one of 4 different inserts. 
> Consistency->ONE.
> I attach the full CQL schema of tables and the query of insert.
> Replication factor - 2:
> create keyspace OBLREPOSITORY_NY with replication = 
> {'class':'NetworkTopologyStrategy','NY':2};
> Initiall throughput is:
> 215.000  inserts /sec
> or
> 54Mb/sec, considering single insert size a bit larger than 256byte.
> Data:
> all fields(5-6) are short strings, except one is BLOB of 256 bytes.
> After about a 2-3 hours of work, I was forced to increase timeout from 2000 
> to 5000ms, for some requests failed for short timeout.
> Later on(after aprox. 12 hous of work) OOM happens on multiple nodes.
> (all failed nodes logs attached)
> I attach also java load client and instructions how set-up and use 
> it.(test2.rar)
> Update:
> Later on test repeated with lesser load (100000 mes/sec) with more relaxed 
> CPU (idle 25%), with only 2 test clients, but anyway test failed.
> At the end (29/12/16) tested DSE-4.8.3 with the load 100 000 mes/sec and it 
> survived. The same installation pattern.
> I think this is a bug, because OOM happens on later stage , when system runs 
> 10 hours and accumulated data on each node is about 250Gb. It is problem 
> growing with time. Definitely.
> Attachments:
> test2.rar -contains most of material
> more-logs.rar - contains additional nodes logs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-10937) OOM on multiple nodes on write load (v. 3.0.0), problem absent on DSE-4.8.3, so it is a bug!

Reply via email to