[ https://issues.apache.org/jira/browse/CASSANDRA-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633494#comment-14633494 ]
Łukasz Mrożkiewicz commented on CASSANDRA-9798: ----------------------------------------------- Hi Benedict, I've updated C* to 2.1.8. run test with 50Mb/s (the same result like 200Mb/s no difference) I thought also that this problem is caused by hardware utilisation but I can't find any metric that show problems with resources: CPU: 92-98% idle JVM - Memory: problem occured when 3GB from 8GB of heap is used too. iostat shows that there is no queue to write on disks (it is usual view of iostat): Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 5.00 0.00 1.00 0.00 0.02 48.00 0.00 3.00 3.00 0.30 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nodetool tpstats says that MemtablePostFlush pending tasks is increasing and completed is const MemtableFlushWriter 6 29 37 0 0 MemtablePostFlush 1 33 47 0 0 MemtableReclaimMemory 0 0 53 0 0 I noticed one new thing: on one node there is a dropped mutation stage:~200k (and it is stable from one hour) but pending mutation stage: is continous increasing: now is 850k nothing appeared in logs: INFO 11:21:15 Enqueuing flush of hints: 111288602 (5%) on-heap, 0 (0%) off-heap WARN 11:40:40 Batch of prepared statements for [] is of size 6102, exceeding specified threshold of 5120 by 982. WARN 11:47:16 Batch of prepared statements for [] is of size 5640, exceeding specified threshold of 5120 by 520. WARN 11:50:22 Batch of prepared statements for [] is of size 5481, exceeding specified threshold of 5120 by 361. now 12:12 in logs - only 4 lines in 50 minutes. all resources are idle. > Cassandra seems to have deadlocks during flush operations > --------------------------------------------------------- > > Key: CASSANDRA-9798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9798 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4x HP Gen9 dl 360 servers > 2x8 cpu each (Intel(R) Xeon E5-2667 v3 @ 3.20GHz) > 6x900GB 10kRPM disk for data > 1x900GB 10kRPM disk for commitlog > 64GB ram > ETH: 10Gb/s > Red Hat Enterprise Linux Server release 6.6 (Santiago) 2.6.32-504.el6.x86_64 > java build 1.8.0_45-b14 (openjdk) (tested on oracle java 8 too) > Reporter: Łukasz Mrożkiewicz > Fix For: 2.1.x > > Attachments: cassandra.2.1.8.log, cassandra.log, cassandra.yaml, > gc.log.0.current > > > Hi, > We noticed some problem with dropped mutationstages. Usually on one random > node there is a situation that: > MutationStage "active" is full, "pending" is increasing "completed" is > stalled. > MemtableFlushWriter "active" 6, pending: 25 completed: stalled > MemtablePostFlush "active" is 1, pending 29 completed: stalled > after a some time (30s-10min) pending mutations are dropped and everything is > working. > When it happened: > 1. Cpu idle is ~95% > 2. no gc long pauses or more activity. > 3. memory usage 3.5GB form 8GB > 4. only writes is processed by cassandra > 5. when LOAD > 400GB/node problems appeared > 6. cassandra 2.1.6 > There is gap in logs: > {code} > INFO 08:47:01 Timed out replaying hints to /192.168.100.83; aborting (0 > delivered) > INFO 08:47:01 Enqueuing flush of hints: 7870567 (0%) on-heap, 0 (0%) off-heap > INFO 08:47:30 Enqueuing flush of table1: 95301807 (4%) on-heap, 0 (0%) > off-heap > INFO 08:47:31 Enqueuing flush of table1: 60462632 (3%) on-heap, 0 (0%) > off-heap > INFO 08:47:31 Enqueuing flush of table2: 76973746 (4%) on-heap, 0 (0%) > off-heap > INFO 08:47:31 Enqueuing flush of table1: 84290135 (4%) on-heap, 0 (0%) > off-heap > INFO 08:47:32 Enqueuing flush of table3: 56926652 (3%) on-heap, 0 (0%) > off-heap > INFO 08:47:32 Enqueuing flush of table1: 85124218 (4%) on-heap, 0 (0%) > off-heap > INFO 08:47:33 Enqueuing flush of table2: 95663415 (4%) on-heap, 0 (0%) > off-heap > INFO 08:47:58 CompactionManager 2 39 > INFO 08:47:58 Writing Memtable-table2@1767938721(13843064 serialized bytes, > 162359 ops, 4%/0% of on/off-heap l > imit) > INFO 08:47:58 Writing Memtable-hints@1433125911(478703 serialized bytes, 424 > ops, 0%/0% of on/off-heap limit) > INFO 08:47:58 Writing Memtable-table2@1318583275(11783615 serialized bytes, > 137378 ops, 4%/0% of on/off-heap l > imit) > INFO 08:47:58 Enqueuing flush of compactions_in_progress: 969 (0%) on-heap, > 0 (0%) off-heap > INFO 08:47:58 Writing Memtable-table1@541175113(17221327 serialized bytes, > 180792 ops, 4%/0% of on/off-heap > limit) > INFO 08:47:58 Writing Memtable-table1@1361154669(27138519 serialized bytes, > 273472 ops, 6%/0% of on/off-hea > p limit) > INFO 08:48:03 2176 MUTATION messages dropped in last 5000ms > {code} > use case: > 100% write - 100Mb/s, couples of CF ~10column each. max cell size 100B > CMS and G1GC tested - no difference -- This message was sent by Atlassian JIRA (v6.3.4#6332)