[ https://issues.apache.org/jira/browse/CASSANDRA-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635158#comment-14635158 ]
Łukasz Mrożkiewicz commented on CASSANDRA-9798: ----------------------------------------------- Thanks Benedict for support, I don't understand how those threads could consume all CPU when top shows more than 90% idle? > Cassandra seems to have deadlocks during flush operations > --------------------------------------------------------- > > Key: CASSANDRA-9798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9798 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4x HP Gen9 dl 360 servers > 2x8 cpu each (Intel(R) Xeon E5-2667 v3 @ 3.20GHz) > 6x900GB 10kRPM disk for data > 1x900GB 10kRPM disk for commitlog > 64GB ram > ETH: 10Gb/s > Red Hat Enterprise Linux Server release 6.6 (Santiago) 2.6.32-504.el6.x86_64 > java build 1.8.0_45-b14 (openjdk) (tested on oracle java 8 too) > Reporter: Łukasz Mrożkiewicz > Fix For: 2.1.x > > Attachments: cassandra.2.1.8.log, cassandra.log, cassandra.yaml, > cassandra.yaml, gc.log.0.current, stack.txt, topHbn1.txt > > > Hi, > We noticed some problem with dropped mutationstages. Usually on one random > node there is a situation that: > MutationStage "active" is full, "pending" is increasing "completed" is > stalled. > MemtableFlushWriter "active" 6, pending: 25 completed: stalled > MemtablePostFlush "active" is 1, pending 29 completed: stalled > after a some time (30s-10min) pending mutations are dropped and everything is > working. > When it happened: > 1. Cpu idle is ~95% > 2. no gc long pauses or more activity. > 3. memory usage 3.5GB form 8GB > 4. only writes is processed by cassandra > 5. when LOAD > 400GB/node problems appeared > 6. cassandra 2.1.6 > There is gap in logs: > {code} > INFO 08:47:01 Timed out replaying hints to /192.168.100.83; aborting (0 > delivered) > INFO 08:47:01 Enqueuing flush of hints: 7870567 (0%) on-heap, 0 (0%) off-heap > INFO 08:47:30 Enqueuing flush of table1: 95301807 (4%) on-heap, 0 (0%) > off-heap > INFO 08:47:31 Enqueuing flush of table1: 60462632 (3%) on-heap, 0 (0%) > off-heap > INFO 08:47:31 Enqueuing flush of table2: 76973746 (4%) on-heap, 0 (0%) > off-heap > INFO 08:47:31 Enqueuing flush of table1: 84290135 (4%) on-heap, 0 (0%) > off-heap > INFO 08:47:32 Enqueuing flush of table3: 56926652 (3%) on-heap, 0 (0%) > off-heap > INFO 08:47:32 Enqueuing flush of table1: 85124218 (4%) on-heap, 0 (0%) > off-heap > INFO 08:47:33 Enqueuing flush of table2: 95663415 (4%) on-heap, 0 (0%) > off-heap > INFO 08:47:58 CompactionManager 2 39 > INFO 08:47:58 Writing Memtable-table2@1767938721(13843064 serialized bytes, > 162359 ops, 4%/0% of on/off-heap l > imit) > INFO 08:47:58 Writing Memtable-hints@1433125911(478703 serialized bytes, 424 > ops, 0%/0% of on/off-heap limit) > INFO 08:47:58 Writing Memtable-table2@1318583275(11783615 serialized bytes, > 137378 ops, 4%/0% of on/off-heap l > imit) > INFO 08:47:58 Enqueuing flush of compactions_in_progress: 969 (0%) on-heap, > 0 (0%) off-heap > INFO 08:47:58 Writing Memtable-table1@541175113(17221327 serialized bytes, > 180792 ops, 4%/0% of on/off-heap > limit) > INFO 08:47:58 Writing Memtable-table1@1361154669(27138519 serialized bytes, > 273472 ops, 6%/0% of on/off-hea > p limit) > INFO 08:48:03 2176 MUTATION messages dropped in last 5000ms > {code} > use case: > 100% write - 100Mb/s, couples of CF ~10column each. max cell size 100B > CMS and G1GC tested - no difference -- This message was sent by Atlassian JIRA (v6.3.4#6332)