[ 
https://issues.apache.org/jira/browse/CASSANDRA-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633246#comment-14633246
 ] 

Benedict commented on CASSANDRA-9798:
-------------------------------------

Hmm. So, looking at your notes you mention that this began happening when your 
data per node exceeded a certain size. I suspect you may be hitting problems 
with disk performance, causing slow flush or commit log. If this is the case, 
there's very little that can be done about it, besides buying more hardware. 
There's no other reason for this to only occur to C* above a certain 
utilisation threshold.

If you can rule that out, we can perhaps look to start capturing thread dumps 
of the process, and see if there's something problematic happening. The latest 
version of Cassandra (2.1.8) has improved logging in this area that would also 
help inform if this is indeed the situation.

> Cassandra seems to have deadlocks during flush operations
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 4x HP Gen9 dl 360 servers
> 2x8 cpu each (Intel(R) Xeon E5-2667 v3 @ 3.20GHz)
> 6x900GB 10kRPM disk for data
> 1x900GB 10kRPM disk for commitlog
> 64GB ram
> ETH: 10Gb/s
> Red Hat Enterprise Linux Server release 6.6 (Santiago) 2.6.32-504.el6.x86_64
> java build 1.8.0_45-b14 (openjdk) (tested on oracle java 8 too)
>            Reporter: Łukasz Mrożkiewicz
>             Fix For: 2.1.x
>
>         Attachments: cassandra.log, cassandra.yaml, gc.log.0.current
>
>
> Hi,
> We noticed some problem with dropped mutationstages. Usually on one random 
> node there is a situation that:
> MutationStage "active" is full, "pending" is increasing  "completed" is 
> stalled.
> MemtableFlushWriter "active" 6, pending: 25 completed: stalled 
> MemtablePostFlush "active" is 1, pending 29 completed: stalled
> after a some time (30s-10min) pending mutations are dropped and everything is 
> working.
> When it happened:
> 1. Cpu idle is ~95%
> 2. no gc long pauses or more activity.
> 3. memory usage 3.5GB form 8GB
> 4. only writes is processed by cassandra
> 5. when LOAD > 400GB/node problems appeared 
> 6. cassandra 2.1.6
> There is gap in logs:
> {code}
> INFO  08:47:01 Timed out replaying hints to /192.168.100.83; aborting (0 
> delivered)
> INFO  08:47:01 Enqueuing flush of hints: 7870567 (0%) on-heap, 0 (0%) off-heap
> INFO  08:47:30 Enqueuing flush of table1: 95301807 (4%) on-heap, 0 (0%) 
> off-heap
> INFO  08:47:31 Enqueuing flush of table1: 60462632 (3%) on-heap, 0 (0%) 
> off-heap
> INFO  08:47:31 Enqueuing flush of table2: 76973746 (4%) on-heap, 0 (0%) 
> off-heap
> INFO  08:47:31 Enqueuing flush of table1: 84290135 (4%) on-heap, 0 (0%) 
> off-heap
> INFO  08:47:32 Enqueuing flush of table3: 56926652 (3%) on-heap, 0 (0%) 
> off-heap
> INFO  08:47:32 Enqueuing flush of table1: 85124218 (4%) on-heap, 0 (0%) 
> off-heap
> INFO  08:47:33 Enqueuing flush of table2: 95663415 (4%) on-heap, 0 (0%) 
> off-heap
> INFO  08:47:58 CompactionManager                 2        39
> INFO  08:47:58 Writing Memtable-table2@1767938721(13843064 serialized bytes, 
> 162359 ops, 4%/0% of on/off-heap l
> imit)
> INFO  08:47:58 Writing Memtable-hints@1433125911(478703 serialized bytes, 424 
> ops, 0%/0% of on/off-heap limit)
> INFO  08:47:58 Writing Memtable-table2@1318583275(11783615 serialized bytes, 
> 137378 ops, 4%/0% of on/off-heap l
> imit)
> INFO  08:47:58 Enqueuing flush of compactions_in_progress: 969 (0%) on-heap, 
> 0 (0%) off-heap
> INFO  08:47:58 Writing Memtable-table1@541175113(17221327 serialized bytes, 
> 180792 ops, 4%/0% of on/off-heap
>  limit)
> INFO  08:47:58 Writing Memtable-table1@1361154669(27138519 serialized bytes, 
> 273472 ops, 6%/0% of on/off-hea
> p limit)
> INFO  08:48:03 2176 MUTATION messages dropped in last 5000ms
> {code}
> use case:
> 100% write - 100Mb/s, couples of CF ~10column each. max cell size 100B
> CMS and G1GC tested - no difference



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to