[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414913#comment-15414913 ]
Benedict commented on CASSANDRA-11363: -------------------------------------- This blocking behaviour and default queue limit was carried forward from the prior code, so I'm afraid I don't have any insights. It may be that the increased baseline performance of 2.1 permits worse outlier states to accumulate if the user exploits it. The old code was using the jboss MemoryAwareExecutorService, but estimated the size of each request as 1. A value of 128 does seem very small for users performing very small operations, but conversely a few large reads could destroy the box, so we will have complaints whatever we pick. Perhaps configuring this parameter should be explicitly called out in whatever best practices docs we have. Ideally, this limit would be removed entirely and better dynamic constraints applied - I think we have some tickets already for keeping the number of requests at a coordinator constrained. If that were dealt with (for all request types), this limit could be removed entirely. > High Blocked NTR When Connecting > -------------------------------- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination > Reporter: Russell Bradberry > Assignee: T Jake Luciani > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack, > max_queued_ntr_property.txt, thread-queue-2.1.txt > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool Name Active Pending Completed Blocked All > time blocked > MutationStage 0 8 8387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 7 2532457 0 > 0 > ReadRepairStage 0 0 150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor 2 190 474 0 > 0 > ValidationExecutor 0 0 0 0 > 0 > MigrationStage 0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator 0 0 310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 1 10 94 0 > 0 > MemtablePostFlush 1 34 257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 387957 16 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)