[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375615#comment-15375615 ]
Romain Hardouin commented on CASSANDRA-11363: --------------------------------------------- I also see this behavior after upgrading to 2.1.14 (from 2.0.17). This metric was 0 before the upgrade. I increased the native max threads to 256 but the NTR blocked were still at 0.6%. As for now, the better change was to move memtable offheap (less GC time => less load => less all time blocked NTR). But I still see up to 0.3% of all time blocked... > High Blocked NTR When Connecting > -------------------------------- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination > Reporter: Russell Bradberry > Assignee: Paulo Motta > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool Name Active Pending Completed Blocked All > time blocked > MutationStage 0 8 8387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 7 2532457 0 > 0 > ReadRepairStage 0 0 150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor 2 190 474 0 > 0 > ValidationExecutor 0 0 0 0 > 0 > MigrationStage 0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator 0 0 310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 1 10 94 0 > 0 > MemtablePostFlush 1 34 257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 387957 16 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)