[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409734#comment-15409734 ]
Romain Hardouin edited comment on CASSANDRA-11363 at 8/9/16 12:37 PM: ---------------------------------------------------------------------- I see a lower NTR blocked percentage with 1024 max queued requests. I attached {{max_queued_ntr_property.txt}} to set this value in {{cassandra-env.sh}} and it turns out that 1536 was a good value in my case. I don't see any blocked NTR so far. That said it's just a workaround because the root cause might be elsewhere. Anyway I think it's better to have a property to set this value instead of a hard coded number. WDYT? UPDATE: I had to increase up to {{-Dcassandra.max_queued_native_transport_requests=3072}} on the other DC (same cluster) in order to see 0 blocked NTR. was (Author: rha): I see a lower NTR blocked percentage with 1024 max queued requests. I attached {{max_queued_ntr_property.txt}} to set this value in {{cassandra-env.sh}} and it turns out that 1536 was a good value in my case. I don't see any blocked NTR so far. That said it's just a workaround because the root cause might be elsewhere. Anyway I think it's better to have a property to set this value instead of a hard coded number. WDYT? > High Blocked NTR When Connecting > -------------------------------- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination > Reporter: Russell Bradberry > Assignee: Paulo Motta > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack, > max_queued_ntr_property.txt, thread-queue-2.1.txt > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool Name Active Pending Completed Blocked All > time blocked > MutationStage 0 8 8387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 7 2532457 0 > 0 > ReadRepairStage 0 0 150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor 2 190 474 0 > 0 > ValidationExecutor 0 0 0 0 > 0 > MigrationStage 0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator 0 0 310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 1 10 94 0 > 0 > MemtablePostFlush 1 34 257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 387957 16 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)