[jira] [Comment Edited] (CASSANDRA-11363) High Blocked NTR When Connecting

Romain Hardouin (JIRA) Tue, 09 Aug 2016 05:38:38 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409734#comment-15409734
 ]


Romain Hardouin edited comment on CASSANDRA-11363 at 8/9/16 12:37 PM:
----------------------------------------------------------------------

I see a lower NTR blocked percentage with 1024 max queued requests.
 
I attached {{max_queued_ntr_property.txt}} to set this value in 
{{cassandra-env.sh}} and it turns out that 1536 was a good value in my case. I 
don't see any blocked NTR so far. That said it's just a workaround because the 
root cause might be elsewhere. Anyway I think it's better to have a property to 
set this value instead of a hard coded number. WDYT?

UPDATE: I had to increase up to 
{{-Dcassandra.max_queued_native_transport_requests=3072}} on the other DC (same 
cluster) in order to see 0 blocked NTR.


was (Author: rha):
I see a lower NTR blocked percentage with 1024 max queued requests.
 
I attached {{max_queued_ntr_property.txt}} to set this value in 
{{cassandra-env.sh}} and it turns out that 1536 was a good value in my case. I 
don't see any blocked NTR so far. That said it's just a workaround because the 
root cause might be elsewhere. Anyway I think it's better to have a property to 
set this value instead of a hard coded number. WDYT?

> High Blocked NTR When Connecting
> --------------------------------
>
>                 Key: CASSANDRA-11363
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Russell Bradberry
>            Assignee: Paulo Motta
>         Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack, 
> max_queued_ntr_property.txt, thread-queue-2.1.txt
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool Name                    Active   Pending      Completed   Blocked  All 
> time blocked
> MutationStage                     0         8        8387821         0        
>          0
> ReadStage                         0         0         355860         0        
>          0
> RequestResponseStage              0         7        2532457         0        
>          0
> ReadRepairStage                   0         0            150         0        
>          0
> CounterMutationStage             32       104         897560         0        
>          0
> MiscStage                         0         0              0         0        
>          0
> HintedHandoff                     0         0             65         0        
>          0
> GossipStage                       0         0           2338         0        
>          0
> CacheCleanupExecutor              0         0              0         0        
>          0
> InternalResponseStage             0         0              0         0        
>          0
> CommitLogArchiver                 0         0              0         0        
>          0
> CompactionExecutor                2       190            474         0        
>          0
> ValidationExecutor                0         0              0         0        
>          0
> MigrationStage                    0         0             10         0        
>          0
> AntiEntropyStage                  0         0              0         0        
>          0
> PendingRangeCalculator            0         0            310         0        
>          0
> Sampler                           0         0              0         0        
>          0
> MemtableFlushWriter               1        10             94         0        
>          0
> MemtablePostFlush                 1        34            257         0        
>          0
> MemtableReclaimMemory             0         0             94         0        
>          0
> Native-Transport-Requests       128       156         387957        16        
>     278451
> Message type           Dropped
> READ                         0
> RANGE_SLICE                  0
> _TRACE                       0
> MUTATION                     0
> COUNTER_MUTATION             0
> BINARY                       0
> REQUEST_RESPONSE             0
> PAGED_RANGE                  0
> READ_REPAIR                  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, 
> the load on the machine went back to healthy, and when the flight recording 
> finished the load went back to > 100.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-11363) High Blocked NTR When Connecting

Reply via email to