[ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375615#comment-15375615
 ] 

Romain Hardouin edited comment on CASSANDRA-11363 at 7/13/16 8:53 PM:
----------------------------------------------------------------------

I also see this behavior after upgrading to 2.1.14 (from 2.0.17). The JMX 
counter was at 0 before the upgrade. 
I increased the native max threads to 256 but the NTR blocked were still at 
0.6%.

As for now, the better change was to move memtable offheap (less GC time => 
less load => less all time blocked NTR). But I still see up to 0.3% of all time 
blocked...

Note: I see this behavior on two different clusters (hosted on AWS). The load 
after the upgrade in 2.1 is higher than in 2.0 on the cluster using LCS. This 
cluster shows the worst blocked NTR ratio. 



was (Author: rha):
I also see this behavior after upgrading to 2.1.14 (from 2.0.17). The JMX 
counter was at 0 before the upgrade.
I increased the native max threads to 256 but the NTR blocked were still at 
0.6%.

As for now, the better change was to move memtable offheap (less GC time => 
less load => less all time blocked NTR). But I still see up to 0.3% of all time 
blocked...

Note: I see this behavior on two different clusters (hosted on AWS).


> High Blocked NTR When Connecting
> --------------------------------
>
>                 Key: CASSANDRA-11363
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Russell Bradberry
>            Assignee: Paulo Motta
>         Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool Name                    Active   Pending      Completed   Blocked  All 
> time blocked
> MutationStage                     0         8        8387821         0        
>          0
> ReadStage                         0         0         355860         0        
>          0
> RequestResponseStage              0         7        2532457         0        
>          0
> ReadRepairStage                   0         0            150         0        
>          0
> CounterMutationStage             32       104         897560         0        
>          0
> MiscStage                         0         0              0         0        
>          0
> HintedHandoff                     0         0             65         0        
>          0
> GossipStage                       0         0           2338         0        
>          0
> CacheCleanupExecutor              0         0              0         0        
>          0
> InternalResponseStage             0         0              0         0        
>          0
> CommitLogArchiver                 0         0              0         0        
>          0
> CompactionExecutor                2       190            474         0        
>          0
> ValidationExecutor                0         0              0         0        
>          0
> MigrationStage                    0         0             10         0        
>          0
> AntiEntropyStage                  0         0              0         0        
>          0
> PendingRangeCalculator            0         0            310         0        
>          0
> Sampler                           0         0              0         0        
>          0
> MemtableFlushWriter               1        10             94         0        
>          0
> MemtablePostFlush                 1        34            257         0        
>          0
> MemtableReclaimMemory             0         0             94         0        
>          0
> Native-Transport-Requests       128       156         387957        16        
>     278451
> Message type           Dropped
> READ                         0
> RANGE_SLICE                  0
> _TRACE                       0
> MUTATION                     0
> COUNTER_MUTATION             0
> BINARY                       0
> REQUEST_RESPONSE             0
> PAGED_RANGE                  0
> READ_REPAIR                  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, 
> the load on the machine went back to healthy, and when the flight recording 
> finished the load went back to > 100.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to