[ https://issues.apache.org/jira/browse/CASSANDRA-7522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Shuler resolved CASSANDRA-7522. --------------------------------------- Resolution: Not a Problem Closing observation - this would be good for the mailing list. > Bootstrapping a single node spikes cluster-wide p95 latencies > ------------------------------------------------------------- > > Key: CASSANDRA-7522 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7522 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: AWS, i2.2xlarge HVM instances > Reporter: Mike Heffner > > We've recently run some tests with Cassandra 2.0.9, largely because we are > interested in the streaming improvements in the 2.0.x series, see: > CASSANDRA-5726. However, our results so far show that even with 2.0.x, > streaming impacts are still quite large and hard to control for. > Our test environment was a 9 node, 2.0.9 ring running on AWS on i2.2xlarge > HVM instances using Oracle JVM 1.7.0.55. Each node is set to use vnodes with > 256 tokens each. We tested expanding this ring to a 12 node ring. We > bootstrapped each node with different throttle settings set around the ring: > 1st node: > * no throttle, stream/compaction throughput = 0 > 2nd node: > * stream throughput = 200 > * compaction throughput = 256 > 3rd node: > * stream throughput = 50 > * compaction throughput = 65 > This is a graph of p95 write latencies (ring was not taking reads) showing > each node bootstrapping left to right. The p95 latencies go from about 200ms > -> ~500ms. > http://snapshots.librato.com/instrument/5j9l3qiq-7462.png > The write latencies appear to be largely driven by CPU as shown by: > http://snapshots.librato.com/instrument/xsfb688i-7463.png > Network graphs show that the joining nodes follow approximately the same > bandwidth pattern: > http://snapshots.librato.com/instrument/ljvkvg6y-7464.png > What are the expected performance behaviors during bootstraping / ring > expansion? The storage loads in this test were fairly small so the duration > of the spikes was short, at a much larger production load we would need to > sustain these spikes for hours. The throttle controls did not seem to help as > far as we could tell. > These are our current config changes: > {code} > -concurrent_reads: 32 > -concurrent_writes: 32 > +concurrent_reads: 64 > +concurrent_writes: 64 > -memtable_flush_queue_size: 4 > +memtable_flush_queue_size: 5 > -rpc_server_type: sync > +rpc_server_type: hsha > -#concurrent_compactors: 1 > +concurrent_compactors: 6 > -cross_node_timeout: false > +cross_node_timeout: true > -# phi_convict_threshold: 8 > +phi_convict_threshold: 12 > -endpoint_snitch: SimpleSnitch > +endpoint_snitch: Ec2Snitch > -internode_compression: all > +internode_compression: none > {code} > Heap settings: > {code} > export MAX_HEAP_SIZE="10G" > export HEAP_NEWSIZE="2G" > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)