Hi everyone, Last week I ran some tests to estimate the latency overhead introduces in a Cassandra cluster by a multi availability zones setup on AWS EC2.
I started a Cassandra cluster of 6 nodes deployed on 3 different AZs (2 nodes/AZ). Then, I used cassandra-stress to create an INSERT (write) test of 20M entries with a replication factor = 3, right after, I ran cassandra-stress again to READ 10M entries. Well, I got the following unexpected result: Single-AZ, CL=ONE -> median/95th percentile/99th percentile: 1.06ms/7.41ms/55.81ms Multi-AZ, CL=ONE -> median/95th percentile/99th percentile: 1.16ms/38.14ms/47.75ms Basically, switching to the multi-AZ setup the latency increased of ~30ms. That's too much considering the the average network latency between AZs on AWS is ~1ms. Since I couldn't find anything to explain those results, I decided to run the cassandra-stress specifying only a single node entry (i.e. "--nodes node1" instead of "--nodes node1,node2,node3,node4,node5,node6") and surprisingly the latency went back to 5.9 ms. Trying to recap: Multi-AZ, CL=ONE, "--nodes node1,node2,node3,node4,node5,node6" -> 95th percentile: 38.14ms Multi-AZ, CL=ONE, "--nodes node1" -> 95th percentile: 5.9ms For the sake of completeness I've ran a further test using a consistency level = LOCAL_QUORUM and the test did not show any large variance with using a single node or multiple ones. Do you guys know what could be the reason? The test were executed on a m3.xlarge (network optimized) using the DataStax AMI 2.6.3 running Cassandra v2.0.15. Thank you in advance for your help. Cheers, Alessandro