Unexpected high internode network activity

Gianluca Borello Thu, 25 Feb 2016 18:52:28 -0800

Hello,

We have a Cassandra 2.1.9 cluster on EC2 for one of our live applications.
There's a total of 21 nodes across 3 AWS availability zones, c3.2xlarge
instances.


The configuration is pretty standard, we use the default settings that come
with the datastax AMI and the driver in our application is configured to
use lz4 compression. The keyspace where all the activity happens has RF 3
and we read and write at quorum to get strong consistency.

While analyzing our monthly bill, we noticed that the amount of network
traffic related to Cassandra was significantly higher than expected. After
breaking it down by port, it seems like over any given time, the internode
network activity is 6-7 times higher than the traffic on port 9042, whereas
we would expect something around 2-3 times, given the replication factor
and the consistency level of our queries.

For example, this is the network traffic broken down by port and direction
over a few minutes, measured as sum of each node:

Port 9042 from client to cluster (write queries): 1 GB
Port 9042 from cluster to client (read queries): 1.5 GB
Port 7000: 35 GB, which must be divided by two because the traffic is
always directed to another instance of the cluster, so that makes it 17.5
GB generated traffic

The traffic on port 9042 completely matches our expectations, we do about
100k write operations writing 10KB binary blobs for each query, and a bit
more reads on the same data.

According to our calculations, in the worst case, when the coordinator of
the query is not a replica for the data, this should generate about (1 +
1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.

Also, hinted handoffs are disabled and nodes are healthy over the period of
observation, and I get the same numbers across pretty much every time
window, even including an entire 24 hours period.

I tried to replicate this problem in a test environment so I connected a
client to a test cluster done in a bunch of Docker containers (same
parameters, essentially the only difference is the
GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
expect, the amount of traffic on port 7000 is between 2 and 3 times the
amount of traffic on port 9042 and the queries are pretty much the same
ones.

Before doing more analysis, I was wondering if someone has an explanation
on this problem, since perhaps we are missing something obvious here?

Thanks

Unexpected high internode network activity

Reply via email to