> On Nov 26, 2020, at 9:53 AM, Jens Fischer wrote:
>
> Hi,
>
> we run a Cassandra cluster with three DCs. We noticed that the traffic
> incurred by running the Cluster is significant.
>
> Consider the following simplified IoT scenario:
>
> * time series data from devices in the field is received at Node A
> * Node A inserts the data into DC 1
> * DC 1 replicates the data within the DC and two the other two DCs
>
> The traffic this produces is significant. The numbers below are based on
> observing the incoming and outgoing traffic on the node level:
>
> * I call the bandwidth for receiving the the data on Node A "base bandwidth"
> * Inserting into Cassandra (in one DC) takes 2-3 times the base bandwidth
> * Replication to each of the other data centres takes 5 times the base
> bandwidth
> * overall we see a “bandwidth amplification” of ~ 13x (3+5+5)
>
You didn’t specify consistency levels or replication factors so it’s hard to
check your math.
Here’s what I’d expect
If you do RF=3 per DC and have 3 DCs, a write of size A is written to the
cluster using coordinator C
C sends that write to replicas R1, R2, and R3 in the local DC
C sends the write to F2 and F3 - forwarders - one in each remote DC
F2 sends the write to R1-2, R2-2 in the remote DC2 and itself (F2 will be a
replica), each replica sends an ack back to C
F3 sends the write to R1-3, R2-3 in the remote DC3 and itself (F3 will be a
replica), each replica sends an ack back to C
You can avoid one extra write using token aware routing and making C a replica
(R1, for example)
Given this, I don’t see how a remote DC is 5x A - it should be cross DC/WAN
cost A into the forwarder and 2A out of the forwarder (local traffic ,
cross-AZ/rack but not WAN), with trivial ACK cost to the original DC.
If you’re seeing more than this, it may be something other than pure writes -
anti entropy repair, hints, read repair are all possible, and would have
different causes and fixes.
Most people who get to this level of calculation are doing so because they’re
trying to solve a problem, and the common problem is that cross-AZ traffic in
cloud providers is expensive at scale. If that’s why you’re asking, compression
is your obvious win, and reducing RF is your alternative option (3/3/3 is super
expensive - how many dcs take writes directly and which consistency level are
you using? What’s the point of having 9 copies of the data? Would 1 copy per dc
be enough if you’re doing global quorum? Would 2 copies in the cold DCs be
enough if you’re only reading / writing from one DC?).
> My questions:
>
> 1. Would you considers these factors expected behaviour?
13 seems high. 9 seems more correct unless you’re double counting sending and
receiving.
> 2. Are there ways to reduce the traffic through configuration?
Compression, reducing RF, maybe mitigation with longer timeouts to avoid double
sending hints.
>
> A few additional notes on the setup:
>
> * use NetworkTopologyStrategy for replication and cassandra-rackdc.properties
> to configure the GossipingPropertyFileSnitch
> * internode_compression is set to dc
> * inter_dc_tcp_nodelay is set to false
>
> Any help is highly appreciated!
>
> Best Regards
> Jens
> Geschäftsführer: Oliver Koch (CEO), Jean-Baptiste Cornefert, Christoph
> Ostermann, Hermann Schweizer, Bianca Swanston
> Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer
> 127/137/50792, USt.-IdNr. DE272208908