Hi!
I have a design question about the communication layer in Cassandra.
Some context first:
1. I am using Cassandra in a high inter-node communication latency
environment, e.g latency between nodes is higher than 60 ms, like in
multi-datacenter environments.
2. Currently, I am running my tests in a three node cluster, with
replication factor equal to three.
3. I am working in a low throughput scenario, where I am trying to
reduce the latency as much as possible for quorum operations (both
reads and writes)
4. All my query-related messages between nodes can be considered
small (all query-related messages would most likely go through the
smallMessageChannel).
I am trying to understand some of the design decisions made on the
org.apache.cassandra.net.async.OutboundMessagingConnection and
org.apache.cassandra.net.async.OutboundMessagingPool.
If I disable completely coalescing (e.g. using otc_coalescing_strategy
as disabled), and I send two almost simultaneous request that affects
the same set of replicas, after the coordinator sends the message to one
of the data nodes. In the data node, the fastest query would complete
and use the OutboundMessagingConnection and block until a TCP ACK is
received, then all the other pending messages in the data node would be
sent in the next TCP PSH. I notice this by running tcpdump and Wireshark
in the nodes and analyzing the TCP connections between nodes.
My question is: Why is Cassandra using TCP in a synchronous manner? and
not using some type of windowing to allow more messages to go in the
wire before waiting for an ack?. Is this because of some of the
guarantees that Cassandra provides?
I know coalescing could help, but if the messages are separated by more
than the window (200 us by default in the fix setting), then it would
need to wait for a whole RTT (60 ms), before it is sent, so for many
operations in my system, I see that the latency perceived is on average
2 RTT instead of the quorum round trip that should be the slowest of all
the communication paths.
I don't know enough about Netty and TCP to figure out this by myself,
but I think I understand the basics of OutboundMessagingConnection and
OutboundMessagingConnectionPool, so I am not sure why that requirement
for synchronous TCP PSH and ACK is needed.
Thanks a lot for your help! Cassandra code is really good, and I was
able to understand many of the components thanks to it.
Best regards,
Enrique Saurez