Hello, before going deeper into your proposal ... I just have to ask, have you tried e.g. custom LoadBalancingPolicy in driver, if you use that?
I can imagine that if the driver detects a node to be down then based on its "distance" / dc it belongs to you might start to create a different query plan which talks to remote DC and similar. https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html On Mon, Dec 1, 2025 at 8:54 PM Qc L <[email protected]> wrote: > Hello team, > > I’d like to propose adding the remote quorum to Cassandra consistency > level for handling simultaneous hosts unavailability in the local data > center that cannot achieve the required quorum. > > Background > NetworkTopologyStrategy is the most commonly used strategy at Uber, and we > use Local_Quorum for read/write in many use cases. Our Cassandra deployment > in each data center currently relies on majority replicas being healthy to > consistently achieve local quorum. > > Current behavior > When a local data center in a Cassandra deployment experiences outages, > network isolation, or maintenance events, the EACH_QUORUM / LOCAL_QUORUM > consistency level will fail for both reads and writes if enough replicas in > that the wlocal data center are unavailable. In this configuration, > simultaneous hosts unavailability can temporarily prevent the cluster from > reaching the required quorum for reads and writes. For applications that > require high availability and a seamless user experience, this can lead to > service downtime and a noticeable drop in overall availability. see > attached figure1.[image: figure1.png] > > Proposed Solution > To prevent this issue and ensure a seamless user experience, we can use > the Remote Quorum consistency level as a fallback mechanism in scenarios > where local replicas are unavailable. Remote Quorum in Cassandra refers to > a read or write operation that achieves quorum (a majority of replicas) in > the remote data center, rather than relying solely on replicas within the > local data center. see attached Figure2. > > [image: figure2.png] > > We will add a feature to do read/write consistency level override on the > server side. When local replicas are not available, we will overwrite the > server side write consistency level from each quorum to remote quorum. Note > that, implementing this change in client side will require some protocol > changes in CQL, we only add this on server side which can only be used by > server internal. > > For example, giving the following Cassandra setup > > > > > > *CREATE KEYSPACE ks WITH REPLICATION = { 'class': > 'NetworkTopologyStrategy', 'cluster1': 3, 'cluster2': 3, 'cluster3': 3};* > > The selected approach for this design is to explicitly configure a backup > data center mapping for the local data center, where each data center > defines its preferred failover target. For example > *remote_quorum_target_data_center:* > > > * cluster1: cluster2 cluster2: cluster3 cluster3: cluster1* > > Implementations > We proposed the following feature to Cassandra to address data center > failure scenarios > > 1. Introduce a new Consistency level called remote quorum. > 2. Feature to do read/write consistency level override on server side. > (This can be controlled by a feature flag). Use Node tools command to turn > on/off the server failback > > Why remote quorum is useful > As shown in the figure 2, we have the data center failure in one of the > data centers, Local quorum and each quorum will fail since two replicas are > unavailable and cannot meet the quorum requirement in the local data center. > > During the incident, we can use the nodetool to enable failover to remote > quorum. With the failover enabled, we can failover to the remote data > center for read and write, which avoids the available drop. > > Any thoughts or concerns? > Thanks in advance for your feedback, > > Qiaochu > >
