Hi Stefan, Thanks for raising the question about using a custom LoadBalancingPolicy! Remote quorum and LB-based failover aim to solve similar problems, maintaining availability during datacenter degradation, so I did a comparison study on the server-side remote quorum solution versus relying on driver-side logic.
- For our use cases, we found that client-side failover is expensive to operate and leads to fragmented behavior during incidents. We also prefer failover to be triggered only when needed, not automatically. A server-side mechanism gives us controlled, predictable behavior and makes the system easier to operate. - Remote quorum is implemented on the server, where the coordinator can intentionally form a quorum using replicas in a backup region when the local region cannot satisfy LOCAL_QUORUM or EACH_QUORUM. The database determines the fallback region, how replicas are selected, and ensures consistency guarantees still hold. By keeping this logic internal to the server, we provide a unified, consistent behavior across all clients without requiring any application changes. Happy to discuss further if helpful. On Mon, Dec 1, 2025 at 12:57 PM Štefan Miklošovič <[email protected]> wrote: > Hello, > > before going deeper into your proposal ... I just have to ask, have you > tried e.g. custom LoadBalancingPolicy in driver, if you use that? > > I can imagine that if the driver detects a node to be down then based on > its "distance" / dc it belongs to you might start to create a different > query plan which talks to remote DC and similar. > > > https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html > > > https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html > > On Mon, Dec 1, 2025 at 8:54 PM Qc L <[email protected]> wrote: > >> Hello team, >> >> I’d like to propose adding the remote quorum to Cassandra consistency >> level for handling simultaneous hosts unavailability in the local data >> center that cannot achieve the required quorum. >> >> Background >> NetworkTopologyStrategy is the most commonly used strategy at Uber, and >> we use Local_Quorum for read/write in many use cases. Our Cassandra >> deployment in each data center currently relies on majority replicas being >> healthy to consistently achieve local quorum. >> >> Current behavior >> When a local data center in a Cassandra deployment experiences outages, >> network isolation, or maintenance events, the EACH_QUORUM / LOCAL_QUORUM >> consistency level will fail for both reads and writes if enough replicas in >> that the wlocal data center are unavailable. In this configuration, >> simultaneous hosts unavailability can temporarily prevent the cluster from >> reaching the required quorum for reads and writes. For applications that >> require high availability and a seamless user experience, this can lead to >> service downtime and a noticeable drop in overall availability. see >> attached figure1.[image: figure1.png] >> >> Proposed Solution >> To prevent this issue and ensure a seamless user experience, we can use >> the Remote Quorum consistency level as a fallback mechanism in scenarios >> where local replicas are unavailable. Remote Quorum in Cassandra refers to >> a read or write operation that achieves quorum (a majority of replicas) in >> the remote data center, rather than relying solely on replicas within the >> local data center. see attached Figure2. >> >> [image: figure2.png] >> >> We will add a feature to do read/write consistency level override on the >> server side. When local replicas are not available, we will overwrite the >> server side write consistency level from each quorum to remote quorum. Note >> that, implementing this change in client side will require some protocol >> changes in CQL, we only add this on server side which can only be used by >> server internal. >> >> For example, giving the following Cassandra setup >> >> >> >> >> >> *CREATE KEYSPACE ks WITH REPLICATION = { 'class': >> 'NetworkTopologyStrategy', 'cluster1': 3, 'cluster2': 3, 'cluster3': 3};* >> >> The selected approach for this design is to explicitly configure a backup >> data center mapping for the local data center, where each data center >> defines its preferred failover target. For example >> *remote_quorum_target_data_center:* >> >> >> * cluster1: cluster2 cluster2: cluster3 cluster3: cluster1* >> >> Implementations >> We proposed the following feature to Cassandra to address data center >> failure scenarios >> >> 1. Introduce a new Consistency level called remote quorum. >> 2. Feature to do read/write consistency level override on server >> side. (This can be controlled by a feature flag). Use Node tools command >> to >> turn on/off the server failback >> >> Why remote quorum is useful >> As shown in the figure 2, we have the data center failure in one of the >> data centers, Local quorum and each quorum will fail since two replicas are >> unavailable and cannot meet the quorum requirement in the local data center. >> >> During the incident, we can use the nodetool to enable failover to remote >> quorum. With the failover enabled, we can failover to the remote data >> center for read and write, which avoids the available drop. >> >> Any thoughts or concerns? >> Thanks in advance for your feedback, >> >> Qiaochu >> >>
