For the record, there is whole section about this here (sorry for pasting
the same link twice in my original mail)

https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/load_balancing/index.html#cross-datacenter-failover

Interesting to see your perspective, I can see how doing something without
application changes might seem appealing.

I wonder what others think about this.

Regards

On Mon, Dec 1, 2025 at 11:39 PM Qc L <[email protected]> wrote:

> Hi Stefan,
>
> Thanks for raising the question about using a custom LoadBalancingPolicy!
> Remote quorum and LB-based failover aim to solve similar problems,
> maintaining availability during datacenter degradation, so I did a
> comparison study on the server-side remote quorum solution versus relying
> on driver-side logic.
>
>
>    - For our use cases, we found that client-side failover is expensive
>    to operate and leads to fragmented behavior during incidents. We also
>    prefer failover to be triggered only when needed, not automatically. A
>    server-side mechanism gives us controlled, predictable behavior and makes
>    the system easier to operate.
>    - Remote quorum is implemented on the server, where the coordinator
>    can intentionally form a quorum using replicas in a backup region when the
>    local region cannot satisfy LOCAL_QUORUM or EACH_QUORUM. The database
>    determines the fallback region, how replicas are selected, and ensures
>    consistency guarantees still hold.
>
> By keeping this logic internal to the server, we provide a unified,
> consistent behavior across all clients without requiring any application
> changes.
>
> Happy to discuss further if helpful.
>
> On Mon, Dec 1, 2025 at 12:57 PM Štefan Miklošovič <[email protected]>
> wrote:
>
>> Hello,
>>
>> before going deeper into your proposal ... I just have to ask, have you
>> tried e.g. custom LoadBalancingPolicy in driver, if you use that?
>>
>> I can imagine that if the driver detects a node to be down then based on
>> its "distance" / dc it belongs to you might start to create a different
>> query plan which talks to remote DC and similar.
>>
>>
>> https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html
>>
>>
>> https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html
>>
>> On Mon, Dec 1, 2025 at 8:54 PM Qc L <[email protected]> wrote:
>>
>>> Hello team,
>>>
>>> I’d like to propose adding the remote quorum to Cassandra consistency
>>> level for handling simultaneous hosts unavailability in the local data
>>> center that cannot achieve the required quorum.
>>>
>>> Background
>>> NetworkTopologyStrategy is the most commonly used strategy at Uber, and
>>> we use Local_Quorum for read/write in many use cases. Our Cassandra
>>> deployment in each data center currently relies on majority replicas being
>>> healthy to consistently achieve local quorum.
>>>
>>> Current behavior
>>> When a local data center in a Cassandra deployment experiences outages,
>>> network isolation, or maintenance events, the EACH_QUORUM / LOCAL_QUORUM
>>> consistency level will fail for both reads and writes if enough replicas in
>>> that the wlocal data center are unavailable. In this configuration,
>>> simultaneous hosts unavailability can temporarily prevent the cluster from
>>> reaching the required quorum for reads and writes. For applications that
>>> require high availability and a seamless user experience, this can lead to
>>> service downtime and a noticeable drop in overall availability. see
>>> attached figure1.[image: figure1.png]
>>>
>>> Proposed Solution
>>> To prevent this issue and ensure a seamless user experience, we can use
>>> the Remote Quorum consistency level as a fallback mechanism in scenarios
>>> where local replicas are unavailable. Remote Quorum in Cassandra refers to
>>> a read or write operation that achieves quorum (a majority of replicas) in
>>> the remote data center, rather than relying solely on replicas within the
>>> local data center. see attached Figure2.
>>>
>>> [image: figure2.png]
>>>
>>> We will add a feature to do read/write consistency level override on the
>>> server side. When local replicas are not available, we will overwrite the
>>> server side write consistency level from each quorum to remote quorum. Note
>>> that, implementing this change in client side will require some protocol
>>> changes in CQL, we only add this on server side which can only be used by
>>> server internal.
>>>
>>> For example, giving the following Cassandra setup
>>>
>>>
>>>
>>>
>>>
>>> *CREATE KEYSPACE ks WITH REPLICATION = {  'class':
>>> 'NetworkTopologyStrategy',  'cluster1': 3,  'cluster2': 3,  'cluster3': 3};*
>>>
>>> The selected approach for this design is to explicitly configure a
>>> backup data center mapping for the local data center, where each data
>>> center defines its preferred failover target. For example
>>> *remote_quorum_target_data_center:*
>>>
>>>
>>> *  cluster1: cluster2  cluster2: cluster3  cluster3: cluster1*
>>>
>>> Implementations
>>> We proposed the following feature to Cassandra to address data center
>>> failure scenarios
>>>
>>>    1. Introduce a new Consistency level called remote quorum.
>>>    2. Feature to do read/write consistency level override on server
>>>    side. (This can be controlled by a feature flag). Use Node tools command 
>>> to
>>>    turn on/off the server failback
>>>
>>> Why remote quorum is useful
>>> As shown in the figure 2, we have the data center failure in one of the
>>> data centers, Local quorum and each quorum will fail since two replicas are
>>> unavailable and cannot meet the quorum requirement in the local data center.
>>>
>>> During the incident, we can use the nodetool to enable failover to
>>> remote quorum. With the failover enabled, we can failover to the remote
>>> data center for read and write, which avoids the available drop.
>>>
>>> Any thoughts or concerns?
>>> Thanks in advance for your feedback,
>>>
>>> Qiaochu
>>>
>>>

Reply via email to