Re: [DISCUSS] Updating blockFor Behavior During Node Replacement to Improve Availability and Latency

Mick Sun, 21 Dec 2025 05:13:03 -0800

FYI there's a healthy slack thread discussing this found here: 
https://the-asf.slack.com/archives/CK23JSY2K/p1762834946972609


From that the concerns left (iiuc) are:
 - cases where a replacing node is removed and we return the original 
being-replaced node to the cluster,
 - cases where multiple nodes are replacing and gossip leaves coordinators 
seeing different states of some, or no, nodes as JOINING/NORMAL.

Given these concerns: the possibilities operators are doing things in 
unexpected ways; the feature flag is warranted on non-trunk branches.

But does trunk need the flag ? It sounds like neither concern exists in trunk 
but there was a desire to do it with more changes in trunk which i'm not 
grokking…?



> On 3 Dec 2025, at 18:32, Runtian Liu <[email protected]> wrote:
> 
> Hi all,
> Just bumping this thread in case it was missed the first time.
> I’ve updated CASSANDRA-20993 with a detailed Correctness / Safety section 
> that explains why excluding the pending replacement node from blockFor during 
> node replacement does not weaken read-after-write guarantees for any 
> combination of write CL and read CL. The key point is that the effective 
> number of natural replicas that must acknowledge a write (and be consulted 
> for a read) is unchanged; we only stop inflating blockFor with the pending 
> replacement.
> For example, in the common RF=3, QUORUM write + QUORUM read case, the proof 
> shows that during a C → D replacement:
>     • Every successful QUORUM write is still guaranteed to be stored on a 
> quorum of naturals (e.g., A and B), and
>     • Every QUORUM read—both before and after the replacement completes—must 
> intersect {A, B}, so it always sees the latest value.
> The more general argument in the ticket covers all CL pairs and shows that 
> the standard condition W_eff + R_eff > RF holds (or not) exactly as before; 
> the change only removes unnecessary write timeouts when the pending 
> replacement is slow.
> If you have concerns about the correctness argument, or think there are 
> corner cases I’m missing (e.g., particular CL combinations or topology 
> transitions), I’d really appreciate feedback on the JIRA or in this thread.
> Thanks,
> Runtian
> 
> On Tue, Nov 25, 2025 at 4:44 PM Runtian Liu <[email protected]> wrote:
> Hi everyone,
> I’d like to start a discussion about adjusting how Cassandra calculates 
> blockFor during node replacements. The JIRA tracking this proposal is here:
> https://issues.apache.org/jira/browse/CASSANDRA-20993
> Problem Background
> Today, during a replacement, the pending replica is always included when 
> determining the required acknowledgments. For example, with RF=3 and 
> LOCAL_QUORUM, the coordinator waits for three responses instead of two. Since 
> replacement nodes are often bootstrapping and slow to respond, this can 
> result in write timeouts or increased write latency—even though the client 
> only requested acknowledgments from the natural replicas.
> This behavior effectively breaks the client contract by requiring more 
> responses than the specified consistency level.
> Proposed Change
> For replacement scenarios only, exclude pending replicas from blockFor and 
> require acknowledgments solely from natural replicas. Pending nodes will 
> still receive writes, but their responses will not count toward satisfying 
> the consistency level.
> Responses from the node being replaced would also be ignored. Although it is 
> uncommon for a replaced node to become reachable again, adding this safeguard 
> avoids ambiguity and ensures correctness if that situation occurs.
> This change would be disabled by default and controlled via a feature flag to 
> avoid affecting existing deployments.
> In my view, this behavior is effectively a bug because the coordinator waits 
> for more acknowledgments than the client requested, leading to avoidable 
> failures or latency. Since the issue affects correctness from the client 
> perspective rather than introducing new semantics, it would be valuable to 
> include this fix in the 4.x branches as well, with the behavior disabled by 
> default where needed.
> Motivation
> This change:
>     • Prevents unnecessary write timeouts during replacements
> 
>     • Reduces write latency by eliminating dependence on a busy pending 
> replica
> 
>     • Aligns server behavior with client expectations
> Current Status
> A PR for 4.1 is available here for review:
> https://github.com/apache/cassandra/pull/4494
> Feedback is welcome on both the implementation and the approach.
> Next Steps
> I’d appreciate input on:
>     • Any correctness concerns for replacement scenarios
> 
>     • Whether a feature-flagged approach is acceptable
> 
> Thanks in advance for your feedback,
> Runtian
> 
>

Re: [DISCUSS] Updating blockFor Behavior During Node Replacement to Improve Availability and Latency

Reply via email to