[ 
https://issues.apache.org/jira/browse/CASSANDRA-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ling Mao updated CASSANDRA-20798:
---------------------------------
    Description: 
h4. How to reproduce
{code:java}
CREATE KEYSPACE c_simple_3_1 WITH replication = {'class': 'SimpleStrategy',
'replication_factor' : '3/1'};
use c_simple_3_1;
CREATE TABLE users (
    user_id varchar PRIMARY KEY,
    first varchar,
    last varchar,
    age int
) WITH read_repair = 'NONE';
cqlsh:c_simple_3_1> consistency;
Current consistency level is ONE.
INSERT INTO users (user_id, first, last, age)
               VALUES ('jsmith', 'John', 'Smith', 42);
INSERT INTO users (user_id, first, last, age)
               VALUES ('foo', 'foo', 'foo', 18);
INSERT INTO users (user_id, first, last, age)
               VALUES ('bar', 'bar', 'bar', 19);
INSERT INTO users (user_id, first, last, age)
               VALUES ('abc', 'abc', 'abc', 20);
cqlsh:c_simple_3_1> SELECT * FROM users;
NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 
127.0.0.1:9042 datacenter1>: <Error from server: code=0000 [Server error] 
message="java.lang.AssertionError">})
cqlsh:c_simple_3_1> SELECT * FROM users where user_id = 'jsmith';
 user_id | age | first | last
---------+-----+-------+-------
  jsmith |  42 |  John | Smith
(1 rows)
cqlsh:c_simple_3_1> SELECT * FROM users where user_id = 'jsmith';
NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 
127.0.0.1:9042 datacenter1>: <Error from server: code=0000 [Server error] 
message="java.lang.AssertionError">})
{code}
h4. Exception stack
{code:java}
ERROR [Native-Transport-Requests-1] 2025-07-23 19:06:16,161 
ExceptionHandlers.java:246 - Unexpected exception during request; channel = 
[id: 0x65a448a1, L:/127.0.0.1:9042 - R:/127.0.0.1:55121]
java.lang.AssertionError: null
 at 
org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:39)
 at 
org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:200)
 at 
org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForRead(ReplicaPlans.java:139)
 at 
org.apache.cassandra.locator.ReplicaPlans.forRangeRead(ReplicaPlans.java:964)
 at 
org.apache.cassandra.locator.ReplicaPlans.forRangeRead(ReplicaPlans.java:945)
 at 
org.apache.cassandra.service.reads.range.ReplicaPlanIterator.computeNext(ReplicaPlanIterator.java:89)
 at 
org.apache.cassandra.service.reads.range.ReplicaPlanIterator.computeNext(ReplicaPlanIterator.java:46)
 at 
org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
 at com.google.common.collect.Iterators$PeekingImpl.hasNext(Iterators.java:1191)
 at 
org.apache.cassandra.service.reads.range.ReplicaPlanMerger.computeNext(ReplicaPlanMerger.java:61)
 at 
org.apache.cassandra.service.reads.range.ReplicaPlanMerger.computeNext(ReplicaPlanMerger.java:34)
 at 
org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
 at 
org.apache.cassandra.service.reads.range.RangeCommandIterator.computeNext(RangeCommandIterator.java:126)
 at 
org.apache.cassandra.service.reads.range.RangeCommandIterator.computeNext(RangeCommandIterator.java:74)
 at 
org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
 at 
org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:90)
 at 
org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:1053)
 at 
org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:636)
 at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:610)
 at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:404)
 at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:154)
 at 
org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:304)
 at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:399)
 at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:386)
 at 
org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:117)
 at org.apache.cassandra.transport.Message$Request.execute(Message.java:259)
 at 
org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:423)
 at 
org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:442)
 at 
org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:469)
 at 
org.apache.cassandra.transport.Dispatcher$RequestProcessor.run(Dispatcher.java:314)
 at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:99)
 at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
 at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
 at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:150)
 at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.base/java.lang.Thread.run(Thread.java:829){code}
h4. Analysis

It's unexpected that executing the same CQL query against the same coordinator 
can return different results under transient replication mode.

After some debugging, we found the root cause is that the candidate replicas 
are selected in a non-deterministic order.

If a transient replica happens to be the first element in the candidates list, 
it may violate the requirement that at least one full replica must be involved 
in the read, leading to inconsistent behavior.
{code:java}
EndpointsForToken candidates = candidatesForRead(keyspace, indexQueryPlan, 
consistencyLevel, forTokenReadLive.all());
EndpointsForToken contacts = contactForRead(metadata.locator, 
replicationStrategy, consistencyLevel, 
retry.equals(AlwaysSpeculativeRetryPolicy.INSTANCE), candidates);{code}
{code:java}
In `org.apache.cassandra.locator.ReplicaPlans#assureSufficientLiveReplicas`:

default:
    int live = allLive.size();
    int full = Replicas.countFull(allLive);
    if (live < blockFor || full < blockForFullReplicas)
    {
        if (logger.isTraceEnabled())
            logger.trace("Live nodes {} do not satisfy ConsistencyLevel ({} 
required)", Iterables.toString(allLive), blockFor);
        throw UnavailableException.create(consistencyLevel, blockFor, 
blockForFullReplicas, live, full);
    }
#################################

In this case, if the list contains only transient replicas, `full` will be 0. 
If `blockForFullReplicas` is 1, the condition `full(0) < 
blockForFullReplicas(1)` is satisfied, triggering an `UnavailableException`.
{code}
h4. Solution

To solve this, we introduce a method _*reorderWithOneFullReplicaFirst*_ that 
attempts to ensure at least one full replica is placed at the head of the 
replica lis, without disturbing the rest of the ordering (which is typically 
proximity-based)

  was:
h4. How to reproduce
{code:java}
CREATE KEYSPACE c_simple_3_1 WITH replication = {'class': 'SimpleStrategy',
'replication_factor' : '3/1'};
use c_simple_3_1;
CREATE TABLE users (
    user_id varchar PRIMARY KEY,
    first varchar,
    last varchar,
    age int
) WITH read_repair = 'NONE';
cqlsh:c_simple_3_1> consistency;
Current consistency level is ONE.
INSERT INTO users (user_id, first, last, age)
               VALUES ('jsmith', 'John', 'Smith', 42);
INSERT INTO users (user_id, first, last, age)
               VALUES ('foo', 'foo', 'foo', 18);
INSERT INTO users (user_id, first, last, age)
               VALUES ('bar', 'bar', 'bar', 19);
INSERT INTO users (user_id, first, last, age)
               VALUES ('abc', 'abc', 'abc', 20);
cqlsh:c_simple_3_1> SELECT * FROM users;
NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 
127.0.0.1:9042 datacenter1>: <Error from server: code=0000 [Server error] 
message="java.lang.AssertionError">})
cqlsh:c_simple_3_1> SELECT * FROM users where user_id = 'jsmith';
 user_id | age | first | last
---------+-----+-------+-------
  jsmith |  42 |  John | Smith
(1 rows)
cqlsh:c_simple_3_1> SELECT * FROM users where user_id = 'jsmith';
NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 
127.0.0.1:9042 datacenter1>: <Error from server: code=0000 [Server error] 
message="java.lang.AssertionError">})
{code}
h4. Exception stack
{code:java}
ERROR [Native-Transport-Requests-1] 2025-07-23 19:06:16,161 
ExceptionHandlers.java:246 - Unexpected exception during request; channel = 
[id: 0x65a448a1, L:/127.0.0.1:9042 - R:/127.0.0.1:55121]
java.lang.AssertionError: null
 at 
org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:39)
 at 
org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:200)
 at 
org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForRead(ReplicaPlans.java:139)
 at 
org.apache.cassandra.locator.ReplicaPlans.forRangeRead(ReplicaPlans.java:964)
 at 
org.apache.cassandra.locator.ReplicaPlans.forRangeRead(ReplicaPlans.java:945)
 at 
org.apache.cassandra.service.reads.range.ReplicaPlanIterator.computeNext(ReplicaPlanIterator.java:89)
 at 
org.apache.cassandra.service.reads.range.ReplicaPlanIterator.computeNext(ReplicaPlanIterator.java:46)
 at 
org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
 at com.google.common.collect.Iterators$PeekingImpl.hasNext(Iterators.java:1191)
 at 
org.apache.cassandra.service.reads.range.ReplicaPlanMerger.computeNext(ReplicaPlanMerger.java:61)
 at 
org.apache.cassandra.service.reads.range.ReplicaPlanMerger.computeNext(ReplicaPlanMerger.java:34)
 at 
org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
 at 
org.apache.cassandra.service.reads.range.RangeCommandIterator.computeNext(RangeCommandIterator.java:126)
 at 
org.apache.cassandra.service.reads.range.RangeCommandIterator.computeNext(RangeCommandIterator.java:74)
 at 
org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
 at 
org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:90)
 at 
org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:1053)
 at 
org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:636)
 at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:610)
 at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:404)
 at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:154)
 at 
org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:304)
 at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:399)
 at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:386)
 at 
org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:117)
 at org.apache.cassandra.transport.Message$Request.execute(Message.java:259)
 at 
org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:423)
 at 
org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:442)
 at 
org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:469)
 at 
org.apache.cassandra.transport.Dispatcher$RequestProcessor.run(Dispatcher.java:314)
 at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:99)
 at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
 at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
 at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:150)
 at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.base/java.lang.Thread.run(Thread.java:829){code}
h4. Analysis

It's *{*}unexpected{*}* that executing the same CQL query against the same 
coordinator can return *{*}different results{*}* under *{*}transient 
replication{*}* mode.

After some debugging, we found the root cause is that the *{*}candidate 
replicas are selected in a non-deterministic order{*}*.

If a *{*}transient replica{*}* happens to be the *{*}first element{*}* in the 
candidates list, it may *{*}violate the requirement that at least one full 
replica must be involved in the read{*}*, leading to inconsistent behavior.
{code:java}
EndpointsForToken candidates = candidatesForRead(keyspace, indexQueryPlan, 
consistencyLevel, forTokenReadLive.all());
EndpointsForToken contacts = contactForRead(metadata.locator, 
replicationStrategy, consistencyLevel, 
retry.equals(AlwaysSpeculativeRetryPolicy.INSTANCE), candidates);{code}
{code:java}
In `org.apache.cassandra.locator.ReplicaPlans#assureSufficientLiveReplicas`:

default:
    int live = allLive.size();
    int full = Replicas.countFull(allLive);
    if (live < blockFor || full < blockForFullReplicas)
    {
        if (logger.isTraceEnabled())
            logger.trace("Live nodes {} do not satisfy ConsistencyLevel ({} 
required)", Iterables.toString(allLive), blockFor);
        throw UnavailableException.create(consistencyLevel, blockFor, 
blockForFullReplicas, live, full);
    }
#################################

In this case, if the list contains only transient replicas, `full` will be 0. 
If `blockForFullReplicas` is 1, the condition `full(0) < 
blockForFullReplicas(1)` is satisfied, triggering an `UnavailableException`.
{code}
h4. Solution

To solve this, we introduce a method `reorderWithOneFullReplicaFirst(...)` that 
attempts to ensure *{*}at least one full replica is placed at the head of the 
replica list{*}*, without disturbing the rest of the ordering (which is 
typically proximity-based)


> Fix non deterministic reads in transient replication
> ----------------------------------------------------
>
>                 Key: CASSANDRA-20798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20798
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Feature/Transient Replication
>            Reporter: Ling Mao
>            Assignee: Ling Mao
>            Priority: Normal
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> h4. How to reproduce
> {code:java}
> CREATE KEYSPACE c_simple_3_1 WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor' : '3/1'};
> use c_simple_3_1;
> CREATE TABLE users (
>     user_id varchar PRIMARY KEY,
>     first varchar,
>     last varchar,
>     age int
> ) WITH read_repair = 'NONE';
> cqlsh:c_simple_3_1> consistency;
> Current consistency level is ONE.
> INSERT INTO users (user_id, first, last, age)
>                VALUES ('jsmith', 'John', 'Smith', 42);
> INSERT INTO users (user_id, first, last, age)
>                VALUES ('foo', 'foo', 'foo', 18);
> INSERT INTO users (user_id, first, last, age)
>                VALUES ('bar', 'bar', 'bar', 19);
> INSERT INTO users (user_id, first, last, age)
>                VALUES ('abc', 'abc', 'abc', 20);
> cqlsh:c_simple_3_1> SELECT * FROM users;
> NoHostAvailable: ('Unable to complete the operation against any hosts', 
> {<Host: 127.0.0.1:9042 datacenter1>: <Error from server: code=0000 [Server 
> error] message="java.lang.AssertionError">})
> cqlsh:c_simple_3_1> SELECT * FROM users where user_id = 'jsmith';
>  user_id | age | first | last
> ---------+-----+-------+-------
>   jsmith |  42 |  John | Smith
> (1 rows)
> cqlsh:c_simple_3_1> SELECT * FROM users where user_id = 'jsmith';
> NoHostAvailable: ('Unable to complete the operation against any hosts', 
> {<Host: 127.0.0.1:9042 datacenter1>: <Error from server: code=0000 [Server 
> error] message="java.lang.AssertionError">})
> {code}
> h4. Exception stack
> {code:java}
> ERROR [Native-Transport-Requests-1] 2025-07-23 19:06:16,161 
> ExceptionHandlers.java:246 - Unexpected exception during request; channel = 
> [id: 0x65a448a1, L:/127.0.0.1:9042 - R:/127.0.0.1:55121]
> java.lang.AssertionError: null
>  at 
> org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:39)
>  at 
> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:200)
>  at 
> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForRead(ReplicaPlans.java:139)
>  at 
> org.apache.cassandra.locator.ReplicaPlans.forRangeRead(ReplicaPlans.java:964)
>  at 
> org.apache.cassandra.locator.ReplicaPlans.forRangeRead(ReplicaPlans.java:945)
>  at 
> org.apache.cassandra.service.reads.range.ReplicaPlanIterator.computeNext(ReplicaPlanIterator.java:89)
>  at 
> org.apache.cassandra.service.reads.range.ReplicaPlanIterator.computeNext(ReplicaPlanIterator.java:46)
>  at 
> org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
>  at 
> com.google.common.collect.Iterators$PeekingImpl.hasNext(Iterators.java:1191)
>  at 
> org.apache.cassandra.service.reads.range.ReplicaPlanMerger.computeNext(ReplicaPlanMerger.java:61)
>  at 
> org.apache.cassandra.service.reads.range.ReplicaPlanMerger.computeNext(ReplicaPlanMerger.java:34)
>  at 
> org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
>  at 
> org.apache.cassandra.service.reads.range.RangeCommandIterator.computeNext(RangeCommandIterator.java:126)
>  at 
> org.apache.cassandra.service.reads.range.RangeCommandIterator.computeNext(RangeCommandIterator.java:74)
>  at 
> org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
>  at 
> org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:90)
>  at 
> org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:1053)
>  at 
> org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:636)
>  at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:610)
>  at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:404)
>  at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:154)
>  at 
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:304)
>  at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:399)
>  at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:386)
>  at 
> org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:117)
>  at org.apache.cassandra.transport.Message$Request.execute(Message.java:259)
>  at 
> org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:423)
>  at 
> org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:442)
>  at 
> org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:469)
>  at 
> org.apache.cassandra.transport.Dispatcher$RequestProcessor.run(Dispatcher.java:314)
>  at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:99)
>  at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
>  at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
>  at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:150)
>  at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.base/java.lang.Thread.run(Thread.java:829){code}
> h4. Analysis
> It's unexpected that executing the same CQL query against the same 
> coordinator can return different results under transient replication mode.
> After some debugging, we found the root cause is that the candidate replicas 
> are selected in a non-deterministic order.
> If a transient replica happens to be the first element in the candidates 
> list, it may violate the requirement that at least one full replica must be 
> involved in the read, leading to inconsistent behavior.
> {code:java}
> EndpointsForToken candidates = candidatesForRead(keyspace, indexQueryPlan, 
> consistencyLevel, forTokenReadLive.all());
> EndpointsForToken contacts = contactForRead(metadata.locator, 
> replicationStrategy, consistencyLevel, 
> retry.equals(AlwaysSpeculativeRetryPolicy.INSTANCE), candidates);{code}
> {code:java}
> In `org.apache.cassandra.locator.ReplicaPlans#assureSufficientLiveReplicas`:
> default:
>     int live = allLive.size();
>     int full = Replicas.countFull(allLive);
>     if (live < blockFor || full < blockForFullReplicas)
>     {
>         if (logger.isTraceEnabled())
>             logger.trace("Live nodes {} do not satisfy ConsistencyLevel ({} 
> required)", Iterables.toString(allLive), blockFor);
>         throw UnavailableException.create(consistencyLevel, blockFor, 
> blockForFullReplicas, live, full);
>     }
> #################################
> In this case, if the list contains only transient replicas, `full` will be 0. 
> If `blockForFullReplicas` is 1, the condition `full(0) < 
> blockForFullReplicas(1)` is satisfied, triggering an `UnavailableException`.
> {code}
> h4. Solution
> To solve this, we introduce a method _*reorderWithOneFullReplicaFirst*_ that 
> attempts to ensure at least one full replica is placed at the head of the 
> replica lis, without disturbing the rest of the ordering (which is typically 
> proximity-based)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to