[jira] [Resolved] (CASSANDRA-12043) Syncing most recent commit in CAS across replicas can cause all CAS queries in the CQL partition to fail

Sylvain Lebresne (JIRA) Tue, 28 Jun 2016 06:27:31 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sylvain Lebresne resolved CASSANDRA-12043.
------------------------------------------
       Resolution: Fixed
         Reviewer: Jason Brown
    Fix Version/s: 3.9
                   3.0.9
                   2.2.7
                   2.1.15

I still needed to merge the branch upwards and wait on CI results for all 
branch. This is now done and tests seem "fine" (no failure appears related) so 
committed. Thanks.

| [2.1|https://github.com/pcmanus/cassandra/commits/12043-2.1] | 
[utests|http://cassci.datastax.com/job/pcmanus-12043-2.1-testall/] | 
[dtests|http://cassci.datastax.com/job/pcmanus-12043-2.1-dtest/] ||
| [2.2|https://github.com/pcmanus/cassandra/commits/12043-2.2] | 
[utests|http://cassci.datastax.com/job/pcmanus-12043-2.2-testall/] | 
[dtests|http://cassci.datastax.com/job/pcmanus-12043-2.2-dtest/] ||
| [3.0|https://github.com/pcmanus/cassandra/commits/12043-3.0] | 
[utests|http://cassci.datastax.com/job/pcmanus-12043-3.0-testall/] | 
[dtests|http://cassci.datastax.com/job/pcmanus-12043-3.0-dtest/] ||
| [3.9|https://github.com/pcmanus/cassandra/commits/12043-3.9] | 
[utests|http://cassci.datastax.com/job/pcmanus-12043-3.9-testall/] | 
[dtests|http://cassci.datastax.com/job/pcmanus-12043-3.9-dtest/] ||


> Syncing most recent commit in CAS across replicas can cause all CAS queries 
> in the CQL partition to fail
> --------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-12043
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12043
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: sankalp kohli
>            Assignee: Sylvain Lebresne
>             Fix For: 2.1.15, 2.2.7, 3.0.9, 3.9
>
>
> We update the most recent commit on requiredParticipant replicas if out of 
> sync during the prepare round in beginAndRepairPaxos method. We keep doing 
> this in a loop till the requiredParticipant replicas have the same most 
> recent commit or we hit timeout. 
> Say we have 3 machines A,B and C and gc grace on the table is 10 days. We do 
> a CAS write at time 0 and it went to A and B but not to C.  C will get the 
> hint later but will not update the most recent commit in paxos table. This is 
> how CAS hints work. 
> In the paxos table whose gc_grace=0, most_recent_commit in A and B will be 
> inserted with timestamp 0 and with a TTL of 10 days. After 10 days, this 
> insert will become a tombstone at time 0 till it is compacted away since 
> gc_grace=0.
> Do a CAS read after say 1 day on the same CQL partition and this time prepare 
> phase involved A and C. most_recent_commit on C for this CQL partition is 
> empty. A sends the most_recent_commit to C with a timestamp of 0 and with a 
> TTL of 10 days. This most_recent_commit on C will expire on 11th day since it 
> is inserted after 1 day. 
> most_recent_commit are now in sync on A,B and C, however A and B 
> most_recent_commit will expire on 10th day whereas for C it will expire on 
> 11th day since it was inserted one day later. 
> Do another CAS read after 10days when most_recent_commit on A and B have 
> expired and is treated as tombstones till compacted. In this CAS read, say A 
> and C are involved in prepare phase. most_recent_commit will not match 
> between them since it is expired in A and is still there on C. This will 
> cause most_recent_commit to be applied to A with a timestamp of 0 and TTL of 
> 10 days. If A has not compacted away the original most_recent_commit which 
> has expired, this new write to most_recent_commit wont be visible on reads 
> since there is a tombstone with same timestamp(Delete wins over data with 
> same timestamp). 
> Another round of prepare will follow and again A would say it does not know 
> about most_recent_write(covered by original write which is not a tombstone) 
> and C will again try to send the write to A. This can keep going on till the 
> request timeouts or only A and B are involved in the prepare phase. 
> When A’s original most_recent_commit which is now a tombstone is compacted, 
> all the inserts which it was covering will come live. This will in turn again 
> get played to another replica. This ping pong can keep going on for a long 
> time. 
> The issue is that most_recent_commit is expiring at different times across 
> replicas. When they get replayed to a replica to make it in sync, we again 
> set the TTL from that point.  
> During the CAS read which timed out, most_recent_commit was being sent to 
> another replica in a loop. Even in successful requests, it will try to loop 
> for a couple of times if involving A and C and then when the replicas which 
> respond are A and B, it will succeed. So this will have impact on latencies 
> as well. 
> These timeouts gets worse when a machine is down as no progress can be made 
> as the machine with unexpired commit is always involved in the CAS prepare 
> round. Also with range movements, the new machine gaining range has empty 
> most recent commit and gets the commit at a later time causing same issue. 
> Repro steps:
> 1. Paxos TTL is max(3 hours, gc_grace) as defined in 
> SystemKeyspace.paxosTtl(). Change this method to not put a minimum TTL of 3 
> hours. 
> Method  SystemKeyspace.paxosTtl() will look like return 
> metadata.getGcGraceSeconds();   instead of return Math.max(3 * 3600, 
> metadata.getGcGraceSeconds());
> We are doing this so that we dont need to wait for 3 hours. 
> Create a 3 node cluster with the code change suggested above with machines 
> A,B and C
> CREATE KEYSPACE  test WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 3 };
> use test;
> CREATE TABLE users (a int PRIMARY KEY,b int);
> alter table users WITH gc_grace_seconds=120;
> consistency QUORUM;
> bring down machine C
> INSERT INTO users (user_name, password ) VALUES ( 1,1) IF NOT EXISTS;
> Nodetool flush on machine A and B
> Bring up the down machine B 
> consistency SERIAL;
> tracing on;
> wait 80 seconds
> Bring up machine C
> select * from users where user_name = 1;
> Wait 40 seconds 
> select * from users where user_name = 1;  //All queries from this point 
> forward will timeout. 
> One of the potential fixes could be to set the TTL based on the remaining 
> time left on another replicas. This will be TTL-timestamp of write. This 
> timestamp is calculated from ballot which uses server time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (CASSANDRA-12043) Syncing most recent commit in CAS across replicas can cause all CAS queries in the CQL partition to fail

Reply via email to