[ https://issues.apache.org/jira/browse/CASSANDRA-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sylvain Lebresne resolved CASSANDRA-12043. ------------------------------------------ Resolution: Fixed Reviewer: Jason Brown Fix Version/s: 3.9 3.0.9 2.2.7 2.1.15 I still needed to merge the branch upwards and wait on CI results for all branch. This is now done and tests seem "fine" (no failure appears related) so committed. Thanks. | [2.1|https://github.com/pcmanus/cassandra/commits/12043-2.1] | [utests|http://cassci.datastax.com/job/pcmanus-12043-2.1-testall/] | [dtests|http://cassci.datastax.com/job/pcmanus-12043-2.1-dtest/] || | [2.2|https://github.com/pcmanus/cassandra/commits/12043-2.2] | [utests|http://cassci.datastax.com/job/pcmanus-12043-2.2-testall/] | [dtests|http://cassci.datastax.com/job/pcmanus-12043-2.2-dtest/] || | [3.0|https://github.com/pcmanus/cassandra/commits/12043-3.0] | [utests|http://cassci.datastax.com/job/pcmanus-12043-3.0-testall/] | [dtests|http://cassci.datastax.com/job/pcmanus-12043-3.0-dtest/] || | [3.9|https://github.com/pcmanus/cassandra/commits/12043-3.9] | [utests|http://cassci.datastax.com/job/pcmanus-12043-3.9-testall/] | [dtests|http://cassci.datastax.com/job/pcmanus-12043-3.9-dtest/] || > Syncing most recent commit in CAS across replicas can cause all CAS queries > in the CQL partition to fail > -------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-12043 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12043 > Project: Cassandra > Issue Type: Bug > Reporter: sankalp kohli > Assignee: Sylvain Lebresne > Fix For: 2.1.15, 2.2.7, 3.0.9, 3.9 > > > We update the most recent commit on requiredParticipant replicas if out of > sync during the prepare round in beginAndRepairPaxos method. We keep doing > this in a loop till the requiredParticipant replicas have the same most > recent commit or we hit timeout. > Say we have 3 machines A,B and C and gc grace on the table is 10 days. We do > a CAS write at time 0 and it went to A and B but not to C. C will get the > hint later but will not update the most recent commit in paxos table. This is > how CAS hints work. > In the paxos table whose gc_grace=0, most_recent_commit in A and B will be > inserted with timestamp 0 and with a TTL of 10 days. After 10 days, this > insert will become a tombstone at time 0 till it is compacted away since > gc_grace=0. > Do a CAS read after say 1 day on the same CQL partition and this time prepare > phase involved A and C. most_recent_commit on C for this CQL partition is > empty. A sends the most_recent_commit to C with a timestamp of 0 and with a > TTL of 10 days. This most_recent_commit on C will expire on 11th day since it > is inserted after 1 day. > most_recent_commit are now in sync on A,B and C, however A and B > most_recent_commit will expire on 10th day whereas for C it will expire on > 11th day since it was inserted one day later. > Do another CAS read after 10days when most_recent_commit on A and B have > expired and is treated as tombstones till compacted. In this CAS read, say A > and C are involved in prepare phase. most_recent_commit will not match > between them since it is expired in A and is still there on C. This will > cause most_recent_commit to be applied to A with a timestamp of 0 and TTL of > 10 days. If A has not compacted away the original most_recent_commit which > has expired, this new write to most_recent_commit wont be visible on reads > since there is a tombstone with same timestamp(Delete wins over data with > same timestamp). > Another round of prepare will follow and again A would say it does not know > about most_recent_write(covered by original write which is not a tombstone) > and C will again try to send the write to A. This can keep going on till the > request timeouts or only A and B are involved in the prepare phase. > When A’s original most_recent_commit which is now a tombstone is compacted, > all the inserts which it was covering will come live. This will in turn again > get played to another replica. This ping pong can keep going on for a long > time. > The issue is that most_recent_commit is expiring at different times across > replicas. When they get replayed to a replica to make it in sync, we again > set the TTL from that point. > During the CAS read which timed out, most_recent_commit was being sent to > another replica in a loop. Even in successful requests, it will try to loop > for a couple of times if involving A and C and then when the replicas which > respond are A and B, it will succeed. So this will have impact on latencies > as well. > These timeouts gets worse when a machine is down as no progress can be made > as the machine with unexpired commit is always involved in the CAS prepare > round. Also with range movements, the new machine gaining range has empty > most recent commit and gets the commit at a later time causing same issue. > Repro steps: > 1. Paxos TTL is max(3 hours, gc_grace) as defined in > SystemKeyspace.paxosTtl(). Change this method to not put a minimum TTL of 3 > hours. > Method SystemKeyspace.paxosTtl() will look like return > metadata.getGcGraceSeconds(); instead of return Math.max(3 * 3600, > metadata.getGcGraceSeconds()); > We are doing this so that we dont need to wait for 3 hours. > Create a 3 node cluster with the code change suggested above with machines > A,B and C > CREATE KEYSPACE test WITH REPLICATION = { 'class' : 'SimpleStrategy', > 'replication_factor' : 3 }; > use test; > CREATE TABLE users (a int PRIMARY KEY,b int); > alter table users WITH gc_grace_seconds=120; > consistency QUORUM; > bring down machine C > INSERT INTO users (user_name, password ) VALUES ( 1,1) IF NOT EXISTS; > Nodetool flush on machine A and B > Bring up the down machine B > consistency SERIAL; > tracing on; > wait 80 seconds > Bring up machine C > select * from users where user_name = 1; > Wait 40 seconds > select * from users where user_name = 1; //All queries from this point > forward will timeout. > One of the potential fixes could be to set the TTL based on the remaining > time left on another replicas. This will be TTL-timestamp of write. This > timestamp is calculated from ballot which uses server time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)