[ 
https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116578#comment-17116578
 ] 

Sylvain Lebresne commented on CASSANDRA-12126:
----------------------------------------------

It definitively doesn't look good that this messages comes so late, but I feel 
this is a serious issue of the {{SERIAL}}/{{LOCAL_SERIAL}} consistency levels 
since this breaks the basic guarantee they exist to provide, and as such should 
be fixed all the way down 3.0, and the sooner, the better.

In an attempt to sum this up quickly, the problem we have here affects both 
serial reads _and_ LWT updates that do not apply (whose condition evaluates to 
{{false}}). In both case, while the current code replays "effectively 
committed" proposals (those whose proposal has been accepted by majority of 
replica) with {{beginAndRepairPaxos}}, neither make proposals of their own, so 
nothing will prevent a proposal accepted by a minority of replica (say just 
one) to be later replayed (and thus committed).

I've pushed [2 in-jvm 
dtests|https://github.com/pcmanus/cassandra/commit/3442277905362b38e0d6a2b8170916fcfd18d469]
 that demonstrate the issue for both cases (again, serial reads and 
non-applying updates). They use "filters" to selectively drop messages to make 
failure consistent but aren't otherwise very involved.

As [~kohlisankalp] mentioned initially, the "simplest"\[1\] way to fix this 
that I see is to commit an empty update in both cases. Actually committing, 
which sets the {{mostRecentCommit}} value in the Paxos state, ensures that no 
prior proposal can ever be replayed. I've pushed a patch to do so on 3.0/3.11 
below (will merge up on 4.0, but wanted to make sure we're ok on the approach 
first):

||version||
| [3.0|https://github.com/pcmanus/cassandra/commits/C-12126-3.0] |
| [3.11|https://github.com/pcmanus/cassandra/commits/C-12126-3.11] |

The big downside of this patch however is the performance impact. Currently, a 
{{SERIAL}} read (that finds nothing in progress it needs to replay) is 2 
round-trips (a prepare phase, followed by the actual read). With this patch, it 
is 3 round-trips as we have to propose our empty commit and get acceptance (we 
don't really have to wait for responses on the commit though), which will be 
noticeable for performance sensitive use-cases. Similarly, the performance of 
LWT that don't apply will be impacted.

That said, I don't seen another approach to fixing this that would be as 
acceptable for 3.0/3.11 in terms of risks, and imo 'slower but correct' beats 
'faster but broken' any day, so I'm in favor of moving forward with this fix.

Opinions?



\[1\]: I mean by that both the simplicity of the change, but also of validating 
that this fix the problem at hand without creating new correctness problems.


> CAS Reads Inconsistencies 
> --------------------------
>
>                 Key: CASSANDRA-12126
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12126
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Feature/Lightweight Transactions, Legacy/Coordination
>            Reporter: Sankalp Kohli
>            Assignee: Sylvain Lebresne
>            Priority: Normal
>              Labels: LWT, pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> While looking at the CAS code in Cassandra, I found a potential issue with 
> CAS Reads. Here is how it can happen with RF=3
> 1) You issue a CAS Write and it fails in the propose phase. A machine replies 
> true to a propose and saves the commit in accepted filed. The other two 
> machines B and C does not get to the accept phase. 
> Current state is that machine A has this commit in paxos table as accepted 
> but not committed and B and C does not. 
> 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the 
> value written in step 1. This step is as if nothing is inflight. 
> 3) Issue another CAS Read and it goes to A and B. Now we will discover that 
> there is something inflight from A and will propose and commit it with the 
> current ballot. Now we can read the value written in step 1 as part of this 
> CAS read.
> If we skip step 3 and instead run step 4, we will never learn about value 
> written in step 1. 
> 4. Issue a CAS Write and it involves only B and C. This will succeed and 
> commit a different value than step 1. Step 1 value will never be seen again 
> and was never seen before. 
> If you read the Lamport “paxos made simple” paper and read section 2.3. It 
> talks about this issue which is how learners can find out if majority of the 
> acceptors have accepted the proposal. 
> In step 3, it is correct that we propose the value again since we dont know 
> if it was accepted by majority of acceptors. When we ask majority of 
> acceptors, and more than one acceptors but not majority has something in 
> flight, we have no way of knowing if it is accepted by majority of acceptors. 
> So this behavior is correct. 
> However we need to fix step 2, since it caused reads to not be linearizable 
> with respect to writes and other reads. In this case, we know that majority 
> of acceptors have no inflight commit which means we have majority that 
> nothing was accepted by majority. I think we should run a propose step here 
> with empty commit and that will cause write written in step 1 to not be 
> visible ever after. 
> With this fix, we will either see data written in step 1 on next serial read 
> or will never see it which is what we want. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to