[ https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630688#comment-16630688 ]
Jeffrey F. Lukman commented on CASSANDRA-12126: ----------------------------------------------- During our testing with our model checker, we limit the round of Paxos for each query, because if not, it is possible that we get stuck in a very long sequence of message transactions among the nodes without progressing anywhere. So, what we do is we only execute one round of Paxos for each query. To enlight our test and combine our whole story, here is what happened in detail: * We first prepared the 3 node-cluster with the test.tests table as initial table structure and yes, the initial table began with: {name:'testing', owner:'user_1', value1:null, value2:null, value3:null} * Next, we run the model checker that will start the 3 node-cluster. * Inject the 3 client requests in order: query 1, then query 2, then query 3. This cause query 1 to have ballot number < query 2 ballot number < query 3 ballot number. * Now this means, in the beginning, the model checker already see there will be 9 prepare messages in its queue that will be reordered in some way. * When the bug is manifested, we ended up having: ** Node X's prepare messages proceed and all nodes response with true back to node X. ** Node X sends its propose message with value_1='A' to itself first and get a response true as well. ** At this moment, Node X inProgress value is updated to the proposed value, value_1='A' ** But then node Y prepare messages proceed and all nodes response with true back to node Y, because prepare messages of node Y have a higher ballot number. ** But when node Y about to proceed the propose messages it realized that the current data does not fulfill the IF condition, so it does not proceed to propose messages. --> Client request 2 to node Y is therefore rejected ** Continuing node X propose messages to node Y and Z, both requests are returned with false to node X ** Now at this point node X should be able to retry the Paxos with a higher ballot number, but since we limit the round of Paxos for each query to one, therefore client request 1 to node X is timed out. ** Lastly, node Z sends its prepare messages to all nodes, and get response true messages from all nodes, because the ballot number is higher as well. ** At this point, if the node X response message is returned first to node X, what will happen is node Z will realize that node X still has an inProgress value in the process (value_1='A'). This cause node Z to send propose messages and commit messages but for client request 1 using the current highest ballot number. Here we have our first data update saved: value_1='A', value_2=null, value_3=null. ** Back to our constraint of one round Paxos for each query, we ended up not retrying client request-3 because we reached timeout. * To sum up: ** client request-1: Timed out ** client request-2: Rejected ** client request-3: Timed out There we get an inconsistency between the client side and the server side, where all requests actually failed, but when we read the end result again from all nodes, we get value_1='A', value_2=null, value_3=null. I made a wrong statement at the end of my first comment: {quote}9. Therefore, we ended up having client request 1 stored to the server, although client request-3 was the one that is said successful. {quote} It should be failed due to timeout. > CAS Reads Inconsistencies > -------------------------- > > Key: CASSANDRA-12126 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12126 > Project: Cassandra > Issue Type: Bug > Components: Coordination > Reporter: sankalp kohli > Priority: Major > Labels: LWT > > While looking at the CAS code in Cassandra, I found a potential issue with > CAS Reads. Here is how it can happen with RF=3 > 1) You issue a CAS Write and it fails in the propose phase. A machine replies > true to a propose and saves the commit in accepted filed. The other two > machines B and C does not get to the accept phase. > Current state is that machine A has this commit in paxos table as accepted > but not committed and B and C does not. > 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the > value written in step 1. This step is as if nothing is inflight. > 3) Issue another CAS Read and it goes to A and B. Now we will discover that > there is something inflight from A and will propose and commit it with the > current ballot. Now we can read the value written in step 1 as part of this > CAS read. > If we skip step 3 and instead run step 4, we will never learn about value > written in step 1. > 4. Issue a CAS Write and it involves only B and C. This will succeed and > commit a different value than step 1. Step 1 value will never be seen again > and was never seen before. > If you read the Lamport “paxos made simple” paper and read section 2.3. It > talks about this issue which is how learners can find out if majority of the > acceptors have accepted the proposal. > In step 3, it is correct that we propose the value again since we dont know > if it was accepted by majority of acceptors. When we ask majority of > acceptors, and more than one acceptors but not majority has something in > flight, we have no way of knowing if it is accepted by majority of acceptors. > So this behavior is correct. > However we need to fix step 2, since it caused reads to not be linearizable > with respect to writes and other reads. In this case, we know that majority > of acceptors have no inflight commit which means we have majority that > nothing was accepted by majority. I think we should run a propose step here > with empty commit and that will cause write written in step 1 to not be > visible ever after. > With this fix, we will either see data written in step 1 on next serial read > or will never see it which is what we want. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org