[ https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629632#comment-16629632 ]
Jeffrey F. Lukman commented on CASSANDRA-12438: ----------------------------------------------- Hi [~benedict] , Following the bug description, we integrate our model checker with Cassandra-v3.7. We grabbed the code from the github repository. Regarding the scheme, here is the initial scheme that we have prepared before we inject any queries in the model checker path execution: * CREATE KEYSPACE test WITH REPLICATION = \{'class': 'SimpleStrategy', 'replication_factor': 3}; * CREATE TABLE tests (name text PRIMARY KEY, owner text, value_1 text, value_2 text, value_3 text, value_4 text, value_5 text, value_6 text, value_7 text); Regarding the operations/queries, here are the details of them: * INSERT INTO test.tests (name, owner, value_1, value_2, value_3, value_4, value_5, value_6, value_7) VALUES ('cass-12438', 'user_1', 'A1', 'B1', 'C1', 'D1', 'E1', 'F1', 'G1') IF NOT EXISTS; * Client Request 2: UPDATE test.tests SET value_1 = 'A2', owner = 'user_2' WHERE name = 'cass-12438' IF owner = 'user_1'; * Client Request 3: UPDATE test.tests SET value_1 = 'A3', owner = 'user_3' WHERE name = 'cass-12438' IF owner = 'user_2'; The messages from these queries here are the one that the model checker control and reorder in some way, so that we ended up reproducing this bug. > Data inconsistencies with lightweight transactions, serial reads, and > rejoining node > ------------------------------------------------------------------------------------ > > Key: CASSANDRA-12438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12438 > Project: Cassandra > Issue Type: Bug > Reporter: Steven Schaefer > Priority: Major > > I've run into some issues with data inconsistency in a situation where a > single node is rejoining a 3-node cluster with RF=3. I'm running 3.7. > I have a client system which inserts data into a table with around 7 columns, > named let's say A-F,id, and version. LWTs are used to make the inserts and > updates. > Typically what happens is there's an insert of values id, V_a1, V_b1, ... , > version=1, then another process will pick up rows with for example A=V_a1 and > subsequently update A to V_a2 and version=2. Yet another process will watch > for A=V_a2 to then make a second update to the same column, and set version > to 3, with end result being <id, V_a3, V_b1, ... , V_f1, version=3> There's a > secondary index on this A column (there's only a few possible values for A so > not worried about the cardinality issue), though I've reproed with the new > SASI index too. > If one of the nodes is down, there's still 2 alive for quorum so inserts can > still happen. When I bring up the downed node, sometimes I get really weird > state back which ultimately crashes the client system that's talking to > Cassandra. > When reading I always select all the columns, but there is a conditional > where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read > is for processing any rows with V_a2, and ultimately updating to V_a3 when > complete. While periodically polling for A=V_a2 it is of course possible for > the poller to to observe the old V_a2 value while the other parts of the > client system process and make the update to V_a3, and that's generally ok > because of the LWTs used for updates, an occassionaly wasted reprocessing run > ins't a big deal, but when reading at serial I always expect to get the > original values for columns that were never updated too. If a paxos update is > in progress then I expect that completed before its value(s) returned. But > instead, the read seems to be seeing the partial commit of the LWT, returning > the old V_2a value for the changed column, but no values whatsoever for the > other columns. From the example above, instead of getting <id, V_a3 V_b1, ... > , version=3>, or even the older <id, V_a2, V_b1, ..., version=2> (either of > which I expect and are ok), I get only <id, V_a2, version=2>, so the rest of > the columns end up null, which I never expect. However this isn't persistent, > Cassandra does end up consistent, which I see via sstabledump and cqlsh after > the fact. > In my client system logs I record the insert / updates, and this > inconsistency happens around the same time as the update from V_a2 to V_a3, > hence my comment about Cassandra seeing a partial commit. So that leads me to > suspect that perhaps due to the where clause in my read query for A=V_a2, > perhaps one of the original good nodes already has the new V_a3 value, so it > doesn't return this row for the select query, but the other good node and the > one that was down still have the old value V_a2, so those 2 nodes return what > they have. The one that was down doesn't yet have the original insert, just > the update from V_a1 -> V_a2 (again I suspect, it's not been easy to verify), > which would explain where <id, V_a2, version=2> comes from, that's all it > knows about. However since it's a serial quorum read, I'd expect some sort of > exception as neither of the remaining 2 nodes with A=V_a2 would be able to > come to a quorum on the values for all the columns, as I'd expect the other > good node to return <id, V_a2, V_b1, ..., version=2> > I know at some point nodetool repair should be run on this node, but I'm > concerned about a window of time between when the node comes back up and > repair starts/completes. It almost seems like if a node goes down the safest > bet is to remove it from the cluster and rebuild, instead of simply > restarting the node? However I haven't tested that to see if it runs into a > similar situation. > It is of course possible to work around the inconsistency for now by > detecting and ignoring it in the client system, but if there is indeed a bug > I hope we can identify it and ultimately resolve it. > I'm also curious if this relates to CASSANDRA-12126, and also CASSANDRA-11219 > may be relevant. > I've been reproducing with a combination of manually stopping one node, > running a test script I have to trigger the client system to insert data, > then manually restarting the node and waiting. It's consistently > inconsistent, reproducing on most attempts > Summary timeline: > {noformat} > 1. Shut down third node of three. > 2. Insert <id, V_a1, V_b1, ... , version=1> (along with many others) > 3. Start the third node. (start occurs concurrently with 4 & 5) > 4. Update <id, V_a2, V_b1, ... , version=2> (along with others for which > A=V_a1) > 5. Update <id, V_a3, V_b1, ... , version=3> (along with many others for > which A=V_a2) > 5. Read > a. Expected: <id, V_a2, V_b1, ... , version=2> OR <id, V_a3, V_b1, > ... , version=3> > b. Actual: <id, V_a2, version=2> // some fields are null > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org