Steven Schaefer created CASSANDRA-12438:
-------------------------------------------

             Summary: Data inconsistencies with lightweight transactions, 
serial reads, and rejoining node
                 Key: CASSANDRA-12438
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
             Project: Cassandra
          Issue Type: Bug
            Reporter: Steven Schaefer


I've run into some issues with data inconsistency in a situation where a single 
node is rejoining a 3-node cluster with RF=3. I'm running 3.7.

I have a client system which inserts data into a table with around 7 columns, 
named let's say A-F,id, and version. LWTs are used to make the inserts and 
updates.

Typically what happens is there's an insert of values id, V_a1, V_b1, ... , 
version=1, then another process will pick up rows with for example A=V_a1 and 
subsequently update A to V_a2 and version=2. Yet another process will watch for 
A=V_a2 to then make a second update to the same column, and set version to 3, 
with end result being <id, V_a3, V_b1, ... , V_f1, version=3> There's a 
secondary index on this A column (there's only a few possible values for A so 
not worried about the cardinality issue), though I've reproed with the new SASI 
index too.

If one of the nodes is down, there's still 2 alive for quorum so inserts can 
still happen. When I bring up the downed node, sometimes I get really weird 
state back which ultimately crashes the client system that's talking to 
Cassandra. 

When reading I always select all the columns, but there is a conditional where 
clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read is for 
processing any rows with V_a2, and ultimately updating to V_a3 when complete. 
While periodically polling for A=V_a2 it is of course possible for the poller 
to to observe the old V_a2 value while the other parts of the client system 
process and make the update to V_a3, and that's generally ok because of the 
LWTs used for updates, an occassionaly wasted reprocessing run ins't a big 
deal, but when reading at serial I always expect to get the original values for 
columns that were never updated too. If a paxos update is in progress then I 
expect that completed before its value(s) returned. But instead, the read seems 
to be seeing the partial commit of the LWT, returning the old V_2a value for 
the changed column, but no values whatsoever for the other columns. From the 
example above, instead of getting <id, V_a3 V_b1, ... , version=3>, or even the 
older <id, V_a2, V_b1, ..., version=2> (either of which I expect and are ok), I 
get only <id, V_a2, version=2>, so the rest of the columns end up null, which I 
never expect. However this isn't persistent, Cassandra does end up consistent, 
which I see via sstabledump and cqlsh after the fact.

In my client system logs I record the insert / updates, and this inconsistency 
happens around the same time as the update from V_a2 to V_a3, hence my comment 
about Cassandra seeing a partial commit. So that leads me to suspect that 
perhaps due to the where clause in my read query for A=V_a2, perhaps one of the 
original good nodes already has the new V_a3 value, so it doesn't return this 
row for the select query, but the other good node and the one that was down 
still have the old value V_a2, so those 2 nodes return what they have. The one 
that was down doesn't yet have the original insert, just the update from V_a1 
-> V_a2 (again I suspect, it's not been easy to verify), which would explain 
where <id, V_a2, version=2> comes from, that's all it knows about. However 
since it's a serial quorum read, I'd expect some sort of exception as neither 
of the remaining 2 nodes with A=V_a2 would be able to come to a quorum on the 
values for all the columns, as I'd expect the other good node to return <id, 
V_a2, V_b1, ..., version=2>

I know at some point nodetool repair should be run on this node, but I'm 
concerned about a window of time between when the node comes back up and repair 
starts/completes. It almost seems like if a node goes down the safest bet is to 
remove it from the cluster and rebuild, instead of simply restarting the node? 
However I haven't tested that to see if it runs into a similar situation.

It is of course possible to work around the inconsistency for now by detecting 
and ignoring it in the client system, but if there is indeed a bug I hope we 
can identify it and ultimately resolve it.

I'm also curious if this relates to CASSANDRA-12126, and also CASSANDRA-11219 
may be relevant.

I've been reproducing with a combination of manually stopping one node, running 
a test script I have to trigger the client system to insert data, then manually 
restarting the node and waiting. It's  consistently inconsistent, reproducing 
on most attempts

Summary timeline:
{noformat}
1.  Shut down third node of three.
2.  Insert <id, V_a1, V_b1, ... , version=1> (along with many others)
3.  Start the third node. (start occurs concurrently with 4 & 5)
4.  Update <id, V_a2, V_b1, ... , version=2> (along with others for which 
A=V_a1)
5.  Update <id, V_a3, V_b1, ... , version=3> (along with many others for which 
A=V_a2)
5.  Read
    a.  Expected: <id, V_a2, V_b1, ... , version=2> OR <id, V_a3, V_b1, ... , 
version=3>
    b.  Actual: <id, V_a2, version=2> // some fields are null
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to