[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node

2018-09-27 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630602#comment-16630602
 ] 

Jeffrey F. Lukman commented on CASSANDRA-12438:
---

Yes, we also performed some reads after all related messages of the client 
requests have been executed to verify the consistency among the nodes.

We run this query:

SELECT * FROM test.tests WHERE name = 'cass-12438';

We executed this query to each node using the cqlsh.
If the bug is manifested, we can see that node X and Y will return the expected 
result, while node Z will return the buggy result.
Therefore, the data are inconsistent among the nodes.

> Data inconsistencies with lightweight transactions, serial reads, and 
> rejoining node
> 
>
> Key: CASSANDRA-12438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Steven Schaefer
>Priority: Major
>
> I've run into some issues with data inconsistency in a situation where a 
> single node is rejoining a 3-node cluster with RF=3. I'm running 3.7.
> I have a client system which inserts data into a table with around 7 columns, 
> named let's say A-F,id, and version. LWTs are used to make the inserts and 
> updates.
> Typically what happens is there's an insert of values id, V_a1, V_b1, ... , 
> version=1, then another process will pick up rows with for example A=V_a1 and 
> subsequently update A to V_a2 and version=2. Yet another process will watch 
> for A=V_a2 to then make a second update to the same column, and set version 
> to 3, with end result being  There's a 
> secondary index on this A column (there's only a few possible values for A so 
> not worried about the cardinality issue), though I've reproed with the new 
> SASI index too.
> If one of the nodes is down, there's still 2 alive for quorum so inserts can 
> still happen. When I bring up the downed node, sometimes I get really weird 
> state back which ultimately crashes the client system that's talking to 
> Cassandra. 
> When reading I always select all the columns, but there is a conditional 
> where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read 
> is for processing any rows with V_a2, and ultimately updating to V_a3 when 
> complete. While periodically polling for A=V_a2 it is of course possible for 
> the poller to to observe the old V_a2 value while the other parts of the 
> client system process and make the update to V_a3, and that's generally ok 
> because of the LWTs used for updates, an occassionaly wasted reprocessing run 
> ins't a big deal, but when reading at serial I always expect to get the 
> original values for columns that were never updated too. If a paxos update is 
> in progress then I expect that completed before its value(s) returned. But 
> instead, the read seems to be seeing the partial commit of the LWT, returning 
> the old V_2a value for the changed column, but no values whatsoever for the 
> other columns. From the example above, instead of getting  , version=3>, or even the older  (either of 
> which I expect and are ok), I get only , so the rest of 
> the columns end up null, which I never expect. However this isn't persistent, 
> Cassandra does end up consistent, which I see via sstabledump and cqlsh after 
> the fact.
> In my client system logs I record the insert / updates, and this 
> inconsistency happens around the same time as the update from V_a2 to V_a3, 
> hence my comment about Cassandra seeing a partial commit. So that leads me to 
> suspect that perhaps due to the where clause in my read query for A=V_a2, 
> perhaps one of the original good nodes already has the new V_a3 value, so it 
> doesn't return this row for the select query, but the other good node and the 
> one that was down still have the old value V_a2, so those 2 nodes return what 
> they have. The one that was down doesn't yet have the original insert, just 
> the update from V_a1 -> V_a2 (again I suspect, it's not been easy to verify), 
> which would explain where  comes from, that's all it 
> knows about. However since it's a serial quorum read, I'd expect some sort of 
> exception as neither of the remaining 2 nodes with A=V_a2 would be able to 
> come to a quorum on the values for all the columns, as I'd expect the other 
> good node to return 
> I know at some point nodetool repair should be run on this node, but I'm 
> concerned about a window of time between when the node comes back up and 
> repair starts/completes. It almost seems like if a node goes down the safest 
> bet is to remove it from the cluster and rebuild, instead of simply 
> restarting the node? However I haven't tested that to see if it runs into a 
> similar situation.
> It is of course possible to work 

[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node

2018-09-27 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629887#comment-16629887
 ] 

Benedict commented on CASSANDRA-12438:
--

You must also perform some reads?  What reads are you performing to verify the 
cluster state, at what consistency level, and routed to which coordinator?

> Data inconsistencies with lightweight transactions, serial reads, and 
> rejoining node
> 
>
> Key: CASSANDRA-12438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Steven Schaefer
>Priority: Major
>
> I've run into some issues with data inconsistency in a situation where a 
> single node is rejoining a 3-node cluster with RF=3. I'm running 3.7.
> I have a client system which inserts data into a table with around 7 columns, 
> named let's say A-F,id, and version. LWTs are used to make the inserts and 
> updates.
> Typically what happens is there's an insert of values id, V_a1, V_b1, ... , 
> version=1, then another process will pick up rows with for example A=V_a1 and 
> subsequently update A to V_a2 and version=2. Yet another process will watch 
> for A=V_a2 to then make a second update to the same column, and set version 
> to 3, with end result being  There's a 
> secondary index on this A column (there's only a few possible values for A so 
> not worried about the cardinality issue), though I've reproed with the new 
> SASI index too.
> If one of the nodes is down, there's still 2 alive for quorum so inserts can 
> still happen. When I bring up the downed node, sometimes I get really weird 
> state back which ultimately crashes the client system that's talking to 
> Cassandra. 
> When reading I always select all the columns, but there is a conditional 
> where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read 
> is for processing any rows with V_a2, and ultimately updating to V_a3 when 
> complete. While periodically polling for A=V_a2 it is of course possible for 
> the poller to to observe the old V_a2 value while the other parts of the 
> client system process and make the update to V_a3, and that's generally ok 
> because of the LWTs used for updates, an occassionaly wasted reprocessing run 
> ins't a big deal, but when reading at serial I always expect to get the 
> original values for columns that were never updated too. If a paxos update is 
> in progress then I expect that completed before its value(s) returned. But 
> instead, the read seems to be seeing the partial commit of the LWT, returning 
> the old V_2a value for the changed column, but no values whatsoever for the 
> other columns. From the example above, instead of getting  , version=3>, or even the older  (either of 
> which I expect and are ok), I get only , so the rest of 
> the columns end up null, which I never expect. However this isn't persistent, 
> Cassandra does end up consistent, which I see via sstabledump and cqlsh after 
> the fact.
> In my client system logs I record the insert / updates, and this 
> inconsistency happens around the same time as the update from V_a2 to V_a3, 
> hence my comment about Cassandra seeing a partial commit. So that leads me to 
> suspect that perhaps due to the where clause in my read query for A=V_a2, 
> perhaps one of the original good nodes already has the new V_a3 value, so it 
> doesn't return this row for the select query, but the other good node and the 
> one that was down still have the old value V_a2, so those 2 nodes return what 
> they have. The one that was down doesn't yet have the original insert, just 
> the update from V_a1 -> V_a2 (again I suspect, it's not been easy to verify), 
> which would explain where  comes from, that's all it 
> knows about. However since it's a serial quorum read, I'd expect some sort of 
> exception as neither of the remaining 2 nodes with A=V_a2 would be able to 
> come to a quorum on the values for all the columns, as I'd expect the other 
> good node to return 
> I know at some point nodetool repair should be run on this node, but I'm 
> concerned about a window of time between when the node comes back up and 
> repair starts/completes. It almost seems like if a node goes down the safest 
> bet is to remove it from the cluster and rebuild, instead of simply 
> restarting the node? However I haven't tested that to see if it runs into a 
> similar situation.
> It is of course possible to work around the inconsistency for now by 
> detecting and ignoring it in the client system, but if there is indeed a bug 
> I hope we can identify it and ultimately resolve it.
> I'm also curious if this relates to CASSANDRA-12126, and also CASSANDRA-11219 
> may be relevant.
> I've been reproducing with a combination of manually 

[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node

2018-09-26 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629632#comment-16629632
 ] 

Jeffrey F. Lukman commented on CASSANDRA-12438:
---

Hi [~benedict] ,

Following the bug description, we integrate our model checker with 
Cassandra-v3.7.
We grabbed the code from the github repository.

Regarding the scheme, here is the initial scheme that we have prepared before 
we inject any queries in the model checker path execution:
 * CREATE KEYSPACE test WITH REPLICATION = \{'class': 'SimpleStrategy', 
'replication_factor': 3};
 * CREATE TABLE tests (name text PRIMARY KEY, owner text, value_1 text, value_2 
text, value_3 text, value_4 text, value_5 text, value_6 text, value_7 text);

Regarding the operations/queries, here are the details of them:
 * INSERT INTO test.tests (name, owner, value_1, value_2, value_3, value_4, 
value_5, value_6, value_7) VALUES ('cass-12438', 'user_1', 'A1', 'B1', 'C1', 
'D1', 'E1', 'F1', 'G1') IF NOT EXISTS;
 * Client Request 2: UPDATE test.tests SET value_1 = 'A2', owner = 'user_2' 
WHERE name = 'cass-12438' IF owner = 'user_1';
 * Client Request 3: UPDATE test.tests SET value_1 = 'A3', owner = 'user_3' 
WHERE name = 'cass-12438' IF owner = 'user_2';

The messages from these queries here are the one that the model checker control 
and reorder in some way, so that we ended up reproducing this bug.

> Data inconsistencies with lightweight transactions, serial reads, and 
> rejoining node
> 
>
> Key: CASSANDRA-12438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Steven Schaefer
>Priority: Major
>
> I've run into some issues with data inconsistency in a situation where a 
> single node is rejoining a 3-node cluster with RF=3. I'm running 3.7.
> I have a client system which inserts data into a table with around 7 columns, 
> named let's say A-F,id, and version. LWTs are used to make the inserts and 
> updates.
> Typically what happens is there's an insert of values id, V_a1, V_b1, ... , 
> version=1, then another process will pick up rows with for example A=V_a1 and 
> subsequently update A to V_a2 and version=2. Yet another process will watch 
> for A=V_a2 to then make a second update to the same column, and set version 
> to 3, with end result being  There's a 
> secondary index on this A column (there's only a few possible values for A so 
> not worried about the cardinality issue), though I've reproed with the new 
> SASI index too.
> If one of the nodes is down, there's still 2 alive for quorum so inserts can 
> still happen. When I bring up the downed node, sometimes I get really weird 
> state back which ultimately crashes the client system that's talking to 
> Cassandra. 
> When reading I always select all the columns, but there is a conditional 
> where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read 
> is for processing any rows with V_a2, and ultimately updating to V_a3 when 
> complete. While periodically polling for A=V_a2 it is of course possible for 
> the poller to to observe the old V_a2 value while the other parts of the 
> client system process and make the update to V_a3, and that's generally ok 
> because of the LWTs used for updates, an occassionaly wasted reprocessing run 
> ins't a big deal, but when reading at serial I always expect to get the 
> original values for columns that were never updated too. If a paxos update is 
> in progress then I expect that completed before its value(s) returned. But 
> instead, the read seems to be seeing the partial commit of the LWT, returning 
> the old V_2a value for the changed column, but no values whatsoever for the 
> other columns. From the example above, instead of getting  , version=3>, or even the older  (either of 
> which I expect and are ok), I get only , so the rest of 
> the columns end up null, which I never expect. However this isn't persistent, 
> Cassandra does end up consistent, which I see via sstabledump and cqlsh after 
> the fact.
> In my client system logs I record the insert / updates, and this 
> inconsistency happens around the same time as the update from V_a2 to V_a3, 
> hence my comment about Cassandra seeing a partial commit. So that leads me to 
> suspect that perhaps due to the where clause in my read query for A=V_a2, 
> perhaps one of the original good nodes already has the new V_a3 value, so it 
> doesn't return this row for the select query, but the other good node and the 
> one that was down still have the old value V_a2, so those 2 nodes return what 
> they have. The one that was down doesn't yet have the original insert, just 
> the update from V_a1 -> V_a2 (again I suspect, it's not been easy to verify), 
> which would 

[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node

2018-09-26 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629454#comment-16629454
 ] 

Benedict commented on CASSANDRA-12438:
--

There's a lot of information here, that I haven't fully parsed, partially 
because of the pseudo-code (it's helpful to post actual schemas and 
operations/queries).

However, if you are performing a QUORUM read of *just* {{V_a2/3}}, by itself 
(to any node; X, Y or Z), before querying node Z directly at ONE then it's 
probable you are encountering CASSANDRA-14593.

The best workaround for this would be to always query all of the columns/rows 
you want to see updated atomically. Never select a subset.  

 

You could also patch your Cassandra instance to not persist the results of 
read-repair.  The upcoming 4.0 release will have the ability to disable it for 
exactly this reason, but this probably won't be released for several months.

> Data inconsistencies with lightweight transactions, serial reads, and 
> rejoining node
> 
>
> Key: CASSANDRA-12438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Steven Schaefer
>Priority: Major
>
> I've run into some issues with data inconsistency in a situation where a 
> single node is rejoining a 3-node cluster with RF=3. I'm running 3.7.
> I have a client system which inserts data into a table with around 7 columns, 
> named let's say A-F,id, and version. LWTs are used to make the inserts and 
> updates.
> Typically what happens is there's an insert of values id, V_a1, V_b1, ... , 
> version=1, then another process will pick up rows with for example A=V_a1 and 
> subsequently update A to V_a2 and version=2. Yet another process will watch 
> for A=V_a2 to then make a second update to the same column, and set version 
> to 3, with end result being  There's a 
> secondary index on this A column (there's only a few possible values for A so 
> not worried about the cardinality issue), though I've reproed with the new 
> SASI index too.
> If one of the nodes is down, there's still 2 alive for quorum so inserts can 
> still happen. When I bring up the downed node, sometimes I get really weird 
> state back which ultimately crashes the client system that's talking to 
> Cassandra. 
> When reading I always select all the columns, but there is a conditional 
> where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read 
> is for processing any rows with V_a2, and ultimately updating to V_a3 when 
> complete. While periodically polling for A=V_a2 it is of course possible for 
> the poller to to observe the old V_a2 value while the other parts of the 
> client system process and make the update to V_a3, and that's generally ok 
> because of the LWTs used for updates, an occassionaly wasted reprocessing run 
> ins't a big deal, but when reading at serial I always expect to get the 
> original values for columns that were never updated too. If a paxos update is 
> in progress then I expect that completed before its value(s) returned. But 
> instead, the read seems to be seeing the partial commit of the LWT, returning 
> the old V_2a value for the changed column, but no values whatsoever for the 
> other columns. From the example above, instead of getting  , version=3>, or even the older  (either of 
> which I expect and are ok), I get only , so the rest of 
> the columns end up null, which I never expect. However this isn't persistent, 
> Cassandra does end up consistent, which I see via sstabledump and cqlsh after 
> the fact.
> In my client system logs I record the insert / updates, and this 
> inconsistency happens around the same time as the update from V_a2 to V_a3, 
> hence my comment about Cassandra seeing a partial commit. So that leads me to 
> suspect that perhaps due to the where clause in my read query for A=V_a2, 
> perhaps one of the original good nodes already has the new V_a3 value, so it 
> doesn't return this row for the select query, but the other good node and the 
> one that was down still have the old value V_a2, so those 2 nodes return what 
> they have. The one that was down doesn't yet have the original insert, just 
> the update from V_a1 -> V_a2 (again I suspect, it's not been easy to verify), 
> which would explain where  comes from, that's all it 
> knows about. However since it's a serial quorum read, I'd expect some sort of 
> exception as neither of the remaining 2 nodes with A=V_a2 would be able to 
> come to a quorum on the values for all the columns, as I'd expect the other 
> good node to return 
> I know at some point nodetool repair should be run on this node, but I'm 
> concerned about a window of time between when the node comes back up and 
> repair 

[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node

2018-09-26 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629434#comment-16629434
 ] 

Jeffrey F. Lukman commented on CASSANDRA-12438:
---

Hi all,

 

Our team from UCARE University of Chicago, have been able to reproduce this bug 
consistently with our model checker.
Here are the workload and scenario of the bug:

Workload: 3 node-cluster (let's call them node X, Y, and Z), 1 Crash, 1 Reboot 
events, with 3 client requests (where node X will be the coordinator node for 
all client requests.

Scenario:
 # Start the 3 nodes and setup the CONSISTENCY = ONE.
 # Inject client request 1 as described in this bug description:
Insert  (along with many others)
 # But before any PREPARE messages have been sent by the node X, node Z has 
crashed.
 # Client request 1 is successfully committed in node X and Y.
 # Reboot node Z.
 # Inject client request 2 & 3 as described in this bug description:
Update  (along with others for which A=V_a1)
Update  (along with many others for which 
A=V_a2)
(**Although Update 3 can also be ignored if we want to simplify the bug 
scenario)
 # If client request-2 finished first without being interfered by client 
request-3, then we expect to see:

If the client request-3 interfere client request-2 or is executed before client 
request-2 for any reason, then we expect to see:

 # But our model checker shows that, if we do a read request to node Z, then we 
will see: // some fields are null
But if we do a read request to node X or Y, then we will get a complete result.
Which means we end up in an inconsistency view among the nodes.

If we run this scenario with CONSISTENCY.ALL we will not see this bug to happen.

We are happy to assist you guys to debug this issue.

> Data inconsistencies with lightweight transactions, serial reads, and 
> rejoining node
> 
>
> Key: CASSANDRA-12438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Steven Schaefer
>Priority: Major
>
> I've run into some issues with data inconsistency in a situation where a 
> single node is rejoining a 3-node cluster with RF=3. I'm running 3.7.
> I have a client system which inserts data into a table with around 7 columns, 
> named let's say A-F,id, and version. LWTs are used to make the inserts and 
> updates.
> Typically what happens is there's an insert of values id, V_a1, V_b1, ... , 
> version=1, then another process will pick up rows with for example A=V_a1 and 
> subsequently update A to V_a2 and version=2. Yet another process will watch 
> for A=V_a2 to then make a second update to the same column, and set version 
> to 3, with end result being  There's a 
> secondary index on this A column (there's only a few possible values for A so 
> not worried about the cardinality issue), though I've reproed with the new 
> SASI index too.
> If one of the nodes is down, there's still 2 alive for quorum so inserts can 
> still happen. When I bring up the downed node, sometimes I get really weird 
> state back which ultimately crashes the client system that's talking to 
> Cassandra. 
> When reading I always select all the columns, but there is a conditional 
> where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read 
> is for processing any rows with V_a2, and ultimately updating to V_a3 when 
> complete. While periodically polling for A=V_a2 it is of course possible for 
> the poller to to observe the old V_a2 value while the other parts of the 
> client system process and make the update to V_a3, and that's generally ok 
> because of the LWTs used for updates, an occassionaly wasted reprocessing run 
> ins't a big deal, but when reading at serial I always expect to get the 
> original values for columns that were never updated too. If a paxos update is 
> in progress then I expect that completed before its value(s) returned. But 
> instead, the read seems to be seeing the partial commit of the LWT, returning 
> the old V_2a value for the changed column, but no values whatsoever for the 
> other columns. From the example above, instead of getting  , version=3>, or even the older  (either of 
> which I expect and are ok), I get only , so the rest of 
> the columns end up null, which I never expect. However this isn't persistent, 
> Cassandra does end up consistent, which I see via sstabledump and cqlsh after 
> the fact.
> In my client system logs I record the insert / updates, and this 
> inconsistency happens around the same time as the update from V_a2 to V_a3, 
> hence my comment about Cassandra seeing a partial commit. So that leads me to 
> suspect that perhaps due to the where clause in my read query for A=V_a2, 
> perhaps one of the original good nodes already