[jira] [Commented] (CASSANDRA-12126) CAS Reads Inconsistencies
[ https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630850#comment-16630850 ] Jeffrey F. Lukman commented on CASSANDRA-12126: --- Thank you for your responses, [~jjordan] and [~kohlisankalp]. I think you have cleared up some misunderstandings for me (and our team) where timeout is a "gray area" for the client to determine whether a request has been successfully processed. One thing that we would like to point out maybe, based on the early discussion in this bug description, quote {quote}However we need to fix step 2, since it caused reads to not be linearizable with respect to writes and other reads. In this case, we know that majority of acceptors have no inflight commit which means we have majority that nothing was accepted by majority. I think we should run a propose step here with empty commit and that will cause write written in step 1 to not be visible ever after. {quote} What we tried to mimic with our model checker in the beginning actually was this scenario where node Y saw that the majority of nodes do not have inProgress value, but then suddenly node Z saw that there is an inProgress value from node X and tried to repair and commit it. So, we confirm that we can also see this behavior: {quote}2: Read -> Nothing 3: Read -> Something {quote} We read nothing in node Y, yet node Z read something in the next request. To sum up, at least, our scenario explains this behavior: Node Y does not try to repair the Paxos because node X's prepare response comes last, therefore node Y ignores the node X's prepare response and based its decision to not repair the Paxos. But in node Z's client request, node Z decides to repair the Paxos based on node X's existing inProgress value_1="A" because node X's prepare response comes early (1st or 2nd). Which cause an inconsistent reaction in some way between node Y and node Z (although this is correct based on the original Paxos algorithm). A solution to avoid this inconsistent reactions from these two nodes maybe is for each node to decide whether to repair a Paxos or not based on the complete view of the alive nodes, therefore if the response X's comes last with an inProgress value, node Y will still repair the Paxos. > CAS Reads Inconsistencies > -- > > Key: CASSANDRA-12126 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12126 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: sankalp kohli >Priority: Major > Labels: LWT > > While looking at the CAS code in Cassandra, I found a potential issue with > CAS Reads. Here is how it can happen with RF=3 > 1) You issue a CAS Write and it fails in the propose phase. A machine replies > true to a propose and saves the commit in accepted filed. The other two > machines B and C does not get to the accept phase. > Current state is that machine A has this commit in paxos table as accepted > but not committed and B and C does not. > 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the > value written in step 1. This step is as if nothing is inflight. > 3) Issue another CAS Read and it goes to A and B. Now we will discover that > there is something inflight from A and will propose and commit it with the > current ballot. Now we can read the value written in step 1 as part of this > CAS read. > If we skip step 3 and instead run step 4, we will never learn about value > written in step 1. > 4. Issue a CAS Write and it involves only B and C. This will succeed and > commit a different value than step 1. Step 1 value will never be seen again > and was never seen before. > If you read the Lamport “paxos made simple” paper and read section 2.3. It > talks about this issue which is how learners can find out if majority of the > acceptors have accepted the proposal. > In step 3, it is correct that we propose the value again since we dont know > if it was accepted by majority of acceptors. When we ask majority of > acceptors, and more than one acceptors but not majority has something in > flight, we have no way of knowing if it is accepted by majority of acceptors. > So this behavior is correct. > However we need to fix step 2, since it caused reads to not be linearizable > with respect to writes and other reads. In this case, we know that majority > of acceptors have no inflight commit which means we have majority that > nothing was accepted by majority. I think we should run a propose step here > with empty commit and that will cause write written in step 1 to not be > visible ever after. > With this fix, we will either see data written in step 1 on next serial read > or will never see it which is what we want. -- This message was sent by Atlassian JIRA (
[jira] [Commented] (CASSANDRA-12126) CAS Reads Inconsistencies
[ https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630688#comment-16630688 ] Jeffrey F. Lukman commented on CASSANDRA-12126: --- During our testing with our model checker, we limit the round of Paxos for each query, because if not, it is possible that we get stuck in a very long sequence of message transactions among the nodes without progressing anywhere. So, what we do is we only execute one round of Paxos for each query. To enlight our test and combine our whole story, here is what happened in detail: * We first prepared the 3 node-cluster with the test.tests table as initial table structure and yes, the initial table began with: {name:'testing', owner:'user_1', value1:null, value2:null, value3:null} * Next, we run the model checker that will start the 3 node-cluster. * Inject the 3 client requests in order: query 1, then query 2, then query 3. This cause query 1 to have ballot number < query 2 ballot number < query 3 ballot number. * Now this means, in the beginning, the model checker already see there will be 9 prepare messages in its queue that will be reordered in some way. * When the bug is manifested, we ended up having: ** Node X's prepare messages proceed and all nodes response with true back to node X. ** Node X sends its propose message with value_1='A' to itself first and get a response true as well. ** At this moment, Node X inProgress value is updated to the proposed value, value_1='A' ** But then node Y prepare messages proceed and all nodes response with true back to node Y, because prepare messages of node Y have a higher ballot number. ** But when node Y about to proceed the propose messages it realized that the current data does not fulfill the IF condition, so it does not proceed to propose messages. --> Client request 2 to node Y is therefore rejected ** Continuing node X propose messages to node Y and Z, both requests are returned with false to node X ** Now at this point node X should be able to retry the Paxos with a higher ballot number, but since we limit the round of Paxos for each query to one, therefore client request 1 to node X is timed out. ** Lastly, node Z sends its prepare messages to all nodes, and get response true messages from all nodes, because the ballot number is higher as well. ** At this point, if the node X response message is returned first to node X, what will happen is node Z will realize that node X still has an inProgress value in the process (value_1='A'). This cause node Z to send propose messages and commit messages but for client request 1 using the current highest ballot number. Here we have our first data update saved: value_1='A', value_2=null, value_3=null. ** Back to our constraint of one round Paxos for each query, we ended up not retrying client request-3 because we reached timeout. * To sum up: ** client request-1: Timed out ** client request-2: Rejected ** client request-3: Timed out There we get an inconsistency between the client side and the server side, where all requests actually failed, but when we read the end result again from all nodes, we get value_1='A', value_2=null, value_3=null. I made a wrong statement at the end of my first comment: {quote}9. Therefore, we ended up having client request 1 stored to the server, although client request-3 was the one that is said successful. {quote} It should be failed due to timeout. > CAS Reads Inconsistencies > -- > > Key: CASSANDRA-12126 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12126 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: sankalp kohli >Priority: Major > Labels: LWT > > While looking at the CAS code in Cassandra, I found a potential issue with > CAS Reads. Here is how it can happen with RF=3 > 1) You issue a CAS Write and it fails in the propose phase. A machine replies > true to a propose and saves the commit in accepted filed. The other two > machines B and C does not get to the accept phase. > Current state is that machine A has this commit in paxos table as accepted > but not committed and B and C does not. > 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the > value written in step 1. This step is as if nothing is inflight. > 3) Issue another CAS Read and it goes to A and B. Now we will discover that > there is something inflight from A and will propose and commit it with the > current ballot. Now we can read the value written in step 1 as part of this > CAS read. > If we skip step 3 and instead run step 4, we will never learn about value > written in step 1. > 4. Issue a CAS Write and it involves only B and C. This will succeed and > commit a different value than step 1. Step
[jira] [Comment Edited] (CASSANDRA-12126) CAS Reads Inconsistencies
[ https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629685#comment-16629685 ] Jeffrey F. Lukman edited comment on CASSANDRA-12126 at 9/27/18 3:19 PM: To complete our scenario, here is the setup for our Cassandra: We run the scenario with Cassandra-v2.0.15. Here is the scheme that we use: * CREATE KEYSPACE test WITH REPLICATION = \{'class': 'SimpleStrategy', 'replication_factor': 3}; * CREATE TABLE tests ( name text PRIMARY KEY, owner text, value_1 text, value_2 text, value_3 text); Here are the queries that we submit: * client request to node X (1st): UPDATE test.tests SET value_1 = 'A' WHERE name = 'testing' IF owner = 'user_1'; * client request to node Y (2nd): UPDATE test.tests SET value_2 = 'B' WHERE name = 'testing' IF value_1 = 'A'; * client request to node Z (3rd): UPDATE test.tests SET value_3 = 'C' WHERE name = 'testing' IF value_1 = 'A'; To confirm, when the bug is manifested, the end result will be: value_1='A', value_2=null, value_3=null [~jjirsa], regarding our tool, at this point, it is not open for public. was (Author: jeffreyflukman): To complete our scenario, here is the setup for our Cassandra: We run the scenario with Cassandra-v2.0.15. Here is the scheme that we use: * CREATE KEYSPACE test WITH REPLICATION = \{'class': 'SimpleStrategy', 'replication_factor': 3}; * CREATE TABLE tests ( name text PRIMARY KEY, owner text, value_1 text, value_2 text, value_3 text); Here are the queries that we submit: * client request to node X (1st): UPDATE test.tests SET value_1 = 'A' WHERE name = 'testing' IF owner = 'user_1'; * client request to node Y (2nd): UPDATE test.tests SET value_2 = 'B' WHERE name = 'testing' IF value_1 = 'A'; * client request to node Z (3rd): UPDATE test.tests SET value_3 = 'C' WHERE name = 'testing' IF value_1 = 'A'; To confirm, when the bug is manifested, the end result will be: value_1='A', value_2=null, value_3=null [~jjirsa], regarding our tool, at this point, it is not open for public. > CAS Reads Inconsistencies > -- > > Key: CASSANDRA-12126 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12126 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: sankalp kohli >Priority: Major > Labels: LWT > > While looking at the CAS code in Cassandra, I found a potential issue with > CAS Reads. Here is how it can happen with RF=3 > 1) You issue a CAS Write and it fails in the propose phase. A machine replies > true to a propose and saves the commit in accepted filed. The other two > machines B and C does not get to the accept phase. > Current state is that machine A has this commit in paxos table as accepted > but not committed and B and C does not. > 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the > value written in step 1. This step is as if nothing is inflight. > 3) Issue another CAS Read and it goes to A and B. Now we will discover that > there is something inflight from A and will propose and commit it with the > current ballot. Now we can read the value written in step 1 as part of this > CAS read. > If we skip step 3 and instead run step 4, we will never learn about value > written in step 1. > 4. Issue a CAS Write and it involves only B and C. This will succeed and > commit a different value than step 1. Step 1 value will never be seen again > and was never seen before. > If you read the Lamport “paxos made simple” paper and read section 2.3. It > talks about this issue which is how learners can find out if majority of the > acceptors have accepted the proposal. > In step 3, it is correct that we propose the value again since we dont know > if it was accepted by majority of acceptors. When we ask majority of > acceptors, and more than one acceptors but not majority has something in > flight, we have no way of knowing if it is accepted by majority of acceptors. > So this behavior is correct. > However we need to fix step 2, since it caused reads to not be linearizable > with respect to writes and other reads. In this case, we know that majority > of acceptors have no inflight commit which means we have majority that > nothing was accepted by majority. I think we should run a propose step here > with empty commit and that will cause write written in step 1 to not be > visible ever after. > With this fix, we will either see data written in step 1 on next serial read > or will never see it which is what we want. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassa
[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node
[ https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630602#comment-16630602 ] Jeffrey F. Lukman commented on CASSANDRA-12438: --- Yes, we also performed some reads after all related messages of the client requests have been executed to verify the consistency among the nodes. We run this query: SELECT * FROM test.tests WHERE name = 'cass-12438'; We executed this query to each node using the cqlsh. If the bug is manifested, we can see that node X and Y will return the expected result, while node Z will return the buggy result. Therefore, the data are inconsistent among the nodes. > Data inconsistencies with lightweight transactions, serial reads, and > rejoining node > > > Key: CASSANDRA-12438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12438 > Project: Cassandra > Issue Type: Bug >Reporter: Steven Schaefer >Priority: Major > > I've run into some issues with data inconsistency in a situation where a > single node is rejoining a 3-node cluster with RF=3. I'm running 3.7. > I have a client system which inserts data into a table with around 7 columns, > named let's say A-F,id, and version. LWTs are used to make the inserts and > updates. > Typically what happens is there's an insert of values id, V_a1, V_b1, ... , > version=1, then another process will pick up rows with for example A=V_a1 and > subsequently update A to V_a2 and version=2. Yet another process will watch > for A=V_a2 to then make a second update to the same column, and set version > to 3, with end result being There's a > secondary index on this A column (there's only a few possible values for A so > not worried about the cardinality issue), though I've reproed with the new > SASI index too. > If one of the nodes is down, there's still 2 alive for quorum so inserts can > still happen. When I bring up the downed node, sometimes I get really weird > state back which ultimately crashes the client system that's talking to > Cassandra. > When reading I always select all the columns, but there is a conditional > where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read > is for processing any rows with V_a2, and ultimately updating to V_a3 when > complete. While periodically polling for A=V_a2 it is of course possible for > the poller to to observe the old V_a2 value while the other parts of the > client system process and make the update to V_a3, and that's generally ok > because of the LWTs used for updates, an occassionaly wasted reprocessing run > ins't a big deal, but when reading at serial I always expect to get the > original values for columns that were never updated too. If a paxos update is > in progress then I expect that completed before its value(s) returned. But > instead, the read seems to be seeing the partial commit of the LWT, returning > the old V_2a value for the changed column, but no values whatsoever for the > other columns. From the example above, instead of getting , version=3>, or even the older (either of > which I expect and are ok), I get only , so the rest of > the columns end up null, which I never expect. However this isn't persistent, > Cassandra does end up consistent, which I see via sstabledump and cqlsh after > the fact. > In my client system logs I record the insert / updates, and this > inconsistency happens around the same time as the update from V_a2 to V_a3, > hence my comment about Cassandra seeing a partial commit. So that leads me to > suspect that perhaps due to the where clause in my read query for A=V_a2, > perhaps one of the original good nodes already has the new V_a3 value, so it > doesn't return this row for the select query, but the other good node and the > one that was down still have the old value V_a2, so those 2 nodes return what > they have. The one that was down doesn't yet have the original insert, just > the update from V_a1 -> V_a2 (again I suspect, it's not been easy to verify), > which would explain where comes from, that's all it > knows about. However since it's a serial quorum read, I'd expect some sort of > exception as neither of the remaining 2 nodes with A=V_a2 would be able to > come to a quorum on the values for all the columns, as I'd expect the other > good node to return > I know at some point nodetool repair should be run on this node, but I'm > concerned about a window of time between when the node comes back up and > repair starts/completes. It almost seems like if a node goes down the safest > bet is to remove it from the cluster and rebuild, instead of simply > restarting the node? However I haven't tested that to see if it runs into a > similar situation. > It is of course possib
[jira] [Commented] (CASSANDRA-12126) CAS Reads Inconsistencies
[ https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629685#comment-16629685 ] Jeffrey F. Lukman commented on CASSANDRA-12126: --- To complete our scenario, here is the setup for our Cassandra: We run the scenario with Cassandra-v2.0.15. Here is the scheme that we use: * CREATE KEYSPACE test WITH REPLICATION = \{'class': 'SimpleStrategy', 'replication_factor': 3}; * CREATE TABLE tests ( name text PRIMARY KEY, owner text, value_1 text, value_2 text, value_3 text); Here are the queries that we submit: * client request to node X (1st): UPDATE test.tests SET value_1 = 'A' WHERE name = 'testing' IF owner = 'user_1'; * client request to node Y (2nd): UPDATE test.tests SET value_2 = 'B' WHERE name = 'testing' IF value_1 = 'A'; * client request to node Z (3rd): UPDATE test.tests SET value_3 = 'C' WHERE name = 'testing' IF value_1 = 'A'; To confirm, when the bug is manifested, the end result will be: value_1='A', value_2=null, value_3=null [~jjirsa], regarding our tool, at this point, it is not open for public. > CAS Reads Inconsistencies > -- > > Key: CASSANDRA-12126 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12126 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: sankalp kohli >Priority: Major > Labels: LWT > > While looking at the CAS code in Cassandra, I found a potential issue with > CAS Reads. Here is how it can happen with RF=3 > 1) You issue a CAS Write and it fails in the propose phase. A machine replies > true to a propose and saves the commit in accepted filed. The other two > machines B and C does not get to the accept phase. > Current state is that machine A has this commit in paxos table as accepted > but not committed and B and C does not. > 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the > value written in step 1. This step is as if nothing is inflight. > 3) Issue another CAS Read and it goes to A and B. Now we will discover that > there is something inflight from A and will propose and commit it with the > current ballot. Now we can read the value written in step 1 as part of this > CAS read. > If we skip step 3 and instead run step 4, we will never learn about value > written in step 1. > 4. Issue a CAS Write and it involves only B and C. This will succeed and > commit a different value than step 1. Step 1 value will never be seen again > and was never seen before. > If you read the Lamport “paxos made simple” paper and read section 2.3. It > talks about this issue which is how learners can find out if majority of the > acceptors have accepted the proposal. > In step 3, it is correct that we propose the value again since we dont know > if it was accepted by majority of acceptors. When we ask majority of > acceptors, and more than one acceptors but not majority has something in > flight, we have no way of knowing if it is accepted by majority of acceptors. > So this behavior is correct. > However we need to fix step 2, since it caused reads to not be linearizable > with respect to writes and other reads. In this case, we know that majority > of acceptors have no inflight commit which means we have majority that > nothing was accepted by majority. I think we should run a propose step here > with empty commit and that will cause write written in step 1 to not be > visible ever after. > With this fix, we will either see data written in step 1 on next serial read > or will never see it which is what we want. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node
[ https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629632#comment-16629632 ] Jeffrey F. Lukman commented on CASSANDRA-12438: --- Hi [~benedict] , Following the bug description, we integrate our model checker with Cassandra-v3.7. We grabbed the code from the github repository. Regarding the scheme, here is the initial scheme that we have prepared before we inject any queries in the model checker path execution: * CREATE KEYSPACE test WITH REPLICATION = \{'class': 'SimpleStrategy', 'replication_factor': 3}; * CREATE TABLE tests (name text PRIMARY KEY, owner text, value_1 text, value_2 text, value_3 text, value_4 text, value_5 text, value_6 text, value_7 text); Regarding the operations/queries, here are the details of them: * INSERT INTO test.tests (name, owner, value_1, value_2, value_3, value_4, value_5, value_6, value_7) VALUES ('cass-12438', 'user_1', 'A1', 'B1', 'C1', 'D1', 'E1', 'F1', 'G1') IF NOT EXISTS; * Client Request 2: UPDATE test.tests SET value_1 = 'A2', owner = 'user_2' WHERE name = 'cass-12438' IF owner = 'user_1'; * Client Request 3: UPDATE test.tests SET value_1 = 'A3', owner = 'user_3' WHERE name = 'cass-12438' IF owner = 'user_2'; The messages from these queries here are the one that the model checker control and reorder in some way, so that we ended up reproducing this bug. > Data inconsistencies with lightweight transactions, serial reads, and > rejoining node > > > Key: CASSANDRA-12438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12438 > Project: Cassandra > Issue Type: Bug >Reporter: Steven Schaefer >Priority: Major > > I've run into some issues with data inconsistency in a situation where a > single node is rejoining a 3-node cluster with RF=3. I'm running 3.7. > I have a client system which inserts data into a table with around 7 columns, > named let's say A-F,id, and version. LWTs are used to make the inserts and > updates. > Typically what happens is there's an insert of values id, V_a1, V_b1, ... , > version=1, then another process will pick up rows with for example A=V_a1 and > subsequently update A to V_a2 and version=2. Yet another process will watch > for A=V_a2 to then make a second update to the same column, and set version > to 3, with end result being There's a > secondary index on this A column (there's only a few possible values for A so > not worried about the cardinality issue), though I've reproed with the new > SASI index too. > If one of the nodes is down, there's still 2 alive for quorum so inserts can > still happen. When I bring up the downed node, sometimes I get really weird > state back which ultimately crashes the client system that's talking to > Cassandra. > When reading I always select all the columns, but there is a conditional > where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read > is for processing any rows with V_a2, and ultimately updating to V_a3 when > complete. While periodically polling for A=V_a2 it is of course possible for > the poller to to observe the old V_a2 value while the other parts of the > client system process and make the update to V_a3, and that's generally ok > because of the LWTs used for updates, an occassionaly wasted reprocessing run > ins't a big deal, but when reading at serial I always expect to get the > original values for columns that were never updated too. If a paxos update is > in progress then I expect that completed before its value(s) returned. But > instead, the read seems to be seeing the partial commit of the LWT, returning > the old V_2a value for the changed column, but no values whatsoever for the > other columns. From the example above, instead of getting , version=3>, or even the older (either of > which I expect and are ok), I get only , so the rest of > the columns end up null, which I never expect. However this isn't persistent, > Cassandra does end up consistent, which I see via sstabledump and cqlsh after > the fact. > In my client system logs I record the insert / updates, and this > inconsistency happens around the same time as the update from V_a2 to V_a3, > hence my comment about Cassandra seeing a partial commit. So that leads me to > suspect that perhaps due to the where clause in my read query for A=V_a2, > perhaps one of the original good nodes already has the new V_a3 value, so it > doesn't return this row for the select query, but the other good node and the > one that was down still have the old value V_a2, so those 2 nodes return what > they have. The one that was down doesn't yet have the original insert, just > the update from V_a1 -> V_a2 (again I suspect, it's not been easy to verify), >
[jira] [Comment Edited] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node
[ https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629434#comment-16629434 ] Jeffrey F. Lukman edited comment on CASSANDRA-12438 at 9/27/18 1:38 AM: Hi all, Our team from UCARE University of Chicago, have been able to reproduce this bug consistently with our model checker. Here are the workload and scenario of the bug: Workload: 3 node-cluster (let's call them node X, Y, and Z), 1 Crash, 1 Reboot events, with 3 client requests (where node X will be the coordinator node for all client requests. Scenario: # Start the 3 nodes and setup the CONSISTENCY = ONE. # Inject client request 1 as described in this bug description: Insert (along with many others) # But before any PREPARE messages have been sent by the node X, node Z has crashed. # Client request 1 is successfully committed in node X and Y. # Reboot node Z. # Inject client request 2 & 3 as described in this bug description: Update (along with others for which A=V_a1) Update (along with many others for which A=V_a2) (**Although Update 3 can also be ignored if we want to simplify the bug scenario) # If only client request-2 that finished, then we expect to see: If the client request-2 and then client request-3 are committed, then we expect to see: The very least possibility is if both client request-2 & -3 failed and they reached timeout, then we expect to see: # But our model checker shows that, if we do a read request to node Z, then we will see: // some fields are null But if we do a read request to node X or Y, then we will get a complete result. (or as expected in step 7) Which means we end up in an inconsistency view among the nodes (X and Y are different from Z). If we run this scenario with CONSISTENCY.ALL we will not see this bug to happen. We are happy to assist you guys to debug this issue. was (Author: jeffreyflukman): Hi all, Our team from UCARE University of Chicago, have been able to reproduce this bug consistently with our model checker. Here are the workload and scenario of the bug: Workload: 3 node-cluster (let's call them node X, Y, and Z), 1 Crash, 1 Reboot events, with 3 client requests (where node X will be the coordinator node for all client requests. Scenario: # Start the 3 nodes and setup the CONSISTENCY = ONE. # Inject client request 1 as described in this bug description: Insert (along with many others) # But before any PREPARE messages have been sent by the node X, node Z has crashed. # Client request 1 is successfully committed in node X and Y. # Reboot node Z. # Inject client request 2 & 3 as described in this bug description: Update (along with others for which A=V_a1) Update (along with many others for which A=V_a2) (**Although Update 3 can also be ignored if we want to simplify the bug scenario) # If client request-2 finished first without being interfered by client request-3, then we expect to see: If the client request-3 interfere client request-2 or is executed before client request-2 for any reason, then we expect to see: # But our model checker shows that, if we do a read request to node Z, then we will see: // some fields are null But if we do a read request to node X or Y, then we will get a complete result. Which means we end up in an inconsistency view among the nodes. If we run this scenario with CONSISTENCY.ALL we will not see this bug to happen. We are happy to assist you guys to debug this issue. > Data inconsistencies with lightweight transactions, serial reads, and > rejoining node > > > Key: CASSANDRA-12438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12438 > Project: Cassandra > Issue Type: Bug >Reporter: Steven Schaefer >Priority: Major > > I've run into some issues with data inconsistency in a situation where a > single node is rejoining a 3-node cluster with RF=3. I'm running 3.7. > I have a client system which inserts data into a table with around 7 columns, > named let's say A-F,id, and version. LWTs are used to make the inserts and > updates. > Typically what happens is there's an insert of values id, V_a1, V_b1, ... , > version=1, then another process will pick up rows with for example A=V_a1 and > subsequently update A to V_a2 and version=2. Yet another process will watch > for A=V_a2 to then make a second update to the same column, and set version > to 3, with end result being There's a > secondary index on this A column (there's only a few possible values for A so > not worried about the cardinality issue), though I've reproed with the new > SASI index too. > If one of the nodes is down, there's still 2 alive for quorum so inserts can > still happen. When I bring up the downed
[jira] [Commented] (CASSANDRA-12126) CAS Reads Inconsistencies
[ https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629455#comment-16629455 ] Jeffrey F. Lukman commented on CASSANDRA-12126: --- Hi all, Our team from UCARE University of Chicago, have been able to reproduce similar manifestation to this bug consistently with our model checker. (Our scenario is different with what [~kohlisankalp] proposed) Here are the workload and scenario of the bug: Workload: 3 nodes-cluster, 3 client requests (but no crash event) Scenario: # Start 3-nodes cluster and inject all of 3 client requests to 3 different nodes (node X, Y, Z) # Node X sends its prepare messages (ballot number=1) to all nodes and all nodes accept it # Node X sends its propose message to itself, causing its inProgress value to be "X". # Node Y sends its prepare messages (ballot number=2) to all nodes. This also causes the rest of node X propose messages to be invalid because its ballot number is smaller than node Y prepare messages. # In our scenario, the prepare response messages from node Y and Z comes first before prepare response message from node X, causing the node Y to unrecognize the state of node X which already accepted value "X" (step 3). # But since our query of client request 2 has an IF, that said IF value_1='X', therefore node Y will not continue on sending propose messages to all nodes. Up to this point, it means none of the queries are committed to the server. # Node Z now sends its prepare messages to all nodes and all nodes accept it. # In our scenario, now the node X returns its response first where it also let node Z knows about its inProgress Value "X". >From here, node Z will propose and commit client request-1 (with value "X") >instead of client-request-3. # Therefore, we ended up having client request 1 stored to the server, although client request-3 was the one that is said successful. We are ready to assist, if any further information is needed. > CAS Reads Inconsistencies > -- > > Key: CASSANDRA-12126 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12126 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: sankalp kohli >Priority: Major > Labels: LWT > > While looking at the CAS code in Cassandra, I found a potential issue with > CAS Reads. Here is how it can happen with RF=3 > 1) You issue a CAS Write and it fails in the propose phase. A machine replies > true to a propose and saves the commit in accepted filed. The other two > machines B and C does not get to the accept phase. > Current state is that machine A has this commit in paxos table as accepted > but not committed and B and C does not. > 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the > value written in step 1. This step is as if nothing is inflight. > 3) Issue another CAS Read and it goes to A and B. Now we will discover that > there is something inflight from A and will propose and commit it with the > current ballot. Now we can read the value written in step 1 as part of this > CAS read. > If we skip step 3 and instead run step 4, we will never learn about value > written in step 1. > 4. Issue a CAS Write and it involves only B and C. This will succeed and > commit a different value than step 1. Step 1 value will never be seen again > and was never seen before. > If you read the Lamport “paxos made simple” paper and read section 2.3. It > talks about this issue which is how learners can find out if majority of the > acceptors have accepted the proposal. > In step 3, it is correct that we propose the value again since we dont know > if it was accepted by majority of acceptors. When we ask majority of > acceptors, and more than one acceptors but not majority has something in > flight, we have no way of knowing if it is accepted by majority of acceptors. > So this behavior is correct. > However we need to fix step 2, since it caused reads to not be linearizable > with respect to writes and other reads. In this case, we know that majority > of acceptors have no inflight commit which means we have majority that > nothing was accepted by majority. I think we should run a propose step here > with empty commit and that will cause write written in step 1 to not be > visible ever after. > With this fix, we will either see data written in step 1 on next serial read > or will never see it which is what we want. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node
[ https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629434#comment-16629434 ] Jeffrey F. Lukman commented on CASSANDRA-12438: --- Hi all, Our team from UCARE University of Chicago, have been able to reproduce this bug consistently with our model checker. Here are the workload and scenario of the bug: Workload: 3 node-cluster (let's call them node X, Y, and Z), 1 Crash, 1 Reboot events, with 3 client requests (where node X will be the coordinator node for all client requests. Scenario: # Start the 3 nodes and setup the CONSISTENCY = ONE. # Inject client request 1 as described in this bug description: Insert (along with many others) # But before any PREPARE messages have been sent by the node X, node Z has crashed. # Client request 1 is successfully committed in node X and Y. # Reboot node Z. # Inject client request 2 & 3 as described in this bug description: Update (along with others for which A=V_a1) Update (along with many others for which A=V_a2) (**Although Update 3 can also be ignored if we want to simplify the bug scenario) # If client request-2 finished first without being interfered by client request-3, then we expect to see: If the client request-3 interfere client request-2 or is executed before client request-2 for any reason, then we expect to see: # But our model checker shows that, if we do a read request to node Z, then we will see: // some fields are null But if we do a read request to node X or Y, then we will get a complete result. Which means we end up in an inconsistency view among the nodes. If we run this scenario with CONSISTENCY.ALL we will not see this bug to happen. We are happy to assist you guys to debug this issue. > Data inconsistencies with lightweight transactions, serial reads, and > rejoining node > > > Key: CASSANDRA-12438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12438 > Project: Cassandra > Issue Type: Bug >Reporter: Steven Schaefer >Priority: Major > > I've run into some issues with data inconsistency in a situation where a > single node is rejoining a 3-node cluster with RF=3. I'm running 3.7. > I have a client system which inserts data into a table with around 7 columns, > named let's say A-F,id, and version. LWTs are used to make the inserts and > updates. > Typically what happens is there's an insert of values id, V_a1, V_b1, ... , > version=1, then another process will pick up rows with for example A=V_a1 and > subsequently update A to V_a2 and version=2. Yet another process will watch > for A=V_a2 to then make a second update to the same column, and set version > to 3, with end result being There's a > secondary index on this A column (there's only a few possible values for A so > not worried about the cardinality issue), though I've reproed with the new > SASI index too. > If one of the nodes is down, there's still 2 alive for quorum so inserts can > still happen. When I bring up the downed node, sometimes I get really weird > state back which ultimately crashes the client system that's talking to > Cassandra. > When reading I always select all the columns, but there is a conditional > where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read > is for processing any rows with V_a2, and ultimately updating to V_a3 when > complete. While periodically polling for A=V_a2 it is of course possible for > the poller to to observe the old V_a2 value while the other parts of the > client system process and make the update to V_a3, and that's generally ok > because of the LWTs used for updates, an occassionaly wasted reprocessing run > ins't a big deal, but when reading at serial I always expect to get the > original values for columns that were never updated too. If a paxos update is > in progress then I expect that completed before its value(s) returned. But > instead, the read seems to be seeing the partial commit of the LWT, returning > the old V_2a value for the changed column, but no values whatsoever for the > other columns. From the example above, instead of getting , version=3>, or even the older (either of > which I expect and are ok), I get only , so the rest of > the columns end up null, which I never expect. However this isn't persistent, > Cassandra does end up consistent, which I see via sstabledump and cqlsh after > the fact. > In my client system logs I record the insert / updates, and this > inconsistency happens around the same time as the update from V_a2 to V_a3, > hence my comment about Cassandra seeing a partial commit. So that leads me to > suspect that perhaps due to the where clause in my read query for A=V_a2, > perhaps one of the original good
[jira] [Commented] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277100#comment-15277100 ] Jeffrey F. Lukman commented on CASSANDRA-11724: --- [~jeromatron] : okay, I will try this again and report the result later whether this config will cause a different result or not. For now, can you help me by confirming whether you also see the Workload-4 bug or not? The Workload-4 : running 512-nodes cluster with some data, then we decommissioned a node. In our place, we see a high numbers of wrong false failure detection. > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg, experiment-result.txt > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine has 16-cores and runs 8 cassandra > instances, and our testing is 32, 64, 128, 256, and 512 instances of > Cassandra. We use the default number of vnodes for each instance which is > 256. The data and log directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition, > load some data and decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275388#comment-15275388 ] Jeffrey F. Lukman edited comment on CASSANDRA-11724 at 5/7/16 9:17 PM: --- {quote} Are you not waiting two minutes between starting each node to join the ring? {quote} Hi Jeremy, No, I haven't waited for 2 minutes for starting each node. So do you say, even when I want to bootstrap a new cluster, let's say I want to bootstrap a 512-nodes cluster, I have to add the nodes one by one and between adding the nodes, I have to wait for 2 minutes? It means, when I want to bootstrap a 512-nodes, I need to wait for 511 * 2 minutes = 1022 minutes = *17 hours*? What about the decommission one node problem? In our 4th workload, our workload shown that we first started X nodes, we waited until it is stable, then we decommissioned a node. And we still see a big number of false failure detection. For 512 nodes, we measured around *90,000+* false failure detection. was (Author: jeffreyflukman): {quote} Are you not waiting two minutes between starting each node to join the ring? {quote} Hi Jeremy, No, I haven't waited for 2 minutes for starting each node. So do you say, even when I want to bootstrap a new cluster, let's say I want to bootstrap a 512-nodes cluster, I have to add the nodes one by one and between adding the nodes, I have to wait for 2 minutes? It means, when I want to bootstrap a 512-nodes, I need to wait for 511 * 2 minutes = 1022 minutes = *17+ hours*? What about the decommission one node problem? In our 4th workload, our workload shown that we first started X nodes, we waited until it is stable, then we decommissioned a node. And we still see a big number of false failure detection. For 512 nodes, we measured around *90,000+* false failure detection. > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg, experiment-result.txt > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine has 16-cores and runs 8 cassandra > instances, and our testing is 32, 64, 128, 256, and 512 instances of > Cassandra. We use the default number of vnodes for each instance which is > 256. The data and log directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition, > load some data and decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275388#comment-15275388 ] Jeffrey F. Lukman commented on CASSANDRA-11724: --- {quote} Are you not waiting two minutes between starting each node to join the ring? {quote} Hi Jeremy, No, I haven't waited for 2 minutes for starting each node. So do you say, even when I want to bootstrap a new cluster, let's say I want to bootstrap a 512-nodes cluster, I have to add the nodes one by one and between adding the nodes, I have to wait for 2 minutes? It means, when I want to bootstrap a 512-nodes, I need to wait for 511 * 2 minutes = 1022 minutes = *17+ hours*? What about the decommission one node problem? In our 4th workload, our workload shown that we first started X nodes, we waited until it is stable, then we decommissioned a node. And we still see a big number of false failure detection. For 512 nodes, we measured around *90,000+* false failure detection. > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg, experiment-result.txt > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine has 16-cores and runs 8 cassandra > instances, and our testing is 32, 64, 128, 256, and 512 instances of > Cassandra. We use the default number of vnodes for each instance which is > 256. The data and log directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition, > load some data and decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275362#comment-15275362 ] Jeffrey F. Lukman edited comment on CASSANDRA-11724 at 5/7/16 8:06 PM: --- Hi Sylvain, I've tried to go deeper on why this bug happen in big scale of cluster. I found out that when we start 256 Cassandra instances (or more) simultaneously, the {{applyStateLocally()}} function of Gossiper.java can take a long time in the init phase. In some nodes, I saw that some of the nodes can take *60+ seconds*. The worst {{applyStateLocally()}} running time that I see so far is *120 seconds*. The average worst running time of {{applyStateLocally()}} from all nodes is *28.5 seconds*. I believe this number can get even worst in bootstrapping 512 Cassandra instances. I'm still trying to understand why this can happen and why this only happen in the initialization phase. I attached the running time records that I have from one of my experiment in file : experiment-result.txt. Note: this time, I run the experiment on 16 machines with 16 instances running in one machine. This is the available machines that I could get for my last experiment. was (Author: jeffreyflukman): Hi Sylvain, I've tried to go deeper on why this bug happen in big scale of cluster. I found out that when we start 256 Cassandra instances (or more) simultaneously, the {{applyStateLocally()}} function of Gossiper.java can take a long time in the init phase. In some nodes, I saw that some of the nodes can take *60+ seconds*. The worst {{applyStateLocally()}} running time that I see so far is *120 seconds*. The average worst running time of {{applyStateLocally()}} from all nodes is *28.5 seconds*. I believe this number can get even worst in bootstrapping 512 Cassandra instances. I'm still trying to understand why this can happen and why this only happen in the initialization phase. I attached the running time records that I have from one of my experiment in file : experiment-result.txt. > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg, experiment-result.txt > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine has 16-cores and runs 8 cassandra > instances, and our testing is 32, 64, 128, 256, and 512 instances of > Cassandra. We use the default number of vnodes for each instance which is > 256. The data and log directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition, > load some data and decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey F. Lukman updated CASSANDRA-11724: -- Comment: was deleted (was: Hi Sylvain, I've tried to go deeper on why this bug happen in big scale of cluster. I found out that when we start 256 Cassandra instances (or more) simultaneously, the {{applyStateLocally()}} function of Gossiper.java can take a long time in the init phase. In some nodes, I saw that some of the nodes can take *60+ seconds*. The worst {{applyStateLocally()}} running time that I see so far is *120 seconds*. The average worst running time of {{applyStateLocally()}} from all nodes is *28.5 seconds*. I believe this number can get even worst in bootstrapping 512 Cassandra instances. I'm still trying to understand why this can happen and why this only happen in the initialization phase. I attached the running time records that I have from one of my experiment.) > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg, experiment-result.txt > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine has 16-cores and runs 8 cassandra > instances, and our testing is 32, 64, 128, 256, and 512 instances of > Cassandra. We use the default number of vnodes for each instance which is > 256. The data and log directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition, > load some data and decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275362#comment-15275362 ] Jeffrey F. Lukman edited comment on CASSANDRA-11724 at 5/7/16 7:52 PM: --- Hi Sylvain, I've tried to go deeper on why this bug happen in big scale of cluster. I found out that when we start 256 Cassandra instances (or more) simultaneously, the {{applyStateLocally()}} function of Gossiper.java can take a long time in the init phase. In some nodes, I saw that some of the nodes can take *60+ seconds*. The worst {{applyStateLocally()}} running time that I see so far is *120 seconds*. The average worst running time of {{applyStateLocally()}} from all nodes is *28.5 seconds*. I believe this number can get even worst in bootstrapping 512 Cassandra instances. I'm still trying to understand why this can happen and why this only happen in the initialization phase. I attached the running time records that I have from one of my experiment in file : experiment-result.txt. was (Author: jeffreyflukman): I run 256 instances of Cassandra and try to measure the Gossiper functions running time. Here is the bottleneck function that I see in my experiment. > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg, experiment-result.txt > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine has 16-cores and runs 8 cassandra > instances, and our testing is 32, 64, 128, 256, and 512 instances of > Cassandra. We use the default number of vnodes for each instance which is > 256. The data and log directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition, > load some data and decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey F. Lukman updated CASSANDRA-11724: -- Attachment: experiment-result.txt I run 256 instances of Cassandra and try to measure the Gossiper functions running time. Here is the bottleneck function that I see in my experiment. > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg, experiment-result.txt > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine has 16-cores and runs 8 cassandra > instances, and our testing is 32, 64, 128, 256, and 512 instances of > Cassandra. We use the default number of vnodes for each instance which is > 256. The data and log directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition, > load some data and decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275359#comment-15275359 ] Jeffrey F. Lukman edited comment on CASSANDRA-11724 at 5/7/16 7:49 PM: --- Hi Sylvain, I've tried to go deeper on why this bug happen in big scale of cluster. I found out that when we start 256 Cassandra instances (or more) simultaneously, the {{applyStateLocally()}} function of Gossiper.java can take a long time in the init phase. In some nodes, I saw that some of the nodes can take *60+ seconds*. The worst {{applyStateLocally()}} running time that I see so far is *120 seconds*. The average worst running time of {{applyStateLocally()}} from all nodes is *28.5 seconds*. I believe this number can get even worst in bootstrapping 512 Cassandra instances. I'm still trying to understand why this can happen and why this only happen in the initialization phase. I attached the running time records that I have from one of my experiment. was (Author: jeffreyflukman): Hi Sylvain, I've tried to go deeper on why this bug happen in big scale of cluster. I found out that when we start 256 Cassandra instances (or more) simultaneously, the {{applyStateLocally()}} function of Gossiper.java can take a long time in the init phase. In some nodes, I saw that some of the nodes can take *60+ seconds*. The worst {{applyStateLocally()}} running time that I see so far is *120 seconds*. The average worst running time of {{applyStateLocally()}} from all nodes is *28.5 seconds*. I believe this number can get even worst in bootstrapping 512 Cassandra instances. I'm still trying to understand why this can happen and why this only happen in the initialization phase. > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine has 16-cores and runs 8 cassandra > instances, and our testing is 32, 64, 128, 256, and 512 instances of > Cassandra. We use the default number of vnodes for each instance which is > 256. The data and log directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition, > load some data and decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275359#comment-15275359 ] Jeffrey F. Lukman commented on CASSANDRA-11724: --- Hi Sylvain, I've tried to go deeper on why this bug happen in big scale of cluster. I found out that when we start 256 Cassandra instances (or more) simultaneously, the {{applyStateLocally()}} function of Gossiper.java can take a long time in the init phase. In some nodes, I saw that some of the nodes can take *60+ seconds*. The worst {{applyStateLocally()}} running time that I see so far is *120 seconds*. The average worst running time of {{applyStateLocally()}} from all nodes is *28.5 seconds*. I believe this number can get even worst in bootstrapping 512 Cassandra instances. I'm still trying to understand why this can happen and why this only happen in the initialization phase. > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine has 16-cores and runs 8 cassandra > instances, and our testing is 32, 64, 128, 256, and 512 instances of > Cassandra. We use the default number of vnodes for each instance which is > 256. The data and log directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition, > load some data and decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey F. Lukman updated CASSANDRA-11724: -- Description: We are running some testing on Cassandra v2.2.5 stable in a big cluster. The setting in our testing is that each machine has 16-cores and runs 8 cassandra instances, and our testing is 32, 64, 128, 256, and 512 instances of Cassandra. We use the default number of vnodes for each instance which is 256. The data and log directories are on in-memory tmpfs file system. We run several types of workloads on this Cassandra cluster: Workload1: Just start the cluster Workload2: Start half of the cluster, wait until it gets into a stable condition, and run another half of the cluster Workload3: Start half of the cluster, wait until it gets into a stable condition, load some data, and run another half of the cluster Workload4: Start the cluster, wait until it gets into a stable condition, load some data and decommission one node For this testing, we measure the total numbers of false failure detection inside the cluster. By false failure detection, we mean that, for example, instance-1 marks the instance-2 down, but the instance-2 is not down. We dig deeper into the root cause and find out that instance-1 has not received any heartbeat after some time from instance-2 because the instance-2 run a long computation process. Here I attach the graphs of each workload result. was: We are running some testing on Cassandra v2.2.5 stable in a big cluster. The setting in our testing is that each machine has 16-cores and runs 8 cassandra instances, and our testing is 32, 64, 128, 256, and 512 instances of Cassandra. We use the default number of vnodes for each instance which is 256. The data and log directories are on in-memory tmpfs file system. We run several types of workloads on this Cassandra cluster: Workload1: Just start the cluster Workload2: Start half of the cluster, wait until it gets into a stable condition, and run another half of the cluster Workload3: Start half of the cluster, wait until it gets into a stable condition, load some data, and run another half of the cluster Workload4: Start the cluster, wait until it gets into a stable condition and decommission one node For this testing, we measure the total numbers of false failure detection inside the cluster. By false failure detection, we mean that, for example, instance-1 marks the instance-2 down, but the instance-2 is not down. We dig deeper into the root cause and find out that instance-1 has not received any heartbeat after some time from instance-2 because the instance-2 run a long computation process. Here I attach the graphs of each workload result. > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine has 16-cores and runs 8 cassandra > instances, and our testing is 32, 64, 128, 256, and 512 instances of > Cassandra. We use the default number of vnodes for each instance which is > 256. The data and log directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition, > load some data and decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey F. Lukman updated CASSANDRA-11724: -- Description: We are running some testing on Cassandra v2.2.5 stable in a big cluster. The setting in our testing is that each machine has 16-cores and runs 8 cassandra instances, and our testing is 32, 64, 128, 256, and 512 instances of Cassandra. We use the default number of vnodes for each instance which is 256. The data and log directories are on in-memory tmpfs file system. We run several types of workloads on this Cassandra cluster: Workload1: Just start the cluster Workload2: Start half of the cluster, wait until it gets into a stable condition, and run another half of the cluster Workload3: Start half of the cluster, wait until it gets into a stable condition, load some data, and run another half of the cluster Workload4: Start the cluster, wait until it gets into a stable condition and decommission one node For this testing, we measure the total numbers of false failure detection inside the cluster. By false failure detection, we mean that, for example, instance-1 marks the instance-2 down, but the instance-2 is not down. We dig deeper into the root cause and find out that instance-1 has not received any heartbeat after some time from instance-2 because the instance-2 run a long computation process. Here I attach the graphs of each workload result. was: We are running some testing on Cassandra v2.2.5 stable in a big cluster. The setting in our testing is that each machine runs 8 cassandra instances, and our testing is 32, 64, 128, 256, and 512 instances of Cassandra. We use the default number of vnodes for each instance which is 256. The data and log directories are on in-memory tmpfs file system. We run several types of workloads on this Cassandra cluster: Workload1: Just start the cluster Workload2: Start half of the cluster, wait until it gets into a stable condition, and run another half of the cluster Workload3: Start half of the cluster, wait until it gets into a stable condition, load some data, and run another half of the cluster Workload4: Start the cluster, wait until it gets into a stable condition and decommission one node For this testing, we measure the total numbers of false failure detection inside the cluster. By false failure detection, we mean that, for example, instance-1 marks the instance-2 down, but the instance-2 is not down. We dig deeper into the root cause and find out that instance-1 has not received any heartbeat after some time from instance-2 because the instance-2 run a long computation process. Here I attach the graphs of each workload result. > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine has 16-cores and runs 8 cassandra > instances, and our testing is 32, 64, 128, 256, and 512 instances of > Cassandra. We use the default number of vnodes for each instance which is > 256. The data and log directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition and > decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey F. Lukman updated CASSANDRA-11724: -- Fix Version/s: (was: 2.2.5) > False Failure Detection in Big Cassandra Cluster > > > Key: CASSANDRA-11724 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeffrey F. Lukman > Labels: gossip, node-failure > Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, > Workload4.jpg > > > We are running some testing on Cassandra v2.2.5 stable in a big cluster. The > setting in our testing is that each machine runs 8 cassandra instances, and > our testing is 32, 64, 128, 256, and 512 instances of Cassandra. We use the > default number of vnodes for each instance which is 256. The data and log > directories are on in-memory tmpfs file system. > We run several types of workloads on this Cassandra cluster: > Workload1: Just start the cluster > Workload2: Start half of the cluster, wait until it gets into a stable > condition, and run another half of the cluster > Workload3: Start half of the cluster, wait until it gets into a stable > condition, load some data, and run another half of the cluster > Workload4: Start the cluster, wait until it gets into a stable condition and > decommission one node > For this testing, we measure the total numbers of false failure detection > inside the cluster. By false failure detection, we mean that, for example, > instance-1 marks the instance-2 down, but the instance-2 is not down. We dig > deeper into the root cause and find out that instance-1 has not received any > heartbeat after some time from instance-2 because the instance-2 run a long > computation process. > Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
Jeffrey F. Lukman created CASSANDRA-11724: - Summary: False Failure Detection in Big Cassandra Cluster Key: CASSANDRA-11724 URL: https://issues.apache.org/jira/browse/CASSANDRA-11724 Project: Cassandra Issue Type: Bug Components: Core Reporter: Jeffrey F. Lukman Fix For: 2.2.5 Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, Workload4.jpg We are running some testing on Cassandra v2.2.5 stable in a big cluster. The setting in our testing is that each machine runs 8 cassandra instances, and our testing is 32, 64, 128, 256, and 512 instances of Cassandra. We use the default number of vnodes for each instance which is 256. The data and log directories are on in-memory tmpfs file system. We run several types of workloads on this Cassandra cluster: Workload1: Just start the cluster Workload2: Start half of the cluster, wait until it gets into a stable condition, and run another half of the cluster Workload3: Start half of the cluster, wait until it gets into a stable condition, load some data, and run another half of the cluster Workload4: Start the cluster, wait until it gets into a stable condition and decommission one node For this testing, we measure the total numbers of false failure detection inside the cluster. By false failure detection, we mean that, for example, instance-1 marks the instance-2 down, but the instance-2 is not down. We dig deeper into the root cause and find out that instance-1 has not received any heartbeat after some time from instance-2 because the instance-2 run a long computation process. Here I attach the graphs of each workload result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)