[jira] [Commented] (CASSANDRA-12126) CAS Reads Inconsistencies

2018-09-27 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630850#comment-16630850
 ] 

Jeffrey F. Lukman commented on CASSANDRA-12126:
---

Thank you for your responses, [~jjordan] and [~kohlisankalp].
I think you have cleared up some misunderstandings for me (and our team) where 
timeout is a "gray area" for the client 
to determine whether a request has been successfully processed.

One thing that we would like to point out maybe, based on the early discussion 
in this bug description, quote
{quote}However we need to fix step 2, since it caused reads to not be 
linearizable with respect to writes and other reads. In this case, we know that 
majority of acceptors have no inflight commit which means we have majority that 
nothing was accepted by majority. I think we should run a propose step here 
with empty commit and that will cause write written in step 1 to not be visible 
ever after.
{quote}
What we tried to mimic with our model checker in the beginning actually was 
this scenario where node Y saw that the majority of nodes do not have 
inProgress value, but then suddenly node Z saw that there is an inProgress 
value from node X and tried to repair and commit it.
So, we confirm that we can also see this behavior:
{quote}2: Read -> Nothing
3: Read -> Something
{quote}
We read nothing in node Y, yet node Z read something in the next request.



To sum up, at least, our scenario explains this behavior: Node Y does not try 
to repair the Paxos because node X's prepare response comes last, therefore 
node Y ignores the node X's prepare response and based its decision to not 
repair the Paxos.
But in node Z's client request, node Z decides to repair the Paxos based on 
node X's existing inProgress value_1="A" because node X's prepare response 
comes early (1st or 2nd). Which cause an inconsistent reaction in some way 
between node Y and node Z (although this is correct based on the original Paxos 
algorithm).


A solution to avoid this inconsistent reactions from these two nodes maybe is 
for each node to decide whether to repair a Paxos or not based on the complete 
view of the alive nodes, therefore if the response X's comes last with an 
inProgress value, node Y will still repair the Paxos.

> CAS Reads Inconsistencies 
> --
>
> Key: CASSANDRA-12126
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12126
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: sankalp kohli
>Priority: Major
>  Labels: LWT
>
> While looking at the CAS code in Cassandra, I found a potential issue with 
> CAS Reads. Here is how it can happen with RF=3
> 1) You issue a CAS Write and it fails in the propose phase. A machine replies 
> true to a propose and saves the commit in accepted filed. The other two 
> machines B and C does not get to the accept phase. 
> Current state is that machine A has this commit in paxos table as accepted 
> but not committed and B and C does not. 
> 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the 
> value written in step 1. This step is as if nothing is inflight. 
> 3) Issue another CAS Read and it goes to A and B. Now we will discover that 
> there is something inflight from A and will propose and commit it with the 
> current ballot. Now we can read the value written in step 1 as part of this 
> CAS read.
> If we skip step 3 and instead run step 4, we will never learn about value 
> written in step 1. 
> 4. Issue a CAS Write and it involves only B and C. This will succeed and 
> commit a different value than step 1. Step 1 value will never be seen again 
> and was never seen before. 
> If you read the Lamport “paxos made simple” paper and read section 2.3. It 
> talks about this issue which is how learners can find out if majority of the 
> acceptors have accepted the proposal. 
> In step 3, it is correct that we propose the value again since we dont know 
> if it was accepted by majority of acceptors. When we ask majority of 
> acceptors, and more than one acceptors but not majority has something in 
> flight, we have no way of knowing if it is accepted by majority of acceptors. 
> So this behavior is correct. 
> However we need to fix step 2, since it caused reads to not be linearizable 
> with respect to writes and other reads. In this case, we know that majority 
> of acceptors have no inflight commit which means we have majority that 
> nothing was accepted by majority. I think we should run a propose step here 
> with empty commit and that will cause write written in step 1 to not be 
> visible ever after. 
> With this fix, we will either see data written in step 1 on next serial read 
> or will never see it which is what we want. 



--
This message was sent by Atlassian JIRA
(

[jira] [Commented] (CASSANDRA-12126) CAS Reads Inconsistencies

2018-09-27 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630688#comment-16630688
 ] 

Jeffrey F. Lukman commented on CASSANDRA-12126:
---

During our testing with our model checker, we limit the round of Paxos for each 
query,
because if not, it is possible that we get stuck in a very long sequence of 
message transactions among the nodes without progressing anywhere. So, what we 
do is we only execute one round of Paxos for each query.

To enlight our test and combine our whole story, here is what happened in 
detail:
 * We first prepared the 3 node-cluster with the test.tests table as initial 
table structure and yes, the initial table began with:
{name:'testing', owner:'user_1', value1:null, value2:null, value3:null}
 * Next, we run the model checker that will start the 3 node-cluster.
 * Inject the 3 client requests in order: query 1, then query 2, then query 3.
This cause query 1 to have ballot number < query 2 ballot number < query 3 
ballot number.
 * Now this means, in the beginning, the model checker already see there will 
be 9 prepare messages in its queue that will be reordered in some way.
 * When the bug is manifested, we ended up having:
 ** Node X's prepare messages proceed and all nodes response with true back to 
node X.
 ** Node X sends its propose message with value_1='A' to itself first and get a 
response true as well.
 ** At this moment, Node X inProgress value is updated to the proposed value, 
value_1='A'
 ** But then node Y prepare messages proceed and all nodes response with true 
back to node Y,
because prepare messages of node Y have a higher ballot number.
 ** But when node Y about to proceed the propose messages it realized that the 
current data does not fulfill the IF condition, so it does not proceed to 
propose messages. --> Client request 2 to node Y is therefore rejected
 ** Continuing node X propose messages to node Y and Z, both requests are 
returned with false to node X
 ** Now at this point node X should be able to retry the Paxos with a higher 
ballot number, but since we limit the round of Paxos for each query to one, 
therefore client request 1 to node X is timed out.
 ** Lastly, node Z sends its prepare messages to all nodes, and get response 
true messages from all nodes,
because the ballot number is higher as well.
 ** At this point, if the node X response message is returned first to node X, 
what will happen is node Z will realize that node X still has an inProgress 
value in the process (value_1='A'). This cause node Z to send propose messages 
and commit messages but for client request 1 using the current highest ballot 
number.
Here we have our first data update saved: value_1='A', value_2=null, 
value_3=null.
 ** Back to our constraint of one round Paxos for each query, we ended up not 
retrying client request-3 because we reached timeout.
 * To sum up:
 ** client request-1: Timed out
 ** client request-2: Rejected
 ** client request-3: Timed out

There we get an inconsistency between the client side and the server side, 
where all requests actually failed, but when we read the end result again from 
all nodes, we get value_1='A', value_2=null, value_3=null.

 

I made a wrong statement at the end of my first comment:
{quote}9. Therefore, we ended up having client request 1 stored to the server, 
although client request-3 was the one that is said successful.
{quote}
It should be failed due to timeout.

> CAS Reads Inconsistencies 
> --
>
> Key: CASSANDRA-12126
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12126
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: sankalp kohli
>Priority: Major
>  Labels: LWT
>
> While looking at the CAS code in Cassandra, I found a potential issue with 
> CAS Reads. Here is how it can happen with RF=3
> 1) You issue a CAS Write and it fails in the propose phase. A machine replies 
> true to a propose and saves the commit in accepted filed. The other two 
> machines B and C does not get to the accept phase. 
> Current state is that machine A has this commit in paxos table as accepted 
> but not committed and B and C does not. 
> 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the 
> value written in step 1. This step is as if nothing is inflight. 
> 3) Issue another CAS Read and it goes to A and B. Now we will discover that 
> there is something inflight from A and will propose and commit it with the 
> current ballot. Now we can read the value written in step 1 as part of this 
> CAS read.
> If we skip step 3 and instead run step 4, we will never learn about value 
> written in step 1. 
> 4. Issue a CAS Write and it involves only B and C. This will succeed and 
> commit a different value than step 1. Step 

[jira] [Comment Edited] (CASSANDRA-12126) CAS Reads Inconsistencies

2018-09-27 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629685#comment-16629685
 ] 

Jeffrey F. Lukman edited comment on CASSANDRA-12126 at 9/27/18 3:19 PM:


To complete our scenario, here is the setup for our Cassandra:
 We run the scenario with Cassandra-v2.0.15.
 Here is the scheme that we use:
 * CREATE KEYSPACE test WITH REPLICATION = \{'class': 'SimpleStrategy', 
'replication_factor': 3};
 * CREATE TABLE tests ( name text PRIMARY KEY, owner text, value_1 text, 
value_2 text, value_3 text);

Here are the queries that we submit:
 * client request to node X (1st): UPDATE test.tests SET value_1 = 'A' WHERE 
name = 'testing' IF owner = 'user_1';
 * client request to node Y (2nd): UPDATE test.tests SET value_2 = 'B' WHERE 
name = 'testing' IF value_1 = 'A';
 * client request to node Z (3rd): UPDATE test.tests SET value_3 = 'C' WHERE 
name = 'testing' IF value_1 = 'A';

To confirm, when the bug is manifested, the end result will be: value_1='A', 
value_2=null, value_3=null

[~jjirsa], regarding our tool, at this point, it is not open for public. 


was (Author: jeffreyflukman):
To complete our scenario, here is the setup for our Cassandra:
We run the scenario with Cassandra-v2.0.15.
Here is the scheme that we use:
 * 
CREATE KEYSPACE test WITH REPLICATION = \{'class': 'SimpleStrategy', 
'replication_factor': 3};
 * 
CREATE TABLE tests ( name text PRIMARY KEY, owner text, value_1 text, value_2 
text, value_3 text);

Here are the queries that we submit:
 * client request to node X (1st): UPDATE test.tests SET value_1 = 'A' WHERE 
name = 'testing' IF owner = 'user_1';
 * client request to node Y (2nd): UPDATE test.tests SET value_2 = 'B' WHERE 
name = 'testing' IF value_1 = 'A';
 * client request to node Z (3rd): UPDATE test.tests SET value_3 = 'C' WHERE 
name = 'testing' IF value_1 = 'A';

To confirm, when the bug is manifested, the end result will be: value_1='A', 
value_2=null, value_3=null



[~jjirsa], regarding our tool, at this point, it is not open for public. 

> CAS Reads Inconsistencies 
> --
>
> Key: CASSANDRA-12126
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12126
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: sankalp kohli
>Priority: Major
>  Labels: LWT
>
> While looking at the CAS code in Cassandra, I found a potential issue with 
> CAS Reads. Here is how it can happen with RF=3
> 1) You issue a CAS Write and it fails in the propose phase. A machine replies 
> true to a propose and saves the commit in accepted filed. The other two 
> machines B and C does not get to the accept phase. 
> Current state is that machine A has this commit in paxos table as accepted 
> but not committed and B and C does not. 
> 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the 
> value written in step 1. This step is as if nothing is inflight. 
> 3) Issue another CAS Read and it goes to A and B. Now we will discover that 
> there is something inflight from A and will propose and commit it with the 
> current ballot. Now we can read the value written in step 1 as part of this 
> CAS read.
> If we skip step 3 and instead run step 4, we will never learn about value 
> written in step 1. 
> 4. Issue a CAS Write and it involves only B and C. This will succeed and 
> commit a different value than step 1. Step 1 value will never be seen again 
> and was never seen before. 
> If you read the Lamport “paxos made simple” paper and read section 2.3. It 
> talks about this issue which is how learners can find out if majority of the 
> acceptors have accepted the proposal. 
> In step 3, it is correct that we propose the value again since we dont know 
> if it was accepted by majority of acceptors. When we ask majority of 
> acceptors, and more than one acceptors but not majority has something in 
> flight, we have no way of knowing if it is accepted by majority of acceptors. 
> So this behavior is correct. 
> However we need to fix step 2, since it caused reads to not be linearizable 
> with respect to writes and other reads. In this case, we know that majority 
> of acceptors have no inflight commit which means we have majority that 
> nothing was accepted by majority. I think we should run a propose step here 
> with empty commit and that will cause write written in step 1 to not be 
> visible ever after. 
> With this fix, we will either see data written in step 1 on next serial read 
> or will never see it which is what we want. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassa

[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node

2018-09-27 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630602#comment-16630602
 ] 

Jeffrey F. Lukman commented on CASSANDRA-12438:
---

Yes, we also performed some reads after all related messages of the client 
requests have been executed to verify the consistency among the nodes.

We run this query:

SELECT * FROM test.tests WHERE name = 'cass-12438';

We executed this query to each node using the cqlsh.
If the bug is manifested, we can see that node X and Y will return the expected 
result, while node Z will return the buggy result.
Therefore, the data are inconsistent among the nodes.

> Data inconsistencies with lightweight transactions, serial reads, and 
> rejoining node
> 
>
> Key: CASSANDRA-12438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Steven Schaefer
>Priority: Major
>
> I've run into some issues with data inconsistency in a situation where a 
> single node is rejoining a 3-node cluster with RF=3. I'm running 3.7.
> I have a client system which inserts data into a table with around 7 columns, 
> named let's say A-F,id, and version. LWTs are used to make the inserts and 
> updates.
> Typically what happens is there's an insert of values id, V_a1, V_b1, ... , 
> version=1, then another process will pick up rows with for example A=V_a1 and 
> subsequently update A to V_a2 and version=2. Yet another process will watch 
> for A=V_a2 to then make a second update to the same column, and set version 
> to 3, with end result being  There's a 
> secondary index on this A column (there's only a few possible values for A so 
> not worried about the cardinality issue), though I've reproed with the new 
> SASI index too.
> If one of the nodes is down, there's still 2 alive for quorum so inserts can 
> still happen. When I bring up the downed node, sometimes I get really weird 
> state back which ultimately crashes the client system that's talking to 
> Cassandra. 
> When reading I always select all the columns, but there is a conditional 
> where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read 
> is for processing any rows with V_a2, and ultimately updating to V_a3 when 
> complete. While periodically polling for A=V_a2 it is of course possible for 
> the poller to to observe the old V_a2 value while the other parts of the 
> client system process and make the update to V_a3, and that's generally ok 
> because of the LWTs used for updates, an occassionaly wasted reprocessing run 
> ins't a big deal, but when reading at serial I always expect to get the 
> original values for columns that were never updated too. If a paxos update is 
> in progress then I expect that completed before its value(s) returned. But 
> instead, the read seems to be seeing the partial commit of the LWT, returning 
> the old V_2a value for the changed column, but no values whatsoever for the 
> other columns. From the example above, instead of getting  , version=3>, or even the older  (either of 
> which I expect and are ok), I get only , so the rest of 
> the columns end up null, which I never expect. However this isn't persistent, 
> Cassandra does end up consistent, which I see via sstabledump and cqlsh after 
> the fact.
> In my client system logs I record the insert / updates, and this 
> inconsistency happens around the same time as the update from V_a2 to V_a3, 
> hence my comment about Cassandra seeing a partial commit. So that leads me to 
> suspect that perhaps due to the where clause in my read query for A=V_a2, 
> perhaps one of the original good nodes already has the new V_a3 value, so it 
> doesn't return this row for the select query, but the other good node and the 
> one that was down still have the old value V_a2, so those 2 nodes return what 
> they have. The one that was down doesn't yet have the original insert, just 
> the update from V_a1 -> V_a2 (again I suspect, it's not been easy to verify), 
> which would explain where  comes from, that's all it 
> knows about. However since it's a serial quorum read, I'd expect some sort of 
> exception as neither of the remaining 2 nodes with A=V_a2 would be able to 
> come to a quorum on the values for all the columns, as I'd expect the other 
> good node to return 
> I know at some point nodetool repair should be run on this node, but I'm 
> concerned about a window of time between when the node comes back up and 
> repair starts/completes. It almost seems like if a node goes down the safest 
> bet is to remove it from the cluster and rebuild, instead of simply 
> restarting the node? However I haven't tested that to see if it runs into a 
> similar situation.
> It is of course possib

[jira] [Commented] (CASSANDRA-12126) CAS Reads Inconsistencies

2018-09-26 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629685#comment-16629685
 ] 

Jeffrey F. Lukman commented on CASSANDRA-12126:
---

To complete our scenario, here is the setup for our Cassandra:
We run the scenario with Cassandra-v2.0.15.
Here is the scheme that we use:
 * 
CREATE KEYSPACE test WITH REPLICATION = \{'class': 'SimpleStrategy', 
'replication_factor': 3};
 * 
CREATE TABLE tests ( name text PRIMARY KEY, owner text, value_1 text, value_2 
text, value_3 text);

Here are the queries that we submit:
 * client request to node X (1st): UPDATE test.tests SET value_1 = 'A' WHERE 
name = 'testing' IF owner = 'user_1';
 * client request to node Y (2nd): UPDATE test.tests SET value_2 = 'B' WHERE 
name = 'testing' IF value_1 = 'A';
 * client request to node Z (3rd): UPDATE test.tests SET value_3 = 'C' WHERE 
name = 'testing' IF value_1 = 'A';

To confirm, when the bug is manifested, the end result will be: value_1='A', 
value_2=null, value_3=null



[~jjirsa], regarding our tool, at this point, it is not open for public. 

> CAS Reads Inconsistencies 
> --
>
> Key: CASSANDRA-12126
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12126
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: sankalp kohli
>Priority: Major
>  Labels: LWT
>
> While looking at the CAS code in Cassandra, I found a potential issue with 
> CAS Reads. Here is how it can happen with RF=3
> 1) You issue a CAS Write and it fails in the propose phase. A machine replies 
> true to a propose and saves the commit in accepted filed. The other two 
> machines B and C does not get to the accept phase. 
> Current state is that machine A has this commit in paxos table as accepted 
> but not committed and B and C does not. 
> 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the 
> value written in step 1. This step is as if nothing is inflight. 
> 3) Issue another CAS Read and it goes to A and B. Now we will discover that 
> there is something inflight from A and will propose and commit it with the 
> current ballot. Now we can read the value written in step 1 as part of this 
> CAS read.
> If we skip step 3 and instead run step 4, we will never learn about value 
> written in step 1. 
> 4. Issue a CAS Write and it involves only B and C. This will succeed and 
> commit a different value than step 1. Step 1 value will never be seen again 
> and was never seen before. 
> If you read the Lamport “paxos made simple” paper and read section 2.3. It 
> talks about this issue which is how learners can find out if majority of the 
> acceptors have accepted the proposal. 
> In step 3, it is correct that we propose the value again since we dont know 
> if it was accepted by majority of acceptors. When we ask majority of 
> acceptors, and more than one acceptors but not majority has something in 
> flight, we have no way of knowing if it is accepted by majority of acceptors. 
> So this behavior is correct. 
> However we need to fix step 2, since it caused reads to not be linearizable 
> with respect to writes and other reads. In this case, we know that majority 
> of acceptors have no inflight commit which means we have majority that 
> nothing was accepted by majority. I think we should run a propose step here 
> with empty commit and that will cause write written in step 1 to not be 
> visible ever after. 
> With this fix, we will either see data written in step 1 on next serial read 
> or will never see it which is what we want. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node

2018-09-26 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629632#comment-16629632
 ] 

Jeffrey F. Lukman commented on CASSANDRA-12438:
---

Hi [~benedict] ,

Following the bug description, we integrate our model checker with 
Cassandra-v3.7.
We grabbed the code from the github repository.

Regarding the scheme, here is the initial scheme that we have prepared before 
we inject any queries in the model checker path execution:
 * CREATE KEYSPACE test WITH REPLICATION = \{'class': 'SimpleStrategy', 
'replication_factor': 3};
 * CREATE TABLE tests (name text PRIMARY KEY, owner text, value_1 text, value_2 
text, value_3 text, value_4 text, value_5 text, value_6 text, value_7 text);

Regarding the operations/queries, here are the details of them:
 * INSERT INTO test.tests (name, owner, value_1, value_2, value_3, value_4, 
value_5, value_6, value_7) VALUES ('cass-12438', 'user_1', 'A1', 'B1', 'C1', 
'D1', 'E1', 'F1', 'G1') IF NOT EXISTS;
 * Client Request 2: UPDATE test.tests SET value_1 = 'A2', owner = 'user_2' 
WHERE name = 'cass-12438' IF owner = 'user_1';
 * Client Request 3: UPDATE test.tests SET value_1 = 'A3', owner = 'user_3' 
WHERE name = 'cass-12438' IF owner = 'user_2';

The messages from these queries here are the one that the model checker control 
and reorder in some way, so that we ended up reproducing this bug.

> Data inconsistencies with lightweight transactions, serial reads, and 
> rejoining node
> 
>
> Key: CASSANDRA-12438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Steven Schaefer
>Priority: Major
>
> I've run into some issues with data inconsistency in a situation where a 
> single node is rejoining a 3-node cluster with RF=3. I'm running 3.7.
> I have a client system which inserts data into a table with around 7 columns, 
> named let's say A-F,id, and version. LWTs are used to make the inserts and 
> updates.
> Typically what happens is there's an insert of values id, V_a1, V_b1, ... , 
> version=1, then another process will pick up rows with for example A=V_a1 and 
> subsequently update A to V_a2 and version=2. Yet another process will watch 
> for A=V_a2 to then make a second update to the same column, and set version 
> to 3, with end result being  There's a 
> secondary index on this A column (there's only a few possible values for A so 
> not worried about the cardinality issue), though I've reproed with the new 
> SASI index too.
> If one of the nodes is down, there's still 2 alive for quorum so inserts can 
> still happen. When I bring up the downed node, sometimes I get really weird 
> state back which ultimately crashes the client system that's talking to 
> Cassandra. 
> When reading I always select all the columns, but there is a conditional 
> where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read 
> is for processing any rows with V_a2, and ultimately updating to V_a3 when 
> complete. While periodically polling for A=V_a2 it is of course possible for 
> the poller to to observe the old V_a2 value while the other parts of the 
> client system process and make the update to V_a3, and that's generally ok 
> because of the LWTs used for updates, an occassionaly wasted reprocessing run 
> ins't a big deal, but when reading at serial I always expect to get the 
> original values for columns that were never updated too. If a paxos update is 
> in progress then I expect that completed before its value(s) returned. But 
> instead, the read seems to be seeing the partial commit of the LWT, returning 
> the old V_2a value for the changed column, but no values whatsoever for the 
> other columns. From the example above, instead of getting  , version=3>, or even the older  (either of 
> which I expect and are ok), I get only , so the rest of 
> the columns end up null, which I never expect. However this isn't persistent, 
> Cassandra does end up consistent, which I see via sstabledump and cqlsh after 
> the fact.
> In my client system logs I record the insert / updates, and this 
> inconsistency happens around the same time as the update from V_a2 to V_a3, 
> hence my comment about Cassandra seeing a partial commit. So that leads me to 
> suspect that perhaps due to the where clause in my read query for A=V_a2, 
> perhaps one of the original good nodes already has the new V_a3 value, so it 
> doesn't return this row for the select query, but the other good node and the 
> one that was down still have the old value V_a2, so those 2 nodes return what 
> they have. The one that was down doesn't yet have the original insert, just 
> the update from V_a1 -> V_a2 (again I suspect, it's not been easy to verify), 
> 

[jira] [Comment Edited] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node

2018-09-26 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629434#comment-16629434
 ] 

Jeffrey F. Lukman edited comment on CASSANDRA-12438 at 9/27/18 1:38 AM:


Hi all,

 

Our team from UCARE University of Chicago, have been able to reproduce this bug 
consistently with our model checker.
Here are the workload and scenario of the bug:

Workload: 3 node-cluster (let's call them node X, Y, and Z), 1 Crash, 1 Reboot 
events, with 3 client requests (where node X will be the coordinator node for 
all client requests.

Scenario:
 # Start the 3 nodes and setup the CONSISTENCY = ONE.
 # Inject client request 1 as described in this bug description:
Insert  (along with many others)
 # But before any PREPARE messages have been sent by the node X, node Z has 
crashed.
 # Client request 1 is successfully committed in node X and Y.
 # Reboot node Z.
 # Inject client request 2 & 3 as described in this bug description:
Update  (along with others for which A=V_a1)
Update  (along with many others for which 
A=V_a2)
(**Although Update 3 can also be ignored if we want to simplify the bug 
scenario)
 # If only client request-2 that finished, then we expect to see:

If the client request-2 and then client request-3 are committed, then we expect 
to see:

The very least possibility is if both client request-2 & -3 failed and they 
reached timeout, then we expect to see:

 # But our model checker shows that, if we do a read request to node Z, then we 
will see:
 // some fields are null
But if we do a read request to node X or Y, then we will get a complete result.
(or as expected in step 7)
Which means we end up in an inconsistency view among the nodes (X and Y are 
different from Z).

If we run this scenario with CONSISTENCY.ALL we will not see this bug to happen.

We are happy to assist you guys to debug this issue.


was (Author: jeffreyflukman):
Hi all,

 

Our team from UCARE University of Chicago, have been able to reproduce this bug 
consistently with our model checker.
Here are the workload and scenario of the bug:

Workload: 3 node-cluster (let's call them node X, Y, and Z), 1 Crash, 1 Reboot 
events, with 3 client requests (where node X will be the coordinator node for 
all client requests.

Scenario:
 # Start the 3 nodes and setup the CONSISTENCY = ONE.
 # Inject client request 1 as described in this bug description:
Insert  (along with many others)
 # But before any PREPARE messages have been sent by the node X, node Z has 
crashed.
 # Client request 1 is successfully committed in node X and Y.
 # Reboot node Z.
 # Inject client request 2 & 3 as described in this bug description:
Update  (along with others for which A=V_a1)
Update  (along with many others for which 
A=V_a2)
(**Although Update 3 can also be ignored if we want to simplify the bug 
scenario)
 # If client request-2 finished first without being interfered by client 
request-3, then we expect to see:

If the client request-3 interfere client request-2 or is executed before client 
request-2 for any reason, then we expect to see:

 # But our model checker shows that, if we do a read request to node Z, then we 
will see: // some fields are null
But if we do a read request to node X or Y, then we will get a complete result.
Which means we end up in an inconsistency view among the nodes.

If we run this scenario with CONSISTENCY.ALL we will not see this bug to happen.

We are happy to assist you guys to debug this issue.

> Data inconsistencies with lightweight transactions, serial reads, and 
> rejoining node
> 
>
> Key: CASSANDRA-12438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Steven Schaefer
>Priority: Major
>
> I've run into some issues with data inconsistency in a situation where a 
> single node is rejoining a 3-node cluster with RF=3. I'm running 3.7.
> I have a client system which inserts data into a table with around 7 columns, 
> named let's say A-F,id, and version. LWTs are used to make the inserts and 
> updates.
> Typically what happens is there's an insert of values id, V_a1, V_b1, ... , 
> version=1, then another process will pick up rows with for example A=V_a1 and 
> subsequently update A to V_a2 and version=2. Yet another process will watch 
> for A=V_a2 to then make a second update to the same column, and set version 
> to 3, with end result being  There's a 
> secondary index on this A column (there's only a few possible values for A so 
> not worried about the cardinality issue), though I've reproed with the new 
> SASI index too.
> If one of the nodes is down, there's still 2 alive for quorum so inserts can 
> still happen. When I bring up the downed

[jira] [Commented] (CASSANDRA-12126) CAS Reads Inconsistencies

2018-09-26 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629455#comment-16629455
 ] 

Jeffrey F. Lukman commented on CASSANDRA-12126:
---

Hi all,

Our team from UCARE University of Chicago, have been able to reproduce similar 
manifestation to this bug consistently with our model checker. (Our scenario is 
different with what [~kohlisankalp] proposed)
Here are the workload and scenario of the bug:

Workload: 3 nodes-cluster, 3 client requests (but no crash event)

Scenario:
 # Start 3-nodes cluster and inject all of 3 client requests to 3 different 
nodes (node X, Y, Z)
 # Node X sends its prepare messages (ballot number=1) to all nodes and all 
nodes accept it
 # Node X sends its propose message to itself, causing its inProgress value to 
be "X".
 # Node Y sends its prepare messages (ballot number=2) to all nodes.
This also causes the rest of node X propose messages to be invalid because its 
ballot number is smaller than node Y prepare messages.
 # In our scenario, the prepare response messages from node Y and Z comes first 
before prepare response message from node X, causing the node Y to unrecognize 
the state of node X which already accepted value "X" (step 3).
 # But since our query of client request 2 has an IF, that said IF value_1='X', 
therefore node Y will not continue on sending propose messages to all nodes.
Up to this point, it means none of the queries are committed to the server.
 # Node Z now sends its prepare messages to all nodes and all nodes accept it.
 # In our scenario, now the node X returns its response first where it also let 
node Z knows about its inProgress Value "X".
>From here, node Z will propose and commit client request-1 (with value "X") 
>instead of client-request-3.
 # Therefore, we ended up having client request 1 stored to the server, 
although client request-3 was the one that is said successful.

We are ready to assist, if any further information is needed.

 

 

> CAS Reads Inconsistencies 
> --
>
> Key: CASSANDRA-12126
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12126
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: sankalp kohli
>Priority: Major
>  Labels: LWT
>
> While looking at the CAS code in Cassandra, I found a potential issue with 
> CAS Reads. Here is how it can happen with RF=3
> 1) You issue a CAS Write and it fails in the propose phase. A machine replies 
> true to a propose and saves the commit in accepted filed. The other two 
> machines B and C does not get to the accept phase. 
> Current state is that machine A has this commit in paxos table as accepted 
> but not committed and B and C does not. 
> 2) Issue a CAS Read and it goes to only B and C. You wont be able to read the 
> value written in step 1. This step is as if nothing is inflight. 
> 3) Issue another CAS Read and it goes to A and B. Now we will discover that 
> there is something inflight from A and will propose and commit it with the 
> current ballot. Now we can read the value written in step 1 as part of this 
> CAS read.
> If we skip step 3 and instead run step 4, we will never learn about value 
> written in step 1. 
> 4. Issue a CAS Write and it involves only B and C. This will succeed and 
> commit a different value than step 1. Step 1 value will never be seen again 
> and was never seen before. 
> If you read the Lamport “paxos made simple” paper and read section 2.3. It 
> talks about this issue which is how learners can find out if majority of the 
> acceptors have accepted the proposal. 
> In step 3, it is correct that we propose the value again since we dont know 
> if it was accepted by majority of acceptors. When we ask majority of 
> acceptors, and more than one acceptors but not majority has something in 
> flight, we have no way of knowing if it is accepted by majority of acceptors. 
> So this behavior is correct. 
> However we need to fix step 2, since it caused reads to not be linearizable 
> with respect to writes and other reads. In this case, we know that majority 
> of acceptors have no inflight commit which means we have majority that 
> nothing was accepted by majority. I think we should run a propose step here 
> with empty commit and that will cause write written in step 1 to not be 
> visible ever after. 
> With this fix, we will either see data written in step 1 on next serial read 
> or will never see it which is what we want. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node

2018-09-26 Thread Jeffrey F. Lukman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629434#comment-16629434
 ] 

Jeffrey F. Lukman commented on CASSANDRA-12438:
---

Hi all,

 

Our team from UCARE University of Chicago, have been able to reproduce this bug 
consistently with our model checker.
Here are the workload and scenario of the bug:

Workload: 3 node-cluster (let's call them node X, Y, and Z), 1 Crash, 1 Reboot 
events, with 3 client requests (where node X will be the coordinator node for 
all client requests.

Scenario:
 # Start the 3 nodes and setup the CONSISTENCY = ONE.
 # Inject client request 1 as described in this bug description:
Insert  (along with many others)
 # But before any PREPARE messages have been sent by the node X, node Z has 
crashed.
 # Client request 1 is successfully committed in node X and Y.
 # Reboot node Z.
 # Inject client request 2 & 3 as described in this bug description:
Update  (along with others for which A=V_a1)
Update  (along with many others for which 
A=V_a2)
(**Although Update 3 can also be ignored if we want to simplify the bug 
scenario)
 # If client request-2 finished first without being interfered by client 
request-3, then we expect to see:

If the client request-3 interfere client request-2 or is executed before client 
request-2 for any reason, then we expect to see:

 # But our model checker shows that, if we do a read request to node Z, then we 
will see: // some fields are null
But if we do a read request to node X or Y, then we will get a complete result.
Which means we end up in an inconsistency view among the nodes.

If we run this scenario with CONSISTENCY.ALL we will not see this bug to happen.

We are happy to assist you guys to debug this issue.

> Data inconsistencies with lightweight transactions, serial reads, and 
> rejoining node
> 
>
> Key: CASSANDRA-12438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Steven Schaefer
>Priority: Major
>
> I've run into some issues with data inconsistency in a situation where a 
> single node is rejoining a 3-node cluster with RF=3. I'm running 3.7.
> I have a client system which inserts data into a table with around 7 columns, 
> named let's say A-F,id, and version. LWTs are used to make the inserts and 
> updates.
> Typically what happens is there's an insert of values id, V_a1, V_b1, ... , 
> version=1, then another process will pick up rows with for example A=V_a1 and 
> subsequently update A to V_a2 and version=2. Yet another process will watch 
> for A=V_a2 to then make a second update to the same column, and set version 
> to 3, with end result being  There's a 
> secondary index on this A column (there's only a few possible values for A so 
> not worried about the cardinality issue), though I've reproed with the new 
> SASI index too.
> If one of the nodes is down, there's still 2 alive for quorum so inserts can 
> still happen. When I bring up the downed node, sometimes I get really weird 
> state back which ultimately crashes the client system that's talking to 
> Cassandra. 
> When reading I always select all the columns, but there is a conditional 
> where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read 
> is for processing any rows with V_a2, and ultimately updating to V_a3 when 
> complete. While periodically polling for A=V_a2 it is of course possible for 
> the poller to to observe the old V_a2 value while the other parts of the 
> client system process and make the update to V_a3, and that's generally ok 
> because of the LWTs used for updates, an occassionaly wasted reprocessing run 
> ins't a big deal, but when reading at serial I always expect to get the 
> original values for columns that were never updated too. If a paxos update is 
> in progress then I expect that completed before its value(s) returned. But 
> instead, the read seems to be seeing the partial commit of the LWT, returning 
> the old V_2a value for the changed column, but no values whatsoever for the 
> other columns. From the example above, instead of getting  , version=3>, or even the older  (either of 
> which I expect and are ok), I get only , so the rest of 
> the columns end up null, which I never expect. However this isn't persistent, 
> Cassandra does end up consistent, which I see via sstabledump and cqlsh after 
> the fact.
> In my client system logs I record the insert / updates, and this 
> inconsistency happens around the same time as the update from V_a2 to V_a3, 
> hence my comment about Cassandra seeing a partial commit. So that leads me to 
> suspect that perhaps due to the where clause in my read query for A=V_a2, 
> perhaps one of the original good 

[jira] [Commented] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-09 Thread Jeffrey F. Lukman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277100#comment-15277100
 ] 

Jeffrey F. Lukman commented on CASSANDRA-11724:
---

[~jeromatron] : okay, I will try this again and report the result later whether 
this config 
will cause a different result or not.

For now, can you help me by confirming  whether you also see the Workload-4 bug 
or not?
The Workload-4 : running 512-nodes cluster with some data, then we 
decommissioned a node.
In our place, we see a high numbers of wrong false failure detection.

> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg, experiment-result.txt
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, 
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-07 Thread Jeffrey F. Lukman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275388#comment-15275388
 ] 

Jeffrey F. Lukman edited comment on CASSANDRA-11724 at 5/7/16 9:17 PM:
---

{quote}
Are you not waiting two minutes between starting each node to join the ring?
{quote}

Hi Jeremy,

No, I haven't waited for 2 minutes for starting each node. So do you say, even 
when I want to bootstrap a new cluster, let's say I want to bootstrap a 
512-nodes cluster, I have to add the nodes one by one and between adding the 
nodes, I have to wait for 2 minutes? It means, when I want to bootstrap a 
512-nodes, I need to wait for 511 * 2 minutes = 1022 minutes = *17 hours*?

What about the decommission one node problem?
In our 4th workload, our workload shown that we first started X nodes, we 
waited until it is stable, then we decommissioned a node.
And we still see a big number of false failure detection. For 512 nodes, we 
measured around *90,000+* false failure detection.


was (Author: jeffreyflukman):
{quote}
Are you not waiting two minutes between starting each node to join the ring?
{quote}

Hi Jeremy,

No, I haven't waited for 2 minutes for starting each node. So do you say, even 
when I want to bootstrap a new cluster, let's say I want to bootstrap a 
512-nodes cluster, I have to add the nodes one by one and between adding the 
nodes, I have to wait for 2 minutes? It means, when I want to bootstrap a 
512-nodes, I need to wait for 511 * 2 minutes = 1022 minutes = *17+ hours*?

What about the decommission one node problem?
In our 4th workload, our workload shown that we first started X nodes, we 
waited until it is stable, then we decommissioned a node.
And we still see a big number of false failure detection. For 512 nodes, we 
measured around *90,000+* false failure detection.

> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg, experiment-result.txt
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, 
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-07 Thread Jeffrey F. Lukman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275388#comment-15275388
 ] 

Jeffrey F. Lukman commented on CASSANDRA-11724:
---

{quote}
Are you not waiting two minutes between starting each node to join the ring?
{quote}

Hi Jeremy,

No, I haven't waited for 2 minutes for starting each node. So do you say, even 
when I want to bootstrap a new cluster, let's say I want to bootstrap a 
512-nodes cluster, I have to add the nodes one by one and between adding the 
nodes, I have to wait for 2 minutes? It means, when I want to bootstrap a 
512-nodes, I need to wait for 511 * 2 minutes = 1022 minutes = *17+ hours*?

What about the decommission one node problem?
In our 4th workload, our workload shown that we first started X nodes, we 
waited until it is stable, then we decommissioned a node.
And we still see a big number of false failure detection. For 512 nodes, we 
measured around *90,000+* false failure detection.

> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg, experiment-result.txt
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, 
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-07 Thread Jeffrey F. Lukman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275362#comment-15275362
 ] 

Jeffrey F. Lukman edited comment on CASSANDRA-11724 at 5/7/16 8:06 PM:
---

Hi Sylvain,

I've tried to go deeper on why this bug happen in big scale of cluster.
I found out that when we start 256 Cassandra instances (or more) simultaneously,
the {{applyStateLocally()}} function of Gossiper.java can take a long time in 
the init phase.
In some nodes, I saw that some of the nodes can take *60+ seconds*. 
The worst {{applyStateLocally()}} running time that I see so far is *120 
seconds*. 
The average worst running time of {{applyStateLocally()}} from all nodes is 
*28.5 seconds*.
I believe this number can get even worst in bootstrapping 512 Cassandra 
instances.

I'm still trying to understand why this can happen and why this only happen in 
the initialization phase.

I attached the running time records that I have from one of my experiment in 
file : experiment-result.txt.
Note: this time, I run the experiment on 16 machines with 16 instances running 
in one machine. This is the available machines that I could get for my last 
experiment.


was (Author: jeffreyflukman):
Hi Sylvain,

I've tried to go deeper on why this bug happen in big scale of cluster.
I found out that when we start 256 Cassandra instances (or more) simultaneously,
the {{applyStateLocally()}} function of Gossiper.java can take a long time in 
the init phase.
In some nodes, I saw that some of the nodes can take *60+ seconds*. 
The worst {{applyStateLocally()}} running time that I see so far is *120 
seconds*. 
The average worst running time of {{applyStateLocally()}} from all nodes is 
*28.5 seconds*.
I believe this number can get even worst in bootstrapping 512 Cassandra 
instances.

I'm still trying to understand why this can happen and why this only happen in 
the initialization phase.

I attached the running time records that I have from one of my experiment in 
file : experiment-result.txt.

> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg, experiment-result.txt
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, 
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-07 Thread Jeffrey F. Lukman (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey F. Lukman updated CASSANDRA-11724:
--
Comment: was deleted

(was: Hi Sylvain,

I've tried to go deeper on why this bug happen in big scale of cluster.
I found out that when we start 256 Cassandra instances (or more) simultaneously,
the {{applyStateLocally()}} function of Gossiper.java can take a long time in 
the init phase.
In some nodes, I saw that some of the nodes can take *60+ seconds*. 
The worst {{applyStateLocally()}} running time that I see so far is *120 
seconds*. 
The average worst running time of {{applyStateLocally()}} from all nodes is 
*28.5 seconds*.
I believe this number can get even worst in bootstrapping 512 Cassandra 
instances.

I'm still trying to understand why this can happen and why this only happen in 
the initialization phase.

I attached the running time records that I have from one of my experiment.)

> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg, experiment-result.txt
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, 
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-07 Thread Jeffrey F. Lukman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275362#comment-15275362
 ] 

Jeffrey F. Lukman edited comment on CASSANDRA-11724 at 5/7/16 7:52 PM:
---

Hi Sylvain,

I've tried to go deeper on why this bug happen in big scale of cluster.
I found out that when we start 256 Cassandra instances (or more) simultaneously,
the {{applyStateLocally()}} function of Gossiper.java can take a long time in 
the init phase.
In some nodes, I saw that some of the nodes can take *60+ seconds*. 
The worst {{applyStateLocally()}} running time that I see so far is *120 
seconds*. 
The average worst running time of {{applyStateLocally()}} from all nodes is 
*28.5 seconds*.
I believe this number can get even worst in bootstrapping 512 Cassandra 
instances.

I'm still trying to understand why this can happen and why this only happen in 
the initialization phase.

I attached the running time records that I have from one of my experiment in 
file : experiment-result.txt.


was (Author: jeffreyflukman):
I run 256 instances of Cassandra and try to measure the Gossiper functions 
running time. Here is the bottleneck function that I see in my experiment.

> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg, experiment-result.txt
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, 
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-07 Thread Jeffrey F. Lukman (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey F. Lukman updated CASSANDRA-11724:
--
Attachment: experiment-result.txt

I run 256 instances of Cassandra and try to measure the Gossiper functions 
running time. Here is the bottleneck function that I see in my experiment.

> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg, experiment-result.txt
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, 
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-07 Thread Jeffrey F. Lukman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275359#comment-15275359
 ] 

Jeffrey F. Lukman edited comment on CASSANDRA-11724 at 5/7/16 7:49 PM:
---

Hi Sylvain,

I've tried to go deeper on why this bug happen in big scale of cluster.
I found out that when we start 256 Cassandra instances (or more) simultaneously,
the {{applyStateLocally()}} function of Gossiper.java can take a long time in 
the init phase.
In some nodes, I saw that some of the nodes can take *60+ seconds*. 
The worst {{applyStateLocally()}} running time that I see so far is *120 
seconds*. 
The average worst running time of {{applyStateLocally()}} from all nodes is 
*28.5 seconds*.
I believe this number can get even worst in bootstrapping 512 Cassandra 
instances.

I'm still trying to understand why this can happen and why this only happen in 
the initialization phase.

I attached the running time records that I have from one of my experiment.


was (Author: jeffreyflukman):
Hi Sylvain,

I've tried to go deeper on why this bug happen in big scale of cluster.
I found out that when we start 256 Cassandra instances (or more) simultaneously,
the {{applyStateLocally()}} function of Gossiper.java can take a long time in 
the init phase.
In some nodes, I saw that some of the nodes can take *60+ seconds*. 
The worst {{applyStateLocally()}} running time that I see so far is *120 
seconds*. 
The average worst running time of {{applyStateLocally()}} from all nodes is 
*28.5 seconds*.
I believe this number can get even worst in bootstrapping 512 Cassandra 
instances.

I'm still trying to understand why this can happen and why this only happen in 
the initialization phase.

> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, 
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-07 Thread Jeffrey F. Lukman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275359#comment-15275359
 ] 

Jeffrey F. Lukman commented on CASSANDRA-11724:
---

Hi Sylvain,

I've tried to go deeper on why this bug happen in big scale of cluster.
I found out that when we start 256 Cassandra instances (or more) simultaneously,
the {{applyStateLocally()}} function of Gossiper.java can take a long time in 
the init phase.
In some nodes, I saw that some of the nodes can take *60+ seconds*. 
The worst {{applyStateLocally()}} running time that I see so far is *120 
seconds*. 
The average worst running time of {{applyStateLocally()}} from all nodes is 
*28.5 seconds*.
I believe this number can get even worst in bootstrapping 512 Cassandra 
instances.

I'm still trying to understand why this can happen and why this only happen in 
the initialization phase.

> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, 
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-05 Thread Jeffrey F. Lukman (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey F. Lukman updated CASSANDRA-11724:
--
Description: 
We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
setting in our testing is that each machine has 16-cores and runs 8 cassandra 
instances, and our testing is 32, 64, 128, 256, and 512 instances of Cassandra. 
We use the default number of vnodes for each instance which is 256. The data 
and log directories are on in-memory tmpfs file system.
We run several types of workloads on this Cassandra cluster:
Workload1: Just start the cluster
Workload2: Start half of the cluster, wait until it gets into a stable 
condition, and run another half of the cluster
Workload3: Start half of the cluster, wait until it gets into a stable 
condition, load some data, and run another half of the cluster
Workload4: Start the cluster, wait until it gets into a stable condition, load 
some data and decommission one node

For this testing, we measure the total numbers of false failure detection 
inside the cluster. By false failure detection, we mean that, for example, 
instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
deeper into the root cause and find out that instance-1 has not received any 
heartbeat after some time from instance-2 because the instance-2 run a long 
computation process.

Here I attach the graphs of each workload result.

  was:
We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
setting in our testing is that each machine has 16-cores and runs 8 cassandra 
instances, and our testing is 32, 64, 128, 256, and 512 instances of Cassandra. 
We use the default number of vnodes for each instance which is 256. The data 
and log directories are on in-memory tmpfs file system.
We run several types of workloads on this Cassandra cluster:
Workload1: Just start the cluster
Workload2: Start half of the cluster, wait until it gets into a stable 
condition, and run another half of the cluster
Workload3: Start half of the cluster, wait until it gets into a stable 
condition, load some data, and run another half of the cluster
Workload4: Start the cluster, wait until it gets into a stable condition and 
decommission one node

For this testing, we measure the total numbers of false failure detection 
inside the cluster. By false failure detection, we mean that, for example, 
instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
deeper into the root cause and find out that instance-1 has not received any 
heartbeat after some time from instance-2 because the instance-2 run a long 
computation process.

Here I attach the graphs of each workload result.


> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, 
> load some data and decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-05 Thread Jeffrey F. Lukman (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey F. Lukman updated CASSANDRA-11724:
--
Description: 
We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
setting in our testing is that each machine has 16-cores and runs 8 cassandra 
instances, and our testing is 32, 64, 128, 256, and 512 instances of Cassandra. 
We use the default number of vnodes for each instance which is 256. The data 
and log directories are on in-memory tmpfs file system.
We run several types of workloads on this Cassandra cluster:
Workload1: Just start the cluster
Workload2: Start half of the cluster, wait until it gets into a stable 
condition, and run another half of the cluster
Workload3: Start half of the cluster, wait until it gets into a stable 
condition, load some data, and run another half of the cluster
Workload4: Start the cluster, wait until it gets into a stable condition and 
decommission one node

For this testing, we measure the total numbers of false failure detection 
inside the cluster. By false failure detection, we mean that, for example, 
instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
deeper into the root cause and find out that instance-1 has not received any 
heartbeat after some time from instance-2 because the instance-2 run a long 
computation process.

Here I attach the graphs of each workload result.

  was:
We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
setting in our testing is that each machine runs 8 cassandra instances, and our 
testing is 32, 64, 128, 256, and 512 instances of Cassandra. We use the default 
number of vnodes for each instance which is 256. The data and log directories 
are on in-memory tmpfs file system.
We run several types of workloads on this Cassandra cluster:
Workload1: Just start the cluster
Workload2: Start half of the cluster, wait until it gets into a stable 
condition, and run another half of the cluster
Workload3: Start half of the cluster, wait until it gets into a stable 
condition, load some data, and run another half of the cluster
Workload4: Start the cluster, wait until it gets into a stable condition and 
decommission one node

For this testing, we measure the total numbers of false failure detection 
inside the cluster. By false failure detection, we mean that, for example, 
instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
deeper into the root cause and find out that instance-1 has not received any 
heartbeat after some time from instance-2 because the instance-2 run a long 
computation process.

Here I attach the graphs of each workload result.


> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine has 16-cores and runs 8 cassandra 
> instances, and our testing is 32, 64, 128, 256, and 512 instances of 
> Cassandra. We use the default number of vnodes for each instance which is 
> 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition and 
> decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-05 Thread Jeffrey F. Lukman (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey F. Lukman updated CASSANDRA-11724:
--
Fix Version/s: (was: 2.2.5)

> False Failure Detection in Big Cassandra Cluster
> 
>
> Key: CASSANDRA-11724
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeffrey F. Lukman
>  Labels: gossip, node-failure
> Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, 
> Workload4.jpg
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
> setting in our testing is that each machine runs 8 cassandra instances, and 
> our testing is 32, 64, 128, 256, and 512 instances of Cassandra. We use the 
> default number of vnodes for each instance which is 256. The data and log 
> directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable 
> condition, and run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable 
> condition, load some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition and 
> decommission one node
> For this testing, we measure the total numbers of false failure detection 
> inside the cluster. By false failure detection, we mean that, for example, 
> instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
> deeper into the root cause and find out that instance-1 has not received any 
> heartbeat after some time from instance-2 because the instance-2 run a long 
> computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster

2016-05-05 Thread Jeffrey F. Lukman (JIRA)
Jeffrey F. Lukman created CASSANDRA-11724:
-

 Summary: False Failure Detection in Big Cassandra Cluster
 Key: CASSANDRA-11724
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Jeffrey F. Lukman
 Fix For: 2.2.5
 Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, Workload4.jpg

We are running some testing on Cassandra v2.2.5 stable in a big cluster. The 
setting in our testing is that each machine runs 8 cassandra instances, and our 
testing is 32, 64, 128, 256, and 512 instances of Cassandra. We use the default 
number of vnodes for each instance which is 256. The data and log directories 
are on in-memory tmpfs file system.
We run several types of workloads on this Cassandra cluster:
Workload1: Just start the cluster
Workload2: Start half of the cluster, wait until it gets into a stable 
condition, and run another half of the cluster
Workload3: Start half of the cluster, wait until it gets into a stable 
condition, load some data, and run another half of the cluster
Workload4: Start the cluster, wait until it gets into a stable condition and 
decommission one node

For this testing, we measure the total numbers of false failure detection 
inside the cluster. By false failure detection, we mean that, for example, 
instance-1 marks the instance-2 down, but the instance-2 is not down. We dig 
deeper into the root cause and find out that instance-1 has not received any 
heartbeat after some time from instance-2 because the instance-2 run a long 
computation process.

Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)