[jira] [Comment Edited] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-08-23 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139527#comment-16139527
 ] 

Michael Fong edited comment on CASSANDRA-11748 at 8/24/17 4:08 AM:
---

Hi, [~mbyrd]

Thanks for further looking into this problem. And, yes, above mentioned items 
were the symptoms we observed (and suspected) when OOM issue happened. I 
totally agree on your analysis on this particular problem - Thanks for 
answering them thoroughly! If we could locate the log back then, we will check 
and share it on this ticket. 

Thanks again for working on this issue!


was (Author: mcfongtw):
Hi, [~mbyrd]

Thanks for further looking into this problem. And, yes, above mentioned items 
were the symptoms we observed (and suspected) when OOM issue happened. I 
totally agree on your analysis on this particular problem - Thanks for 
answering them thoroughly! If we have located the log back them, we will check 
and share it on this ticket. 

Thanks again for working on this issue!

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have 
> InternalResponseStage performing schema merge operation. Since this operation 
> requires a compaction for each merge and is much slower to consume. Thus, the 
> back-pressure of incoming schema migration content objects consumes all of 
> the heap space and ultimately ends up OOM!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-08-23 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139527#comment-16139527
 ] 

Michael Fong commented on CASSANDRA-11748:
--

Hi, [~mbyrd]

Thanks for further looking into this problem. And, yes, above mentioned items 
were the symptoms we observed (and suspected) when OOM issue happened. I 
totally agree on your analysis on this particular problem - Thanks for 
answering them thoroughly! If we have located the log back them, we will check 
and share it on this ticket. 

Thanks again for working on this issue!

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have 
> InternalResponseStage performing schema merge operation. Since this operation 
> requires a compaction for each merge and is much slower to consume. Thus, the 
> back-pressure of incoming schema migration content objects consumes all of 
> the heap space and ultimately ends up OOM!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-08-10 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122718#comment-16122718
 ] 

Michael Fong edited comment on CASSANDRA-11748 at 8/11/17 2:35 AM:
---

Hi, guys, 

Thanks for putting some time on this issue, and this is an awesome discussion 
thread. 

When we reported this issue a year ago, we ended up patching the C* (v2.0) with 
similar approach to CASSANDRA-13569, but later we found it was not addressing 
the root problem but putting more patches on top of one another as time goes 
by. In my humble opinion, I am not sure if we want to have many more types of 
soft/hard caps to reduce risks of running into OOM. Instead, we could probably 
look deeper into causes behind the current working model, such as 
1. Have migration checks and requests fired asynchronously and finally stack up 
the all message at the receiver end merge the schema one-by-one at 
{code:java}Schema.instance.mergeSchemaAndAnnounceVersion(){code}
2. Send the receiver the complete copy of schema, instead of delta copy of 
schema out of diff between two nodes.
3. Last but not least, the most mysterious problem that leads to OOM and  we 
could not figure out why back then, is that there are hundreds of migration 
task all fired nearly simultaneously,  within 2 s. The number of rpcs does not 
match with the nodes in cluster, but is close to number of second taken for the 
node to reboot. 

Maybe there are other tickets working to address these items already, which I 
may not know. 

Thanks.

Michael Fong


was (Author: mcfongtw):
Hi, guys, 

Thanks for putting some time on this issue, and this is an awesome discussion 
thread. 

When we reported this issue a year ago, we ended up patching the C* (v2.0) with 
similar approach to CASSANDRA-13569, but later we found it was not addressing 
the root problem but putting more patches on top of one another as time goes 
by. In my humble opinion, I am not sure if we want to have many more types of 
soft/hard caps to reduce risks of running into OOM. Instead, we could probably 
look deeper into causes behind the current working model, such as 
1. Have migration checks and requests fired asynchronously and finally stack up 
the all message at the receiver end merge the schema one-by-one at 
{code:java}Schema.instance.mergeAndAnnounceVersion(){code}
2. Send the receiver the complete copy of schema, instead of delta copy of 
schema out of diff between two nodes.
3. Last but not least, the most mysterious problem that leads to OOM and  we 
could not figure out why back then, is that there are hundreds of migration 
task all fired nearly simultaneously,  within 2 s. The number of rpcs does not 
match with the nodes in cluster, but is close to number of second taken for the 
node to reboot. 

Maybe there are other tickets working to address these items already, which I 
may not know. 

Thanks.

Michael Fong

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the 

[jira] [Comment Edited] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-08-10 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122718#comment-16122718
 ] 

Michael Fong edited comment on CASSANDRA-11748 at 8/11/17 2:34 AM:
---

Hi, guys, 

Thanks for putting some time on this issue, and this is an awesome discussion 
thread. 

When we reported this issue a year ago, we ended up patching the C* (v2.0) with 
similar approach to CASSANDRA-13569, but later we found it was not addressing 
the root problem but putting more patches on top of one another as time goes 
by. In my humble opinion, I am not sure if we want to have many more types of 
soft/hard caps to reduce risks of running into OOM. Instead, we could probably 
look deeper into causes behind the current working model, such as 
1. Have migration checks and requests fired asynchronously and finally stack up 
the all message at the receiver end merge the schema one-by-one at 
{code:java}Schema.instance.mergeAndAnnounceVersion(){code}
2. Send the receiver the complete copy of schema, instead of delta copy of 
schema out of diff between two nodes.
3. Last but not least, the most mysterious problem that leads to OOM and  we 
could not figure out why back then, is that there are hundreds of migration 
task all fired nearly simultaneously,  within 2 s. The number of rpcs does not 
match with the nodes in cluster, but is close to number of second taken for the 
node to reboot. 

Maybe there are other tickets working to address these items already, which I 
may not know. 

Thanks.

Michael Fong


was (Author: mcfongtw):
Hi, guys, 

Thanks for putting some time on this issue, and this is an awesome discussion 
thread. 

When we reported this issue a year ago, we ended up patching the C* (v2.0) with 
similar approach to CASSANDRA-13569, but later we found it was not addressing 
the root problem but putting more patches on top of one another as time goes 
by. In my humble opinion, I am not sure if we want to have many more types of 
soft/hard caps to reduce risks of running into OOM. Instead, we could probably 
look deeper into causes behind the current working model, such as 
1. Have migration checks and requests fired asynchronously and finally stack up 
the all message at the receiver end merge the schema one-by-one at 
{code:java}
Schema.instance.mergeAndAnnounceVersion()
{code}
2. Send the receiver the complete copy of schema, instead of delta copy of 
schema out of diff between two nodes.
3. Last but not least, the most mysterious problem that leads to OOM and  we 
could not figure out why back then, is that there are hundreds of migration 
task all fired nearly simultaneously,  within 2 s. The number of rpcs does not 
match with the nodes in cluster, but is close to number of second taken for the 
node to reboot. 

Maybe there are other tickets working to address these items already, which I 
may not know. 

Thanks.

Michael Fong

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the 

[jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-08-10 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122718#comment-16122718
 ] 

Michael Fong commented on CASSANDRA-11748:
--

Hi, guys, 

Thanks for putting some time on this issue, and this is an awesome discussion 
thread. 

When we reported this issue a year ago, we ended up patching the C* (v2.0) with 
similar approach to CASSANDRA-13569, but later we found it was not addressing 
the root problem but putting more patches on top of one another as time goes 
by. In my humble opinion, I am not sure if we want to have many more types of 
soft/hard caps to reduce risks of running into OOM. Instead, we could probably 
look deeper into causes behind the current working model, such as 
1. Have migration checks and requests fired asynchronously and finally stack up 
the all message at the receiver end merge the schema one-by-one at 
{code:java}
Schema.instance.mergeAndAnnounceVersion()
{code}
2. Send the receiver the complete copy of schema, instead of delta copy of 
schema out of diff between two nodes.
3. Last but not least, the most mysterious problem that leads to OOM and  we 
could not figure out why back then, is that there are hundreds of migration 
task all fired nearly simultaneously,  within 2 s. The number of rpcs does not 
match with the nodes in cluster, but is close to number of second taken for the 
node to reboot. 

Maybe there are other tickets working to address these items already, which I 
may not know. 

Thanks.

Michael Fong

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have 
> InternalResponseStage performing schema merge operation. Since this operation 
> requires a compaction for each merge and is much slower to consume. Thus, the 
> 

[jira] [Comment Edited] (CASSANDRA-13569) Schedule schema pulls just once per endpoint

2017-06-23 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061094#comment-16061094
 ] 

Michael Fong edited comment on CASSANDRA-13569 at 6/23/17 11:44 PM:


Hi, [~spo...@gmail.com]

I agree w/ you that even ScheduledExecutor on MigrationTask would fail on rare 
cases. 

In CASSANDRA-11748, we had patched our own v2.0 source code with similar idea 
that limits schema pull only once per endpoint. However, we later on have 
observed a corner case that when two nodes with different schema version boot 
up at the same time, one node running slightly - a few seconds - faster than 
the other. The first node requests schema pull and failed since the other node 
has not yet finished initialization. 

There has been a huge difference in v2.0 and 3.x code bases, and I do not know 
if the corner problem still exists. Here is the the problematic code snippet 
for your reference. 
{code:java}
if (epState == null)  {
{code} would probably not prevent this. In your patch, if the state of 
ScheduledFuture return done, things could get much messier since schema 
migration would never happen. 

Sincerely,

Michael Fong



was (Author: mcfongtw):
Hi, [~spo...@gmail.com]

I agree w/ you that even ScheduledExecutor on MigrationTask would fail on rare 
cases. 

In CASSANDRA-11748, we had patched our own v2.0 source code with similar idea 
that limits schema pull only once per endpoint. However, we later on have 
observed a corner case that when two nodes with different schema version boot 
up at the same time, one node running slightly - a few seconds - faster than 
the other. The first node requests schema pull and failed since the other node 
has not yet finished initialization. 

There has been a huge difference in v2.0 and 3.x code bases, and I do not know 
if the corner problem still persists. Here is the the problematic code snippet 
for your reference. 
{code:java}
if (epState == null)  {
{code} would probably not prevent this. In your patch, if the state of 
ScheduledFuture return done, things could get much messier since schema 
migration would never happen. 

Sincerely,

Michael Fong


> Schedule schema pulls just once per endpoint
> 
>
> Key: CASSANDRA-13569
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13569
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> Schema mismatches detected through gossip will get resolved by calling 
> {{MigrationManager.maybeScheduleSchemaPull}}. This method may decide to 
> schedule execution of {{MigrationTask}}, but only after using a 
> {{MIGRATION_DELAY_IN_MS = 6}} delay (for reasons unclear to me). 
> Meanwhile, as long as the migration task hasn't been executed, we'll continue 
> to have schema mismatches reported by gossip and will have corresponding 
> {{maybeScheduleSchemaPull}} calls, which will schedule further tasks with the 
> mentioned delay. Some local testing shows that dozens of tasks for the same 
> endpoint will eventually be executed and causing the same, stormy behavior 
> for this very endpoints.
> My proposal would be to simply not schedule new tasks for the same endpoint, 
> in case we still have pending tasks waiting for execution after 
> {{MIGRATION_DELAY_IN_MS}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13569) Schedule schema pulls just once per endpoint

2017-06-23 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061094#comment-16061094
 ] 

Michael Fong commented on CASSANDRA-13569:
--

Hi, [~spo...@gmail.com]

I agree w/ you that even ScheduledExecutor on MigrationTask would fail on rare 
cases. 

In CASSANDRA-11748, we had patched our own v2.0 source code with similar idea 
that limits schema pull only once per endpoint. However, we later on have 
observed a corner case that when two nodes with different schema version boot 
up at the same time, one node running slightly - a few seconds - faster than 
the other. The first node requests schema pull and failed since the other node 
has not yet finished initialization. 

There has been a huge difference in v2.0 and 3.x code bases, and I do not know 
if the corner problem still persists. Here is the the problematic code snippet 
for your reference. 
{code:java}
if (epState == null)  {
{code} would probably not prevent this. In your patch, if the state of 
ScheduledFuture return done, things could get much messier since schema 
migration would never happen. 

Sincerely,

Michael Fong


> Schedule schema pulls just once per endpoint
> 
>
> Key: CASSANDRA-13569
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13569
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> Schema mismatches detected through gossip will get resolved by calling 
> {{MigrationManager.maybeScheduleSchemaPull}}. This method may decide to 
> schedule execution of {{MigrationTask}}, but only after using a 
> {{MIGRATION_DELAY_IN_MS = 6}} delay (for reasons unclear to me). 
> Meanwhile, as long as the migration task hasn't been executed, we'll continue 
> to have schema mismatches reported by gossip and will have corresponding 
> {{maybeScheduleSchemaPull}} calls, which will schedule further tasks with the 
> mentioned delay. Some local testing shows that dozens of tasks for the same 
> endpoint will eventually be executed and causing the same, stormy behavior 
> for this very endpoints.
> My proposal would be to simply not schedule new tasks for the same endpoint, 
> in case we still have pending tasks waiting for execution after 
> {{MIGRATION_DELAY_IN_MS}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-06-22 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060262#comment-16060262
 ] 

Michael Fong edited comment on CASSANDRA-11748 at 6/23/17 1:06 AM:
---

Hi, [~mbyrd], 


Thanks for looking into this issue. 

If my memory serves me correctly, we observed that the number of schema 
migration request and response message exchanged between two nodes is linearly 
related to the 
1. # of gossip message a node sent to the other node but yet responded since 
the other node was in process of restarting. 
2. # of elapsed second that two nodes has been blocked for internal 
communication.

It is also true that we had *a lot* of table - over a few hundreds tables, 
secondary indices included, and that makes each round of schema migration more 
expensive.  Our workaround was to add a throttle control on # of schema 
migration task requested in v2.0 source code, and that seemed to work just 
fine. This makes more sense as each schema migration tasks requested a full 
copy of schema, as far as I remember. Hence, requesting migration for 100+ 
times is likely inefficient per say.

Last but not least, the root cause of having different schema version is yet 
unknown, that is, say its schema version is A, but having B as schema version 
after restarting the C* instance. This happens seemly at random and uncertain 
how to reproduce. Our best guess is perhaps related to 
1. Some variant added to calculating schema hash is different - maybe timestamp 
after restarting C* instances
2. Down to file system level where a schema migration task did not successfully 
flush onto disk before killing the process.

Thanks.

Michael Fong


was (Author: mcfongtw):
Hi, [~mbyrd], 


Thanks for looking into this issue. 

If my memory serves me correctly, we observed that the number of schema 
migration request and response message exchanged between two nodes is linearly 
related to the 
1. # of gossip message a node sent to the other node but yet responded since 
the other node was in process of restarting. 
2. # of elapsed second that two nodes has been blocked for internal 
communication.

It is also true that we had *a lot* of table - over 500+ tables, and that makes 
each round of schema migration more expensive.  Our workaround was to add a 
throttle control on # of schema migration task requested in v2.0 source code, 
and that seemed to work just fine. This makes more sense as each schema 
migration tasks requested a full copy of schema, as far as I remember. Hence, 
requesting migration for 100+ times is likely inefficient per say.

Last but not least, the root cause of having different schema version is yet 
unknown, that is, say its schema version is A, but having B as schema version 
after restarting the C* instance. This happens seemly at random and uncertain 
how to reproduce. Our best guess is perhaps related to 
1. Some variant added to calculating schema hash is different - maybe timestamp 
after restarting C* instances
2. Down to file system level where a schema migration task did not successfully 
flush onto disk before killing the process.

Thanks.

Michael Fong

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After 

[jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-06-22 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060262#comment-16060262
 ] 

Michael Fong commented on CASSANDRA-11748:
--

Hi, [~mbyrd], 


Thanks for looking into this issue. 

If my memory serves me correctly, we observed that the number of schema 
migration request and response message exchanged between two nodes is linearly 
related to the 
1. # of gossip message a node sent to the other node but yet responded since 
the other node was in process of restarting. 
2. # of elapsed second that two nodes has been blocked for internal 
communication.

It is also true that we had *a lot* of table - over 500+ tables, and that makes 
each round of schema migration more expensive.  Our workaround was to add a 
throttle control on # of schema migration task requested in v2.0 source code, 
and that seemed to work just fine. This makes more sense as each schema 
migration tasks requested a full copy of schema, as far as I remember. Hence, 
requesting migration for 100+ times is likely inefficient per say.

Last but not least, the root cause of having different schema version is yet 
unknown, that is, say its schema version is A, but having B as schema version 
after restarting the C* instance. This happens seemly at random and uncertain 
how to reproduce. Our best guess is perhaps related to 
1. Some variant added to calculating schema hash is different - maybe timestamp 
after restarting C* instances
2. Down to file system level where a schema migration task did not successfully 
flush onto disk before killing the process.

Thanks.

Michael Fong

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have 
> InternalResponseStage performing schema merge operation. Since this 

[jira] [Updated] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2016-05-12 Thread Michael Fong (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Fong updated CASSANDRA-11748:
-
Description: 
We have observed multiple times when a multi-node C* (v2.0.17) cluster ran into 
OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 

Here is the simple guideline of our rolling upgrade process
1. Update schema on a node, and wait until all nodes to be in schema version 
agreemnt - via nodetool describeclulster
2. Restart a Cassandra node
3. After restart, there is a chance that the the restarted node has different 
schema version.
4. All nodes in cluster start to rapidly exchange schema information, and any 
of node could run into OOM. 

The following is the system.log that occur in one of our 2-node cluster test bed
--
Before rebooting node 2:
Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.java 
(line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f

Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.java 
(line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f

After rebooting node 2, 
Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b

The node2  keeps submitting the migration task over 100+ times to the other 
node.
INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
/192.168.88.33 has restarted, now UP
INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
Updating topology for /192.168.88.33
...
DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) 
Submitting migration task for /192.168.88.33
... ( over 100+ times)
--
On the otherhand, Node 1 keeps updating its gossip information, followed by 
receiving and submitting migrationTask afterwards: 

INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 978) 
InetAddress /192.168.88.34 is now UP
...
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
MigrationRequestVerbHandler.java (line 41) Received migration request from 
/192.168.88.34.
…… ( over 100+ times)
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
127) submitting migration task for /192.168.88.34
.  (over 50+ times)

On the side note, we have over 200+ column families defined in Cassandra 
database, which may related to this amount of rpc traffic.

P.S.2 The over requested schema migration task will eventually have 
InternalResponseStage performing schema merge operation. Since this operation 
requires a compaction for each merge and is much slower to consume. Thus, the 
back-pressure of incoming schema migration content objects consumes all of the 
heap space and ultimately ends up OOM!


  was:
We have observed multiple times when a multi-node C* (v2.0.17) cluster ran into 
OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 

Here is the simple guideline of our rolling upgrade process
1. Update schema on a node, and wait until all nodes to be in schema version 
agreemnt - via nodetool describeclulster
2. Restart a Cassandra node
3. After restart, there is a chance that the the restarted node has different 
schema version.
4. All nodes in cluster start to rapidly exchange schema information, and any 
of node could run into OOM. 

The following is the system.log that occur in one of our 2-node cluster test bed
--
Before rebooting node 2:
Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.java 
(line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f

Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.java 
(line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f

After rebooting node 2, 
Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b

The node2  keeps submitting the migration task over 100+ times to the other 
node.
INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
/192.168.88.33 has restarted, now UP
INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
Updating topology for /192.168.88.33
...
DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) 
Submitting migration task for /192.168.88.33
... ( over 100+ times)
--
On the otherhand, Node 1 keeps updating its gossip information, followed by 
receiving and submitting migrationTask afterwards: 

INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 978) 
InetAddress /192.168.88.34 is now UP
...
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
MigrationRequestVerbHandler.java 

[jira] [Comment Edited] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2016-05-12 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279442#comment-15279442
 ] 

Michael Fong edited comment on CASSANDRA-11748 at 5/12/16 8:06 AM:
---

The reason of why schema version would change after restart is yet unknown. 
However, having different schema version and leading to flood Cassandra heap 
space seems pretty easy to reproduce. 
All we have tried to do is
1. To block gossip communication between a 2-node cluster via iptables
2. Keep updating schema on a node and so schema version is different
3. Unblock the firewall setting
4. We would see the message storm on exchanging schema information, and 
Cassandra would possibly run into OOM if it is small heap size.

P.S. It seems somewhat related to the number of gossip message accumulated. 
Once unblock the firewall setting, the gossip message flood into the other 
node, and eventually trigger StorageService.onAlive(), which schedules a schema 
pull for each gossip message handled. 


was (Author: michael.fong):
The reason of why schema version would change after restart is yet unknown. 
However, having different schema version and leading to flood Cassandra heap 
space seems pretty easy to reproduce. 
All we have tried to do is
1. To block gossip communication between a 2-node cluster via iptables
2. Keep updating schema on a node and so schema version is different
3. Unblock the firewall setting
4. We would see the message storm on exchanging schema information, and 
Cassandra would possibly run into OOM if it is small heap size.

P.S. It seems somewhat related to the number of schema change; the more the 
change, the greater the scale of message exchange.

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Priority: Critical
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-6862) Schema versions mismatch on large ring

2016-05-10 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279521#comment-15279521
 ] 

Michael Fong edited comment on CASSANDRA-6862 at 5/11/16 4:29 AM:
--

Hi,

We have recently seen schema version mismatch issue while doing rolling upgrade 
from 1.2.19 to 2.0.17. What is worse is that the mismatch could lead to a rapid 
and massive message exchange of schema information across nodes, and sometimes 
may lead to node OOM. I opened a ticket regarding this at CASSANDRA-11748

Has anyone observed the heap over usage on exchanging schema information before?


was (Author: michael.fong):
Hi,

We have recently seen schema version mismatch issue while doing rolling upgrade 
from 1.2.19 to 2.0.17. What is worse is that the mismatch could lead to a rapid 
and massive message exchange of schema information across nodes, and sometimes 
may lead to node OOM. I opened a ticket regarding this at CASSANDRA-11748

Have you observed the heap over usage on exchanging schema information before?

> Schema versions mismatch on large ring
> --
>
> Key: CASSANDRA-6862
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6862
> Project: Cassandra
>  Issue Type: Bug
> Environment: 2.0
>Reporter: Oleg Anastasyev
>Assignee: Oleg Anastasyev
> Fix For: 2.0.7
>
> Attachments: 6862-v2.txt, SchemaVersionLiveColumns.txt
>
>
> We have a large cluster with several hundreds nodes in 1 ring. Sometimes, 
> especially after massive restarts, schema versions reported on different 
> nodes mismatch.
> Investigated and found, that difference is not in schema itsself, but in 
> thombstones (different nodes have different set of thombstones applied to 
> system tables).
> Fixed by digesting schema with thombstones removed first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6862) Schema versions mismatch on large ring

2016-05-10 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279521#comment-15279521
 ] 

Michael Fong commented on CASSANDRA-6862:
-

Hi,

We have recently seen schema version mismatch issue while doing rolling upgrade 
from 1.2.19 to 2.0.17. What is worse is that the mismatch could lead to a rapid 
and massive message exchange of schema information across nodes, and sometimes 
may lead to node OOM. I opened a ticket regarding this at CASSANDRA-11748

Have you observed the heap over usage on exchanging schema information before?

> Schema versions mismatch on large ring
> --
>
> Key: CASSANDRA-6862
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6862
> Project: Cassandra
>  Issue Type: Bug
> Environment: 2.0
>Reporter: Oleg Anastasyev
>Assignee: Oleg Anastasyev
> Fix For: 2.0.7
>
> Attachments: 6862-v2.txt, SchemaVersionLiveColumns.txt
>
>
> We have a large cluster with several hundreds nodes in 1 ring. Sometimes, 
> especially after massive restarts, schema versions reported on different 
> nodes mismatch.
> Investigated and found, that difference is not in schema itsself, but in 
> thombstones (different nodes have different set of thombstones applied to 
> system tables).
> Fixed by digesting schema with thombstones removed first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-8165) Do not include tombstones in schema version computation

2016-05-10 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279499#comment-15279499
 ] 

Michael Fong edited comment on CASSANDRA-8165 at 5/11/16 4:12 AM:
--

Hi, 

We have recently seen schema version mismatch issue while doing rolling upgrade 
from 1.2.19 to 2.0.17. What is worse is that the mismatch could lead to a rapid 
and massive message exchange of schema information across nodes, and sometimes 
may lead to node OOM. I opened a ticket regarding this at CASSANDRA-11748

Have you observed the OOM situation before?



was (Author: michael.fong):
Hi, 

We have recently seen schema version mismatch issue while doing rolling upgrade 
from 1.2.19 to 2.0.17. What is worse is that the mismatch could lead to a rapid 
and massive message exchange of schema information across nodes, and sometimes 
may lead to node OOM. I opened a ticket regarding this at 
https://issues.apache.org/jira/browse/CASSANDRA-11748


> Do not include tombstones in schema version computation
> ---
>
> Key: CASSANDRA-8165
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8165
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Vishy Kasar
>
> During 1.2.19 migration, we found the schema version mismatch issue. On 
> digging further, Sankalp found this was due to inclusion of the tombstones by 
> 1.2.19 due to increased gc_grace_seconds. 
> It seems wrong to include ephemeral data like tombstones in MD5 computation. 
> Can this be avoided?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8165) Do not include tombstones in schema version computation

2016-05-10 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279499#comment-15279499
 ] 

Michael Fong commented on CASSANDRA-8165:
-

Hi, 

We have recently seen schema version mismatch issue while doing rolling upgrade 
from 1.2.19 to 2.0.17. What is worse is that the mismatch could lead to a rapid 
and massive message exchange of schema information across nodes, and sometimes 
may lead to node OOM. I opened a ticket regarding this at 
https://issues.apache.org/jira/browse/CASSANDRA-11748


> Do not include tombstones in schema version computation
> ---
>
> Key: CASSANDRA-8165
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8165
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Vishy Kasar
>
> During 1.2.19 migration, we found the schema version mismatch issue. On 
> digging further, Sankalp found this was due to inclusion of the tombstones by 
> 1.2.19 due to increased gc_grace_seconds. 
> It seems wrong to include ephemeral data like tombstones in MD5 computation. 
> Can this be avoided?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2016-05-10 Thread Michael Fong (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Fong updated CASSANDRA-11748:
-
Priority: Critical  (was: Major)

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
>Reporter: Michael Fong
>Priority: Critical
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2016-05-10 Thread Michael Fong (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Fong updated CASSANDRA-11748:
-
Environment: 
Rolling upgrade process from 1.2.19 to 2.0.17. 
CentOS 6.6
Occurred in different C* node of different scale of deployment (2G ~ 5G)

  was:
Rolling upgrade process from 1.2.19 to 2.0.17. 
CentOS 6.6


> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Priority: Critical
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2016-05-10 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279442#comment-15279442
 ] 

Michael Fong commented on CASSANDRA-11748:
--

The reason of why schema version would change after restart is yet unknown. 
However, having different schema version and leading to flood Cassandra heap 
space seems pretty easy to reproduce. 
All we have tried to do is
1. To block gossip communication between a 2-node cluster via iptables
2. Keep updating schema on a node and so schema version is different
3. Unblock the firewall setting
4. We would see the message storm on exchanging schema information, and 
Cassandra would possibly run into OOM if it is small heap size.

P.S. It seems somewhat related to the number of schema change; the more the 
change, the greater the scale of message exchange.

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
>Reporter: Michael Fong
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2016-05-10 Thread Michael Fong (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Fong updated CASSANDRA-11748:
-
Since Version: 2.0.17

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
>Reporter: Michael Fong
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2016-05-10 Thread Michael Fong (JIRA)
Michael Fong created CASSANDRA-11748:


 Summary: Schema version mismatch may leads to Casandra OOM at 
bootstrap during a rolling upgrade process
 Key: CASSANDRA-11748
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
 Project: Cassandra
  Issue Type: Bug
 Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
CentOS 6.6
Reporter: Michael Fong


We have observed multiple times when a multi-node C* (v2.0.17) cluster ran into 
OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 

Here is the simple guideline of our rolling upgrade process
1. Update schema on a node, and wait until all nodes to be in schema version 
agreemnt - via nodetool describeclulster
2. Restart a Cassandra node
3. After restart, there is a chance that the the restarted node has different 
schema version.
4. All nodes in cluster start to rapidly exchange schema information, and any 
of node could run into OOM. 

The following is the system.log that occur in one of our 2-node cluster test bed
--
Before rebooting node 2:
Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.java 
(line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f

Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.java 
(line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f

After rebooting node 2, 
Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b

The node2  keeps submitting the migration task over 100+ times to the other 
node.
INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
/192.168.88.33 has restarted, now UP
INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
Updating topology for /192.168.88.33
...
DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) 
Submitting migration task for /192.168.88.33
... ( over 100+ times)
--
On the otherhand, Node 1 keeps updating its gossip information, followed by 
receiving and submitting migrationTask afterwards: 

INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 978) 
InetAddress /192.168.88.34 is now UP
...
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
MigrationRequestVerbHandler.java (line 41) Received migration request from 
/192.168.88.34.
…… ( over 100+ times)
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
127) submitting migration task for /192.168.88.34
.  (over 50+ times)

On the side note, we have over 200+ column families defined in Cassandra 
database, which may related to this amount of rpc traffic.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11624) Scrub does not seem to work on previously marked corrupted SSTables

2016-04-21 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251392#comment-15251392
 ] 

Michael Fong edited comment on CASSANDRA-11624 at 4/21/16 6:57 AM:
---

Looking at the source code of Cassandra-2.0.17, it seems the cause might come 
from the following logic:

>From org.apache.cassandra.db.compaction.CompactionManager 
…
public void performScrub(ColumnFamilyStore cfStore, final boolean 
skipCorrupted, final boolean checkData) throws InterruptedException, 
ExecutionException
{
performAllSSTableOperation(cfStore, new AllSSTablesOperation()
{
…
private void performAllSSTableOperation(final ColumnFamilyStore cfs, final 
AllSSTablesOperation operation) throws InterruptedException, ExecutionException
{
final Iterable sstables = cfs.markAllCompacting();
…
org.apache.cassandra.db. ColumnFamilyStore 
…
public Iterable markAllCompacting()
{
Callable callable = new 
Callable()
{
public Iterable call() throws Exception
{
assert data.getCompacting().isEmpty() : data.getCompacting();
Iterable sstables = 
Lists.newArrayList(AbstractCompactionStrategy.filterSuspectSSTables(getSSTables()));
 <-filter out all previously marked suspected SSTables
if (Iterables.isEmpty(sstables))
return null;
…
--
It seems scrub will not perform if SSTable was marked  as corrupted (in 
blacklist) previously; however, would this defeat the original purpose of scrub 
operation?



was (Author: michael.fong):
Looking at the source code of Cassandra-2.0.17, it seems the cause might come 
from the following logic:

>From org.apache.cassandra.db.compaction.CompactionManager 
…
public void performScrub(ColumnFamilyStore cfStore, final boolean 
skipCorrupted, final boolean checkData) throws InterruptedException, 
ExecutionException
{
performAllSSTableOperation(cfStore, new AllSSTablesOperation()
{
…
private void performAllSSTableOperation(final ColumnFamilyStore cfs, final 
AllSSTablesOperation operation) throws InterruptedException, ExecutionException
{
final Iterable sstables = cfs.markAllCompacting();
…
org.apache.cassandra.db. ColumnFamilyStore 
…
public Iterable markAllCompacting()
{
Callable callable = new 
Callable()
{
public Iterable call() throws Exception
{
assert data.getCompacting().isEmpty() : data.getCompacting();
Iterable sstables = 
Lists.newArrayList(AbstractCompactionStrategy.filterSuspectSSTables(getSSTables()));
 <-filter out all previously marked suspected SSTables
if (Iterables.isEmpty(sstables))
return null;
…
--
It seems scrub will not perform if SSTable was marked as corrupted previously; 
however, would this defeat the original purpose of scrub operation?


> Scrub does not seem to work on previously marked corrupted SSTables
> ---
>
> Key: CASSANDRA-11624
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11624
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Fong
>
> We ran into a scenario that scrub does not seem to work on a previously 
> marked-as-corrupted SSTable. 
> Here is the log snippet related to the corrupted SSTable and scrub-attempt :
> ERROR [ReadStage:174] 2016-03-17 04:14:39,658 CassandraDaemon.java (line 258) 
> Exception in thread Thread[ReadStage:174,5,main]
> java.lang.RuntimeException: 
> org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: 
> mmap segment underflow; remaining is 10197 but 30062 requested for 
> /data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-ic-2-Data.db
> at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2022)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: 
> java.io.IOException: mmap segment underflow; remaining is 10197 but 30062 
> requested for 
> /data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-ic-2-Data.db
> at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.(IndexedSliceReader.java:97)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:65)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:42)
> at 
> 

[jira] [Comment Edited] (CASSANDRA-11624) Scrub does not seem to work on previously marked corrupted SSTables

2016-04-21 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251392#comment-15251392
 ] 

Michael Fong edited comment on CASSANDRA-11624 at 4/21/16 6:48 AM:
---

Looking at the source code of Cassandra-2.0.17, it seems the cause might come 
from the following logic:

>From org.apache.cassandra.db.compaction.CompactionManager 
…
public void performScrub(ColumnFamilyStore cfStore, final boolean 
skipCorrupted, final boolean checkData) throws InterruptedException, 
ExecutionException
{
performAllSSTableOperation(cfStore, new AllSSTablesOperation()
{
…
private void performAllSSTableOperation(final ColumnFamilyStore cfs, final 
AllSSTablesOperation operation) throws InterruptedException, ExecutionException
{
final Iterable sstables = cfs.markAllCompacting();
…
org.apache.cassandra.db. ColumnFamilyStore 
…
public Iterable markAllCompacting()
{
Callable callable = new 
Callable()
{
public Iterable call() throws Exception
{
assert data.getCompacting().isEmpty() : data.getCompacting();
Iterable sstables = 
Lists.newArrayList(AbstractCompactionStrategy.filterSuspectSSTables(getSSTables()));
 <-filter out all previously marked suspected SSTables
if (Iterables.isEmpty(sstables))
return null;
…
--
It seems scrub will not perform if SSTable was marked as corrupted previously; 
however, would this defeat the original purpose of scrub operation?



was (Author: michael.fong):
Looking at the source code of Cassandra-2.0.17, it seems the cause might come 
from the following logic:

>From org.apache.cassandra.db.compaction.CompactionManager 
…
public void performScrub(ColumnFamilyStore cfStore, final boolean 
skipCorrupted, final boolean checkData) throws InterruptedException, 
ExecutionException
{
performAllSSTableOperation(cfStore, new AllSSTablesOperation()
{
…
private void performAllSSTableOperation(final ColumnFamilyStore cfs, final 
AllSSTablesOperation operation) throws InterruptedException, ExecutionException
{
final Iterable sstables = cfs.markAllCompacting();
…
org.apache.cassandra.db. ColumnFamilyStore 
…
public Iterable markAllCompacting()
{
Callable callable = new 
Callable()
{
public Iterable call() throws Exception
{
assert data.getCompacting().isEmpty() : data.getCompacting();
Iterable sstables = 
Lists.newArrayList(AbstractCompactionStrategy.filterSuspectSSTables(getSSTables()));
 <-filter out all previously marked suspected SSTables
if (Iterables.isEmpty(sstables))
return null;
…
--
It seems scrub will not perform if SSTable was marked as corrupted previously; 
however, would this defeat the original perform of scrub operation?


> Scrub does not seem to work on previously marked corrupted SSTables
> ---
>
> Key: CASSANDRA-11624
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11624
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Fong
>
> We ran into a scenario that scrub does not seem to work on a previously 
> marked-as-corrupted SSTable. 
> Here is the log snippet related to the corrupted SSTable and scrub-attempt :
> ERROR [ReadStage:174] 2016-03-17 04:14:39,658 CassandraDaemon.java (line 258) 
> Exception in thread Thread[ReadStage:174,5,main]
> java.lang.RuntimeException: 
> org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: 
> mmap segment underflow; remaining is 10197 but 30062 requested for 
> /data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-ic-2-Data.db
> at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2022)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: 
> java.io.IOException: mmap segment underflow; remaining is 10197 but 30062 
> requested for 
> /data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-ic-2-Data.db
> at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.(IndexedSliceReader.java:97)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:65)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:42)
> at 
> 

[jira] [Comment Edited] (CASSANDRA-11624) Scrub does not seem to work on previously marked corrupted SSTables

2016-04-21 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251392#comment-15251392
 ] 

Michael Fong edited comment on CASSANDRA-11624 at 4/21/16 6:48 AM:
---

Looking at the source code of Cassandra-2.0.17, it seems the cause might come 
from the following logic:

>From org.apache.cassandra.db.compaction.CompactionManager 
…
public void performScrub(ColumnFamilyStore cfStore, final boolean 
skipCorrupted, final boolean checkData) throws InterruptedException, 
ExecutionException
{
performAllSSTableOperation(cfStore, new AllSSTablesOperation()
{
…
private void performAllSSTableOperation(final ColumnFamilyStore cfs, final 
AllSSTablesOperation operation) throws InterruptedException, ExecutionException
{
final Iterable sstables = cfs.markAllCompacting();
…
org.apache.cassandra.db. ColumnFamilyStore 
…
public Iterable markAllCompacting()
{
Callable callable = new 
Callable()
{
public Iterable call() throws Exception
{
assert data.getCompacting().isEmpty() : data.getCompacting();
Iterable sstables = 
Lists.newArrayList(AbstractCompactionStrategy.filterSuspectSSTables(getSSTables()));
 <-filter out all previously marked suspected SSTables
if (Iterables.isEmpty(sstables))
return null;
…
--
It seems scrub will not perform if SSTable was marked as corrupted previously; 
however, would this defeat the original perform of scrub operation?



was (Author: michael.fong):
Looking at the source code of Cassandra-2.0.17, it seems the cause might come 
from the following logic:

>From org.apache.cassandra.db.compaction.CompactionManager 
…
public void performScrub(ColumnFamilyStore cfStore, final boolean 
skipCorrupted, final boolean checkData) throws InterruptedException, 
ExecutionException
{
performAllSSTableOperation(cfStore, new AllSSTablesOperation()
{
…
private void performAllSSTableOperation(final ColumnFamilyStore cfs, final 
AllSSTablesOperation operation) throws InterruptedException, ExecutionException
{
final Iterable sstables = cfs.markAllCompacting();
…
org.apache.cassandra.db. ColumnFamilyStore 
…
public Iterable markAllCompacting()
{
Callable callable = new 
Callable()
{
public Iterable call() throws Exception
{
assert data.getCompacting().isEmpty() : data.getCompacting();
Iterable sstables = 
Lists.newArrayList(AbstractCompactionStrategy.filterSuspectSSTables(getSSTables()));
 <-filter out all previously marked suspected SSTables
if (Iterables.isEmpty(sstables))
return null;
…


> Scrub does not seem to work on previously marked corrupted SSTables
> ---
>
> Key: CASSANDRA-11624
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11624
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Fong
>
> We ran into a scenario that scrub does not seem to work on a previously 
> marked-as-corrupted SSTable. 
> Here is the log snippet related to the corrupted SSTable and scrub-attempt :
> ERROR [ReadStage:174] 2016-03-17 04:14:39,658 CassandraDaemon.java (line 258) 
> Exception in thread Thread[ReadStage:174,5,main]
> java.lang.RuntimeException: 
> org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: 
> mmap segment underflow; remaining is 10197 but 30062 requested for 
> /data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-ic-2-Data.db
> at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2022)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: 
> java.io.IOException: mmap segment underflow; remaining is 10197 but 30062 
> requested for 
> /data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-ic-2-Data.db
> at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.(IndexedSliceReader.java:97)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:65)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:42)
> at 
> org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:238)
> at 
> 

[jira] [Commented] (CASSANDRA-11624) Scrub does not seem to work on previously marked corrupted SSTables

2016-04-21 Thread Michael Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251392#comment-15251392
 ] 

Michael Fong commented on CASSANDRA-11624:
--

Looking at the source code of Cassandra-2.0.17, it seems the cause might come 
from the following logic:

>From org.apache.cassandra.db.compaction.CompactionManager 
…
public void performScrub(ColumnFamilyStore cfStore, final boolean 
skipCorrupted, final boolean checkData) throws InterruptedException, 
ExecutionException
{
performAllSSTableOperation(cfStore, new AllSSTablesOperation()
{
…
private void performAllSSTableOperation(final ColumnFamilyStore cfs, final 
AllSSTablesOperation operation) throws InterruptedException, ExecutionException
{
final Iterable sstables = cfs.markAllCompacting();
…
org.apache.cassandra.db. ColumnFamilyStore 
…
public Iterable markAllCompacting()
{
Callable callable = new 
Callable()
{
public Iterable call() throws Exception
{
assert data.getCompacting().isEmpty() : data.getCompacting();
Iterable sstables = 
Lists.newArrayList(AbstractCompactionStrategy.filterSuspectSSTables(getSSTables()));
 <-filter out all previously marked suspected SSTables
if (Iterables.isEmpty(sstables))
return null;
…


> Scrub does not seem to work on previously marked corrupted SSTables
> ---
>
> Key: CASSANDRA-11624
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11624
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Fong
>
> We ran into a scenario that scrub does not seem to work on a previously 
> marked-as-corrupted SSTable. 
> Here is the log snippet related to the corrupted SSTable and scrub-attempt :
> ERROR [ReadStage:174] 2016-03-17 04:14:39,658 CassandraDaemon.java (line 258) 
> Exception in thread Thread[ReadStage:174,5,main]
> java.lang.RuntimeException: 
> org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: 
> mmap segment underflow; remaining is 10197 but 30062 requested for 
> /data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-ic-2-Data.db
> at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2022)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: 
> java.io.IOException: mmap segment underflow; remaining is 10197 but 30062 
> requested for 
> /data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-ic-2-Data.db
> at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.(IndexedSliceReader.java:97)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:65)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:42)
> at 
> org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:238)
> at 
> org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:62)
> at 
> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:250)
> at 
> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:53)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1642)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1461)
> at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:340)
> at 
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:89)
> at 
> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1445)
> at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2010)
> ... 3 more
>  INFO [CompactionExecutor:98] 2016-03-17 04:14:39,693 OutputHandler.java 
> (line 42) Scrubbing 
> SSTableReader(path='/data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-jb-11-Data.db')
>  (2230223 bytes)
>  INFO [CompactionExecutor:98] 2016-03-17 04:14:39,751 OutputHandler.java 
> (line 42) Scrub of 
> SSTableReader(path='/data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-jb-11-Data.db')
>  complete: 2 rows in new sstable and 0 empty (tombstoned) rows dropped
> --
> Below is the file information around that time
> --
> -bash-4.1$ ls -alF 

[jira] [Created] (CASSANDRA-11624) Scrub does not seem to work on previously marked corrupted SSTables

2016-04-21 Thread Michael Fong (JIRA)
Michael Fong created CASSANDRA-11624:


 Summary: Scrub does not seem to work on previously marked 
corrupted SSTables
 Key: CASSANDRA-11624
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11624
 Project: Cassandra
  Issue Type: Bug
Reporter: Michael Fong


We ran into a scenario that scrub does not seem to work on a previously 
marked-as-corrupted SSTable. 

Here is the log snippet related to the corrupted SSTable and scrub-attempt :
ERROR [ReadStage:174] 2016-03-17 04:14:39,658 CassandraDaemon.java (line 258) 
Exception in thread Thread[ReadStage:174,5,main]
java.lang.RuntimeException: 
org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: 
mmap segment underflow; remaining is 10197 but 30062 requested for 
/data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-ic-2-Data.db
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2022)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: 
java.io.IOException: mmap segment underflow; remaining is 10197 but 30062 
requested for 
/data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-ic-2-Data.db
at 
org.apache.cassandra.db.columniterator.IndexedSliceReader.(IndexedSliceReader.java:97)
at 
org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:65)
at 
org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:42)
at 
org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:238)
at 
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:62)
at 
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:250)
at 
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:53)
at 
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1642)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1461)
at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:340)
at 
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:89)
at 
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1445)
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2010)
... 3 more
 INFO [CompactionExecutor:98] 2016-03-17 04:14:39,693 OutputHandler.java (line 
42) Scrubbing 
SSTableReader(path='/data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-jb-11-Data.db')
 (2230223 bytes)
 INFO [CompactionExecutor:98] 2016-03-17 04:14:39,751 OutputHandler.java (line 
42) Scrub of 
SSTableReader(path='/data/ng/db/data/wsg/dpStatusRealTime/wsg-dpStatusRealTime-jb-11-Data.db')
 complete: 2 rows in new sstable and 0 empty (tombstoned) rows dropped

--
Below is the file information around that time
--

-bash-4.1$ ls -alF /data/ng/db/data/wsg/dpStatusRealTime/
total 2328
drwxr-xr-x   2 root root4096 Mar 17 04:14 ./
drwxr-xr-x 264 root root   12288 Mar 16 06:48 ../
-rw-r--r--   1 root root   72995 Mar 16 07:08 wsg-dpStatusRealTime-ic-2-Data.db
-rw-r--r--   1 root root  75 Mar 16 07:08 
wsg-dpStatusRealTime-ic-2-Digest.sha1
-rw-r--r--   1 root root  16 Mar 16 07:08 
wsg-dpStatusRealTime-ic-2-Filter.db
-rw-r--r--   1 root root 132 Mar 16 07:08 wsg-dpStatusRealTime-ic-2-Index.db
-rw-r--r--   1 root root5956 Mar 16 07:08 
wsg-dpStatusRealTime-ic-2-Statistics.db
-rw-r--r--   1 root root 244 Mar 16 07:20 
wsg-dpStatusRealTime-ic-2-Summary.db
-rw-r--r--   1 root root  72 Mar 16 07:08 wsg-dpStatusRealTime-ic-2-TOC.txt
-rw-r--r--   1 root root 144 Mar 17 04:14 wsg-dpStatusRealTime-jb-12-CRC.db
-rw-r--r--   1 root root 2230223 Mar 17 04:14 wsg-dpStatusRealTime-jb-12-Data.db
-rw-r--r--   1 root root  76 Mar 17 04:14 
wsg-dpStatusRealTime-jb-12-Digest.sha1
-rw-r--r--   1 root root 336 Mar 17 04:14 
wsg-dpStatusRealTime-jb-12-Filter.db
-rw-r--r--   1 root root1424 Mar 17 04:14 
wsg-dpStatusRealTime-jb-12-Index.db
-rw-r--r--   1 root root6004 Mar 17 04:14 
wsg-dpStatusRealTime-jb-12-Statistics.db
-rw-r--r--   1 root root 244 Mar 17 04:14 
wsg-dpStatusRealTime-jb-12-Summary.db
-rw-r--r--   1 root root  79 Mar 17 04:14 wsg-dpStatusRealTime-jb-12-TOC.txt
--
1. Please note that the corrupted file is in (ic) version, which is 1.2.19. 
This test bed was upgraded and attempted to upgradesstable a day ago. There has 
been some