[jira] [Assigned] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2019-12-05 Thread Matt Byrd (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd reassigned CASSANDRA-11748:
-

Assignee: (was: Nirmal Singh KPS)

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Core
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Priority: Urgent
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have 
> InternalResponseStage performing schema merge operation. Since this operation 
> requires a compaction for each merge and is much slower to consume. Thus, the 
> back-pressure of incoming schema migration content objects consumes all of 
> the heap space and ultimately ends up OOM!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2018-10-25 Thread Matt Byrd (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663994#comment-16663994
 ] 

Matt Byrd edited comment on CASSANDRA-11748 at 10/25/18 4:46 PM:
-

I think it would be great to try and fix these related issues in the 4.0 
timeframe. I'd be keen on trying the above outlined approach, I'll have a go at 
sketching it out in a PR to see what folks think.
To reiterate what I believe to be fundamental problem:
The way we tee up a schema pull whenever a relevant gossip event shows a node 
with a different schema version,
results in far too many superfluous pulls for the same schema contents. When 
there are sufficient endpoints and a sufficiently large schema doing so can 
lead to the instance OOMing.

The above proposed solution solves this by decoupling the schema pulls from the 
incoming gossip messages and instead using gossip to update the nodes view of 
which other nodes have which schema version and then having a thread 
periodically check and attempt to resolve any inconsistencies.
There are some details to flesh out and I think an important part will be to 
ensure we have tests to demonstrate the issues and demonstrate we've fixed them.
I'm hoping that we can perhaps leverage 
[CASSANDRA-14821|https://issues.apache.org/jira/browse/CASSANDRA-14821] to do 
so. 
Though we may want to augment this with dtests or something else.
Let me know if you have any thoughts on the above approach, perhaps a sketch in 
code will help better illuminate it and help flush out potential problems. 
[~iamaleksey] / [~spo...@gmail.com] / [~michael.fong] / [~jjirsa] 


was (Author: mbyrd):
I think it would be great to try and fix these related issues in the 4.0 
timeframe. I'd be keen on trying the above outlined approach, I'll have a go at 
sketching it out in a PR to see what folks think.
To reiterate what I believe to be fundamental problem:
The way we tee up a schema pull whenever a relevant gossip event shows a node 
with a different schema version,
results in far too many superfluous pulls for the same schema contents. When 
there are sufficient endpoints and a sufficiently large schema doing so can 
lead to the instance OOMing.

The above proposed solution solves this by decoupling the schema pulls from the 
incoming gossip messages and instead using gossip to update the nodes view of 
which other nodes have which schema version and then having a thread 
periodically check and attempt to resolve any inconsistencies.
There are some details to flesh out and I think an important part will be to 
ensure we have tests to demonstrate the issues and demonstrate we've fixed them.
I'm hoping that we can perhaps leverage 
[CASSANDRA-14821|https://issues.apache.org/jira/browse/CASSANDRA-14821] to do 
so. 
Though we may want to augment this with dtests or something else.
Let me know if you have any thoughts on the above approach, perhaps a sketch in 
code will help better illuminate it and help flush out potential problems. 
[~iamaleksey][~spo...@gmail.com][~michael.fong][~jjirsa] 

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java 

[jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2018-10-25 Thread Matt Byrd (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663994#comment-16663994
 ] 

Matt Byrd commented on CASSANDRA-11748:
---

I think it would be great to try and fix these related issues in the 4.0 
timeframe. I'd be keen on trying the above outlined approach, I'll have a go at 
sketching it out in a PR to see what folks think.
To reiterate what I believe to be fundamental problem:
The way we tee up a schema pull whenever a relevant gossip event shows a node 
with a different schema version,
results in far too many superfluous pulls for the same schema contents. When 
there are sufficient endpoints and a sufficiently large schema doing so can 
lead to the instance OOMing.

The above proposed solution solves this by decoupling the schema pulls from the 
incoming gossip messages and instead using gossip to update the nodes view of 
which other nodes have which schema version and then having a thread 
periodically check and attempt to resolve any inconsistencies.
There are some details to flesh out and I think an important part will be to 
ensure we have tests to demonstrate the issues and demonstrate we've fixed them.
I'm hoping that we can perhaps leverage 
[CASSANDRA-14821|https://issues.apache.org/jira/browse/CASSANDRA-14821] to do 
so. 
Though we may want to augment this with dtests or something else.
Let me know if you have any thoughts on the above approach, perhaps a sketch in 
code will help better illuminate it and help flush out potential problems. 
[~iamaleksey][~spo...@gmail.com][~michael.fong][~jjirsa] 

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have 
> InternalResponseStage performing schema merge operation. Since this operation 
> requires a 

[jira] [Created] (CASSANDRA-14531) Only include data owned by the node in totals for repaired, un-repaired and pending repair.

2018-06-19 Thread Matt Byrd (JIRA)
Matt Byrd created CASSANDRA-14531:
-

 Summary: Only include data owned by the node in totals for 
repaired, un-repaired and pending repair.
 Key: CASSANDRA-14531
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14531
 Project: Cassandra
  Issue Type: Improvement
  Components: Metrics, Repair
Reporter: Matt Byrd
 Fix For: 4.x


If there is data which is left over from a topology change and is not yet 
cleaned up, it will be included in the total for BytesRepaired, BytesUnrepaired 
or BytesPendingRepair metrics.
 This can distort the total and lead to misleading metrics (albeit potentially 
short-lived).
 As an operator if you wanted to keep track of percent repaired, you might not 
have an accurate idea of the relevant percent repaired under such conditions.

I propose we only include sstables owned by the node in the totals for 
BytesRepaired, BytesUnrepaired, BytesPendingRepair and PercentRepaired. It 
feels more logical to only emit metrics like repaired/un-repaired for data 
which can actually be repaired.

When an SStable is partially owned by the node, we can compute the size which 
falls within the token-range by binary searching the index for the uncompressed 
offsets.
 We can finally also emit a metric which consists of all the data which is not 
owned by the node.
 This might also be helpful for operators to discover whether there is data 
which is not owned by the node and hence the need to run cleanup.   

On slight complication is that with a large number of sstables and a reasonable 
number of vnodes, computing these values now becomes a bit expensive. There is 
probably a way of keeping some of these metrics updated online rather than 
re-computing periodically, though this might be a bit fiddly. Alternately using 
things like the interval tree or some other data-structure might be enough to 
ensure it performs sufficiently and doesn't add undue overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-14531) Only include data owned by the node in totals for repaired, un-repaired and pending repair.

2018-06-19 Thread Matt Byrd (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd reassigned CASSANDRA-14531:
-

Assignee: Matt Byrd

> Only include data owned by the node in totals for repaired, un-repaired and 
> pending repair.
> ---
>
> Key: CASSANDRA-14531
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14531
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Metrics, Repair
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> If there is data which is left over from a topology change and is not yet 
> cleaned up, it will be included in the total for BytesRepaired, 
> BytesUnrepaired or BytesPendingRepair metrics.
>  This can distort the total and lead to misleading metrics (albeit 
> potentially short-lived).
>  As an operator if you wanted to keep track of percent repaired, you might 
> not have an accurate idea of the relevant percent repaired under such 
> conditions.
> I propose we only include sstables owned by the node in the totals for 
> BytesRepaired, BytesUnrepaired, BytesPendingRepair and PercentRepaired. It 
> feels more logical to only emit metrics like repaired/un-repaired for data 
> which can actually be repaired.
> When an SStable is partially owned by the node, we can compute the size which 
> falls within the token-range by binary searching the index for the 
> uncompressed offsets.
>  We can finally also emit a metric which consists of all the data which is 
> not owned by the node.
>  This might also be helpful for operators to discover whether there is data 
> which is not owned by the node and hence the need to run cleanup.   
> On slight complication is that with a large number of sstables and a 
> reasonable number of vnodes, computing these values now becomes a bit 
> expensive. There is probably a way of keeping some of these metrics updated 
> online rather than re-computing periodically, though this might be a bit 
> fiddly. Alternately using things like the interval tree or some other 
> data-structure might be enough to ensure it performs sufficiently and doesn't 
> add undue overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13557) allow different NUMACTL_ARGS to be passed in

2017-08-16 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129276#comment-16129276
 ] 

Matt Byrd commented on CASSANDRA-13557:
---

for reference the commit is actually here I believe:
[af20226dcadc6f15e245b3c786233d783d77b914|https://github.com/apache/cassandra/commit/af20226dcadc6f15e245b3c786233d783d77b914]

> allow different NUMACTL_ARGS to be passed in
> 
>
> Key: CASSANDRA-13557
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13557
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 3.0.15, 3.11.1, 4.0
>
>
> Currently in bin/cassandra the following is hardcoded:
> NUMACTL_ARGS="--interleave=all"
> Ideally users of cassandra/bin could pass in a different set of NUMACTL_ARGS 
> if they wanted to say bind the process to a socket for cpu/memory reasons, 
> rather than having to comment out/modify this line in the deployed 
> cassandra/bin. e.g as described in:
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
> This could be done by just having the default be set to "--interleave=all" 
> but pickup any value which has already been set for the variable NUMACTL_ARGS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-08-16 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129137#comment-16129137
 ] 

Matt Byrd commented on CASSANDRA-11748:
---

Hi [~mcfongtw],
Hopefully I'm interpreting your comments correctly, I believe it's further 
analysis of the particular problem rather than suggestions for improvement?
Firstly I agree that having an unbounded number of concurrent migration tasks 
is the root of the problem (along with the other pre-condition of having a 
suitably large schema and somehow missing a schema update, either being down or 
being on another major version from where the change took place):
{quote}
1. Have migration checks and requests fired asynchronously and finally stack up 
the all message at the receiver end merge the schema one-by-one at
{code}
Schema.instance.mergeSchemaAndAnnounceVersion()
{code}
{quote}
Rather than trying to de-dupe the schema mutations at the receiver end (which 
might help reduce how much is retained on the heap but ultimately doesn't get 
at the heart of the problem).

{quote}
2. Send the receiver the complete copy of schema, instead of delta copy of 
schema out of diff between two nodes.
{quote}
Sending the whole copy of the schema came into play here: 
https://issues.apache.org/jira/browse/CASSANDRA-1391.
I believe reverting this behaviour is probably out of scope of any 3.0 update, 
but perhaps for a future patch we can negotiate the delta rather than sending 
the whole schema. This would be a good improvement, but I don't think it's 
strictly necessary for solving this particular problem.

{quote}
3. Last but not least, the most mysterious problem that leads to OOM and we 
could not figure out why back then, is that there are hundreds of migration 
task all fired nearly simultaneously, within 2 s. The number of rpcs does not 
match with the nodes in cluster, but is close to number of second taken for the 
node to reboot.
{quote}
It's possible there is something else going on in addition here, although one 
thing that I've observed (as mentioned above) is that due to all the heap 
pressure from the large mutations sent concurrently, the node itself can pause 
for several seconds and hence both be marked as DOWN by the remote nodes and 
mark those remote nodes DOWN itself, followed by then marking them UP and doing 
a another schema pull as a result. This spiral often results in many more 
migration tasks than are necessary, before either OOMing out or finally 
applying the required schema change.
If you still have your logs you could check roughly how many on on UP messages 
for other endpoints occurred on a problematic instance and compare that to the 
number of migration tasks.

At any rate I believe either rate limiting the migration tasks either globally 
or per schema version or indeed coming up with an alternative mechanism which 
serialises the schema pulls should address the problem.

I'll take a look at a proposition by [~iamaleksey] to pass the information 
about schema versions to a map and move the actual triggering of pull requests 
onto a frequently run periodic task, which reads this map and decides an 
appropriate course of action to resolve the schema difference. (This way we can 
collect all this information arriving asynchronously and for example 
de-duplicate repeated calls for the same schema version for different 
endpoints.) 
I think the main advantage of such an approach (as opposed to limiting the 
number of migration tasks by schema version) is that it removes the possibility 
of ending up with a stale schema due to the limiting, however it's worth noting 
that doing the limit per schema version and expiring the limits, already goes a 
long way to reducing this possibility. I'll try and dig up that version of the 
patch for reference/comparison.
[~iamaleksey], [~spod] Please let me know if there is anything in particular 
about the way you want this to behave or you feel I've misrepresented the idea 
in any way. One further thing that did occur to me was that trying to balance 
avoiding superfluous schema pulls against ensuring we converge as quickly as 
possible, might necessitate some degree of parallelism. For example if we pick 
a node to pull schema from and it's partitioned off, so we don't hear back for 
a while(or ever), under such a scenario we probably want to be proactively 
scheduling a pull from elsewhere to avoid waiting too long for a timeout. I'm 
sure there will be some other details to work out too but I think the general 
approach makes sense.

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> 

[jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-08-10 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122194#comment-16122194
 ] 

Matt Byrd commented on CASSANDRA-11748:
---

{quote}
But we should at least take the schema Ids and/or endpoints into account as 
well. It just doesn't make sense to queue 50 requests for the same schema Id 
and potentially drop requests for a different schema afterwards.
{quote}
Yes, I did also have a patch with an expiring map of schema-version to counter 
and was limiting it per schema version, but decided to keep it simple, since 
the single limit sufficed for a particular scenario. Less relevant, but it also 
provides some protection in the rather strange case that there are actually 
lots of different schema versions in the cluster. I could resurrect the schema 
version patch, but it sounds like we're considering a slightly different 
approach.

{quote}
Schedule that pull with a delay instead, give the new node a chance to pull the 
new schema from one of the nodes in the cluster. It'll most likely converge by 
the time the delay has passed, so we'd just abort the request if schema 
versions now match.
{quote}
Once a node has been up for MIGRATION_DELAY_IN_MS and doesn't have an empty 
schema, it will always schedule the task to pull schema with a delay of 
MIGRATION_DELAY_IN_MS and then do a further check within the task itself to see 
if the schema versions still differ before asking for schema.

Though admittedly this problem does still exist if two nodes start up at the 
same time, they may pull from each other.
I suppose we're going to schedule a pull from a newer node too, then assuming 
we successively merge the schema together we end up hopefully at the final 
desired state? Although in the interim I suppose it's possible a node might 
come into play with a slightly older schema, but I suppose that can just happen 
whenever a DOWN node comes up with out of date schema.

It's also possible that if the node is so overwhelmed by the reverse problem, 
it won't have made it to the correct schema version in MIGRATION_DELAY_IN_MS 
and hence will start sending it's old schema back at all the other nodes in the 
cluster, fortunately the sending happens on the migration stage so is single 
threaded and less likely to cause OOMS. 



> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 

[jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-08-07 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117330#comment-16117330
 ] 

Matt Byrd commented on CASSANDRA-11748:
---

Hey [~iamaleksey],
I know as part of https://issues.apache.org/jira/browse/CASSANDRA-10699 and 
related JIRAS you are intent on reworking schema quite a bit. I'm trying to 
determine whether the migration limit patches linked above will still be 
necessary in addition to the changes you're making.
It sounded like the serialised schema itself might become a bit cheaper 
(reducing the heap cost of the sending the big serialised mutation) however the 
fundamental PULL model of getting schema changes on startup wouldn't change.
I.e if you are a node in a large cluster and are been down whilst a schema 
change occurs, when you startup, you will still ask for schema from all the 
other nodes as they appear in your view and only stop asking when you've 
successfully applied the schema.
I suppose if this is the case, then we probably still need the above linked 
patch for trunk, what do you think?
Thanks,
Matt

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have 
> InternalResponseStage performing schema merge operation. Since this operation 
> requires a compaction for each merge and is much slower to consume. Thus, the 
> back-pressure of incoming schema migration content objects consumes all of 
> the heap space and ultimately ends up OOM!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-8076) Expose an mbean method to poll for repair job status

2017-08-02 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111857#comment-16111857
 ] 

Matt Byrd commented on CASSANDRA-8076:
--

Looks like this issue was trying to address a similar problem described in:
https://issues.apache.org/jira/browse/CASSANDRA-13480

In the patch there I added a method which allows one to get the parent repair 
status of a given repair:

https://github.com/apache/cassandra/commit/20d5ce8b9b587be2f0b7bc5765254e8dc6e0bd3b

Which is sort of similar to the mentioned method here.
Additionally nodetool now also checks for this status when notifications are 
lost and periodically so we don't hang indefinitely when notifications are lost.
Those using Jmx directly can do something analogous if the desire.

[~yukim] Would you mind taking a quick look CASSANDRA-13480 and If appropriate 
close this as a duplicate of that?

> Expose an mbean method to poll for repair job status
> 
>
> Key: CASSANDRA-8076
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8076
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Philip S Doctor
>Assignee: Yuki Morishita
> Attachments: 8076-2.0.txt
>
>
> Given the int reply-id from forceRepairAsync, allow a client to request the 
> status of this ID via jmx.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-06-29 Thread Matt Byrd (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd updated CASSANDRA-13480:
--
Status: Ready to Commit  (was: Patch Available)

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
>  Labels: repair
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-06-27 Thread Matt Byrd (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd updated CASSANDRA-13480:
--
Reviewer: Chris Lohfink  (was: Blake Eggleston)

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
>  Labels: repair
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13557) allow different NUMACTL_ARGS to be passed in

2017-06-20 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056599#comment-16056599
 ] 

Matt Byrd commented on CASSANDRA-13557:
---

|3.0|3.11|Trunk|
|[branch|https://github.com/Jollyplum/cassandra/tree/13557]|[branch|https://github.com/Jollyplum/cassandra/tree/13557-3.11]|[branch|https://github.com/Jollyplum/cassandra/tree/13557]|
|[testall|https://circleci.com/gh/Jollyplum/cassandra/19#tests/containers/3]|[testall|https://circleci.com/gh/Jollyplum/cassandra/20]|[testall|https://circleci.com/gh/Jollyplum/cassandra/6]|

> allow different NUMACTL_ARGS to be passed in
> 
>
> Key: CASSANDRA-13557
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13557
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> Currently in bin/cassandra the following is hardcoded:
> NUMACTL_ARGS="--interleave=all"
> Ideally users of cassandra/bin could pass in a different set of NUMACTL_ARGS 
> if they wanted to say bind the process to a socket for cpu/memory reasons, 
> rather than having to comment out/modify this line in the deployed 
> cassandra/bin. e.g as described in:
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
> This could be done by just having the default be set to "--interleave=all" 
> but pickup any value which has already been set for the variable NUMACTL_ARGS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-06-19 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054926#comment-16054926
 ] 

Matt Byrd commented on CASSANDRA-11748:
---

|3.0|3.11|Trunk|
|[branch|https://github.com/Jollyplum/cassandra/tree/13480]|[branch|https://github.com/Jollyplum/cassandra/tree/11748-3.11]|[branch|https://github.com/Jollyplum/cassandra/tree/11748]|
|[dtest|]|[dtest|]|[dtest|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/93/testReport/]|
|[testall|https://circleci.com/gh/Jollyplum/cassandra/15]|[testall|https://circleci.com/gh/Jollyplum/cassandra/16]|[testall|https://circleci.com/gh/Jollyplum/cassandra/17]|

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have 
> InternalResponseStage performing schema merge operation. Since this operation 
> requires a compaction for each merge and is much slower to consume. Thus, the 
> back-pressure of incoming schema migration content objects consumes all of 
> the heap space and ultimately ends up OOM!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-06-19 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054909#comment-16054909
 ] 

Matt Byrd edited comment on CASSANDRA-13480 at 6/19/17 11:11 PM:
-

||Trunk|||
|[branch|https://github.com/Jollyplum/cassandra/tree/13480]|
|[dtest|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/98/]|
|[testall|https://circleci.com/gh/Jollyplum/cassandra/14]|



was (Author: mbyrd):
||Trunk|||
|[branch|https://github.com/Jollyplum/cassandra/tree/13480]|
|[dtest|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/98/]||[testall|https://circleci.com/gh/Jollyplum/cassandra/14]|


> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-06-19 Thread Matt Byrd (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd updated CASSANDRA-13480:
--
 Reviewer: Blake Eggleston
Reproduced In: 3.0.13, 2.1.16, 4.x  (was: 2.1.16, 3.0.13, 4.x)
   Status: Patch Available  (was: Open)

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-06-19 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054909#comment-16054909
 ] 

Matt Byrd edited comment on CASSANDRA-13480 at 6/19/17 11:03 PM:
-

||Trunk|||
|[branch|https://github.com/Jollyplum/cassandra/tree/13480]|
|[dtest|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/98/]||[testall|https://circleci.com/gh/Jollyplum/cassandra/14]|



was (Author: mbyrd):
||Trunk|||
|[branch|https://github.com/Jollyplum/cassandra/tree/13480]|
|[testall|https://circleci.com/gh/Jollyplum/cassandra/14]|
|[dtests|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/98/]|

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-06-19 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054909#comment-16054909
 ] 

Matt Byrd edited comment on CASSANDRA-13480 at 6/19/17 11:02 PM:
-

||Trunk|||
|[branch|https://github.com/Jollyplum/cassandra/tree/13480]|
|[testall|https://circleci.com/gh/Jollyplum/cassandra/14]|
|[dtests|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/98/]|


was (Author: mbyrd):

||Trunk|||
|[branch|https://github.com/Jollyplum/cassandra/tree/13480]|[testall|https://circleci.com/gh/Jollyplum/cassandra/14]|[dtests|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/98/]|

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-06-19 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054909#comment-16054909
 ] 

Matt Byrd commented on CASSANDRA-13480:
---


||Trunk|||
|[branch|https://github.com/Jollyplum/cassandra/tree/13480]|[testall|https://circleci.com/gh/Jollyplum/cassandra/14]|[dtests|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/98/]|

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13570) allow sub-range repairs (specifying -et -st) for a preview of repaired data

2017-06-12 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046826#comment-16046826
 ] 

Matt Byrd commented on CASSANDRA-13570:
---

Done, thanks

> allow sub-range repairs (specifying -et -st) for a preview of repaired data
> ---
>
> Key: CASSANDRA-13570
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13570
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.0
>
>
> I don't see any inherent reason for restricting preview repairs of repaired 
> data to not allow specifying start and end tokens. 
> The restriction seems to be coming from the fact that incremental=true in 
> RepairOption, which is the case but it's not truly an incremental repair 
> since we're only previewing.
> {code:java}
> if (option.isIncremental() && !option.isGlobal())
> {
> throw new IllegalArgumentException("Incremental repairs cannot be 
> run against a subset of tokens or ranges");
> }
> {code}
> It would be helpful to allow this, so that operators could sequence a sweep 
> over the entirety of the token-space in a more gradual fashion.
> Also it might help in examining which portions of the token-space differ.
> Can anyone see any reasons for not allowing this?
> I.e just changing the above to something like:
> {code:java}
>  if (option.isIncremental() && !option.getPreviewKind().isPreview() && 
> !option.isGlobal())
>   {
>   throw new IllegalArgumentException("Incremental repairs cannot 
> be run against a subset of tokens or ranges");
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13569) Schedule schema pulls just once per endpoint

2017-06-09 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044683#comment-16044683
 ] 

Matt Byrd commented on CASSANDRA-13569:
---

[~spod] Yes avoiding multiple schema migrations in flight per endpoint seems 
like a strict improvement.
Maybe the issue on CASSANDRA-11748 can be addressed there separately or perhaps 
if CASSANDRA-10699 reworks the mechanism CASSANDRA-11748 will no longer be a 
problem.

> Schedule schema pulls just once per endpoint
> 
>
> Key: CASSANDRA-13569
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13569
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> Schema mismatches detected through gossip will get resolved by calling 
> {{MigrationManager.maybeScheduleSchemaPull}}. This method may decide to 
> schedule execution of {{MigrationTask}}, but only after using a 
> {{MIGRATION_DELAY_IN_MS = 6}} delay (for reasons unclear to me). 
> Meanwhile, as long as the migration task hasn't been executed, we'll continue 
> to have schema mismatches reported by gossip and will have corresponding 
> {{maybeScheduleSchemaPull}} calls, which will schedule further tasks with the 
> mentioned delay. Some local testing shows that dozens of tasks for the same 
> endpoint will eventually be executed and causing the same, stormy behavior 
> for this very endpoints.
> My proposal would be to simply not schedule new tasks for the same endpoint, 
> in case we still have pending tasks waiting for execution after 
> {{MIGRATION_DELAY_IN_MS}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13569) Schedule schema pulls just once per endpoint

2017-06-02 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035393#comment-16035393
 ] 

Matt Byrd commented on CASSANDRA-13569:
---

Sure n.p [~spo...@gmail.com]
Yes, so adding jitter in MIGRATION_DELAY_IN_MS could help when we're past: 
{code:java}
| runtimeMXBean.getUptime() < MIGRATION_DELAY_IN_MS)
{code}
However it doesn't help on startup.
Initially in trying to solve CASSANDRA-11748, I did also think about adding 
random the delay for even this branch (where we've only been up a short amount 
of time).
This just didn't seem that straightforward to do and also guarantee that we 
wouldn't hit the problem described in CASSANDRA-11748.
How do you know what is enough random delay? what if you actually delay getting 
the schema legitimately?

I suppose the concerns in this ticket are similar but not exactly the same as 
CASSANDRA-11748, though I admit that rate limiting the number of schema pulls 
per endpoint to one at a time seems sensible and might possibly help a bit with 
CASSANDRA-11748.
The schema is being pulled repeatedly from the same instances in 
CASSANDRA-11748, but I'm not sure rate limiting alone as described above will 
definitely solve it, perhaps it will make it less likely to OOM, but we're 
still going to have a lot of incoming serialised schemas from lots of nodes and 
we're still left with this sort of rough limit to scalability of "number of 
nodes * size of serialised schema" (albeit perhaps with a different threshold).

Maybe some upcoming changes in CASSANDRA-10699 and related tickets may make the 
problem CASSANDRA-11748 even less likely, since part of the problem is that 
we're sending the entire serialised schema inside a mutation, which can end up 
being quite large if you have lots of tables or lots of columns in lots of 
tables.

Also, for reference I believe the migration delay was added in the following 
ticket, in order to give a schema alteration sufficient time to propagate from 
the node where it changed, 
and not have a migration task race with this change and pull the whole schema 
instead of receive the delta:
https://issues.apache.org/jira/browse/CASSANDRA-5025

> Schedule schema pulls just once per endpoint
> 
>
> Key: CASSANDRA-13569
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13569
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Distributed Metadata
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> Schema mismatches detected through gossip will get resolved by calling 
> {{MigrationManager.maybeScheduleSchemaPull}}. This method may decide to 
> schedule execution of {{MigrationTask}}, but only after using a 
> {{MIGRATION_DELAY_IN_MS = 6}} delay (for reasons unclear to me). 
> Meanwhile, as long as the migration task hasn't been executed, we'll continue 
> to have schema mismatches reported by gossip and will have corresponding 
> {{maybeScheduleSchemaPull}} calls, which will schedule further tasks with the 
> mentioned delay. Some local testing shows that dozens of tasks for the same 
> endpoint will eventually be executed and causing the same, stormy behavior 
> for this very endpoints.
> My proposal would be to simply not schedule new tasks for the same endpoint, 
> in case we still have pending tasks waiting for execution after 
> {{MIGRATION_DELAY_IN_MS}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-10699) Make schema alterations strongly consistent

2017-06-02 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035130#comment-16035130
 ] 

Matt Byrd edited comment on CASSANDRA-10699 at 6/2/17 5:56 PM:
---

In particular I'm interested in how to avoid concurrent schema changes causing 
problems.
[~iamaleksey] Is the plan to use Paxos to linearise the schema changes?
btw the original assignee change was not intentional.


was (Author: mbyrd):
In particular I'm interested in how to avoid concurrent schema changes causing 
problems.
[~iamaleksey] Is the plan to use Paxos to linearise the schema changes?

> Make schema alterations strongly consistent
> ---
>
> Key: CASSANDRA-10699
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10699
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
> Fix For: 4.0
>
>
> Schema changes do not necessarily commute. This has been the case before 
> CASSANDRA-5202, but now is particularly problematic.
> We should employ a strongly consistent protocol instead of relying on 
> marshalling {{Mutation}} objects with schema changes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-10699) Make schema alterations strongly consistent

2017-06-02 Thread Matt Byrd (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd reassigned CASSANDRA-10699:
-

Assignee: Aleksey Yeschenko  (was: Matt Byrd)

> Make schema alterations strongly consistent
> ---
>
> Key: CASSANDRA-10699
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10699
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
> Fix For: 4.0
>
>
> Schema changes do not necessarily commute. This has been the case before 
> CASSANDRA-5202, but now is particularly problematic.
> We should employ a strongly consistent protocol instead of relying on 
> marshalling {{Mutation}} objects with schema changes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10699) Make schema alterations strongly consistent

2017-06-02 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035130#comment-16035130
 ] 

Matt Byrd commented on CASSANDRA-10699:
---

In particular I'm interested in how to avoid concurrent schema changes causing 
problems.
[~iamaleksey] Is the plan to use Paxos to linearise the schema changes?

> Make schema alterations strongly consistent
> ---
>
> Key: CASSANDRA-10699
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10699
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Aleksey Yeschenko
>Assignee: Matt Byrd
> Fix For: 4.0
>
>
> Schema changes do not necessarily commute. This has been the case before 
> CASSANDRA-5202, but now is particularly problematic.
> We should employ a strongly consistent protocol instead of relying on 
> marshalling {{Mutation}} objects with schema changes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-10699) Make schema alterations strongly consistent

2017-06-02 Thread Matt Byrd (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd reassigned CASSANDRA-10699:
-

Assignee: Matt Byrd

> Make schema alterations strongly consistent
> ---
>
> Key: CASSANDRA-10699
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10699
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Aleksey Yeschenko
>Assignee: Matt Byrd
> Fix For: 4.0
>
>
> Schema changes do not necessarily commute. This has been the case before 
> CASSANDRA-5202, but now is particularly problematic.
> We should employ a strongly consistent protocol instead of relying on 
> marshalling {{Mutation}} objects with schema changes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-13570) allow sub-range repairs (specifying -et -st) for a preview of repaired data

2017-06-02 Thread Matt Byrd (JIRA)
Matt Byrd created CASSANDRA-13570:
-

 Summary: allow sub-range repairs (specifying -et -st) for a 
preview of repaired data
 Key: CASSANDRA-13570
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13570
 Project: Cassandra
  Issue Type: Improvement
Reporter: Matt Byrd
Assignee: Matt Byrd
Priority: Minor
 Fix For: 4.x


I don't see any inherent reason for restricting preview repairs of repaired 
data to not allow specifying start and end tokens. 
The restriction seems to be coming from the fact that incremental=true in 
RepairOption, which is the case but it's not truly an incremental repair since 
we're only previewing.
{code:java}
if (option.isIncremental() && !option.isGlobal())
{
throw new IllegalArgumentException("Incremental repairs cannot be 
run against a subset of tokens or ranges");
}
{code}
It would be helpful to allow this, so that operators could sequence a sweep 
over the entirety of the token-space in a more gradual fashion.
Also it might help in examining which portions of the token-space differ.
Can anyone see any reasons for not allowing this?
I.e just changing the above to something like:
{code:java}
 if (option.isIncremental() && !option.getPreviewKind().isPreview() && 
!option.isGlobal())
  {
  throw new IllegalArgumentException("Incremental repairs cannot be 
run against a subset of tokens or ranges");
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2017-06-02 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034971#comment-16034971
 ] 

Matt Byrd commented on CASSANDRA-11748:
---

So I believe the crux of this problem is:
On startup if our schema differs from one of the nodes on the same version of 
messaging service as ourselves, we'll pull the schema from said node upon 
marking it as UP, hence with a large cluster and a large schema, we're pulling 
many copies of the serialised schema onto the heap, potentially causing 
pressure on the heap and eventually OOMS.
To make matters worse, when we hit the GC pauses this seems to result in the 
other nodes being marked as DOWN and then UP again, pulling the schema once 
again.
As a result the instance OOMs on startup, with a large enough schema and 
cluster this is probably deterministic.

This can happen when a node is down for a while and has missed a schema change, 
or if the given upgrade path results in a schema version change, which somehow 
is not reflected quickly enough locally, then maybeScheduleSchemaPull runs and 
decides to pull it remotely.

When you startup and see hundreds of nodes all with the same schema version 
that you need it probably doesn't make much sense to pull it from every single 
one of them, if instead we just limit the number of schema migration tasks in 
flight, we can limit or stop this behaviour from occuring.

I've got a patch which does just this and fixes a dtest reproduction I've 
written.
I had some other variants that limited the number of in-flight tasks per schema 
version for example, but it seemed that a straightforward limit was sufficient.

Admittedly I'm not certain that the upgrade problem still exists, but starting 
a node without the latest schema should still cause this problem.

There is a expiry to the limit to avoid getting stuck in a state where the 
counter for inflight isn't decremented properly (which during testing I found 
can occur, whenever a message sent via messaging fails to even be sent 
properly, hence neither the failure nor success callback is ever called).
I'll attach some links shortly.

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 

[jira] [Created] (CASSANDRA-13557) allow different NUMACTL_ARGS to be passed in

2017-05-26 Thread Matt Byrd (JIRA)
Matt Byrd created CASSANDRA-13557:
-

 Summary: allow different NUMACTL_ARGS to be passed in
 Key: CASSANDRA-13557
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13557
 Project: Cassandra
  Issue Type: Improvement
  Components: Configuration
Reporter: Matt Byrd
Assignee: Matt Byrd
Priority: Minor
 Fix For: 4.x


Currently in bin/cassandra the following is hardcoded:
NUMACTL_ARGS="--interleave=all"
Ideally users of cassandra/bin could pass in a different set of NUMACTL_ARGS if 
they wanted to say bind the process to a socket for cpu/memory reasons, rather 
than having to comment out/modify this line in the deployed cassandra/bin. e.g 
as described in:
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

This could be done by just having the default be set to "--interleave=all" but 
pickup any value which has already been set for the variable NUMACTL_ARGS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-04-28 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989556#comment-15989556
 ] 

Matt Byrd commented on CASSANDRA-13480:
---

So the patch I have currently also caches the notifications for repairs for a 
limited time on the co-ordinator, it was initially targeting a release where we 
didn't yet have the repair history tables.
I suppose there is a concern that caching these notifications could under some 
circumstances cause unwanted extra heap usage. 
(Similarly to the notifications buffer, although at least here we're only 
caching a subset that we care more about)
So using the repair history tables instead and exposing this information by imx 
seems like a reasonable alternative.
There are perhaps a couple of kinks to work out, but I'll have a go at adapting 
the patch that I have to work in this way.
For one we only have the cmd id int sent back to the nodetool process (rather 
than the parent session id which the internal table is partition keyed off)
We could either keep track of the cmd id int -> parent session uuid in the 
co-ordinator, either in memory cached to expire or in another internal table,
or we could parse the uuid out of the notification sent for the start of the 
parent repair.
Parsing the message is a bit brittle though and not full proof in theory (we 
could miss that notification also).
Ideally I suppose running a repair could return and communicate on the basis of 
the parent session uuid rather than the int cmd id, but this is a pretty major 
overhaul and has all sorts of compatibility questions.

> nodetool repair can hang forever if we lose the notification for the repair 
> completing/failing
> --
>
> Key: CASSANDRA-13480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in 
> question is the notification which let's RepairRunner know that the repair is 
> finished (ProgressEventType.COMPLETE or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost 
> (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
> Jmx to receive all the relevant notifications we're interested in, we can 
> replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself 
> might be lost and so for good measure I have made RepairRunner poll 
> periodically to see if there were any notifications that had been sent but we 
> didn't receive (scoped just to the particular tag for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new 
> endpoint and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different 
> approach, I also tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios 
> but in this test we don't even send that many notifications so I'm not 
> surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with 
> jmx as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing

2017-04-27 Thread Matt Byrd (JIRA)
Matt Byrd created CASSANDRA-13480:
-

 Summary: nodetool repair can hang forever if we lose the 
notification for the repair completing/failing
 Key: CASSANDRA-13480
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Reporter: Matt Byrd
Assignee: Matt Byrd
Priority: Minor
 Fix For: 4.x


When a Jmx lost notification occurs, sometimes the lost notification in 
question is the notification which let's RepairRunner know that the repair is 
finished (ProgressEventType.COMPLETE or even ERROR for that matter).
This results in nodetool process running the repair hanging forever. 

I have a test which reproduces the issue here:
https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test

To fix this, If on receiving a notification that notifications have been lost 
(JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via 
Jmx to receive all the relevant notifications we're interested in, we can 
replay those we missed and avoid this scenario.

It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself might 
be lost and so for good measure I have made RepairRunner poll periodically to 
see if there were any notifications that had been sent but we didn't receive 
(scoped just to the particular tag for the given repair).

Users who don't use nodetool but go via jmx directly, can still use this new 
endpoint and implement similar behaviour in their clients as desired.
I'm also expiring the notifications which have been kept on the server side.
Please let me know if you've any questions or can think of a different 
approach, I also tried setting:
 JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
but this didn't fix the test. I suppose it might help under certain scenarios 
but in this test we don't even send that many notifications so I'm not 
surprised it doesn't fix it.
It seems like getting lost notifications is always a potential problem with jmx 
as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13307) The specification of protocol version in cqlsh means the python driver doesn't automatically downgrade protocol version.

2017-04-13 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968247#comment-15968247
 ] 

Matt Byrd commented on CASSANDRA-13307:
---

Hey [~michaelsembwever] Did you still want me to take a look? 
sounds like the failures can be explained by flakiness? 

> The specification of protocol version in cqlsh means the python driver 
> doesn't automatically downgrade protocol version.
> 
>
> Key: CASSANDRA-13307
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13307
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
>  Labels: doc-impacting
> Fix For: 3.11.x
>
>
> Hi,
> Looks like we've regressed on the issue described in:
> https://issues.apache.org/jira/browse/CASSANDRA-9467
> In that we're no longer able to connect from newer cqlsh versions
> (e.g trunk) to older versions of Cassandra with a lower version of the 
> protocol (e.g 2.1 with protocol version 3)
> The problem seems to be that we're relying on the ability for the client to 
> automatically downgrade protocol version implemented in Cassandra here:
> https://issues.apache.org/jira/browse/CASSANDRA-12838
> and utilised in the python client here:
> https://datastax-oss.atlassian.net/browse/PYTHON-240
> The problem however comes when we implemented:
> https://datastax-oss.atlassian.net/browse/PYTHON-537
> "Don't downgrade protocol version if explicitly set" 
> (included when we bumped from 3.5.0 to 3.7.0 of the python driver as part of 
> fixing: https://issues.apache.org/jira/browse/CASSANDRA-11534)
> Since we do explicitly specify the protocol version in the bin/cqlsh.py.
> I've got a patch which just adds an option to explicitly specify the protocol 
> version (for those who want to do that) and then otherwise defaults to not 
> setting the protocol version, i.e using the protocol version from the client 
> which we ship, which should by default be the same protocol as the server.
> Then it should downgrade gracefully as was intended. 
> Let me know if that seems reasonable.
> Thanks,
> Matt



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13307) The specification of protocol version in cqlsh means the python driver doesn't automatically downgrade protocol version.

2017-03-29 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947667#comment-15947667
 ] 

Matt Byrd commented on CASSANDRA-13307:
---

Hey [~tjake] are you at all keen to review? or shall I see if someone else can?
Thanks

> The specification of protocol version in cqlsh means the python driver 
> doesn't automatically downgrade protocol version.
> 
>
> Key: CASSANDRA-13307
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13307
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 3.11.x
>
>
> Hi,
> Looks like we've regressed on the issue described in:
> https://issues.apache.org/jira/browse/CASSANDRA-9467
> In that we're no longer able to connect from newer cqlsh versions
> (e.g trunk) to older versions of Cassandra with a lower version of the 
> protocol (e.g 2.1 with protocol version 3)
> The problem seems to be that we're relying on the ability for the client to 
> automatically downgrade protocol version implemented in Cassandra here:
> https://issues.apache.org/jira/browse/CASSANDRA-12838
> and utilised in the python client here:
> https://datastax-oss.atlassian.net/browse/PYTHON-240
> The problem however comes when we implemented:
> https://datastax-oss.atlassian.net/browse/PYTHON-537
> "Don't downgrade protocol version if explicitly set" 
> (included when we bumped from 3.5.0 to 3.7.0 of the python driver as part of 
> fixing: https://issues.apache.org/jira/browse/CASSANDRA-11534)
> Since we do explicitly specify the protocol version in the bin/cqlsh.py.
> I've got a patch which just adds an option to explicitly specify the protocol 
> version (for those who want to do that) and then otherwise defaults to not 
> setting the protocol version, i.e using the protocol version from the client 
> which we ship, which should by default be the same protocol as the server.
> Then it should downgrade gracefully as was intended. 
> Let me know if that seems reasonable.
> Thanks,
> Matt



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (CASSANDRA-13307) The specification of protocol version in cqlsh means the python driver doesn't automatically downgrade protocol version.

2017-03-08 Thread Matt Byrd (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd updated CASSANDRA-13307:
--
Status: Patch Available  (was: Open)

https://github.com/Jollyplum/cassandra/commit/b52b27810bf0d3bb9caafe21fde6120cf53c7382
https://github.com/apache/cassandra/pull/96
https://github.com/Jollyplum/cassandra/tree/13307

> The specification of protocol version in cqlsh means the python driver 
> doesn't automatically downgrade protocol version.
> 
>
> Key: CASSANDRA-13307
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13307
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Matt Byrd
>Assignee: Matt Byrd
>Priority: Minor
> Fix For: 3.11.x
>
>
> Hi,
> Looks like we've regressed on the issue described in:
> https://issues.apache.org/jira/browse/CASSANDRA-9467
> In that we're no longer able to connect from newer cqlsh versions
> (e.g trunk) to older versions of Cassandra with a lower version of the 
> protocol (e.g 2.1 with protocol version 3)
> The problem seems to be that we're relying on the ability for the client to 
> automatically downgrade protocol version implemented in Cassandra here:
> https://issues.apache.org/jira/browse/CASSANDRA-12838
> and utilised in the python client here:
> https://datastax-oss.atlassian.net/browse/PYTHON-240
> The problem however comes when we implemented:
> https://datastax-oss.atlassian.net/browse/PYTHON-537
> "Don't downgrade protocol version if explicitly set" 
> (included when we bumped from 3.5.0 to 3.7.0 of the python driver as part of 
> fixing: https://issues.apache.org/jira/browse/CASSANDRA-11534)
> Since we do explicitly specify the protocol version in the bin/cqlsh.py.
> I've got a patch which just adds an option to explicitly specify the protocol 
> version (for those who want to do that) and then otherwise defaults to not 
> setting the protocol version, i.e using the protocol version from the client 
> which we ship, which should by default be the same protocol as the server.
> Then it should downgrade gracefully as was intended. 
> Let me know if that seems reasonable.
> Thanks,
> Matt



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CASSANDRA-13307) The specification of protocol version in cqlsh means the python driver doesn't automatically downgrade protocol version.

2017-03-07 Thread Matt Byrd (JIRA)
Matt Byrd created CASSANDRA-13307:
-

 Summary: The specification of protocol version in cqlsh means the 
python driver doesn't automatically downgrade protocol version.
 Key: CASSANDRA-13307
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13307
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Reporter: Matt Byrd
Assignee: Matt Byrd
Priority: Minor


Hi,
Looks like we've regressed on the issue described in:
https://issues.apache.org/jira/browse/CASSANDRA-9467
In that we're no longer able to connect from newer cqlsh versions
(e.g trunk) to older versions of Cassandra with a lower version of the protocol 
(e.g 2.1 with protocol version 3)

The problem seems to be that we're relying on the ability for the client to 
automatically downgrade protocol version implemented in Cassandra here:
https://issues.apache.org/jira/browse/CASSANDRA-12838
and utilised in the python client here:
https://datastax-oss.atlassian.net/browse/PYTHON-240

The problem however comes when we implemented:
https://datastax-oss.atlassian.net/browse/PYTHON-537
"Don't downgrade protocol version if explicitly set" 
(included when we bumped from 3.5.0 to 3.7.0 of the python driver as part of 
fixing: https://issues.apache.org/jira/browse/CASSANDRA-11534)

Since we do explicitly specify the protocol version in the bin/cqlsh.py.

I've got a patch which just adds an option to explicitly specify the protocol 
version (for those who want to do that) and then otherwise defaults to not 
setting the protocol version, i.e using the protocol version from the client 
which we ship, which should by default be the same protocol as the server.
Then it should downgrade gracefully as was intended. 
Let me know if that seems reasonable.
Thanks,
Matt




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-7688) Add data sizing to a system table

2015-01-30 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299242#comment-14299242
 ] 

Matt Byrd commented on CASSANDRA-7688:
--

So I suppose the reason for suggesting exposing the same call via cql, 
was that at least abstractly it was clear what this meant.
I concede that plumbing all this through might not be straightforward.

The problem with putting it in a system table is, what exactly do you put there?

The current computation is a somewhat expensive on demand computation that is 
generally done relatively rarely.

Was your intent to just periodically execute this function and dump the results 
into system tables?
Or did you have something different in mind?

 Add data sizing to a system table
 -

 Key: CASSANDRA-7688
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7688
 Project: Cassandra
  Issue Type: New Feature
Reporter: Jeremiah Jordan
Assignee: Aleksey Yeschenko
 Fix For: 2.1.3


 Currently you can't implement something similar to describe_splits_ex purely 
 from the a native protocol driver.  
 https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose easily 
 getting ownership information to a client in the java-driver.  But you still 
 need the data sizing part to get splits of a given size.  We should add the 
 sizing information to a system table so that native clients can get to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8052) OOMs from allocating large arrays when deserializing (e.g probably corrupted EstimatedHistogram data)

2014-10-03 Thread Matt Byrd (JIRA)
Matt Byrd created CASSANDRA-8052:


 Summary: OOMs from allocating large arrays when deserializing (e.g 
probably corrupted EstimatedHistogram data)
 Key: CASSANDRA-8052
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8052
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: linux
Reporter: Matt Byrd



We've seen nodes with what are presumably corrupted sstables repeatedly OOM on 
attempted startup with such a message:
{code}
java.lang.OutOfMemoryError: Java heap space
 at 
org.apache.cassandra.utils.EstimatedHistogram$EstimatedHistogramSerializer.deserialize(EstimatedHistogram.java:266)
 
at 
org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:292)
 at 
org.apache.cassandra.io.sstable.SSTableMetadata$SSTableMetadataSerializer.deserialize(SSTableMetadata.java:282)
 at 
org.apache.cassandra.io.sstable.SSTableReader.openMetadata(SSTableReader.java:234)
 at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:194)
 at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
 at org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
{code}

It's probably not a coincidence that it's throwing an exception here since this 
seems to be the first byte of the file read.

Presumably the correct operational process is just to replace the node, 
however I was wondering if generally we might want to validate lengths when we 
deserialise things?
This could avoid allocating large byte buffers causing unpredictable OOMs and 
instead throw an exception to be handled as appropriate.

In this particular instance, there is no need for an unduly large size for the 
estimated histogram.
Admittedly things are slightly different in 2.1, though I suspect a similar 
thing might have happened with:
{code}
   int numComponents = in.readInt();
   // read toc
   MapMetadataType, Integer toc = new HashMap(numComponents); 
{code}
Doing a find usages of DataInputStream.readInt() reveals quite a few places 
where an int is read in and then an ArrayList, array or map of that size is 
created.
In some cases this size might validly vary over a java int,
or be in a performance critical or delicate piece of code where one doesn't 
want such checks.
Also there are other checksums and mechanisms at play which make some input 
less likely to be corrupted.

However, is it maybe worth a pass over instances of this type of input, to try 
and avoid such cases where it makes sense?
Perhaps there are less likely but worse failure modes present and hidden? 
E.g if the deserialisation is happens to be for a message sent to some or all 
nodes say.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7688) Add data sizing to a system table

2014-08-08 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091315#comment-14091315
 ] 

Matt Byrd commented on CASSANDRA-7688:
--

Originally I was just thinking of exposing the same method available in thrift, 
via some cql syntax i.e:
essentially from StorageProxy:
   public ListPairRangeToken, Long getSplits(String keyspaceName, String 
cfName, RangeToken range, int keysPerSplit, CFMetaData metadata)

This in turn actually operates on the index intervals in memory, getting 
appropriately sized splits given the samples taken.

Can you please elaborate on what the idea is behind storing this info in a 
system table?
It would seem that you would need to keep doing the above computation or 
something similar and write the result to a system table.
I would have thought it’d be easier to just expose the Storage proxy call via 
cql?

 Add data sizing to a system table
 -

 Key: CASSANDRA-7688
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7688
 Project: Cassandra
  Issue Type: New Feature
Reporter: Jeremiah Jordan
 Fix For: 2.1.1


 Currently you can't implement something similar to describe_splits_ex purely 
 from the a native protocol driver.  
 https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose easily 
 getting ownership information to a client in the java-driver.  But you still 
 need the data sizing part to get splits of a given size.  We should add the 
 sizing information to a system table so that native clients can get to it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7543) Assertion error when compacting large row with map//list field or range tombstone

2014-07-16 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064110#comment-14064110
 ] 

Matt Byrd commented on CASSANDRA-7543:
--

Thanks for looking at this.
With the attached patch my repro script no longer reproduces the problem.
 It might also be nice to include the value of openedMarkerSize in the debug 
log line for the dataSize, if only to avoid confusion when debugging, however 
it's not strictly necessary.


 Assertion error when compacting large row with map//list field or range 
 tombstone
 -

 Key: CASSANDRA-7543
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7543
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: linux
Reporter: Matt Byrd
Assignee: Yuki Morishita
  Labels: compaction, map
 Fix For: 1.2.19

 Attachments: 0001-add-rangetombstone-test.patch, 
 0002-fix-rangetomebstone-not-included-in-LCR-size-calc.patch


 Hi,
 So in a couple of clusters we're hitting this problem when compacting large 
 rows with a schema which contains the map data-type.
 Here is an example of the error:
 {code}
 java.lang.AssertionError: incorrect row data size 87776427 written to 
 /cassandra/X/Y/X-Y-tmp-ic-2381-Data.db; correct is 87845952
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:162)
 org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:163)
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
  
 org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
  
 org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
 {code}
 I have a python script which reproduces the problem, by just writing lots of 
 data to a single partition key with a schema that contains the map data-type.
 I added some debug logging and found that the difference in bytes seen in the 
 reproduction (255) was due to the following pieces of data being written:
 {code}
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:42,891 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 [java.nio.HeapByteBuffer[pos=0 lim=34 cap=34], java.nio.HeapByteBuffer[pos=0 
 lim=34 cap=34]](deletedAt=1405237116014999, localDeletion=1405237116) 
 startPosition: 262476 endPosition: 262561 diff: 85 
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:43,007 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 org.apache.cassandra.db.Column@3e5b5939 startPosition: 328157 endPosition: 
 328242 diff: 85 
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:44,159 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 org.apache.cassandra.db.Column@fc3299b startPosition: 984105 endPosition: 
 984190 diff: 85
 {code}
 So looking at the code you can see that there are extra range tombstones 
 written on the column index border (in ColumnIndex where 
 tombstoneTracker.writeOpenedMarker is called) which aren't accounted for in 
 LazilyCompactedRow.columnSerializedSize.
 This is where the difference comes from in the assertion error, so the 
 solution is just to account for this data.
 I have a patch which does just this, by keeping track of the extra data 
 written out via tombstoneTracker.writeOpenedMarker in ColumnIndex and adding 
 it back to the dataSize in LazilyCompactedRow.java, where it serialises out 
 the row size.
 After applying the patch the reproduction stops producing the AssertionError.
 I know this is not a problem in 2.0 + because of singe pass compaction, 
 however there are lots of 1.2 clusters out there still which might run into 
 this.
 Please let me know if you've any questions.
 Thanks,
 Matt



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CASSANDRA-7543) Assertion error when compacting large row with map//list field or range tombstone

2014-07-14 Thread Matt Byrd (JIRA)
Matt Byrd created CASSANDRA-7543:


 Summary: Assertion error when compacting large row with map//list 
field or range tombstone
 Key: CASSANDRA-7543
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7543
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: linux cassandra 1.2.16
Reporter: Matt Byrd


Hi,
So in a couple of clusters we're hitting this problem when compacting large 
rows with a schema which contains the map data-type.
Here is an example of the error:
{code}
java.lang.AssertionError: incorrect row data size 87776427 written to 
/cassandra/X/Y/X-Y-tmp-ic-2381-Data.db; correct is 87845952
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:162)
org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:163)
org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
 
org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
{code}

I have a python script which reproduces the problem, by just writing lots of 
data to a single partition key with a schema that contains the map data-type.

I added some debug logging and found that the difference in bytes seen in the 
reproduction (255) was due to the following pieces of data being written:
{code}
DEBUG [CompactionExecutor:3] 2014-07-13 00:38:42,891 ColumnIndex.java (line 
168) DATASIZE writeOpenedMarker columnIndex: 
org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
[java.nio.HeapByteBuffer[pos=0 lim=34 cap=34], java.nio.HeapByteBuffer[pos=0 
lim=34 cap=34]](deletedAt=1405237116014999, localDeletion=1405237116) 
startPosition: 262476 endPosition: 262561 diff: 85 
DEBUG [CompactionExecutor:3] 2014-07-13 00:38:43,007 ColumnIndex.java (line 
168) DATASIZE writeOpenedMarker columnIndex: 
org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
org.apache.cassandra.db.Column@3e5b5939 startPosition: 328157 endPosition: 
328242 diff: 85 
DEBUG [CompactionExecutor:3] 2014-07-13 00:38:44,159 ColumnIndex.java (line 
168) DATASIZE writeOpenedMarker columnIndex: 
org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
org.apache.cassandra.db.Column@fc3299b startPosition: 984105 endPosition: 
984190 diff: 85
{code}

So looking at the code you can see that there are extra range tombstones 
written on the column index border (in ColumnIndex where 
tombstoneTracker.writeOpenedMarker is called) which aren't accounted for in 
LazilyCompactedRow.columnSerializedSize.

This is where the difference comes from in the assertion error, so the solution 
is just to account for this data.
I have a patch which does just this, by keeping track of the extra data written 
out via tombstoneTracker.writeOpenedMarker in ColumnIndex and adding it back to 
the dataSize in LazilyCompactedRow.java, where it serialises out the row size.
After applying the patch the reproduction stops producing the AssertionError.

I know this is not a problem in 2.0 + because of singe pass compaction, however 
there are lots of 1.2 clusters out there still which might run into this.
Please let me know if you've any questions.

Thanks,
Matt




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7543) Assertion error when compacting large row with map//list field or range tombstone

2014-07-14 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061463#comment-14061463
 ] 

Matt Byrd commented on CASSANDRA-7543:
--

I believe this is the same issue, which wasn't fixed in 1.2.x, the 
recommendation being to move to 2.0 where the error was unlikely to occur.


 Assertion error when compacting large row with map//list field or range 
 tombstone
 -

 Key: CASSANDRA-7543
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7543
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: linux cassandra 1.2.16
Reporter: Matt Byrd
  Labels: compaction, map

 Hi,
 So in a couple of clusters we're hitting this problem when compacting large 
 rows with a schema which contains the map data-type.
 Here is an example of the error:
 {code}
 java.lang.AssertionError: incorrect row data size 87776427 written to 
 /cassandra/X/Y/X-Y-tmp-ic-2381-Data.db; correct is 87845952
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:162)
 org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:163)
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
  
 org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
  
 org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
 {code}
 I have a python script which reproduces the problem, by just writing lots of 
 data to a single partition key with a schema that contains the map data-type.
 I added some debug logging and found that the difference in bytes seen in the 
 reproduction (255) was due to the following pieces of data being written:
 {code}
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:42,891 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 [java.nio.HeapByteBuffer[pos=0 lim=34 cap=34], java.nio.HeapByteBuffer[pos=0 
 lim=34 cap=34]](deletedAt=1405237116014999, localDeletion=1405237116) 
 startPosition: 262476 endPosition: 262561 diff: 85 
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:43,007 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 org.apache.cassandra.db.Column@3e5b5939 startPosition: 328157 endPosition: 
 328242 diff: 85 
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:44,159 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 org.apache.cassandra.db.Column@fc3299b startPosition: 984105 endPosition: 
 984190 diff: 85
 {code}
 So looking at the code you can see that there are extra range tombstones 
 written on the column index border (in ColumnIndex where 
 tombstoneTracker.writeOpenedMarker is called) which aren't accounted for in 
 LazilyCompactedRow.columnSerializedSize.
 This is where the difference comes from in the assertion error, so the 
 solution is just to account for this data.
 I have a patch which does just this, by keeping track of the extra data 
 written out via tombstoneTracker.writeOpenedMarker in ColumnIndex and adding 
 it back to the dataSize in LazilyCompactedRow.java, where it serialises out 
 the row size.
 After applying the patch the reproduction stops producing the AssertionError.
 I know this is not a problem in 2.0 + because of singe pass compaction, 
 however there are lots of 1.2 clusters out there still which might run into 
 this.
 Please let me know if you've any questions.
 Thanks,
 Matt



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-7543) Assertion error when compacting large row with map//list field or range tombstone

2014-07-14 Thread Matt Byrd (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd updated CASSANDRA-7543:
-

Environment: linux  (was: linux cassandra 1.2.16)

 Assertion error when compacting large row with map//list field or range 
 tombstone
 -

 Key: CASSANDRA-7543
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7543
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: linux
Reporter: Matt Byrd
  Labels: compaction, map

 Hi,
 So in a couple of clusters we're hitting this problem when compacting large 
 rows with a schema which contains the map data-type.
 Here is an example of the error:
 {code}
 java.lang.AssertionError: incorrect row data size 87776427 written to 
 /cassandra/X/Y/X-Y-tmp-ic-2381-Data.db; correct is 87845952
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:162)
 org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:163)
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
  
 org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
  
 org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
 {code}
 I have a python script which reproduces the problem, by just writing lots of 
 data to a single partition key with a schema that contains the map data-type.
 I added some debug logging and found that the difference in bytes seen in the 
 reproduction (255) was due to the following pieces of data being written:
 {code}
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:42,891 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 [java.nio.HeapByteBuffer[pos=0 lim=34 cap=34], java.nio.HeapByteBuffer[pos=0 
 lim=34 cap=34]](deletedAt=1405237116014999, localDeletion=1405237116) 
 startPosition: 262476 endPosition: 262561 diff: 85 
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:43,007 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 org.apache.cassandra.db.Column@3e5b5939 startPosition: 328157 endPosition: 
 328242 diff: 85 
 DEBUG [CompactionExecutor:3] 2014-07-13 00:38:44,159 ColumnIndex.java (line 
 168) DATASIZE writeOpenedMarker columnIndex: 
 org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
 org.apache.cassandra.db.Column@fc3299b startPosition: 984105 endPosition: 
 984190 diff: 85
 {code}
 So looking at the code you can see that there are extra range tombstones 
 written on the column index border (in ColumnIndex where 
 tombstoneTracker.writeOpenedMarker is called) which aren't accounted for in 
 LazilyCompactedRow.columnSerializedSize.
 This is where the difference comes from in the assertion error, so the 
 solution is just to account for this data.
 I have a patch which does just this, by keeping track of the extra data 
 written out via tombstoneTracker.writeOpenedMarker in ColumnIndex and adding 
 it back to the dataSize in LazilyCompactedRow.java, where it serialises out 
 the row size.
 After applying the patch the reproduction stops producing the AssertionError.
 I know this is not a problem in 2.0 + because of singe pass compaction, 
 however there are lots of 1.2 clusters out there still which might run into 
 this.
 Please let me know if you've any questions.
 Thanks,
 Matt



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7533) Let MAX_OUTSTANDING_REPLAY_COUNT be configurable

2014-07-14 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061506#comment-14061506
 ] 

Matt Byrd commented on CASSANDRA-7533:
--

Just to add a bit more context, we had a single instance of Cassandra get 
fairly stuck replaying commitlogs.
It was burning through 2000% cpu + for over four hours with no end in sight, so 
we killed it removed commit logs brought it up and ran repair. (This was in q.a 
thankfully)

The problem can easily be reproduce by just writing 100,000 cql row (range 
deletes) to the same partition key, stopping Cassandra and starting it again.
I admit this is somewhat of an anti-pattern, but still quite a dramatic effect 
from not very much data.
The problem exercised here is that:
1. We contend in the memtable to do this insert in a CAS loop.
2. the work done in this loop becomes ever more expensive as 
RangeTombstoneList.dataSize is iterated over to compute the size.

Point 2. effectively fixed in 2.1 with all the off-heap allocation, the 
dataSize calculation effectively becomes more online.
To resolve this problem in 2.0 you could also keep this tally of dataSize 
online, or maybe start keeping it online once the list is sufficiently big to 
cause a problem.
Doing this seemed to help a lot, but far simpler was just toggling the 
concurrency of the commitlog replay, which can be achieved by lowering 
MAX_OUTSTANDING_REPLAY_COUNT (in our case setting this to 1 seemed to help).

Thanks,
Matt


 Let MAX_OUTSTANDING_REPLAY_COUNT be configurable
 

 Key: CASSANDRA-7533
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7533
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jeremiah Jordan
Assignee: Yuki Morishita
Priority: Minor
 Fix For: 2.0.10


 There are some workloads where commit log replay will run into contention 
 issues with multiple things updating the same partition.  Through some 
 testing it was found that lowering CommitLogReplayer.java 
 MAX_OUTSTANDING_REPLAY_COUNT can help with this issue.
 The calculations added in CASSANDRA-6655 are one such place things get 
 bottlenecked.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-5345) Potential problem with GarbageCollectorMXBean

2014-03-25 Thread Matt Byrd (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd updated CASSANDRA-5345:
-

Reproduced In: 1.0.7

 Potential problem with GarbageCollectorMXBean
 -

 Key: CASSANDRA-5345
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5345
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.7
 Environment: JVM:JVM vendor/version: Java HotSpot(TM) 64-Bit Server 
 VM/1.6.0_30  typical 6 node 2 availability zone Mutli DC cluster on linux vms 
 with
 and mx4j-tools.jar and jna.jar both on path. Default configuration bar token 
 setup(equispaced), sensible cassandra-topology.properties file and use of 
 said snitch.
Reporter: Matt Byrd
Assignee: Ryan McGuire
Priority: Trivial

 I am not certain this is definitely a bug, but I thought it might be worth 
 posting to see if someone with more JVM//JMX knowledge could disprove my 
 reasoning. Apologies if I've failed to understand something.
 We've seen an intermittent problem where there is an uncaught exception in 
 the scheduled task of logging gc results in GcInspector.java:
 {code}
 ...
  ERROR [ScheduledTasks:1] 2013-03-08 01:09:06,335 
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
 Thread[ScheduledTasks:1,5,main]
 java.lang.reflect.UndeclaredThrowableException
 at $Proxy0.getName(Unknown Source)
 at 
 org.apache.cassandra.service.GCInspector.logGCResults(GCInspector.java:95)
 at 
 org.apache.cassandra.service.GCInspector.access$000(GCInspector.java:41)
 at org.apache.cassandra.service.GCInspector$1.run(GCInspector.java:85)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: javax.management.InstanceNotFoundException: 
 java.lang:name=ParNew,type=GarbageCollector
 at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1094)
 at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:662)
 at 
 com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
 at 
 com.sun.jmx.mbeanserver.MXBeanProxy$GetHandler.invoke(MXBeanProxy.java:106)
 at com.sun.jmx.mbeanserver.MXBeanProxy.invoke(MXBeanProxy.java:148)
 at 
 javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:248)
 ... 13 more
 ...
 {code}
 I think the problem, may be caused by the following reasoning:
 In GcInspector we populate a list of mxbeans when the GcInspector instance is 
 instantiated:
 {code}
 ...
 ListGarbageCollectorMXBean beans = new ArrayListGarbageCollectorMXBean();
 MBeanServer server = ManagementFactory.getPlatformMBeanServer();
 try
 {
 ObjectName gcName = new 
 ObjectName(ManagementFactory.GARBAGE_COLLECTOR_MXBEAN_DOMAIN_TYPE + ,*);
 for (ObjectName name : server.queryNames(gcName, null))
 {
 GarbageCollectorMXBean gc = 
 ManagementFactory.newPlatformMXBeanProxy(server, name.getCanonicalName(), 
 GarbageCollectorMXBean.class);
 beans.add(gc);
 }
 }
 catch (Exception e)
 {
 throw new RuntimeException(e);
 }
 ...
 {code}
 Cassandra then periodically calls:
 {code}
 ...
 private void logGCResults()
 {
 for (GarbageCollectorMXBean gc : beans)
 {
 Long previousTotal = gctimes.get(gc.getName());
 ...
 {code}
 In the oracle javadocs, they seem to suggest that these beans could disappear 
 at any time.(I'm not sure why when or how this might happen)
 http://docs.oracle.com/javase/6/docs/api/
 See: getGarbageCollectorMXBeans
 {code}
 ...
 public static ListGarbageCollectorMXBean getGarbageCollectorMXBeans()
 Returns a list of GarbageCollectorMXBean objects in the Java virtual machine. 
 The Java virtual machine may have one or more 

[jira] [Updated] (CASSANDRA-5345) Potential problem with GarbageCollectorMXBean

2014-03-25 Thread Matt Byrd (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd updated CASSANDRA-5345:
-

Reproduced In:   (was: 1.0.7)
Since Version: 1.0.7

 Potential problem with GarbageCollectorMXBean
 -

 Key: CASSANDRA-5345
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5345
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.7
 Environment: JVM:JVM vendor/version: Java HotSpot(TM) 64-Bit Server 
 VM/1.6.0_30  typical 6 node 2 availability zone Mutli DC cluster on linux vms 
 with
 and mx4j-tools.jar and jna.jar both on path. Default configuration bar token 
 setup(equispaced), sensible cassandra-topology.properties file and use of 
 said snitch.
Reporter: Matt Byrd
Assignee: Ryan McGuire
Priority: Trivial

 I am not certain this is definitely a bug, but I thought it might be worth 
 posting to see if someone with more JVM//JMX knowledge could disprove my 
 reasoning. Apologies if I've failed to understand something.
 We've seen an intermittent problem where there is an uncaught exception in 
 the scheduled task of logging gc results in GcInspector.java:
 {code}
 ...
  ERROR [ScheduledTasks:1] 2013-03-08 01:09:06,335 
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
 Thread[ScheduledTasks:1,5,main]
 java.lang.reflect.UndeclaredThrowableException
 at $Proxy0.getName(Unknown Source)
 at 
 org.apache.cassandra.service.GCInspector.logGCResults(GCInspector.java:95)
 at 
 org.apache.cassandra.service.GCInspector.access$000(GCInspector.java:41)
 at org.apache.cassandra.service.GCInspector$1.run(GCInspector.java:85)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: javax.management.InstanceNotFoundException: 
 java.lang:name=ParNew,type=GarbageCollector
 at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1094)
 at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:662)
 at 
 com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
 at 
 com.sun.jmx.mbeanserver.MXBeanProxy$GetHandler.invoke(MXBeanProxy.java:106)
 at com.sun.jmx.mbeanserver.MXBeanProxy.invoke(MXBeanProxy.java:148)
 at 
 javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:248)
 ... 13 more
 ...
 {code}
 I think the problem, may be caused by the following reasoning:
 In GcInspector we populate a list of mxbeans when the GcInspector instance is 
 instantiated:
 {code}
 ...
 ListGarbageCollectorMXBean beans = new ArrayListGarbageCollectorMXBean();
 MBeanServer server = ManagementFactory.getPlatformMBeanServer();
 try
 {
 ObjectName gcName = new 
 ObjectName(ManagementFactory.GARBAGE_COLLECTOR_MXBEAN_DOMAIN_TYPE + ,*);
 for (ObjectName name : server.queryNames(gcName, null))
 {
 GarbageCollectorMXBean gc = 
 ManagementFactory.newPlatformMXBeanProxy(server, name.getCanonicalName(), 
 GarbageCollectorMXBean.class);
 beans.add(gc);
 }
 }
 catch (Exception e)
 {
 throw new RuntimeException(e);
 }
 ...
 {code}
 Cassandra then periodically calls:
 {code}
 ...
 private void logGCResults()
 {
 for (GarbageCollectorMXBean gc : beans)
 {
 Long previousTotal = gctimes.get(gc.getName());
 ...
 {code}
 In the oracle javadocs, they seem to suggest that these beans could disappear 
 at any time.(I'm not sure why when or how this might happen)
 http://docs.oracle.com/javase/6/docs/api/
 See: getGarbageCollectorMXBeans
 {code}
 ...
 public static ListGarbageCollectorMXBean getGarbageCollectorMXBeans()
 Returns a list of GarbageCollectorMXBean objects in the Java virtual machine. 
 The Java virtual machine may 

[jira] [Updated] (CASSANDRA-5345) Potential problem with GarbageCollectorMXBean

2014-03-25 Thread Matt Byrd (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Byrd updated CASSANDRA-5345:
-

Priority: Major  (was: Trivial)

 Potential problem with GarbageCollectorMXBean
 -

 Key: CASSANDRA-5345
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5345
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.7
 Environment: JVM:JVM vendor/version: Java HotSpot(TM) 64-Bit Server 
 VM/1.6.0_30  typical 6 node 2 availability zone Mutli DC cluster on linux vms 
 with
 and mx4j-tools.jar and jna.jar both on path. Default configuration bar token 
 setup(equispaced), sensible cassandra-topology.properties file and use of 
 said snitch.
Reporter: Matt Byrd
Assignee: Ryan McGuire

 I am not certain this is definitely a bug, but I thought it might be worth 
 posting to see if someone with more JVM//JMX knowledge could disprove my 
 reasoning. Apologies if I've failed to understand something.
 We've seen an intermittent problem where there is an uncaught exception in 
 the scheduled task of logging gc results in GcInspector.java:
 {code}
 ...
  ERROR [ScheduledTasks:1] 2013-03-08 01:09:06,335 
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
 Thread[ScheduledTasks:1,5,main]
 java.lang.reflect.UndeclaredThrowableException
 at $Proxy0.getName(Unknown Source)
 at 
 org.apache.cassandra.service.GCInspector.logGCResults(GCInspector.java:95)
 at 
 org.apache.cassandra.service.GCInspector.access$000(GCInspector.java:41)
 at org.apache.cassandra.service.GCInspector$1.run(GCInspector.java:85)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: javax.management.InstanceNotFoundException: 
 java.lang:name=ParNew,type=GarbageCollector
 at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1094)
 at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:662)
 at 
 com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
 at 
 com.sun.jmx.mbeanserver.MXBeanProxy$GetHandler.invoke(MXBeanProxy.java:106)
 at com.sun.jmx.mbeanserver.MXBeanProxy.invoke(MXBeanProxy.java:148)
 at 
 javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:248)
 ... 13 more
 ...
 {code}
 I think the problem, may be caused by the following reasoning:
 In GcInspector we populate a list of mxbeans when the GcInspector instance is 
 instantiated:
 {code}
 ...
 ListGarbageCollectorMXBean beans = new ArrayListGarbageCollectorMXBean();
 MBeanServer server = ManagementFactory.getPlatformMBeanServer();
 try
 {
 ObjectName gcName = new 
 ObjectName(ManagementFactory.GARBAGE_COLLECTOR_MXBEAN_DOMAIN_TYPE + ,*);
 for (ObjectName name : server.queryNames(gcName, null))
 {
 GarbageCollectorMXBean gc = 
 ManagementFactory.newPlatformMXBeanProxy(server, name.getCanonicalName(), 
 GarbageCollectorMXBean.class);
 beans.add(gc);
 }
 }
 catch (Exception e)
 {
 throw new RuntimeException(e);
 }
 ...
 {code}
 Cassandra then periodically calls:
 {code}
 ...
 private void logGCResults()
 {
 for (GarbageCollectorMXBean gc : beans)
 {
 Long previousTotal = gctimes.get(gc.getName());
 ...
 {code}
 In the oracle javadocs, they seem to suggest that these beans could disappear 
 at any time.(I'm not sure why when or how this might happen)
 http://docs.oracle.com/javase/6/docs/api/
 See: getGarbageCollectorMXBeans
 {code}
 ...
 public static ListGarbageCollectorMXBean getGarbageCollectorMXBeans()
 Returns a list of GarbageCollectorMXBean objects in the Java virtual machine. 
 The Java virtual machine may have one or more GarbageCollectorMXBean objects. 
 It 

[jira] [Commented] (CASSANDRA-5345) Potential problem with GarbageCollectorMXBean

2014-03-25 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13946228#comment-13946228
 ] 

Matt Byrd commented on CASSANDRA-5345:
--

The cluster in question was running on 1.0.7, however the code in question has 
remained static since well before that and doesn't look to have changed since. 
(though admittedly the problem could somehow be being caused elsewhere, jvm 
maybe?) 
I've upped the priority to major.

Have you been able to reproduce? or seen the problem anywhere else?
Any further details about your environment is set up and how you deploy may 
also help those trying to reproduce.
Some common but perhaps co-incidental things about the two occurrences:
1. virtual machines (though not both AWS)
2. multi D.C , wouldn't have though this would be relevant but Arya does seem 
to see the problem after removing a d.c.
3. Slightly old Jvm versions...

I no longer have access to the cluster where we saw this previously but let me 
know if I can help in any other way.

 Potential problem with GarbageCollectorMXBean
 -

 Key: CASSANDRA-5345
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5345
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.7
 Environment: JVM:JVM vendor/version: Java HotSpot(TM) 64-Bit Server 
 VM/1.6.0_30  typical 6 node 2 availability zone Mutli DC cluster on linux vms 
 with
 and mx4j-tools.jar and jna.jar both on path. Default configuration bar token 
 setup(equispaced), sensible cassandra-topology.properties file and use of 
 said snitch.
Reporter: Matt Byrd
Assignee: Ryan McGuire

 I am not certain this is definitely a bug, but I thought it might be worth 
 posting to see if someone with more JVM//JMX knowledge could disprove my 
 reasoning. Apologies if I've failed to understand something.
 We've seen an intermittent problem where there is an uncaught exception in 
 the scheduled task of logging gc results in GcInspector.java:
 {code}
 ...
  ERROR [ScheduledTasks:1] 2013-03-08 01:09:06,335 
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
 Thread[ScheduledTasks:1,5,main]
 java.lang.reflect.UndeclaredThrowableException
 at $Proxy0.getName(Unknown Source)
 at 
 org.apache.cassandra.service.GCInspector.logGCResults(GCInspector.java:95)
 at 
 org.apache.cassandra.service.GCInspector.access$000(GCInspector.java:41)
 at org.apache.cassandra.service.GCInspector$1.run(GCInspector.java:85)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
 at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: javax.management.InstanceNotFoundException: 
 java.lang:name=ParNew,type=GarbageCollector
 at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1094)
 at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:662)
 at 
 com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
 at 
 com.sun.jmx.mbeanserver.MXBeanProxy$GetHandler.invoke(MXBeanProxy.java:106)
 at com.sun.jmx.mbeanserver.MXBeanProxy.invoke(MXBeanProxy.java:148)
 at 
 javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:248)
 ... 13 more
 ...
 {code}
 I think the problem, may be caused by the following reasoning:
 In GcInspector we populate a list of mxbeans when the GcInspector instance is 
 instantiated:
 {code}
 ...
 ListGarbageCollectorMXBean beans = new ArrayListGarbageCollectorMXBean();
 MBeanServer server = ManagementFactory.getPlatformMBeanServer();
 try
 {
 ObjectName gcName = new 
 ObjectName(ManagementFactory.GARBAGE_COLLECTOR_MXBEAN_DOMAIN_TYPE + ,*);
 for (ObjectName name : server.queryNames(gcName, null))
 {
 GarbageCollectorMXBean gc = 
 ManagementFactory.newPlatformMXBeanProxy(server, name.getCanonicalName(), 
 

[jira] [Created] (CASSANDRA-6797) compaction and scrub data directories race on startup

2014-03-03 Thread Matt Byrd (JIRA)
Matt Byrd created CASSANDRA-6797:


 Summary: compaction and scrub data directories race on startup
 Key: CASSANDRA-6797
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6797
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: macos (and linux)
Reporter: Matt Byrd
Priority: Minor


 
Hi,  

On doing a rolling restarting of a 2.0.5 cluster in several environments I'm 
seeing the following error:
{code}

 INFO [CompactionExecutor:1] 2014-03-03 17:11:07,549 CompactionTask.java (line 
115) Compacting 
[SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13-Data.db'),
 SSTableReader(path='/Users/Matthew/.ccm/compactio
n_race/node1/data/system/local/system-local-jb-15-Data.db'), 
SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-16-Data.db'),
 
SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/syst
em-local-jb-14-Data.db')]
 INFO [CompactionExecutor:1] 2014-03-03 17:11:07,557 ColumnFamilyStore.java 
(line 254) Initializing system_traces.sessions
 INFO [CompactionExecutor:1] 2014-03-03 17:11:07,560 ColumnFamilyStore.java 
(line 254) Initializing system_traces.events
 WARN [main] 2014-03-03 17:11:07,608 ColumnFamilyStore.java (line 473) Removing 
orphans for 
/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13: 
[CompressionInfo.db, Filter.db, Index.db, TOC.txt, Summary.db, Data.db, 
Statistics.
db]
ERROR [main] 2014-03-03 17:11:07,609 CassandraDaemon.java (line 479) Exception 
encountered during startup
java.lang.AssertionError: attempted to delete non-existing file 
system-local-jb-13-CompressionInfo.db
at 
org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:111)
at 
org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:106)
at 
org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:476)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:462)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:552)
 INFO [CompactionExecutor:1] 2014-03-03 17:11:07,612 CompactionTask.java (line 
275) Compacted 4 sstables to 
[/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-17,].
  10,963 bytes to 5,572 (~50% of original) in 57ms = 0.093226MB/s.  4 total 
partitions merged to 1.  Partition merge counts were {4:1, }

{code}
Seems like a potential race, since compactions are occurring whilst the 
existing data directories are being scrubbed.
Probably an in progress compaction looks like an incomplete one and results in 
it being attempted to be scrubbed whilst in progress. 
On the attempt to delete in the scrubDataDirectories we discover that it no 
longer exists, presumably because it has now been compacted away. 
This then causes an assertion error and the node fails to start up. 

Here is a ccm script which just stops and starts a 3 node 2.0.5 cluster 
repeatedly. 
It seems to fairly reliably reproduce the problem, in less than ten iterations: 

{code}
#!/bin/bash

ccm create compaction_race -v 2.0.5
ccm populate -n 3
ccm start

for i in $(seq 0 1000); do 
echo $i;
ccm stop
ccm start
grep ERR ~/.ccm/compaction_race/*/logs/system.log;
done

{code}
 
Someone else should probably confirm that this is what is going wrong,  
however if it is, the solution might be as simple as to disable autocompactions 
slightly earlier in CassandraDaemon.setup. 
 
Or alternatively if there isn't a good reason why we are first scrubbing the 
system tables and then scrubbing all keyspaces (including the system keyspace), 
you could perhaps just scrub solely the non system keyspaces on the second 
scrub.

Please let me know if there is anything else I can provide.
Thanks,
Matt




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6797) compaction and scrub data directories race on startup

2014-03-03 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918938#comment-13918938
 ] 

Matt Byrd commented on CASSANDRA-6797:
--

I think this may be the same or a similar issue, but since the repro is more 
complicated and the environment windows, I thought I'd file this ticket also.

 compaction and scrub data directories race on startup
 -

 Key: CASSANDRA-6797
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6797
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: macos (and linux)
Reporter: Matt Byrd
Priority: Minor
  Labels: compaction, concurrency, starting

  
 Hi,  
 On doing a rolling restarting of a 2.0.5 cluster in several environments I'm 
 seeing the following error:
 {code}
  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,549 CompactionTask.java 
 (line 115) Compacting 
 [SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13-Data.db'),
  SSTableReader(path='/Users/Matthew/.ccm/compactio
 n_race/node1/data/system/local/system-local-jb-15-Data.db'), 
 SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-16-Data.db'),
  
 SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/syst
 em-local-jb-14-Data.db')]
  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,557 ColumnFamilyStore.java 
 (line 254) Initializing system_traces.sessions
  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,560 ColumnFamilyStore.java 
 (line 254) Initializing system_traces.events
  WARN [main] 2014-03-03 17:11:07,608 ColumnFamilyStore.java (line 473) 
 Removing orphans for 
 /Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13:
  [CompressionInfo.db, Filter.db, Index.db, TOC.txt, Summary.db, Data.db, 
 Statistics.
 db]
 ERROR [main] 2014-03-03 17:11:07,609 CassandraDaemon.java (line 479) 
 Exception encountered during startup
 java.lang.AssertionError: attempted to delete non-existing file 
 system-local-jb-13-CompressionInfo.db
 at 
 org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:111)
 at 
 org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:106)
 at 
 org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:476)
 at 
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
 at 
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:462)
 at 
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:552)
  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,612 CompactionTask.java 
 (line 275) Compacted 4 sstables to 
 [/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-17,].
   10,963 bytes to 5,572 (~50% of original) in 57ms = 0.093226MB/s.  4 total 
 partitions merged to 1.  Partition merge counts were {4:1, }
 {code}
 Seems like a potential race, since compactions are occurring whilst the 
 existing data directories are being scrubbed.
 Probably an in progress compaction looks like an incomplete one and results 
 in it being attempted to be scrubbed whilst in progress. 
 On the attempt to delete in the scrubDataDirectories we discover that it no 
 longer exists, presumably because it has now been compacted away. 
 This then causes an assertion error and the node fails to start up. 
 Here is a ccm script which just stops and starts a 3 node 2.0.5 cluster 
 repeatedly. 
 It seems to fairly reliably reproduce the problem, in less than ten 
 iterations: 
 {code}
 #!/bin/bash
 ccm create compaction_race -v 2.0.5
 ccm populate -n 3
 ccm start
 for i in $(seq 0 1000); do 
 echo $i;
 ccm stop
 ccm start
 grep ERR ~/.ccm/compaction_race/*/logs/system.log;
 done
 {code}
  
 Someone else should probably confirm that this is what is going wrong,  
 however if it is, the solution might be as simple as to disable 
 autocompactions slightly earlier in CassandraDaemon.setup. 
  
 Or alternatively if there isn't a good reason why we are first scrubbing the 
 system tables and then scrubbing all keyspaces (including the system 
 keyspace), you could perhaps just scrub solely the non system keyspaces on 
 the second scrub.
 Please let me know if there is anything else I can provide.
 Thanks,
 Matt



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CASSANDRA-6797) compaction and scrub data directories race on startup

2014-03-03 Thread Matt Byrd (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918938#comment-13918938
 ] 

Matt Byrd edited comment on CASSANDRA-6797 at 3/4/14 5:18 AM:
--

I think CASSANDRA-6795 may be the same or a similar issue, but since the 
reproduction is more complicated and the environment is windows, I thought I'd 
file this ticket also.


was (Author: mbyrd):
I think this may be the same or a similar issue, but since the repro is more 
complicated and the environment windows, I thought I'd file this ticket also.

 compaction and scrub data directories race on startup
 -

 Key: CASSANDRA-6797
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6797
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: macos (and linux)
Reporter: Matt Byrd
Priority: Minor
  Labels: compaction, concurrency, starting

  
 Hi,  
 On doing a rolling restarting of a 2.0.5 cluster in several environments I'm 
 seeing the following error:
 {code}
  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,549 CompactionTask.java 
 (line 115) Compacting 
 [SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13-Data.db'),
  SSTableReader(path='/Users/Matthew/.ccm/compactio
 n_race/node1/data/system/local/system-local-jb-15-Data.db'), 
 SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-16-Data.db'),
  
 SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/syst
 em-local-jb-14-Data.db')]
  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,557 ColumnFamilyStore.java 
 (line 254) Initializing system_traces.sessions
  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,560 ColumnFamilyStore.java 
 (line 254) Initializing system_traces.events
  WARN [main] 2014-03-03 17:11:07,608 ColumnFamilyStore.java (line 473) 
 Removing orphans for 
 /Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13:
  [CompressionInfo.db, Filter.db, Index.db, TOC.txt, Summary.db, Data.db, 
 Statistics.
 db]
 ERROR [main] 2014-03-03 17:11:07,609 CassandraDaemon.java (line 479) 
 Exception encountered during startup
 java.lang.AssertionError: attempted to delete non-existing file 
 system-local-jb-13-CompressionInfo.db
 at 
 org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:111)
 at 
 org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:106)
 at 
 org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:476)
 at 
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
 at 
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:462)
 at 
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:552)
  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,612 CompactionTask.java 
 (line 275) Compacted 4 sstables to 
 [/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-17,].
   10,963 bytes to 5,572 (~50% of original) in 57ms = 0.093226MB/s.  4 total 
 partitions merged to 1.  Partition merge counts were {4:1, }
 {code}
 Seems like a potential race, since compactions are occurring whilst the 
 existing data directories are being scrubbed.
 Probably an in progress compaction looks like an incomplete one and results 
 in it being attempted to be scrubbed whilst in progress. 
 On the attempt to delete in the scrubDataDirectories we discover that it no 
 longer exists, presumably because it has now been compacted away. 
 This then causes an assertion error and the node fails to start up. 
 Here is a ccm script which just stops and starts a 3 node 2.0.5 cluster 
 repeatedly. 
 It seems to fairly reliably reproduce the problem, in less than ten 
 iterations: 
 {code}
 #!/bin/bash
 ccm create compaction_race -v 2.0.5
 ccm populate -n 3
 ccm start
 for i in $(seq 0 1000); do 
 echo $i;
 ccm stop
 ccm start
 grep ERR ~/.ccm/compaction_race/*/logs/system.log;
 done
 {code}
  
 Someone else should probably confirm that this is what is going wrong,  
 however if it is, the solution might be as simple as to disable 
 autocompactions slightly earlier in CassandraDaemon.setup. 
  
 Or alternatively if there isn't a good reason why we are first scrubbing the 
 system tables and then scrubbing all keyspaces (including the system 
 keyspace), you could perhaps just scrub solely the non system keyspaces on 
 the second scrub.
 Please let me know if there is anything else I can provide.
 Thanks,
 Matt



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CASSANDRA-5345) Potential problem with GarbageCollectorMXBean

2013-03-14 Thread Matt Byrd (JIRA)
Matt Byrd created CASSANDRA-5345:


 Summary: Potential problem with GarbageCollectorMXBean
 Key: CASSANDRA-5345
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5345
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.7
 Environment: JVM:JVM vendor/version: Java HotSpot(TM) 64-Bit Server 
VM/1.6.0_30  typical 6 node 2 availability zone Mutli DC cluster on linux vms 
with
and mx4j-tools.jar and jna.jar both on path. Default configuration bar token 
setup(equispaced), sensible cassandra-topology.properties file and use of said 
snitch.
Reporter: Matt Byrd
Priority: Trivial


I am not certain this is definitely a bug, but I thought it might be worth 
posting to see if someone with more JVM//JMX knowledge could disprove my 
reasoning. Apologies if I've failed to understand something.

We've seen an intermittent problem where there is an uncaught exception in the 
scheduled task of logging gc results in GcInspector.java:

{code}
...
 ERROR [ScheduledTasks:1] 2013-03-08 01:09:06,335 AbstractCassandraDaemon.java 
(line 139) Fatal exception in thread Thread[ScheduledTasks:1,5,main]
java.lang.reflect.UndeclaredThrowableException
at $Proxy0.getName(Unknown Source)
at 
org.apache.cassandra.service.GCInspector.logGCResults(GCInspector.java:95)
at 
org.apache.cassandra.service.GCInspector.access$000(GCInspector.java:41)
at org.apache.cassandra.service.GCInspector$1.run(GCInspector.java:85)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: javax.management.InstanceNotFoundException: 
java.lang:name=ParNew,type=GarbageCollector
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1094)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:662)
at 
com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
at 
com.sun.jmx.mbeanserver.MXBeanProxy$GetHandler.invoke(MXBeanProxy.java:106)
at com.sun.jmx.mbeanserver.MXBeanProxy.invoke(MXBeanProxy.java:148)
at 
javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:248)
... 13 more
...
{code}

I think the problem, may be caused by the following reasoning:

In GcInspector we populate a list of mxbeans when the GcInspector instance is 
instantiated:

{code}
...
ListGarbageCollectorMXBean beans = new ArrayListGarbageCollectorMXBean();
MBeanServer server = ManagementFactory.getPlatformMBeanServer();
try
{
ObjectName gcName = new 
ObjectName(ManagementFactory.GARBAGE_COLLECTOR_MXBEAN_DOMAIN_TYPE + ,*);
for (ObjectName name : server.queryNames(gcName, null))
{
GarbageCollectorMXBean gc = 
ManagementFactory.newPlatformMXBeanProxy(server, name.getCanonicalName(), 
GarbageCollectorMXBean.class);
beans.add(gc);
}
}
catch (Exception e)
{
throw new RuntimeException(e);
}
...
{code}

Cassandra then periodically calls:

{code}
...
private void logGCResults()
{
for (GarbageCollectorMXBean gc : beans)
{
Long previousTotal = gctimes.get(gc.getName());
...
{code}

In the oracle javadocs, they seem to suggest that these beans could disappear 
at any time.(I'm not sure why when or how this might happen)
http://docs.oracle.com/javase/6/docs/api/
See: getGarbageCollectorMXBeans

{code}
...
public static ListGarbageCollectorMXBean getGarbageCollectorMXBeans()
Returns a list of GarbageCollectorMXBean objects in the Java virtual machine. 
The Java virtual machine may have one or more GarbageCollectorMXBean objects. 
It may add or remove GarbageCollectorMXBean during execution.
Returns:
a list of GarbageCollectorMXBean objects.
...
{code}

Correct me if I'm wrong, but do you think this might be causing the problem? 
That somehow the JVM decides to remove the GarbageCollectorMXBean temporarily 
or