[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected
[ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549412#comment-13549412 ] Janne Jalkanen edited comment on CASSANDRA-4417 at 1/10/13 7:58 AM: I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes. Our counters are in a composite column family, no TTLs in use. We *mostly* increment by one, but sometimes more. I did disablegossip, disablethrift, drain, shutdown, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. Environment is Ubuntu Linux 12.04 LTS, JVM is OpenJDK 7u9. Last repair picked 497 invalid counter shards, and we have approximately 8 million counters, of which about a hundred is incremented each second (and sometimes subtracted from if our read repair kicks in - we have our own in-app repair for certain low values). All the counter writes are batched with 100 increments/batch. So this is only affecting a really small subset, though it's rather annoying when it happens, as it means that you can never really trust the counters to be even in the ballpark :-/ was (Author: jalkanen): I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes. Our counters are in a composite column family, no TTLs in use. We *mostly* increment by one, but sometimes more. I did disablegossip, disablethrift, drain, shutdown, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. Environment is Ubuntu Linux 12.04 LTS, JVM is OpenJDK 7u9. invalid counter shard detected --- Key: CASSANDRA-4417 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.1 Environment: Amazon Linux Reporter: Senthilvel Rangaswamy Attachments: cassandra-mck.log.bz2, err.txt Seeing errors like these: 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard What does it mean ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected
[ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549412#comment-13549412 ] Janne Jalkanen edited comment on CASSANDRA-4417 at 1/10/13 7:48 AM: I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes. Our counters are in a composite column family, no TTLs in use. We *mostly* increment by one, but sometimes more. I did disablegossip, disablethrift, drain, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. was (Author: jalkanen): I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes. Our counters are in a composite column family, no TTLs in use. I did disablegossip, disablethrift, drain, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. invalid counter shard detected --- Key: CASSANDRA-4417 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.1 Environment: Amazon Linux Reporter: Senthilvel Rangaswamy Attachments: cassandra-mck.log.bz2, err.txt Seeing errors like these: 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard What does it mean ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected
[ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549412#comment-13549412 ] Janne Jalkanen edited comment on CASSANDRA-4417 at 1/10/13 7:48 AM: I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes. Our counters are in a composite column family, no TTLs in use. We *mostly* increment by one, but sometimes more. I did disablegossip, disablethrift, drain, shutdown, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. was (Author: jalkanen): I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes. Our counters are in a composite column family, no TTLs in use. We *mostly* increment by one, but sometimes more. I did disablegossip, disablethrift, drain, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. invalid counter shard detected --- Key: CASSANDRA-4417 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.1 Environment: Amazon Linux Reporter: Senthilvel Rangaswamy Attachments: cassandra-mck.log.bz2, err.txt Seeing errors like these: 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard What does it mean ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected
[ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549412#comment-13549412 ] Janne Jalkanen edited comment on CASSANDRA-4417 at 1/10/13 7:49 AM: I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes. Our counters are in a composite column family, no TTLs in use. We *mostly* increment by one, but sometimes more. I did disablegossip, disablethrift, drain, shutdown, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. Environment is Ubuntu Linux 12.04 LTS. was (Author: jalkanen): I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes. Our counters are in a composite column family, no TTLs in use. We *mostly* increment by one, but sometimes more. I did disablegossip, disablethrift, drain, shutdown, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. invalid counter shard detected --- Key: CASSANDRA-4417 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.1 Environment: Amazon Linux Reporter: Senthilvel Rangaswamy Attachments: cassandra-mck.log.bz2, err.txt Seeing errors like these: 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard What does it mean ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected
[ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549412#comment-13549412 ] Janne Jalkanen edited comment on CASSANDRA-4417 at 1/10/13 7:50 AM: I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes. Our counters are in a composite column family, no TTLs in use. We *mostly* increment by one, but sometimes more. I did disablegossip, disablethrift, drain, shutdown, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. Environment is Ubuntu Linux 12.04 LTS, JVM is OpenJDK 7u9. was (Author: jalkanen): I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight upgrade from 1.0.12 to 1.1.8; no topology changes. I see two invalid shard IDs, counts differ by more than one - sometimes even by 3000 or more. Seems random to my eyes. Our counters are in a composite column family, no TTLs in use. We *mostly* increment by one, but sometimes more. I did disablegossip, disablethrift, drain, shutdown, upgrade, restart on every node in a rolling fashion. Then I did upgradesstables and repair -pr on every node when the entire cluster had been upgraded. Environment is Ubuntu Linux 12.04 LTS. invalid counter shard detected --- Key: CASSANDRA-4417 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.1 Environment: Amazon Linux Reporter: Senthilvel Rangaswamy Attachments: cassandra-mck.log.bz2, err.txt Seeing errors like these: 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard What does it mean ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected
[ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13492245#comment-13492245 ] Mck SembWever edited comment on CASSANDRA-4417 at 11/7/12 10:20 AM: Sylvain, here's log from one node. For most of the log we were running 1.0.8. And then at line 2883399 we upgraded (and this was the first node to upgrade) to 1.1.6. The error msg comes every few seconds. Our counters are sub-columns inside supercolumns. We completed the upgrade on all nodes. Then restarted again (because jna was missing). We are now running upgradesstables but that's not in this logfile. The error msgs still appear. An operational problem we've had recently is that we had one node down for ~one month (faulty raid controller) and when we finally brought the node back into the cluster nightly repairs would never finish. In the end we just disabled nightly repairs (we don't have tombstones) with the plan that an upgrade and upgradesstables would bring us back to a state where repairs would work again. I have no idea if this can be related. was (Author: michaelsembwever): Sylvain, here's log from one node. For most of the log we were running 1.0.8. And then at line 2883399 we upgraded (and this was the first node to upgrade) to 1.1.6. The error msg comes every few seconds. Our counters are sub-columns inside supercolumns. We completed the upgrade on all nodes. Then restarted again (because jna was missing). We are now running upgradesstables but that's not in this logfile. The error msgs still appear. An operational problem we're had recently is that we had one node down for ~one month (faulty raid controller) and when we finally brought the node back into the cluster nightly repairs would never finish. In the end we just disabled nightly repairs (we don't have tombstones) with the plan that an upgrade and upgradesstables would bring us back to a state where repairs would work again. I have no idea if this can be related. invalid counter shard detected --- Key: CASSANDRA-4417 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.1 Environment: Amazon Linux Reporter: Senthilvel Rangaswamy Attachments: cassandra-mck.log.bz2, err.txt Seeing errors like these: 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard What does it mean ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected
[ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485333#comment-13485333 ] Eric Lubow edited comment on CASSANDRA-4417 at 10/27/12 2:22 AM: - We are getting this on DSE 2.2 (C* 1.1.5) on a new node during bootstrap. We upgraded the cluster from C* 1.0.10 about 10 days ago and upgradesstables was run on every node and we repaired the entire cluster. We ran We've been getting this error sporadically on various nodes at various points but it's not consistent. I've double and triple checked every node looking for sstable files named *- hd -* and I don't see any (assuming that's enough to tell that the sstable has been upgraded. If this error is an effect of requiring one to run upgradesstables, then how would it happen during a bootstrap? All nodes involved in this cluster are 1.1.5. was (Author: elubow): We are getting this on DSE 2.2 (C* 1.1.5) on a new node during bootstrap. We upgraded the cluster from C* 1.0.10 about 10 days ago and upgradesstables was run on every node and we repaired the entire cluster. We ran We've been getting this error sporadically on various nodes at various points but it's not consistent. I've double and triple checked every node looking for sstable files named *-hd-* and I don't see any (assuming that's enough to tell that the sstable has been upgraded. If this error is an effect of requiring one to run upgradesstables, then how would it happen during a bootstrap? All nodes involved in this cluster are 1.1.5. invalid counter shard detected --- Key: CASSANDRA-4417 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.1 Environment: Amazon Linux Reporter: Senthilvel Rangaswamy Seeing errors like these: 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard What does it mean ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected
[ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453537#comment-13453537 ] Omid Aladini edited comment on CASSANDRA-4417 at 9/12/12 10:18 AM: --- {quote} A simple workaround is to use batch commit log, but that has a potentially important performance impact. {quote} I'm a bit confused why batch commit would solve the problem. If cassandra crashes before the batch is fsynced, the counter mutations in the batch which it was the leader for will still be lost although they might have been applied on other replicas. The difference would be that the mutations won't be acknowledged to the client, and since counters aren't idempotent, the client won't know weather to retry or not. Am I missing something? was (Author: omid): {quote} A simple workaround is to use batch commit log, but that has a potentially important performance impact. {quote} I'm a bit confused why batch commit would solve the problem. If cassandra crashes before the batch is fsynced, the counter mutations which it was the leader for will still be lost although they might have been applied on other replicas. The difference would be that the mutations won't be acknowledged to the client, and since counters aren't idempotent, the client won't know weather to retry or not. Am I missing something? invalid counter shard detected --- Key: CASSANDRA-4417 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.1 Environment: Amazon Linux Reporter: Senthilvel Rangaswamy Seeing errors like these: 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard What does it mean ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected
[ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13450766#comment-13450766 ] Charles Brophy edited comment on CASSANDRA-4417 at 9/8/12 3:35 AM: --- We have a six node cluster with even key range balance, random partitioner, and with replication factor=2. I get these errors immediately following running nodetool repair but ONLY if a streaming repair happens as a result. We are serving live updates to our counters from our clickstream. My guess is that the sstable being streamed between the servers winds up becoming out of date for the duration of the streaming process and ends up containing these duplicates that are vetted during the subsequent compaction. In any case, for us it is 100% reproducible via: nodetool repair - streaming repair - subsequent compaction. Let me know if you need more details. Hope this helps! was (Author: charlesb_zulily): We have a six node cluster with even key range balance, random partitioner, and with relication factor=2. I get these errors immediately following running nodetool repair but ONLY if a streaming repair happens as a result. We are serving live updates to our counters from our clickstream. My guess is that the sstable being streamed between the servers winds up becoming out of date for the duration of the streaming process and ends up containing these duplicates that are vetted during the subsequent compaction. In any case, for us it is 100% reproducible via: nodetool repair - streaming repair - subsequent compaction. Let me know if you need more details. Hope this helps! invalid counter shard detected --- Key: CASSANDRA-4417 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.1 Environment: Amazon Linux Reporter: Senthilvel Rangaswamy Seeing errors like these: 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard What does it mean ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected
[ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13450766#comment-13450766 ] Charles Brophy edited comment on CASSANDRA-4417 at 9/8/12 3:39 AM: --- We have a six node cluster [1.1.3, jdk 1.6.33, CentOs 6] with even key range balance, random partitioner, and with replication factor=2. I get these errors immediately following running nodetool repair but ONLY if a streaming repair happens as a result. We are serving live updates to our counters from our clickstream. My guess is that the sstable being streamed between the servers winds up becoming out of date for the duration of the streaming process and ends up containing these duplicates that are vetted during the subsequent compaction. In any case, for us it is 100% reproducible via: nodetool repair - streaming repair - subsequent compaction. Let me know if you need more details. Hope this helps! was (Author: charlesb_zulily): We have a six node cluster with even key range balance, random partitioner, and with replication factor=2. I get these errors immediately following running nodetool repair but ONLY if a streaming repair happens as a result. We are serving live updates to our counters from our clickstream. My guess is that the sstable being streamed between the servers winds up becoming out of date for the duration of the streaming process and ends up containing these duplicates that are vetted during the subsequent compaction. In any case, for us it is 100% reproducible via: nodetool repair - streaming repair - subsequent compaction. Let me know if you need more details. Hope this helps! invalid counter shard detected --- Key: CASSANDRA-4417 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.1 Environment: Amazon Linux Reporter: Senthilvel Rangaswamy Seeing errors like these: 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard What does it mean ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira