[ https://issues.apache.org/jira/browse/CASSANDRA-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012034#comment-13012034 ]
Sylvain Lebresne commented on CASSANDRA-2324: --------------------------------------------- The problem is, the ranges repair hashes are not actual node ranges. Let's consider the following ring (RF=2), where I consider token being in [0..12] to simplify, and where everything is consistent: {noformat} _.-""""-._ C (token: 11) .' `. [11,3][3,7] / \ | | | | A(token: 3) | | [3,7],[7,11] \ / `._ _.' B (token: 7)`-....-' [7,11],[11,3] {noformat} Now say I run a repair on node A. The problem is that the Merkle tree ranges are built by dividing the full range by 2 recursively. This means that in this example, the ranges in the tree will for instance be [0,2], [2, 4], [4, 6], [6, 8], [8,10] and [10,12]. If you compare the hashes for A and B on those ranges, changes are you'll find mismatches for [6,8] and [10,12] (because A don't have anyone on [11, 12] while B have, and B don't have anyone on [6, 7] while A have). As a consequence, the range [7,8] and [10,11] will be repaired, even though there is no inconsistencies. What that means in practice is that it will be very rare for anti-antropy to actually consider the nodes in sync, it will almost surely "repair" something, even if the nodes are perfectly consistent. It's Very easy to check btw: with a cluster right the one above (3 nodes, RF=2), with as few as 5 keys for the whole cluster I'm able to have a repair do repairs over and over again. Now the good question is: how bad is it ? I'm not sure, I depends a bit. On a 3 nodes cluster (RF=2), I tried inserting 1M keys with stress (stress -l 2) and triggered repair afterwards. The amount of (unnecessarily) repaired keys was around 150 keys for a given node (it varies slightly for run to run because there is some randomness in the creation of the Merkle tree), corresponding to ~44KB streamed (that is the amount transfered to the node where repair has been ran, so for the total operation its twice this, since we stream in both ways). That's ~0.02% of keys (a given node have ~666 666 keys). It's bad to do useless work, but not a really big deal. However, the less keys we'll have, the worst it gets (and the bigger our rows are, the more useless transfer we do). With the same experiment inserting only 10K keys, there is 190 keys uselessly repaired. That's now close to 3% of the load. It also gets worst with increasing replication factor. To fix this, we would need for the range in the Merkle tree to "share" the node range boundaries. An interesting way to do this would be to have the coordinating node give a list a range for which to calculate Merkle trees, and the node would compute one tree by range (for the coordinating node, that would be #RF's tree). A nice think with this is that it would leave room to optimizing repair since a node would need to do a validation compaction only on the range asked for, which means that only the coordinator node would validate all its data. The neighbors would do less work. > Repair transfers more data than necessary > ----------------------------------------- > > Key: CASSANDRA-2324 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2324 > Project: Cassandra > Issue Type: Bug > Components: Core > Affects Versions: 0.7.0 > Reporter: Brandon Williams > Assignee: Sylvain Lebresne > Fix For: 0.7.5 > > > To repro: 3 node cluster, stress.java 1M rows with -x KEYS and -l 2. The > index is enough to make some mutations drop (about 20-30k total in my tests). > Repair afterwards will repair a large amount of ranges the first time. > However, each subsequent run will repair the same set of small ranges every > time. INDEXED_RANGE_SLICE in stress never fully works. Counting rows with > sstablekeys shows there are 2M rows total as expected, however when trying to > count the indexed keys, I get exceptions like: > {noformat} > Exception in thread "main" java.io.IOException: Key out of order! > DecoratedKey(101571366040797913119296586470838356016, > 0707ab782c5b5029d28a5e6d508ef72f0222528b5e28da3b7787492679dc51b96f868e0746073e54bc173be927049d0f51e25a6a95b3268213b8969abf40cea7d7) > > DecoratedKey(12639574763031545147067490818595764132, > 0bc414be3093348a2ad389ed28f18f0cc9a044b2e98587848a0d289dae13ed0ad479c74654900eeffc6236) > at > org.apache.cassandra.tools.SSTableExport.enumeratekeys(SSTableExport.java:206) > at > org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:388) > {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira