[ 
https://issues.apache.org/jira/browse/CASSANDRA-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012034#comment-13012034
 ] 

Sylvain Lebresne commented on CASSANDRA-2324:
---------------------------------------------

The problem is, the ranges repair hashes are not actual node ranges.

Let's consider the following ring (RF=2), where I consider token being in 
[0..12] to simplify, and where everything is consistent:
{noformat}
                  _.-""""-._
 C (token: 11)  .'          `.
 [11,3][3,7]   /              \
              |                |
              |                | A(token: 3)
              |                | [3,7],[7,11]
               \              /
                `._        _.'
       B (token: 7)`-....-'
       [7,11],[11,3]
{noformat}
Now say I run a repair on node A. The problem is that the Merkle tree ranges 
are built by dividing the full range by 2 recursively. This means that in this 
example, the ranges in the tree will for instance be [0,2], [2, 4], [4, 6], [6, 
8], [8,10] and [10,12].

If you compare the hashes for A and B on those ranges, changes are you'll find 
mismatches for [6,8] and [10,12] (because A don't have anyone on [11, 12] while 
B have, and B don't have anyone on [6, 7] while A have). As a consequence, the 
range [7,8] and [10,11] will be repaired, even though there is no 
inconsistencies.

What that means in practice is that it will be very rare for anti-antropy to 
actually consider the nodes in sync, it will almost surely "repair" something, 
even if the nodes are perfectly consistent. It's Very easy to check btw: with a 
cluster right the one above (3 nodes, RF=2), with as few as 5 keys for the 
whole cluster I'm able to have a repair do repairs over and over again.

Now the good question is: how bad is it ? I'm not sure, I depends a bit.

On a 3 nodes cluster (RF=2), I tried inserting 1M keys with stress (stress -l 
2) and triggered repair afterwards. The amount of (unnecessarily) repaired keys 
was around 150 keys for a given node (it varies slightly for run to run because 
there is some randomness in the creation of the Merkle tree), corresponding to 
~44KB streamed (that is the amount transfered to the node where repair has been 
ran, so for the total operation its twice this, since we stream in both ways). 
That's ~0.02% of keys (a given node have ~666 666 keys).  It's bad to do 
useless work, but not a really big deal.

However, the less keys we'll have, the worst it gets (and the bigger our rows 
are, the more useless transfer we do). With the same experiment inserting only 
10K keys, there is 190 keys uselessly repaired. That's now close to 3% of the 
load. It also gets worst with increasing replication factor.


To fix this, we would need for the range in the Merkle tree to "share" the node 
range boundaries. An interesting way to do this would be to have the 
coordinating node give a list a range for which to calculate Merkle trees, and 
the node would compute one tree by range (for the coordinating node, that would 
be #RF's tree). A nice think with this is that it would leave room to 
optimizing repair since a node would need to do a validation compaction only on 
the range asked for, which means that only the coordinator node would validate 
all its data. The neighbors would do less work.


> Repair transfers more data than necessary
> -----------------------------------------
>
>                 Key: CASSANDRA-2324
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2324
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0
>            Reporter: Brandon Williams
>            Assignee: Sylvain Lebresne
>             Fix For: 0.7.5
>
>
> To repro: 3 node cluster, stress.java 1M rows with -x KEYS and -l 2.  The 
> index is enough to make some mutations drop (about 20-30k total in my tests). 
>  Repair afterwards will repair a large amount of ranges the first time.  
> However, each subsequent run will repair the same set of small ranges every 
> time.  INDEXED_RANGE_SLICE in stress never fully works.  Counting rows with 
> sstablekeys shows there are 2M rows total as expected, however when trying to 
> count the indexed keys, I get exceptions like:
> {noformat}
> Exception in thread "main" java.io.IOException: Key out of order! 
> DecoratedKey(101571366040797913119296586470838356016, 
> 0707ab782c5b5029d28a5e6d508ef72f0222528b5e28da3b7787492679dc51b96f868e0746073e54bc173be927049d0f51e25a6a95b3268213b8969abf40cea7d7)
>  > DecoratedKey(12639574763031545147067490818595764132, 
> 0bc414be3093348a2ad389ed28f18f0cc9a044b2e98587848a0d289dae13ed0ad479c74654900eeffc6236)
>         at 
> org.apache.cassandra.tools.SSTableExport.enumeratekeys(SSTableExport.java:206)
>         at 
> org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:388)
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to