[ https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
T Jake Luciani updated CASSANDRA-8911: -------------------------------------- Fix Version/s: 3.x > Consider Mutation-based Repairs > ------------------------------- > > Key: CASSANDRA-8911 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8911 > Project: Cassandra > Issue Type: Improvement > Reporter: Tyler Hobbs > Fix For: 3.x > > > We should consider a mutation-based repair to replace the existing streaming > repair. While we're at it, we could do away with a lot of the complexity > around merkle trees. > I have not planned this out in detail, but here's roughly what I'm thinking: > * Instead of building an entire merkle tree up front, just send the "leaves" > one-by-one. Instead of dealing with token ranges, make the leaves primary > key ranges. The PK ranges would need to be contiguous, so that the start of > each range would match the end of the previous range. (The first and last > leaves would need to be open-ended on one end of the PK range.) This would be > similar to doing a read with paging. > * Once one page of data is read, compute a hash of it and send it to the > other replicas along with the PK range that it covers and a row count. > * When the replicas receive the hash, the perform a read over the same PK > range (using a LIMIT of the row count + 1) and compare hashes (unless the row > counts don't match, in which case this can be skipped). > * If there is a mismatch, the replica will send a mutation covering that > page's worth of data (ignoring the row count this time) to the source node. > Here are the advantages that I can think of: > * With the current repair behavior of streaming, vnode-enabled clusters may > need to stream hundreds of small SSTables. This results in increased compact > ion load on the receiving node. With the mutation-based approach, memtables > would naturally merge these. > * It's simple to throttle. For example, you could give a number of rows/sec > that should be repaired. > * It's easy to see what PK range has been repaired so far. This could make > it simpler to resume a repair that fails midway. > * Inconsistencies start to be repaired almost right away. > * Less special code \(?\) > * Wide partitions are no longer a problem. > There are a few problems I can think of: > * Counters. I don't know if this can be made safe, or if they need to be > skipped. > * To support incremental repair, we need to be able to read from only > repaired sstables. Probably not too difficult to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)