[ https://issues.apache.org/jira/browse/CASSANDRA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191146#comment-13191146 ]
Sylvain Lebresne commented on CASSANDRA-3721: --------------------------------------------- I did a quick pass on the patches. It seems to me that the refactoring of AntiEntropyService this patch does is largely orthogonal to the issue at hand. All that seem needed for this issue is to allow sending treeRequest one after the other. But it should be doable with 2 lines in RepairJob.addTree(), and maybe a few more lines to send the snapshot commands. This would have the advantage of making it clear that the patch isn't breaking anything. I am not saying that the AntiEntropyService synchronization code is the cleanest one we have, and maybe a refactoring could improve that. I'm not necessarily convinced such refactoring is necessary at this point, but if you care enough about it, I'm not strongly against it either, but I want to point out that making that refactoring as part of this ticket almost surely make this out of reach for 1.1 (as it'll make review more complicated and make it unreasonable to shove this in a handful of days before the freeze). As a side note, I spotted 2 changes that seems gratuitous without seemingly improving the code: * In TreeRequestVerbHandler.doVerb, you renamed the variables. However I think the new name, cloneRequest, is misleading as we're not really doing a clone. * Is there a reason to change RepairFuture to not be a Future anymore? Even if we don't really use it, it can be convenient to have it implement the native Future interface, especially given it's called RepairFuture. > Staggering repair > ----------------- > > Key: CASSANDRA-3721 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3721 > Project: Cassandra > Issue Type: Improvement > Components: Core > Affects Versions: 1.1 > Reporter: Vijay > Assignee: Vijay > Priority: Minor > Fix For: 1.1 > > Attachments: 0001-staggering-repair-with-snapshot.patch > > > Currently repair runs on all the nodes at once and causing the range of data > to be hot (higher latency on reads). > Sequence: > 1) Send a repair request to all of the nodes so we can hold the references of > the SSTables (point at which repair was initiated) > 2) Send Validation on one node at a time (once completed will release > references). > 3) Hold the reference of the tree in the requesting node and once everything > is complete start diff. > We can also serialize the streaming part not more than 1 node is involved in > the streaming. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira