[ https://issues.apache.org/jira/browse/CASSANDRA-5426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621523#comment-13621523 ]
Yuki Morishita commented on CASSANDRA-5426: ------------------------------------------- Work in progress is pushed to: https://github.com/yukim/cassandra/commits/5426-1 Only implemented for normal case that works. -- First of all, ActiveRepairService is broken down to several classes and placed into o.a.c.repair to make my work easier. The main design change around messages is that, all repair related message is packed into RepairMessage and handled in RepairMessageVerbHandler, which is executed in ANTY_ENTROPY stage. RepairMessage carries RepairMessageHeader and its content(if any). RepairMessageHeader is basically to indicate that the message belongs to which repair job and to specify content type. Repair message content type currently has 6 types defined in RepairMessageType: VALIDATION_REQUEST, VALIDATION_COMPLETE, VALIDATION_FAILED, SYNC_REQUEST, SYNC_COMPLETE, and SYNC_FAILED. *VALIDATION_REQUEST* VALIDATION_REQUEST is sent from repair initiator(coordinator) to request Merkle tree. *VALIDATION_COMPLETE*/*VALIDATION_FAILED* Calculated Merkle tree is sent back using VALIDATION_COMPLETE message. VALIDATION_FAILED message is used when something goes wrong in remote node. *SYNC_REQUEST* SYNC_REQUEST is sent when we have to repair remote two nodes. This is forwarded StreamingRepairTask we have today. *SYNC_COMPLETE*/*SYNC_FAILED* When there is no need to exchange data, or need to exchange but completed streaming, the node(this includes the node that received SYNC_REQUEST) sends back SYNC_COMPLETE. If streaming data fails, sends back SYNC_FAILED. The whole repair process is depend on async message exchange using MessagingService, so there is still the chance to hang when the node fail to deliver message(see CASSANDRA-5393). Any feedback is appreciated. > Redesign repair messages > ------------------------ > > Key: CASSANDRA-5426 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5426 > Project: Cassandra > Issue Type: Improvement > Reporter: Yuki Morishita > Assignee: Yuki Morishita > Priority: Minor > Fix For: 2.0 > > > Many people have been reporting 'repair hang' when something goes wrong. > Two major causes of hang are 1) validation failure and 2) streaming failure. > Currently, when those failures happen, the failed node would not respond back > to the repair initiator. > The goal of this ticket is to redesign message flows around repair so that > repair never hang. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira