[ 
https://issues.apache.org/jira/browse/CASSANDRA-5426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621523#comment-13621523
 ] 

Yuki Morishita commented on CASSANDRA-5426:
-------------------------------------------

Work in progress is pushed to: https://github.com/yukim/cassandra/commits/5426-1

Only implemented for normal case that works.

--

First of all, ActiveRepairService is broken down to several classes and placed 
into o.a.c.repair to make my work easier.

The main design change around messages is that, all repair related message is 
packed into RepairMessage and handled in RepairMessageVerbHandler, which is 
executed in ANTY_ENTROPY stage. RepairMessage carries RepairMessageHeader and 
its content(if any). RepairMessageHeader is basically to indicate that the 
message belongs to which repair job and to specify content type. Repair message 
content type currently has 6 types defined in RepairMessageType: 
VALIDATION_REQUEST, VALIDATION_COMPLETE, VALIDATION_FAILED, SYNC_REQUEST, 
SYNC_COMPLETE, and SYNC_FAILED.

*VALIDATION_REQUEST*

VALIDATION_REQUEST is sent from repair initiator(coordinator) to request Merkle 
tree.

*VALIDATION_COMPLETE*/*VALIDATION_FAILED*

Calculated Merkle tree is sent back using VALIDATION_COMPLETE message. 
VALIDATION_FAILED message is used when something goes wrong in remote node.

*SYNC_REQUEST*

SYNC_REQUEST is sent when we have to repair remote two nodes. This is forwarded 
StreamingRepairTask we have today.

*SYNC_COMPLETE*/*SYNC_FAILED*

When there is no need to exchange data, or need to exchange but completed 
streaming, the node(this includes the node that received SYNC_REQUEST) sends 
back SYNC_COMPLETE. If streaming data fails, sends back SYNC_FAILED.

The whole repair process is depend on async message exchange using 
MessagingService, so there is still the chance to hang when the node fail to 
deliver message(see CASSANDRA-5393).

Any feedback is appreciated.
                
> Redesign repair messages
> ------------------------
>
>                 Key: CASSANDRA-5426
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5426
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Yuki Morishita
>            Assignee: Yuki Morishita
>            Priority: Minor
>             Fix For: 2.0
>
>
> Many people have been reporting 'repair hang' when something goes wrong.
> Two major causes of hang are 1) validation failure and 2) streaming failure.
> Currently, when those failures happen, the failed node would not respond back 
> to the repair initiator.
> The goal of this ticket is to redesign message flows around repair so that 
> repair never hang.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to