[ 
https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054748#comment-17054748
 ] 

ZhaoYang commented on CASSANDRA-15566:
--------------------------------------

Thanks for the update..

I don't have any concrete implementation details in mind yet, C* 4.0 code is 
quite new to me...

Based on my understanding on 3.x, the main reasons for repair hanging are:
 # request/response messages got dropped if exceeding expiration time which is 
10s..
 # internode connections are closed and clear all queued messages due to 
network or gossip status changes..
 # participant crashed.
 # failure response was not sent to coordinator in 
{{RepairMessageVerbHandler.doVerb()}} in case of unknown exception. currently 
it only handles dropped tables..
 # participant is indeed making progress but very slow during validation 
because disk IO throttle.

For problem #1-2, I am thinking to make repair message idempotent and sender 
will periodically resend message until it got a reply.

For problem #3, make sure repair manager responds to endpoint status 
changes(eg. up/down/remove, etc..) if it doesn't do it already.

For problem #4, make sure all exceptions are caught and responded with failure. 
need to add some failure injections to dtests.

For problem #5, as you suggested in CASSANDRA-15399, coordinator should be able 
to check participants' in-mem virtual table to determine if it's making 
progress.

In order to make repair great again, i think it's important to be able to 
identify hanged repairs automatically (even with some false-positive) and abort 
those hanged repairs by nodetool. Because I don't expect repair operations to 
be run by operators manually. On production, it should be managed by automation 
tool, like repair service or reaper which will abort and retry hanged repair.. 
It can probably be done in CASSANDRA-15399 or a separate ticket..

> Repair coordinator can hang under some cases
> --------------------------------------------
>
>                 Key: CASSANDRA-15566
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15566
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Repair
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 4.0-beta
>
>
> Repair coordination makes a few assumptions about message delivery which 
> cause it to hang forever when those assumptions don’t hold true: fire and 
> forget will not get rejected (participate has an issue and rejects the 
> message), and a very delayed message will one day be seen (messaging can be 
> dropped under load or when failure detector thinks a node is bad but is just 
> GCing).
> Given this and the desire to have better observability with repair (see 
> CASSANDRA-15399), coordination should be changed into a request/response 
> pattern (with retries) and polling (validation status and MerkleTree 
> sending).  This would allow the coordinator to detect changes in state (it 
> was known participate was working on validation, but it no longer knows about 
> the validation task), and to be able to recover from ephemeral issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to