[ https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Capwell updated CASSANDRA-15566: -------------------------------------- Change Category: Operability Complexity: Normal Status: Open (was: Triage Needed) > Repair coordinator can hang under some cases > -------------------------------------------- > > Key: CASSANDRA-15566 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15566 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair > Reporter: David Capwell > Assignee: David Capwell > Priority: Normal > Fix For: 4.0-beta > > > Repair coordination makes a few assumptions about message delivery which > cause it to hang forever when those assumptions don’t hold true: fire and > forget will not get rejected (participate has an issue and rejects the > message), and a very delayed message will one day be seen (messaging can be > dropped under load or when failure detector thinks a node is bad but is just > GCing). > Given this and the desire to have better observability with repair (see > CASSANDRA-15399), coordination should be changed into a request/response > pattern (with retries) and polling (validation status and MerkleTree > sending). This would allow the coordinator to detect changes in state (it > was known participate was working on validation, but it no longer knows about > the validation task), and to be able to recover from ephemeral issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org