[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases
[ https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519273#comment-17519273 ] David Capwell commented on CASSANDRA-15566: --- now that the vtables are in, progress can be made here. Have a few things to look into outside of repair, but do plan to come back to work on this; if anyone else has cycles do feel free to pick up w/e you can (there are many edge cases) > Repair coordinator can hang under some cases > > > Key: CASSANDRA-15566 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15566 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: David Capwell >Priority: Normal > Fix For: 4.x > > > Repair coordination makes a few assumptions about message delivery which > cause it to hang forever when those assumptions don’t hold true: fire and > forget will not get rejected (participate has an issue and rejects the > message), and a very delayed message will one day be seen (messaging can be > dropped under load or when failure detector thinks a node is bad but is just > GCing). > Given this and the desire to have better observability with repair (see > CASSANDRA-15399), coordination should be changed into a request/response > pattern (with retries) and polling (validation status and MerkleTree > sending). This would allow the coordinator to detect changes in state (it > was known participate was working on validation, but it no longer knows about > the validation task), and to be able to recover from ephemeral issues. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases
[ https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105683#comment-17105683 ] David Capwell commented on CASSANDRA-15566: --- Moved out of 4.0 since these are not regressions but improvements, and these improvements can be done in a minor release. The 4.0 timeline should flesh out a plan for how to fix and validate these changes, but should not be made in the 4.0 timeline. > Repair coordinator can hang under some cases > > > Key: CASSANDRA-15566 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15566 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 4.x > > > Repair coordination makes a few assumptions about message delivery which > cause it to hang forever when those assumptions don’t hold true: fire and > forget will not get rejected (participate has an issue and rejects the > message), and a very delayed message will one day be seen (messaging can be > dropped under load or when failure detector thinks a node is bad but is just > GCing). > Given this and the desire to have better observability with repair (see > CASSANDRA-15399), coordination should be changed into a request/response > pattern (with retries) and polling (validation status and MerkleTree > sending). This would allow the coordinator to detect changes in state (it > was known participate was working on validation, but it no longer knows about > the validation task), and to be able to recover from ephemeral issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases
[ https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055063#comment-17055063 ] David Capwell commented on CASSANDRA-15566: --- bq. C* 4.0 code is quite new to me... Me too :) One of the best ways to start is testing; we need more tests to show where repair needs improvement. When I joined this project I asked operators top pain points with repair (all were from 2.1) and as I write tests I see 4.0 has the same issues. More tests which show new areas world be great! Think your 5 classifications are good, though 1/2 can merge; our networking is lossy (not a bad thing, under load it’s crash or drop). I would love a smoke test which runs user/operators tasks constantly under “load” (should be able to artificially lower resources). This test would help show if the different sub systems work well or need improvement as well. About participate crashing, I added a jvm dtest with shows this is handled; assuming failure detector detect this (restart node also fails repair). About detection and abort, I agree it should be external for now. Any/all things the external tools need must be identified and tested to show they work (for example does aborting repair work?). > Repair coordinator can hang under some cases > > > Key: CASSANDRA-15566 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15566 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 4.0-beta > > > Repair coordination makes a few assumptions about message delivery which > cause it to hang forever when those assumptions don’t hold true: fire and > forget will not get rejected (participate has an issue and rejects the > message), and a very delayed message will one day be seen (messaging can be > dropped under load or when failure detector thinks a node is bad but is just > GCing). > Given this and the desire to have better observability with repair (see > CASSANDRA-15399), coordination should be changed into a request/response > pattern (with retries) and polling (validation status and MerkleTree > sending). This would allow the coordinator to detect changes in state (it > was known participate was working on validation, but it no longer knows about > the validation task), and to be able to recover from ephemeral issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases
[ https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054748#comment-17054748 ] ZhaoYang commented on CASSANDRA-15566: -- Thanks for the update.. I don't have any concrete implementation details in mind yet, C* 4.0 code is quite new to me... Based on my understanding on 3.x, the main reasons for repair hanging are: # request/response messages got dropped if exceeding expiration time which is 10s.. # internode connections are closed and clear all queued messages due to network or gossip status changes.. # participant crashed. # failure response was not sent to coordinator in {{RepairMessageVerbHandler.doVerb()}} in case of unknown exception. currently it only handles dropped tables.. # participant is indeed making progress but very slow during validation because disk IO throttle. For problem #1-2, I am thinking to make repair message idempotent and sender will periodically resend message until it got a reply. For problem #3, make sure repair manager responds to endpoint status changes(eg. up/down/remove, etc..) if it doesn't do it already. For problem #4, make sure all exceptions are caught and responded with failure. need to add some failure injections to dtests. For problem #5, as you suggested in CASSANDRA-15399, coordinator should be able to check participants' in-mem virtual table to determine if it's making progress. In order to make repair great again, i think it's important to be able to identify hanged repairs automatically (even with some false-positive) and abort those hanged repairs by nodetool. Because I don't expect repair operations to be run by operators manually. On production, it should be managed by automation tool, like repair service or reaper which will abort and retry hanged repair.. It can probably be done in CASSANDRA-15399 or a separate ticket.. > Repair coordinator can hang under some cases > > > Key: CASSANDRA-15566 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15566 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 4.0-beta > > > Repair coordination makes a few assumptions about message delivery which > cause it to hang forever when those assumptions don’t hold true: fire and > forget will not get rejected (participate has an issue and rejects the > message), and a very delayed message will one day be seen (messaging can be > dropped under load or when failure detector thinks a node is bad but is just > GCing). > Given this and the desire to have better observability with repair (see > CASSANDRA-15399), coordination should be changed into a request/response > pattern (with retries) and polling (validation status and MerkleTree > sending). This would allow the coordinator to detect changes in state (it > was known participate was working on validation, but it no longer knows about > the validation task), and to be able to recover from ephemeral issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases
[ https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053776#comment-17053776 ] David Capwell commented on CASSANDRA-15566: --- Thanks! I have and have not started! There are several things going on so this task really should be split into smaller tasks (kinda like this is an epic!). I am currently trying to finish CASSANDRA-15564 which fixes some of the issues but doesn't fix anything that fails because of networking; that JIRA also adds a @Ignored test which fails 100% of the time (which is what triggered this JIRA). I have a POC patch for CASSANDRA-15399 which exposes the state across the coordinator and participants I am working on longevity testing (without explicitly causing failures) with a focus on repair; this was to help boost my confidence on any structural changes made. Thats the current state of this all! There are a few things already found problematic, so the next steps are mostly 1) testing 2) come up with a proposal for how to fix I have not fleshed out #2 so if you want to take a stab at fleshing out how to solve that would be great! Without that we can't really split this JIRA. > Repair coordinator can hang under some cases > > > Key: CASSANDRA-15566 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15566 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 4.0-beta > > > Repair coordination makes a few assumptions about message delivery which > cause it to hang forever when those assumptions don’t hold true: fire and > forget will not get rejected (participate has an issue and rejects the > message), and a very delayed message will one day be seen (messaging can be > dropped under load or when failure detector thinks a node is bad but is just > GCing). > Given this and the desire to have better observability with repair (see > CASSANDRA-15399), coordination should be changed into a request/response > pattern (with retries) and polling (validation status and MerkleTree > sending). This would allow the coordinator to detect changes in state (it > was known participate was working on validation, but it no longer knows about > the validation task), and to be able to recover from ephemeral issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases
[ https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053336#comment-17053336 ] ZhaoYang commented on CASSANDRA-15566: -- Hi, [~dcapwell] Have you started working on this ticket? I am happy to help.. > Repair coordinator can hang under some cases > > > Key: CASSANDRA-15566 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15566 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 4.0-beta > > > Repair coordination makes a few assumptions about message delivery which > cause it to hang forever when those assumptions don’t hold true: fire and > forget will not get rejected (participate has an issue and rejects the > message), and a very delayed message will one day be seen (messaging can be > dropped under load or when failure detector thinks a node is bad but is just > GCing). > Given this and the desire to have better observability with repair (see > CASSANDRA-15399), coordination should be changed into a request/response > pattern (with retries) and polling (validation status and MerkleTree > sending). This would allow the coordinator to detect changes in state (it > was known participate was working on validation, but it no longer knows about > the validation task), and to be able to recover from ephemeral issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org