Push vs pull isn’t too critical, but there is one edge case to consider; if we didn’t think the participate got restarted triggering validation again (which may have caused the process to end) could be a problem.
> On Aug 26, 2021, at 9:50 AM, Yifan Cai <yc25c...@gmail.com> wrote: > >> >> 2. Add retries to specific stages of coordination, such as prepare and >> validate. In order to do these retries we first need to know what the > > state is for the participant which has yet to reply... > > > If I understand it correctly, does it mean retries only happen in the > coordinator and the coordinator pulls the states of the participants > periodically? > If the handling of the requests in the participant is made to be idempotent > (which I think is required for retry anyway), pulling the state is > unnecessary. For example, the coordinator can just send the PrepareRequest > at regular intervals until it receives the PrepareResponse. > > - Yifan > > On Thu, Aug 26, 2021 at 8:56 AM Blake Eggleston > <beggles...@apple.com.invalid> wrote: > >> +1 from me, any improvement in this area would be great. >> >> It would be nice if this could include visibility into repair streams, but >> just exposing the repair state will be a big improvement. >> >>> On Aug 25, 2021, at 5:46 PM, David Capwell <dcapw...@gmail.com> wrote: >>> >>> Now that 4.0 is out, I want to bring up improving repair again (earlier >>> thread >>> >> http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3cjira.13266448.1572997299000.99567.1572997440...@atlassian.jira%3E >> ), >>> specifically the following two JIRAs: >>> >>> >>> CASSANDRA-15566 - Repair coordinator can hang under some cases >>> >>> CASSANDRA-15399 - Add ability to track state in repair >>> >>> >>> Right now repair has an issue if any message is lost, which leads to hung >>> or timed out repairs; in addition there is a large lack of visibility >> into >>> what is going on, and can be even harder if you wish to join coordinator >>> with participant state. >>> >>> >>> I propose the following changes to improve our current repair subsystem: >>> >>> >>> >>> 1. New tracking system for coordinator and participants (covered by >>> CASSANDRA-15399). This system will expose progress on each instance >> and >>> expose this information for internal access as well as external users >>> 2. Add retries to specific stages of coordination, such as prepare and >>> validate. In order to do these retries we first need to know what the >>> state is for the participant which has yet to reply, this will leverage >>> CASSANDRA-15399 to see what's going on (has the prepare been seen? Is >>> validation running? Did it complete?). In addition to checking the >>> state, we will need to store the validation MerkleTree, this allows for >>> coordinator to fetch if goes missing (can be dropped in route to >>> coordinator or even on the coordinator). >>> >>> >>> What is not in scope? >>> >>> - Rewriting all of Repair; the idea is specific "small" changes can fix >>> 80% of the issues >>> - Handle coordinator node failure. Being able to recover from a failed >>> coordinator should be possible after the above work is done, so is >> seen as >>> tangental and can be done later >>> - Recovery from a downed participant. Similar to the previous bullet, >>> with the state being tracked this acts as a kind of checkpoint, so >> future >>> work can come in to handle recovery >>> - Handling "too large" range. Ideally we should add an ability to split >>> the coordination into sub repairs, but this is not the goal of this >> work. >>> - Overstreaming. This is a byproduct of the previous "not in scope" >>> bullet, and/or large partitions; so is tangental to this work >>> >>> >>> Wanted to share here before starting this work again; let me know if >> there >>> are any concerns or feedback! >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: dev-h...@cassandra.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org