>From Jose - > 1. In the schema for VoteRequest and VoteResponse, you are using > "boolean" as the type keyword. The correct keyword should be "bool" > instead. > Thanks!
> 2. In the states and state transaction table you have the following entry: > > * Candidate transitions to: > > * ... > > * Prospective: After expiration of the election timeout > > Can you explain the reason a candidate would transition back to > prospective? If a voter transitions to the candidate state it is > because the voters don't support KIP-996 or the replica was able to > win the majority of the votes at some point in the past. Are we > concerned that the network partition might have occurred after the > replica has become a candidate? If so, I think we should state this > explicitly in the KIP. > Added this under Proposed Changes Also, if a Candidate is unable to be elected (transition to Leader) before its election timeout expires, it will transition back to Prospective. This will handle the case if a network partition occurs while the server is in Candidate state and prevent unnecessary loss of leadership. 3. In the proposed section and state transition section, I think it > would be helpful to explicitly state that we have an invariant that > only the prospective state can transition to the candidate state. This > transition to the candidate state from the prospective state can only > happen because the replica won the majority of the votes or there is > at least one remote voter that doesn't support pre-vote. > Added this under Proposed Changes A follower will now transition to Prospective instead of Candidate when its fetch timeout expires. Servers will only be able to transition to Candidate state from the Prospective state. 4. I am a bit confused by this paragraph > > A candidate will now send a VoteRequest with the PreVote field set to > true and CandidateEpoch set to its [epoch + 1] when its election timeout > expires. If [majority - 1] of VoteResponse grant the vote, the candidate > will then bump its epoch up and send a VoteRequest with PreVote set to > false which is our standard vote that will cause state changes for servers > receiving the request. > > I am assuming that "candidate" refers to the states enumerated on the > table above this quote. If so, I think you mean "prospective" for the > first candidate. > > CandidateEpoch should be ReplicaEpoch. > > [epoch + 1] should just be epoch. I thought we agreed that replicas > will always send their current epoch to the remote replicas. Thanks, Luke also pointed out that I missed modifying this section. It should read correctly now. > 5. I am a bit confused by this bullet section > > true if the server receives less than [majority] VoteResponse with > VoteGranted set to false within [election.timeout.ms + a little > randomness] and the first bullet point does not apply > Explanation for why we don't send a standard vote at this point > is explained in rejected alternatives. > > Can we explain this case in plain english? I assume that this case is > trying to cover the scenario where the election timer expired but the > prospective candidate hasn't received enough votes (granted or > rejected) to make a decision if it could win an election. > Yes, thanks for the better wording! Modified to the following - true if the server does not receive enough votes (granted or rejected) within [election.timeout.ms + a little randomness] 6. > > Yes. If a leader is unable to receive fetch responses from a majority of > servers, it can impede followers that are able to communicate with it from > voting in an eligible leader that can communicate with a majority of the > cluster. > > In general, leaders don't receive fetch responses. They receive FETCH > requests. Did you mean "if a leader is able to send FETCH responses to > the majority - 1 of the voters, it can impede fetching voters > (followers) from granting their vote to prospective candidates. This > should stop prospective candidates from getting enough votes to > transition to the candidate state and increase their epoch". 7. > > Check Quorum ensures a leader steps down if it is unable to receive > fetch responses from a majority of servers. > > I think you mean "... if it is unable to receive FETCH requests from > the majority - 1 of the voters". > Yes, thanks for this catch! The section now reads as Yes. If a leader is unable to send FETCH responses to [majority - 1] of servers, it can impede its connected followers from granting their vote to prospectives which *can* communicate with a majority of the cluster. This is the reason why an additional "Check Quorum" safeguard is needed which is what KAFKA-15489 <https://github.com/apache/kafka/pull/14428> implements. Check Quorum ensures a leader steps down if it is unable to receive FETCH requests from a majority of servers. This will allow all servers to grant their votes to eligible prospectives. > 8. At the end of the Proposed changes section you have the following: > > The logic now looks like the following for servers receiving > VoteRequests with PreVote set to true: > > > > When servers receive VoteRequests with the PreVote field set to true, > they will respond with VoteGranted set to > > > > * true if they are not a Follower and the epoch and offsets in the > Pre-Vote request satisfy the same requirements as a standard vote > > * false if they are a Follower or the epoch and end offsets in the > Pre-Vote request do not satisfy the requirements > > This seems to duplicate the same algorithm that was stated earlier in > the section. > Thanks, I forgot to remove this after incorporating the "follower" requirement in the beginning of the section 9. I don't understand this rejected idea: Sending Standard Votes after > failure to win Pre-Vote > > In your example in the "Disruptive server scenarios" voters 4 and 5 > are partitioned from the majority of the voters. We don't want voters > 4 and 5 increasing their epoch and transitioning to the candidate > state else they would disrupt the quorum established by voters 1, 2 > and 3. > Yes, this is basically the scenario where a Prospective does not receive enough votes (granted or rejected) within election timeout. I'll just remove this as a rejected alternative since it seems pretty obvious why we wouldn't want to do this. On Tue, Dec 5, 2023 at 11:14 AM Alyssa Huang <ahu...@confluent.io> wrote: > From Jun - > >> 10. "If a server happens to receive multiple VoteResponses from another >> server for a particular VoteRequest, it can take the first and ignore the >> rest.": Could you explain why a server would receive multiple responses >> for >> the same request? >> > This was meant to be a coverall for network flakiness and weirdness, it > wouldn't be expected in the general case. > > >> 11. "e.g. S1 in the below diagram pg. 41)": What is pg. 41? > > Of the Raft paper, I've made the language more clear now > > (e.g. S1 in the below diagram, pg. 41 of Raft paper > <https://purl.stanford.edu/qr033xr6097>) > > 12. "if a server attempts to send out a Pre-Vote request while any other > > server in the quorum does not understand it, it will get back an >> UnsupportedVersionException from the network client and knows to default >> back to the old behavior." > > 12.1 Based on ApiVersion, a server knows whether a peer supports PreVote or >> not. If it doesn't, there is no need for the server to send a PreVote >> request only to be rejected, right? > > Correct, the server won't actually send the PreVote request, its network > client will skip/abort the request when `latestUsableVersion` throws an > UnsupportedVersionException because the peer does not support PreVote. > > 12.2 What happens when some servers understand PreVote while some others >> don't? >> > We would default to the original standard vote behavior. I can be more > explicit about this in the Compatibility section (modified section pasted > below) > > We currently use ApiVersions to gate new/newer versions of Raft APIs from > being used before all servers can support it. This is useful in the upgrade > scenario for Pre-Vote - if a server attempts to send out a Pre-Vote request > while any other server in the quorum does not understand it, it will get > back an UnsupportedVersionException from the network client and knows to > default back to the old behavior. Specifically, the server will transition > from Prospective immediately to Candidate state, and will send standard > votes instead which can be understood by servers on older software > versions. > Let's take a look at an edge case. As the network client will only check > the supported version of the peer that we are intending to send a request > to, we can imagine a scenario where a server first sends PreVotes to peers > which understand PreVote, and then attempts to send PreVote to a peer which > does not. If the server receives and processes a majority of granted > PreVote responses prior to hitting the UnsupportedVersionException, it can > transition to Candidate phase. Otherwise, it will also transition to > Candidate phase once it hits the exception, and send standard vote requests > to all servers. Any PreVote responses received while in Candidate phase > would be ignored. > > > On Tue, Dec 5, 2023 at 10:10 AM Alyssa Huang <ahu...@confluent.io> wrote: > >> Hey folks, thanks for the reviews! >> Addressing them one by one. From Luke - >> >> Some comments: >>> 1. Follower transitions to: Prospective: After expiration of the election >>> timeout >>> -> Is this the fetch timeout, not election timeout? >>> >> Yes, thanks for this catch! >> >> >>> 2. I also agree we don't bump the epoch in prospective state. >>> A candidate will now send a VoteRequest with the PreVote field set to >>> true >>> and CandidateEpoch set to its [epoch + 1] when its election timeout >>> expires. >>> -> What is "CandidateEpoch"? And I thought you've agreed to not set >>> [epoch >>> + 1] ? >>> >> Forgot to update this section, it now reads >> >> A follower will now transition to Prospective when its fetch timeout >> expires. The Prospective server will send a VoteRequest with the PreVote >> field set to true and ReplicaEpoch set to its current, unbumped epoch. >> If [majority - 1] of VoteResponse grant the vote, the server will >> transition to Candidate and will then bump its epoch up and send a >> VoteRequest with PreVote set to false (which is the original behavior). >> >> >> On Wed, Nov 29, 2023 at 4:53 PM José Armando García Sancio >> <jsan...@confluent.io.invalid> wrote: >> >>> Hi Alyssa, >>> >>> 1. In the schema for VoteRequest and VoteResponse, you are using >>> "boolean" as the type keyword. The correct keyword should be "bool" >>> instead. >>> >>> 2. In the states and state transaction table you have the following >>> entry: >>> > * Candidate transitions to: >>> > * ... >>> > * Prospective: After expiration of the election timeout >>> >>> Can you explain the reason a candidate would transition back to >>> prospective? If a voter transitions to the candidate state it is >>> because the voters don't support KIP-996 or the replica was able to >>> win the majority of the votes at some point in the past. Are we >>> concerned that the network partition might have occurred after the >>> replica has become a candidate? If so, I think we should state this >>> explicitly in the KIP. >>> >>> 3. In the proposed section and state transition section, I think it >>> would be helpful to explicitly state that we have an invariant that >>> only the prospective state can transition to the candidate state. This >>> transition to the candidate state from the prospective state can only >>> happen because the replica won the majority of the votes or there is >>> at least one remote voter that doesn't support pre-vote. >>> >>> 4. I am a bit confused by this paragraph >>> > A candidate will now send a VoteRequest with the PreVote field set to >>> true and CandidateEpoch set to its [epoch + 1] when its election timeout >>> expires. If [majority - 1] of VoteResponse grant the vote, the candidate >>> will then bump its epoch up and send a VoteRequest with PreVote set to >>> false which is our standard vote that will cause state changes for servers >>> receiving the request. >>> >>> I am assuming that "candidate" refers to the states enumerated on the >>> table above this quote. If so, I think you mean "prospective" for the >>> first candidate. >>> >>> CandidateEpoch should be ReplicaEpoch. >>> >>> [epoch + 1] should just be epoch. I thought we agreed that replicas >>> will always send their current epoch to the remote replicas. >>> >>> 5. I am a bit confused by this bullet section >>> > true if the server receives less than [majority] VoteResponse with >>> VoteGranted set to false within [election.timeout.ms + a little >>> randomness] and the first bullet point does not apply >>> Explanation for why we don't send a standard vote at this point >>> is explained in rejected alternatives. >>> >>> Can we explain this case in plain english? I assume that this case is >>> trying to cover the scenario where the election timer expired but the >>> prospective candidate hasn't received enough votes (granted or >>> rejected) to make a decision if it could win an election. >>> >>> 6. >>> > Yes. If a leader is unable to receive fetch responses from a majority >>> of servers, it can impede followers that are able to communicate with it >>> from voting in an eligible leader that can communicate with a majority of >>> the cluster. >>> >>> In general, leaders don't receive fetch responses. They receive FETCH >>> requests. Did you mean "if a leader is able to send FETCH responses to >>> the majority - 1 of the voters, it can impede fetching voters >>> (followers) from granting their vote to prospective candidates. This >>> should stop prospective candidates from getting enough votes to >>> transition to the candidate state and increase their epoch". >>> >>> 7. >>> > Check Quorum ensures a leader steps down if it is unable to receive >>> fetch responses from a majority of servers. >>> >>> I think you mean "... if it is unable to receive FETCH requests from >>> the majority - 1 of the voters". >>> >>> 8. At the end of the Proposed changes section you have the following: >>> > The logic now looks like the following for servers receiving >>> VoteRequests with PreVote set to true: >>> > >>> > When servers receive VoteRequests with the PreVote field set to true, >>> they will respond with VoteGranted set to >>> > >>> > * true if they are not a Follower and the epoch and offsets in the >>> Pre-Vote request satisfy the same requirements as a standard vote >>> > * false if they are a Follower or the epoch and end offsets in the >>> Pre-Vote request do not satisfy the requirements >>> >>> This seems to duplicate the same algorithm that was stated earlier in >>> the section. >>> >>> 9. I don't understand this rejected idea: Sending Standard Votes after >>> failure to win Pre-Vote >>> >>> In your example in the "Disruptive server scenarios" voters 4 and 5 >>> are partitioned from the majority of the voters. We don't want voters >>> 4 and 5 increasing their epoch and transitioning to the candidate >>> state else they would disrupt the quorum established by voters 1, 2 >>> and 3. >>> >>> >>> Thanks, >>> -- >>> -José >>> >>