Re: [DISCUSS] KIP-996: Pre-Vote

Alyssa Huang Wed, 06 Dec 2023 16:14:30 -0800

>From Jose -

> 1. In the schema for VoteRequest and VoteResponse, you are using
> "boolean" as the type keyword. The correct keyword should be "bool"
> instead.
>
Thanks!



> 2. In the states and state transaction table you have the following entry:
> >  * Candidate transitions to:
> > *    ...
> > *    Prospective: After expiration of the election timeout
>
> Can you explain the reason a candidate would transition back to
> prospective? If a voter transitions to the candidate state it is
> because the voters don't support KIP-996 or the replica was able to
> win the majority of the votes at some point in the past. Are we
> concerned that the network partition might have occurred after the
> replica has become a candidate? If so, I think we should state this
> explicitly in the KIP.
>
Added this under Proposed Changes

Also, if a Candidate is unable to be elected (transition to Leader) before
its election timeout expires, it will transition back to Prospective. This
will handle the case if a network partition occurs while the server is in
Candidate state and prevent unnecessary loss of leadership.


3. In the proposed section and state transition section, I think it
> would be helpful to explicitly state that we have an invariant that
> only the prospective state can transition to the candidate state. This
> transition to the candidate state from the prospective state can only
> happen because the replica won the majority of the votes or there is
> at least one remote voter that doesn't support pre-vote.
>
 Added this under Proposed Changes

A follower will now transition to Prospective instead of Candidate when its
fetch timeout expires. Servers will only be able to transition to Candidate
state from the Prospective state.

4. I am a bit confused by this paragraph
> > A candidate will now send a VoteRequest with the PreVote field set to
> true and CandidateEpoch set to its [epoch + 1] when its election timeout
> expires. If [majority - 1] of VoteResponse grant the vote, the candidate
> will then bump its epoch up and send a VoteRequest with PreVote set to
> false which is our standard vote that will cause state changes for servers
> receiving the request.
>
> I am assuming that "candidate" refers to the states enumerated on the
> table above this quote. If so, I think you mean "prospective" for the
> first candidate.
>
> CandidateEpoch should be ReplicaEpoch.
>
> [epoch + 1] should just be epoch. I thought we agreed that replicas
> will always send their current epoch to the remote replicas.

Thanks, Luke also pointed out that I missed modifying this section. It
should read correctly now.


> 5. I am a bit confused by this bullet section
> > true if the server receives less than [majority] VoteResponse with
> VoteGranted set to false within [election.timeout.ms + a little
> randomness] and the first bullet point does not apply
>      Explanation for why we don't send a standard vote at this point
> is explained in rejected alternatives.
>
> Can we explain this case in plain english? I assume that this case is
> trying to cover the scenario where the election timer expired but the
> prospective candidate hasn't received enough votes (granted or
> rejected) to make a decision if it could win an election.
>
 Yes, thanks for the better wording! Modified to the following


   - true if the server does not receive enough votes (granted or rejected)
   within [election.timeout.ms + a little randomness]


6.
> > Yes. If a leader is unable to receive fetch responses from a majority of
> servers, it can impede followers that are able to communicate with it from
> voting in an eligible leader that can communicate with a majority of the
> cluster.
>
> In general, leaders don't receive fetch responses. They receive FETCH
> requests. Did you mean "if a leader is able to send FETCH responses to
> the majority - 1 of the voters, it can impede fetching voters
> (followers) from granting their vote to prospective candidates. This
> should stop prospective candidates from getting enough votes to
> transition to the candidate state and increase their epoch".

7.
> > Check Quorum ensures a leader steps down if it is unable to receive
> fetch responses from a majority of servers.
>
> I think you mean "... if it is unable to receive FETCH requests from
> the majority - 1 of the voters".
>
Yes, thanks for this catch! The section now reads as

Yes. If a leader is unable to send FETCH responses to [majority - 1] of
servers, it can impede its connected followers from granting their vote to
prospectives which *can* communicate with a majority of the cluster. This
is the reason why an additional "Check Quorum" safeguard is needed which is
what KAFKA-15489 <https://github.com/apache/kafka/pull/14428> implements.
Check Quorum ensures a leader steps down if it is unable to receive FETCH
requests from a majority of servers. This will allow all servers to grant
their votes to eligible prospectives.



> 8. At the end of the Proposed changes section you have the following:
> > The logic now looks like the following for servers receiving
> VoteRequests with PreVote set to true:
> >
> > When servers receive VoteRequests with the PreVote field set to true,
> they will respond with VoteGranted set to
> >
> > * true if they are not a Follower and the epoch and offsets in the
> Pre-Vote request satisfy the same requirements as a standard vote
> > * false if they are a Follower or the epoch and end offsets in the
> Pre-Vote request do not satisfy the requirements
>
> This seems to duplicate the same algorithm that was stated earlier in
> the section.
>
 Thanks, I forgot to remove this after incorporating the "follower"
requirement in the beginning of the section

9. I don't understand this rejected idea: Sending Standard Votes after
> failure to win Pre-Vote
>
> In your example in the "Disruptive server scenarios" voters 4 and 5
> are partitioned from the majority of the voters. We don't want voters
> 4 and 5 increasing their epoch and transitioning to the candidate
> state else they would disrupt the quorum established by voters 1, 2
> and 3.
>
Yes, this is basically the scenario where a Prospective does not receive
enough votes (granted or rejected) within election timeout. I'll just
remove this as a rejected alternative since it seems pretty obvious why we
wouldn't want to do this.

On Tue, Dec 5, 2023 at 11:14 AM Alyssa Huang <ahu...@confluent.io> wrote:

> From Jun -
>
>> 10. "If a server happens to receive multiple VoteResponses from another
>> server for a particular VoteRequest, it can take the first and ignore the
>> rest.": Could you explain why a server would receive multiple responses
>> for
>> the same request?
>>
> This was meant to be a coverall for network flakiness and weirdness, it
> wouldn't be expected in the general case.
>
>
>> 11. "e.g. S1 in the below diagram pg. 41)": What is pg. 41?
>
> Of the Raft paper, I've made the language more clear now
>
> (e.g. S1 in the below diagram, pg. 41 of Raft paper
> <https://purl.stanford.edu/qr033xr6097>)
>
>  12. "if a server attempts to send out a Pre-Vote request while any other
>
> server in the quorum does not understand it, it will get back an
>> UnsupportedVersionException from the network client and knows to default
>> back to the old behavior."
>
> 12.1 Based on ApiVersion, a server knows whether a peer supports PreVote or
>> not. If it doesn't, there is no need for the server to send a PreVote
>> request only to be rejected, right?
>
> Correct, the server won't actually send the PreVote request, its network
> client will skip/abort the request when `latestUsableVersion` throws an
> UnsupportedVersionException because the peer does not support PreVote.
>
> 12.2 What happens when some servers understand PreVote while some others
>> don't?
>>
> We would default to the original standard vote behavior. I can be more
> explicit about this in the Compatibility section (modified section pasted
> below)
>
> We currently use ApiVersions to gate new/newer versions of Raft APIs from
> being used before all servers can support it. This is useful in the upgrade
> scenario for Pre-Vote - if a server attempts to send out a Pre-Vote request
> while any other server in the quorum does not understand it, it will get
> back an UnsupportedVersionException from the network client and knows to
> default back to the old behavior. Specifically, the server will transition
> from Prospective immediately to Candidate state, and will send standard
> votes instead which can be understood by servers on older software
> versions.
> Let's take a look at an edge case. As the network client will only check
> the supported version of the peer that we are intending to send a request
> to, we can imagine a scenario where a server first sends PreVotes to peers
> which understand PreVote, and then attempts to send PreVote to a peer which
> does not. If the server receives and processes a majority of granted
> PreVote responses prior to hitting the UnsupportedVersionException, it can
> transition to Candidate phase. Otherwise, it will also transition to
> Candidate phase once it hits the exception, and send standard vote requests
> to all servers. Any PreVote responses received while in Candidate phase
> would be ignored.
>
>
> On Tue, Dec 5, 2023 at 10:10 AM Alyssa Huang <ahu...@confluent.io> wrote:
>
>> Hey folks, thanks for the reviews!
>> Addressing them one by one. From Luke -
>>
>> Some comments:
>>> 1. Follower transitions to: Prospective: After expiration of the election
>>> timeout
>>> -> Is this the fetch timeout, not election timeout?
>>>
>> Yes, thanks for this catch!
>>
>>
>>> 2. I also agree we don't bump the epoch in prospective state.
>>>  A candidate will now send a VoteRequest with the PreVote field set to
>>> true
>>> and CandidateEpoch set to its [epoch + 1] when its election timeout
>>> expires.
>>> -> What is "CandidateEpoch"? And I thought you've agreed to not set
>>> [epoch
>>> + 1] ?
>>>
>> Forgot to update this section, it now reads
>>
>> A follower will now transition to Prospective when its fetch timeout
>> expires. The Prospective server will send a VoteRequest with the PreVote
>> field set to true and ReplicaEpoch  set to its current, unbumped epoch.
>> If [majority - 1] of VoteResponse grant the vote, the server will
>> transition to Candidate and will then bump its epoch up and send a
>> VoteRequest with PreVote set to false (which is the original behavior).
>>
>>
>> On Wed, Nov 29, 2023 at 4:53 PM José Armando García Sancio
>> <jsan...@confluent.io.invalid> wrote:
>>
>>> Hi Alyssa,
>>>
>>> 1. In the schema for VoteRequest and VoteResponse, you are using
>>> "boolean" as the type keyword. The correct keyword should be "bool"
>>> instead.
>>>
>>> 2. In the states and state transaction table you have the following
>>> entry:
>>> >  * Candidate transitions to:
>>> > *    ...
>>> > *    Prospective: After expiration of the election timeout
>>>
>>> Can you explain the reason a candidate would transition back to
>>> prospective? If a voter transitions to the candidate state it is
>>> because the voters don't support KIP-996 or the replica was able to
>>> win the majority of the votes at some point in the past. Are we
>>> concerned that the network partition might have occurred after the
>>> replica has become a candidate? If so, I think we should state this
>>> explicitly in the KIP.
>>>
>>> 3. In the proposed section and state transition section, I think it
>>> would be helpful to explicitly state that we have an invariant that
>>> only the prospective state can transition to the candidate state. This
>>> transition to the candidate state from the prospective state can only
>>> happen because the replica won the majority of the votes or there is
>>> at least one remote voter that doesn't support pre-vote.
>>>
>>> 4. I am a bit confused by this paragraph
>>> > A candidate will now send a VoteRequest with the PreVote field set to
>>> true and CandidateEpoch set to its [epoch + 1] when its election timeout
>>> expires. If [majority - 1] of VoteResponse grant the vote, the candidate
>>> will then bump its epoch up and send a VoteRequest with PreVote set to
>>> false which is our standard vote that will cause state changes for servers
>>> receiving the request.
>>>
>>> I am assuming that "candidate" refers to the states enumerated on the
>>> table above this quote. If so, I think you mean "prospective" for the
>>> first candidate.
>>>
>>> CandidateEpoch should be ReplicaEpoch.
>>>
>>> [epoch + 1] should just be epoch. I thought we agreed that replicas
>>> will always send their current epoch to the remote replicas.
>>>
>>> 5. I am a bit confused by this bullet section
>>> > true if the server receives less than [majority] VoteResponse with
>>> VoteGranted set to false within [election.timeout.ms + a little
>>> randomness] and the first bullet point does not apply
>>>      Explanation for why we don't send a standard vote at this point
>>> is explained in rejected alternatives.
>>>
>>> Can we explain this case in plain english? I assume that this case is
>>> trying to cover the scenario where the election timer expired but the
>>> prospective candidate hasn't received enough votes (granted or
>>> rejected) to make a decision if it could win an election.
>>>
>>> 6.
>>> > Yes. If a leader is unable to receive fetch responses from a majority
>>> of servers, it can impede followers that are able to communicate with it
>>> from voting in an eligible leader that can communicate with a majority of
>>> the cluster.
>>>
>>> In general, leaders don't receive fetch responses. They receive FETCH
>>> requests. Did you mean "if a leader is able to send FETCH responses to
>>> the majority - 1 of the voters, it can impede fetching voters
>>> (followers) from granting their vote to prospective candidates. This
>>> should stop prospective candidates from getting enough votes to
>>> transition to the candidate state and increase their epoch".
>>>
>>> 7.
>>> > Check Quorum ensures a leader steps down if it is unable to receive
>>> fetch responses from a majority of servers.
>>>
>>> I think you mean "... if it is unable to receive FETCH requests from
>>> the majority - 1 of the voters".
>>>
>>> 8. At the end of the Proposed changes section you have the following:
>>> > The logic now looks like the following for servers receiving
>>> VoteRequests with PreVote set to true:
>>> >
>>> > When servers receive VoteRequests with the PreVote field set to true,
>>> they will respond with VoteGranted set to
>>> >
>>> > * true if they are not a Follower and the epoch and offsets in the
>>> Pre-Vote request satisfy the same requirements as a standard vote
>>> > * false if they are a Follower or the epoch and end offsets in the
>>> Pre-Vote request do not satisfy the requirements
>>>
>>> This seems to duplicate the same algorithm that was stated earlier in
>>> the section.
>>>
>>> 9. I don't understand this rejected idea: Sending Standard Votes after
>>> failure to win Pre-Vote
>>>
>>> In your example in the "Disruptive server scenarios" voters 4 and 5
>>> are partitioned from the majority of the voters. We don't want voters
>>> 4 and 5 increasing their epoch and transitioning to the candidate
>>> state else they would disrupt the quorum established by voters 1, 2
>>> and 3.
>>>
>>>
>>> Thanks,
>>> --
>>> -José
>>>
>>

Re: [DISCUSS] KIP-996: Pre-Vote

Reply via email to