Hi Jun

Thanks for the feedback. See my comments below.

On Wed, Mar 6, 2024 at 4:47 PM Jun Rao <j...@confluent.io.invalid> wrote:
> 20.1. It seems that the PreferredSuccessors field didn't get fixed. It's
> still there together with PreferredCandidates.
> +        { "name": "PreferredSuccessors", "type": "[]int32", "versions":
> "0",
> +          "about": "A sorted list of preferred successors to start the
> election" },
> +        { "name": "PreferredCandidates", "type": "[]ReplicaInfo",
> "versions": "1+",
> +          "about": "A sorted list of preferred candidates to start the
> election", "fields": [

Notice that the PreferredSuccessors field is only for version 0 while
the PreferredCandidate field is for version 1 or greater. I had to
create a new field because arrays of int32 ([]int32) are not
compatible with arrays of structs because of tagged fields in
sub-structs.

> 37. If we don't support batching in AddVoterResponse, RemoveVoterResponse
> and UpdateVoterResponse, should we combine CurrentLeader and NodeEndpoint
> into a single field?

Yes. I replaced the LeaderIdAndEpoch and NodeEpoint structs into a
single struct that contains the leader id, epoch, host and port.

> 42. We include VoterId and VoterUuid for the receiver in Vote and
> BeginQuorumEpoch requests, but not in EndQuorumEpoch, Fetch and
> FetchSnapshot. Could you explain how they are used?

For the Vote request and BeginQuorumEpoch request the replica
(candidate for Vote and leader for BeginQuorumEpoch) sending the
request needs to make sure that it is sending the request to the
correct node. This is needed for correctness. The most important case
that I wanted to make sure that replicas handle correctly is the
following:
1. Voter set is A, B, C  and the leader is A. The voter A is both the
voter id and voter uuid
2. Assume that A crashes and loses its disk. When it recovers it will
come back as A'. A' means a replica with the same id but with a
different replica uuid.

Replica A' will most likely be accepting connection and handling
requests (e.g. Vote and BeginQuorumEpoch) on the same endpoints as A.
There can be inconsistency in the state, if for example replica B
sends a Vote request to A' but A' handles it as if it was A. This is
the reason the sender sends the remote replica's id and uuid (VoterId
and VoterUuid) in the request. The same analysis applies to
BeginEpochQuorum.

For the Fetch and FetchSnapshot request the closest equivalent would
be leader id and leader epoch. Those RPCs only have leader epochs. You
can argue that they don't need the leader id because a leader epoch
can have at most one leader id. In other words, the leader epoch also
uniquely identifies the leader id if there is one. I am reluctant to
change the Fetch RPC unless it is strictly required because that RPC
is also used for regular topic partition replication. I tried to make
the FetchSnapshot RPC as consistent to the Fetch RPC as possible since
they have similar access patterns.

EndQuorumEpoch is not needed for correctness. It is there for improved
availability; to speedup leadership change when the nodes are
cooperating (controlled shutdown and resignation). The sending replica
(leader) doesn't need to wait for the response or to check that the
RPC was handled correctly.

I'll reread the KIP and update it to better explain the need for
VoteId and VoteUuid in the Vote and BeginQuorumEpoch requests.

Thanks,
-- 
-José

Reply via email to