Hi Jun Thanks for the feedback. See my comments below.
On Wed, Mar 6, 2024 at 4:47 PM Jun Rao <j...@confluent.io.invalid> wrote: > 20.1. It seems that the PreferredSuccessors field didn't get fixed. It's > still there together with PreferredCandidates. > + { "name": "PreferredSuccessors", "type": "[]int32", "versions": > "0", > + "about": "A sorted list of preferred successors to start the > election" }, > + { "name": "PreferredCandidates", "type": "[]ReplicaInfo", > "versions": "1+", > + "about": "A sorted list of preferred candidates to start the > election", "fields": [ Notice that the PreferredSuccessors field is only for version 0 while the PreferredCandidate field is for version 1 or greater. I had to create a new field because arrays of int32 ([]int32) are not compatible with arrays of structs because of tagged fields in sub-structs. > 37. If we don't support batching in AddVoterResponse, RemoveVoterResponse > and UpdateVoterResponse, should we combine CurrentLeader and NodeEndpoint > into a single field? Yes. I replaced the LeaderIdAndEpoch and NodeEpoint structs into a single struct that contains the leader id, epoch, host and port. > 42. We include VoterId and VoterUuid for the receiver in Vote and > BeginQuorumEpoch requests, but not in EndQuorumEpoch, Fetch and > FetchSnapshot. Could you explain how they are used? For the Vote request and BeginQuorumEpoch request the replica (candidate for Vote and leader for BeginQuorumEpoch) sending the request needs to make sure that it is sending the request to the correct node. This is needed for correctness. The most important case that I wanted to make sure that replicas handle correctly is the following: 1. Voter set is A, B, C and the leader is A. The voter A is both the voter id and voter uuid 2. Assume that A crashes and loses its disk. When it recovers it will come back as A'. A' means a replica with the same id but with a different replica uuid. Replica A' will most likely be accepting connection and handling requests (e.g. Vote and BeginQuorumEpoch) on the same endpoints as A. There can be inconsistency in the state, if for example replica B sends a Vote request to A' but A' handles it as if it was A. This is the reason the sender sends the remote replica's id and uuid (VoterId and VoterUuid) in the request. The same analysis applies to BeginEpochQuorum. For the Fetch and FetchSnapshot request the closest equivalent would be leader id and leader epoch. Those RPCs only have leader epochs. You can argue that they don't need the leader id because a leader epoch can have at most one leader id. In other words, the leader epoch also uniquely identifies the leader id if there is one. I am reluctant to change the Fetch RPC unless it is strictly required because that RPC is also used for regular topic partition replication. I tried to make the FetchSnapshot RPC as consistent to the Fetch RPC as possible since they have similar access patterns. EndQuorumEpoch is not needed for correctness. It is there for improved availability; to speedup leadership change when the nodes are cooperating (controlled shutdown and resignation). The sending replica (leader) doesn't need to wait for the response or to check that the RPC was handled correctly. I'll reread the KIP and update it to better explain the need for VoteId and VoteUuid in the Vote and BeginQuorumEpoch requests. Thanks, -- -José