Re: [DISCUSS] KIP-853: KRaft Controller Membership Changes

Jun Rao Thu, 14 Mar 2024 15:38:49 -0700

Hi, Jose,

Thanks for the reply. A few more comments.

37. Have you updated the wiki? It seems that LeaderIdAndEpoch and
NodeEpoint are still two separate structs.

45. kafka-storage format --cluster-id <cluster-id> --release-version 3.8
--standalone --config controller.properties
It seems that --release-version is optional and the default value of this
initial metadata.version is statically defined by the binary used to run
the tool. So, in the common case, it seems that we could just omit
--release-version.

46. The KIP says "The leader will learn the range of supported version from
the UpdateVoter RPC.
46.1 Then "KRaft will use the information in the controller registration,
broker registration and add voter records to determine if the new version
is compatible." is no longer accurate.
46.2 Do we have a bit of a chicken-and-egg problem? A voter can't send
UpdateVoter RPC until kraft.version is upgraded to 1. But to upgrade
kraft.version, the leader needs to receive UpdateVoter RPCs.
46.3 If the leader always learns the range of supported kraft.version from
UpdateVoter RPC, does the VotersRecord need to include KRaftVersionFeature?

47. When a voter is started, in what order does it send the UpdateVoter and
ControllerRegistration RPC?

48. Voters set: "If the partition doesn't contain a VotersRecord control
record then the value stored in the controller.quorum.voters property will
be used."
  Hmm, I thought controller.quorum.voters is only the fallback
for controller.quorum.bootstrap.servers?

49. "If the local replica is getting added to the voters set, it will allow
the transition to prospective candidate when the fetch timer expires."
  Could you explain why this is needed?

50. Quorum state: AppliedOffset will get removed in version 1.
  This is the original description of AppliedOffset: Reflects the maximum
offset that has been applied to this quorum state. This is used for log
recovery. The broker must scan from this point on initialization to detect
updates to this file. If we remove this field, how do we reason about the
consistency between the quorum state and the metadata log?

51. AddVoter has the following steps.
1. Wait for the fetch offset of the replica (ID, UUID) to catch up to the
log end offset of the leader.
2. Wait until there are no uncommitted VotersRecord. Note that the
implementation may just return a REQUEST_TIMED_OUT error if there are
pending operations.
3. Wait for the LeaderChangeMessage control record from the current epoch
to get committed. Note that the implementation may just return a
REQUEST_TIMED_OUT error if there are pending operations.
4. Send an ApiVersions RPC to the first listener to discover the supported
kraft.version of the new voter.
It seems that doing the check on supported kraft.version of the new voter
in step 4 is too late. If the new voter doesn't support kraft.version of 1,
it can't process the metadata log records and step 1 could fail.

52. Admin client: Could you provide a bit more details on the changes?

53. A few more typos.
53.1 be bootstrap => be bootstrapped
53.2 If the new leader supports the current kraft.version,
  new leader => new voter
53.3 Voter are removed
  Voter => Voters
53.4 that are part that are voters
  This part doesn't read well.
53.5 the the current
  extra the
53.6 it will be discover
  discover => discovered
53.7 it would beneficial
  beneficial => be beneficial

Jun

On Mon, Mar 11, 2024 at 10:39 AM José Armando García Sancio
<jsan...@confluent.io.invalid> wrote:

> Hi Jun
>
> Thanks for the feedback. See my comments below.
>
> On Wed, Mar 6, 2024 at 4:47 PM Jun Rao <j...@confluent.io.invalid> wrote:
> > 20.1. It seems that the PreferredSuccessors field didn't get fixed. It's
> > still there together with PreferredCandidates.
> > +        { "name": "PreferredSuccessors", "type": "[]int32", "versions":
> > "0",
> > +          "about": "A sorted list of preferred successors to start the
> > election" },
> > +        { "name": "PreferredCandidates", "type": "[]ReplicaInfo",
> > "versions": "1+",
> > +          "about": "A sorted list of preferred candidates to start the
> > election", "fields": [
>
> Notice that the PreferredSuccessors field is only for version 0 while
> the PreferredCandidate field is for version 1 or greater. I had to
> create a new field because arrays of int32 ([]int32) are not
> compatible with arrays of structs because of tagged fields in
> sub-structs.
>
> > 37. If we don't support batching in AddVoterResponse, RemoveVoterResponse
> > and UpdateVoterResponse, should we combine CurrentLeader and NodeEndpoint
> > into a single field?
>
> Yes. I replaced the LeaderIdAndEpoch and NodeEpoint structs into a
> single struct that contains the leader id, epoch, host and port.
>
> > 42. We include VoterId and VoterUuid for the receiver in Vote and
> > BeginQuorumEpoch requests, but not in EndQuorumEpoch, Fetch and
> > FetchSnapshot. Could you explain how they are used?
>
> For the Vote request and BeginQuorumEpoch request the replica
> (candidate for Vote and leader for BeginQuorumEpoch) sending the
> request needs to make sure that it is sending the request to the
> correct node. This is needed for correctness. The most important case
> that I wanted to make sure that replicas handle correctly is the
> following:
> 1. Voter set is A, B, C  and the leader is A. The voter A is both the
> voter id and voter uuid
> 2. Assume that A crashes and loses its disk. When it recovers it will
> come back as A'. A' means a replica with the same id but with a
> different replica uuid.
>
> Replica A' will most likely be accepting connection and handling
> requests (e.g. Vote and BeginQuorumEpoch) on the same endpoints as A.
> There can be inconsistency in the state, if for example replica B
> sends a Vote request to A' but A' handles it as if it was A. This is
> the reason the sender sends the remote replica's id and uuid (VoterId
> and VoterUuid) in the request. The same analysis applies to
> BeginEpochQuorum.
>
> For the Fetch and FetchSnapshot request the closest equivalent would
> be leader id and leader epoch. Those RPCs only have leader epochs. You
> can argue that they don't need the leader id because a leader epoch
> can have at most one leader id. In other words, the leader epoch also
> uniquely identifies the leader id if there is one. I am reluctant to
> change the Fetch RPC unless it is strictly required because that RPC
> is also used for regular topic partition replication. I tried to make
> the FetchSnapshot RPC as consistent to the Fetch RPC as possible since
> they have similar access patterns.
>
> EndQuorumEpoch is not needed for correctness. It is there for improved
> availability; to speedup leadership change when the nodes are
> cooperating (controlled shutdown and resignation). The sending replica
> (leader) doesn't need to wait for the response or to check that the
> RPC was handled correctly.
>
> I'll reread the KIP and update it to better explain the need for
> VoteId and VoteUuid in the Vote and BeginQuorumEpoch requests.
>
> Thanks,
> --
> -José
>

Re: [DISCUSS] KIP-853: KRaft Controller Membership Changes

Reply via email to