Re: [DISCUSS] KIP-853: KRaft Voters Change

José Armando García Sancio Mon, 25 Jul 2022 09:58:18 -0700

Thanks for the feedback Colin. I updated the KIP with your suggestions
and replied to your comments below.

Colin McCabe wrote:
> 1. It seems like the proposal is to have a UUID per partition directory on 
> the voter. If I understand correctly, this is sometimes referred to as 
> "VoterUUID" and sometimes as "ReplicaUUID." The latter seems more accurate, 
> since a single voter could have multiple of these IDs, in a situation where 
> we had multiple Raft topics. So it would be good to standardize on that. 
> Also, I didn't see a description of how this would be stored in the log 
> directory. That would be good to add.

Conceptually, these are all replica UUID. In the protocol (Kafka
message) we use different field names depending on the context. This
KIP and KIP-595 uses the prefixes "Candidate", "Replica", "Voter" and
"Voted" depending on the context. I went through the KIP and made some
changes to make this consistent. Diff:
https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=217391519&selectedPageVersions=24&selectedPageVersions=23

> 2. When we originally did the Raft and Quorum Controller KIPs, one 
> contentious topic was node IDs. We eventually settled on the idea that broker 
> and controller IDs were in the same ID space. So you can't (for example) have 
> a broker 3 that is in a separate JVM from controller 3. This is pretty easy 
> to enforce with a static configuration, but it seems like it will be harder 
> to do dynamically.
>
> I would like to keep this invariant. This probably requires us to reject 
> attempts to add a new quorum voter which duplicates a broker ID (except in 
> the special case of co-location!) Similarly, we should reject broker 
> registrations that duplicate an unrelated controller ID. The broker's 
> incarnation ID is the key to doing this, I think. But that requires us to 
> send the incarnation ID in many of these RPCs.
>

I think there are two cases that we want to protect.
1. The admin attempts to add a voter that is a broker using the
AddVoter RPC. The KIP protects against this case by only having
controllers (or replicas that support being voters) send the
ReplicaUuid to the cluster metadata leader. Since the AddVoter RPC
requires both ID and UUID to be specified, brokers cannot be added as
voters. I thought about adding another field to the Fetch and
FetchSnapshot request to communicate to the leader if the sending
replica supports becoming a voter but decided against it to not
increase the already large number of fields in the Fetch request.

2. The admin starts a broker that has the same ID as a voter. Kafka
currently protects against this by validating the broker configuration
and fails the validation of the node is a broker and has an id that is
enumerated in controller.quorum.voters. When using dynamic
configuration the controller.quorum.voters configuration will be
empty. In the implementation we can still protect against this case by
passing enough information to the KafkaRaftClient. For example,if we
pass the boolean "canBeVoter" to the KRaft implement then check that
its replica id is not in the voter set. If it is then it will shut
down.

> 3. Is it really necessary to put the endpoint information into the 
> AddVoterRecord? It seems like that could be figured out at runtime, like we 
> do today. If we do need it, it seems particularly weird for it to be 
> per-partition (will we have a separate TCP port for each Raft partition?) I 
> also don't know why we'd want multiple endpoints. We have that for the broker 
> because the endpoints have different uses, but that isn't the case here.

Today, voters/controllers use the hosts and ports specified in
controller.quorum.voters to establish a leader. Brokers use the hosts
and ports in controller.quorum.voters to discover the leaders. The
requirements are the following:
1. Voters/controllers need to know the endpoint of all of the voters.
This is required because at a minimum the voters need to send a Vote
request to the majority of the voters to establish leadership.
2. Brokers need to discover the leader from a list of nodes.

The endpoints in AddVoterRecord are so that the voters can discover
each other. The configuration controllers.quorum.bootstrap.servers is
so that observers (brokers and new voters) which are not required to
have replicated the voters and AddVoterRecord can discover the leader.

I noticed that I forgot to add endpoint information in all of the RPCs
that return the current leader. This is needed because it is not
guaranteed that the replica will have an AddVoterRecord for the leader
id returned in the response.

> The original rationale for multiple endpoints on the controllers was to 
> support migration from PLAINTEXT to SSL (or whatever). But that only requires 
> multiple listeners to be active on the receive side, not send side. A single 
> voter never needs more than one endpoint to contact a peer.

I agree. I removed the array ([]) and made it a single endpoint.

> Overall, I think we'd be better off keeping this as soft state rather than 
> adding it to the log. Particularly if it's not in the log at all for the 
> static configuration case...

I covered this issue in one of my comments above but in summary we
need the endpoints for each voters persisted in the log whe
controller.quorum.bootstrap.server is used and if we want to support
going from controller.quorum.voters to
controller.quorum.bootstrap.servers.

> 4. How do you get from the static configuration situation to the dynamic one? 
> Can it be done with a rolling restart? I think the answer is yes, but I 
> wasn't quite sure on reading.

I didn't address this issue but I wanted to share my update and reply
to your other comments. Let me get back to you on this issue. There
are a few requirements that I need to make sure are satisfied before
we can replace the static configuration with a dynamic one.

> Does a leader using the static configuration auto-remove voters that aren't 
> in that static config, as well as auto-add? The adding behavior is spelled 
> out, but not removing (or maybe I missed it).

Yes. I don't think that the leader should auto remove voters. It is
not safe to change the controller.quorum.voters field after the active
controller has written records to the cluster metadata topic
partition. I updated the bootstrap section to mention that the replica
will fail to start and shutdown if the cluster metadata topic
partition contains AddVoterRecords for replica ids that are not in the
controller.quorum.voters configuration.

Thank you!
--
-José

Re: [DISCUSS] KIP-853: KRaft Voters Change

Reply via email to