Thanks for the feedback Colin. I updated the KIP with your suggestions and replied to your comments below.
Colin McCabe wrote: > 1. It seems like the proposal is to have a UUID per partition directory on > the voter. If I understand correctly, this is sometimes referred to as > "VoterUUID" and sometimes as "ReplicaUUID." The latter seems more accurate, > since a single voter could have multiple of these IDs, in a situation where > we had multiple Raft topics. So it would be good to standardize on that. > Also, I didn't see a description of how this would be stored in the log > directory. That would be good to add. Conceptually, these are all replica UUID. In the protocol (Kafka message) we use different field names depending on the context. This KIP and KIP-595 uses the prefixes "Candidate", "Replica", "Voter" and "Voted" depending on the context. I went through the KIP and made some changes to make this consistent. Diff: https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=217391519&selectedPageVersions=24&selectedPageVersions=23 > 2. When we originally did the Raft and Quorum Controller KIPs, one > contentious topic was node IDs. We eventually settled on the idea that broker > and controller IDs were in the same ID space. So you can't (for example) have > a broker 3 that is in a separate JVM from controller 3. This is pretty easy > to enforce with a static configuration, but it seems like it will be harder > to do dynamically. > > I would like to keep this invariant. This probably requires us to reject > attempts to add a new quorum voter which duplicates a broker ID (except in > the special case of co-location!) Similarly, we should reject broker > registrations that duplicate an unrelated controller ID. The broker's > incarnation ID is the key to doing this, I think. But that requires us to > send the incarnation ID in many of these RPCs. > I think there are two cases that we want to protect. 1. The admin attempts to add a voter that is a broker using the AddVoter RPC. The KIP protects against this case by only having controllers (or replicas that support being voters) send the ReplicaUuid to the cluster metadata leader. Since the AddVoter RPC requires both ID and UUID to be specified, brokers cannot be added as voters. I thought about adding another field to the Fetch and FetchSnapshot request to communicate to the leader if the sending replica supports becoming a voter but decided against it to not increase the already large number of fields in the Fetch request. 2. The admin starts a broker that has the same ID as a voter. Kafka currently protects against this by validating the broker configuration and fails the validation of the node is a broker and has an id that is enumerated in controller.quorum.voters. When using dynamic configuration the controller.quorum.voters configuration will be empty. In the implementation we can still protect against this case by passing enough information to the KafkaRaftClient. For example,if we pass the boolean "canBeVoter" to the KRaft implement then check that its replica id is not in the voter set. If it is then it will shut down. > 3. Is it really necessary to put the endpoint information into the > AddVoterRecord? It seems like that could be figured out at runtime, like we > do today. If we do need it, it seems particularly weird for it to be > per-partition (will we have a separate TCP port for each Raft partition?) I > also don't know why we'd want multiple endpoints. We have that for the broker > because the endpoints have different uses, but that isn't the case here. Today, voters/controllers use the hosts and ports specified in controller.quorum.voters to establish a leader. Brokers use the hosts and ports in controller.quorum.voters to discover the leaders. The requirements are the following: 1. Voters/controllers need to know the endpoint of all of the voters. This is required because at a minimum the voters need to send a Vote request to the majority of the voters to establish leadership. 2. Brokers need to discover the leader from a list of nodes. The endpoints in AddVoterRecord are so that the voters can discover each other. The configuration controllers.quorum.bootstrap.servers is so that observers (brokers and new voters) which are not required to have replicated the voters and AddVoterRecord can discover the leader. I noticed that I forgot to add endpoint information in all of the RPCs that return the current leader. This is needed because it is not guaranteed that the replica will have an AddVoterRecord for the leader id returned in the response. > The original rationale for multiple endpoints on the controllers was to > support migration from PLAINTEXT to SSL (or whatever). But that only requires > multiple listeners to be active on the receive side, not send side. A single > voter never needs more than one endpoint to contact a peer. I agree. I removed the array ([]) and made it a single endpoint. > Overall, I think we'd be better off keeping this as soft state rather than > adding it to the log. Particularly if it's not in the log at all for the > static configuration case... I covered this issue in one of my comments above but in summary we need the endpoints for each voters persisted in the log whe controller.quorum.bootstrap.server is used and if we want to support going from controller.quorum.voters to controller.quorum.bootstrap.servers. > 4. How do you get from the static configuration situation to the dynamic one? > Can it be done with a rolling restart? I think the answer is yes, but I > wasn't quite sure on reading. I didn't address this issue but I wanted to share my update and reply to your other comments. Let me get back to you on this issue. There are a few requirements that I need to make sure are satisfied before we can replace the static configuration with a dynamic one. > Does a leader using the static configuration auto-remove voters that aren't > in that static config, as well as auto-add? The adding behavior is spelled > out, but not removing (or maybe I missed it). Yes. I don't think that the leader should auto remove voters. It is not safe to change the controller.quorum.voters field after the active controller has written records to the cluster metadata topic partition. I updated the bootstrap section to mention that the replica will fail to start and shutdown if the cluster metadata topic partition contains AddVoterRecords for replica ids that are not in the controller.quorum.voters configuration. Thank you! -- -José