Hey Jun, > Thanks for the response. Should we make clusterId a nullable field consistently in all new requests?
Yes, makes sense. I updated the proposal. -Jason On Fri, Jul 31, 2020 at 4:17 PM Jun Rao <j...@confluent.io> wrote: > Hi, Jason, > > Thanks for the response. Should we make clusterId a nullable field > consistently in all new requests? > > Jun > > On Wed, Jul 29, 2020 at 12:20 PM Jason Gustafson <ja...@confluent.io> > wrote: > > > Hey Jun, > > > > I added a section on "Cluster Bootstrapping" which discusses clusterId > > generation and the process through which brokers find the current leader. > > The quick summary is that the first controller will be responsible for > > generating the clusterId and persisting it in the metadata log. Before > the > > first leader has been elected, quorum APIs will skip clusterId > validation. > > This seems reasonable since this is primarily intended to prevent the > > damage from misconfiguration after a cluster has been running for some > > time. Upon startup, brokers begin by sending Fetch requests to find the > > current leader. This will include the cluster.id from meta.properties if > > it > > is present. The broker will shutdown immediately if it receives > > INVALID_CLUSTER_ID from the Fetch response. > > > > I also added some details about our testing strategy, which you asked > about > > previously. > > > > Thanks, > > Jason > > > > On Mon, Jul 27, 2020 at 10:46 PM Boyang Chen <reluctanthero...@gmail.com > > > > wrote: > > > > > On Mon, Jul 27, 2020 at 4:58 AM Unmesh Joshi <unmeshjo...@gmail.com> > > > wrote: > > > > > > > Just checked etcd and zookeeper code, and both support leader to step > > > down > > > > as a follower to make sure there are no two leaders if the leader has > > > been > > > > disconnected from the majority of the followers > > > > For etcd this is https://github.com/etcd-io/etcd/issues/3866 > > > > For Zookeeper its > https://issues.apache.org/jira/browse/ZOOKEEPER-1699 > > > > I was just thinking if it would be difficult to implement in the Pull > > > based > > > > model, but I guess not. It is possibly the same way ISR list is > managed > > > > currently, if leader of the controller quorum loses majority of the > > > > followers, it should step down and become follower, that way, telling > > > > client in time that it was disconnected from the quorum, and not keep > > on > > > > sending state metadata to clients. > > > > > > > > Thanks, > > > > Unmesh > > > > > > > > > > > > On Mon, Jul 27, 2020 at 9:31 AM Unmesh Joshi <unmeshjo...@gmail.com> > > > > wrote: > > > > > > > > > >>Could you clarify on this question? Which part of the raft group > > > > doesn't > > > > > >>know about leader dis-connection? > > > > > The leader of the controller quorum is partitioned from the > > controller > > > > > cluster, and a different leader is elected for the remaining > > controller > > > > > cluster. > > > > > > > I see your concern. For KIP-595 implementation, since there is no > regular > > > heartbeats sent > > > from the leader to the followers, we decided to piggy-back on the fetch > > > timeout so that if the leader did not receive Fetch > > > requests from a majority of the quorum for that amount of time, it > would > > > begin a new election and > > > start sending VoteRequest to voter nodes in the cluster to understand > the > > > latest quorum. You could > > > find more details in this section > > > < > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote > > > > > > > . > > > > > > > > > > > I think there are two things here, > > > > > 1. The old leader will not know if it's disconnected from the rest > > of > > > > the > > > > > controller quorum cluster unless it receives BeginQuorumEpoch from > > the > > > > new > > > > > leader. So it will keep on serving stale metadata to the clients > > > > (Brokers, > > > > > Producers and Consumers) > > > > > 2. I assume, the Broker Leases will be managed on the controller > > quorum > > > > > leader. This partitioned leader will keep on tracking broker leases > > it > > > > has, > > > > > while the new leader of the quorum will also start managing broker > > > > leases. > > > > > So while the quorum leader is partitioned, there will be two > > membership > > > > > views of the kafka brokers managed on two leaders. > > > > > Unless broker heartbeats are also replicated as part of the Raft > log, > > > > > there is no way to solve this? > > > > > I know LogCabin implementation does replicate client heartbeats. I > > > > suspect > > > > > that the same issue is there in Zookeeper, which does not replicate > > > > client > > > > > Ping requests.. > > > > > > > > > > Thanks, > > > > > Unmesh > > > > > > > > > > > > > > > > > > > > On Mon, Jul 27, 2020 at 6:23 AM Boyang Chen < > > > reluctanthero...@gmail.com> > > > > > wrote: > > > > > > > > > >> Thanks for the questions Unmesh! > > > > >> > > > > >> On Sun, Jul 26, 2020 at 6:18 AM Unmesh Joshi < > unmeshjo...@gmail.com > > > > > > > >> wrote: > > > > >> > > > > >> > Hi, > > > > >> > > > > > >> > In the FetchRequest Handling, how to make sure we handle > scenarios > > > > where > > > > >> > the leader might have been disconnected from the cluster, but > > > doesn't > > > > >> know > > > > >> > yet? > > > > >> > > > > > >> Could you clarify on this question? Which part of the raft group > > > doesn't > > > > >> know about leader > > > > >> dis-connection? > > > > >> > > > > >> > > > > >> > As discussed in the Raft Thesis section 6.4, the linearizable > > > > semantics > > > > >> of > > > > >> > read requests is implemented in LogCabin by sending heartbeat to > > > > >> followers > > > > >> > and waiting till the heartbeats are successful to make sure that > > the > > > > >> leader > > > > >> > is still the leader. > > > > >> > I think for the controller quorum to make sure none of the > > consumers > > > > get > > > > >> > stale data, it's important to have linearizable semantics? In > the > > > pull > > > > >> > based model, the leader will need to wait for heartbeats from > the > > > > >> followers > > > > >> > before returning each fetch request from the consumer then? Or > do > > we > > > > >> need > > > > >> > to introduce some other request? > > > > >> > (Zookeeper does not have linearizable semantics for read > requests, > > > but > > > > >> as > > > > >> > of now all the kafka interactions are through writes and > watches). > > > > >> > > > > > >> > This is a very good question. For our v1 implementation we are > not > > > > >> aiming > > > > >> to guarantee linearizable read, which > > > > >> would be considered as a follow-up effort. Note that today in > Kafka > > > > there > > > > >> is no guarantee on the metadata freshness either, > > > > >> so no regression is introduced. > > > > >> > > > > >> > > > > >> > Thanks, > > > > >> > Unmesh > > > > >> > > > > > >> > On Fri, Jul 24, 2020 at 11:36 PM Jun Rao <j...@confluent.io> > > wrote: > > > > >> > > > > > >> > > Hi, Jason, > > > > >> > > > > > > >> > > Thanks for the reply. > > > > >> > > > > > > >> > > 101. Sounds good. Regarding clusterId, I am not sure storing > it > > in > > > > the > > > > >> > > metadata log is enough. For example, the vote request includes > > > > >> clusterId. > > > > >> > > So, no one can vote until they know the clusterId. Also, it > > would > > > be > > > > >> > useful > > > > >> > > to support the case when a voter completely loses its disk and > > > needs > > > > >> to > > > > >> > > recover. > > > > >> > > > > > > >> > > 210. There is no longer a FindQuorum request. When a follower > > > > >> restarts, > > > > >> > how > > > > >> > > does it discover the leader? Is that based on DescribeQuorum? > It > > > > >> would be > > > > >> > > useful to document this. > > > > >> > > > > > > >> > > Jun > > > > >> > > > > > > >> > > On Fri, Jul 17, 2020 at 2:15 PM Jason Gustafson < > > > ja...@confluent.io > > > > > > > > > >> > > wrote: > > > > >> > > > > > > >> > > > Hi Jun, > > > > >> > > > > > > > >> > > > Thanks for the questions. > > > > >> > > > > > > > >> > > > 101. I am treating some of the bootstrapping problems as out > > of > > > > the > > > > >> > scope > > > > >> > > > of this KIP. I am working on a separate proposal which > > addresses > > > > >> > > > bootstrapping security credentials specifically. Here is a > > rough > > > > >> sketch > > > > >> > > of > > > > >> > > > how I am seeing it: > > > > >> > > > > > > > >> > > > 1. Dynamic broker configurations including encrypted > passwords > > > > will > > > > >> be > > > > >> > > > persisted in the metadata log and cached in the broker's > > > > >> > > `meta.properties` > > > > >> > > > file. > > > > >> > > > 2. We will provide a tool which allows users to directly > > > override > > > > >> the > > > > >> > > > values in `meta.properties` without requiring access to the > > > > quorum. > > > > >> > This > > > > >> > > > can be used to bootstrap the credentials of the voter set > > itself > > > > >> before > > > > >> > > the > > > > >> > > > cluster has been started. > > > > >> > > > 3. Some dynamic config changes will only be allowed when a > > > broker > > > > is > > > > >> > > > online. For example, changing a truststore password > > dynamically > > > > >> would > > > > >> > > > prevent that broker from being able to start if it were > > offline > > > > when > > > > >> > the > > > > >> > > > change was made. > > > > >> > > > 4. I am still thinking a little bit about SCRAM credentials, > > but > > > > >> most > > > > >> > > > likely they will be handled with an approach similar to > > > > >> > > `meta.properties`. > > > > >> > > > > > > > >> > > > 101.3 As for the question about `clusterId`, I think the way > > we > > > > >> would > > > > >> > do > > > > >> > > > this is to have the first elected leader generate a UUID and > > > write > > > > >> it > > > > >> > to > > > > >> > > > the metadata log. Let me add some detail to the proposal > about > > > > this. > > > > >> > > > > > > > >> > > > A few additional answers below: > > > > >> > > > > > > > >> > > > 203. Yes, that is correct. > > > > >> > > > > > > > >> > > > 204. That is a good question. What happens in this case is > > that > > > > all > > > > >> > > voters > > > > >> > > > advance their epoch to the one designated by the candidate > > even > > > if > > > > >> they > > > > >> > > > reject its vote request. Assuming the candidate fails to be > > > > elected, > > > > >> > the > > > > >> > > > election will be retried until a leader emerges. > > > > >> > > > > > > > >> > > > 205. I had some discussion with Colin offline about this > > > problem. > > > > I > > > > >> > think > > > > >> > > > the answer should be "yes," but it probably needs a little > > more > > > > >> > thought. > > > > >> > > > Handling JBOD failures is tricky. For an observer, we can > > > > replicate > > > > >> the > > > > >> > > > metadata log from scratch safely in a new log dir. But if > the > > > log > > > > >> dir > > > > >> > of > > > > >> > > a > > > > >> > > > voter fails, I do not think it is generally safe to start > from > > > an > > > > >> empty > > > > >> > > > state. > > > > >> > > > > > > > >> > > > 206. Yes, that is discussed in KIP-631 I believe. > > > > >> > > > > > > > >> > > > 207. Good suggestion. I will work on this. > > > > >> > > > > > > > >> > > > > > > > >> > > > Thanks, > > > > >> > > > Jason > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > On Thu, Jul 16, 2020 at 3:44 PM Jun Rao <j...@confluent.io> > > > wrote: > > > > >> > > > > > > > >> > > > > Hi, Jason, > > > > >> > > > > > > > > >> > > > > Thanks for the updated KIP. Looks good overall. A few more > > > > >> comments > > > > >> > > > below. > > > > >> > > > > > > > > >> > > > > 101. I still don't see a section on bootstrapping related > > > > issues. > > > > >> It > > > > >> > > > would > > > > >> > > > > be useful to document if/how the following is supported. > > > > >> > > > > 101.1 Currently, we support auto broker id generation. Is > > this > > > > >> > > supported > > > > >> > > > > for bootstrap brokers? > > > > >> > > > > 101.2 As Colin mentioned, sometimes we may need to load > the > > > > >> security > > > > >> > > > > credentials to be broker before it can be connected to. > > Could > > > > you > > > > >> > > > provide a > > > > >> > > > > bit more detail on how this will work? > > > > >> > > > > 101.3 Currently, we use ZK to generate clusterId on a new > > > > cluster. > > > > >> > With > > > > >> > > > > Raft, how does every broker generate the same clusterId > in a > > > > >> > > distributed > > > > >> > > > > way? > > > > >> > > > > > > > > >> > > > > 200. It would be useful to document if the various special > > > > offsets > > > > >> > (log > > > > >> > > > > start offset, recovery point, HWM, etc) for the Raft log > are > > > > >> stored > > > > >> > in > > > > >> > > > the > > > > >> > > > > same existing checkpoint files or not. > > > > >> > > > > 200.1 Since the Raft log flushes every append, does that > > allow > > > > us > > > > >> to > > > > >> > > > > recover from a recovery point within the active segment or > > do > > > we > > > > >> > still > > > > >> > > > need > > > > >> > > > > to scan the full segment including the recovery point? The > > > > former > > > > >> can > > > > >> > > be > > > > >> > > > > tricky since multiple records can fall into the same disk > > page > > > > >> and a > > > > >> > > > > subsequent flush may corrupt a page with previously > flushed > > > > >> records. > > > > >> > > > > > > > > >> > > > > 201. Configurations. > > > > >> > > > > 201.1 How do the Raft brokers get security related configs > > for > > > > >> inter > > > > >> > > > broker > > > > >> > > > > communication? Is that based on the existing > > > > >> > > > > inter.broker.security.protocol? > > > > >> > > > > 201.2 We have quorum.retry.backoff.max.ms and > > > > >> > quorum.retry.backoff.ms, > > > > >> > > > but > > > > >> > > > > only quorum.election.backoff.max.ms. This seems a bit > > > > >> inconsistent. > > > > >> > > > > > > > > >> > > > > 202. Metrics: > > > > >> > > > > 202.1 TotalTimeMs, InboundQueueTimeMs, HandleTimeMs, > > > > >> > > OutboundQueueTimeMs: > > > > >> > > > > Are those the same as existing totalTime, > requestQueueTime, > > > > >> > localTime, > > > > >> > > > > responseQueueTime? Could we reuse the existing ones with > the > > > tag > > > > >> > > > > request=[request-type]? > > > > >> > > > > 202.2. Could you explain what InboundChannelSize and > > > > >> > > OutboundChannelSize > > > > >> > > > > are? > > > > >> > > > > 202.3 ElectionLatencyMax/Avg: It seems that both should be > > > > >> windowed? > > > > >> > > > > > > > > >> > > > > 203. Quorum State: I assume that LeaderId will be kept > > > > >> consistently > > > > >> > > with > > > > >> > > > > LeaderEpoch. For example, if a follower transitions to > > > candidate > > > > >> and > > > > >> > > > bumps > > > > >> > > > > up LeaderEpoch, it will set leaderId to -1 and persist > both > > in > > > > the > > > > >> > > Quorum > > > > >> > > > > state file. Is that correct? > > > > >> > > > > > > > > >> > > > > 204. I was thinking about a corner case when a Raft broker > > is > > > > >> > > partitioned > > > > >> > > > > off. This broker will then be in a continuous loop of > > bumping > > > up > > > > >> the > > > > >> > > > leader > > > > >> > > > > epoch, but failing to get enough votes. When the > > partitioning > > > is > > > > >> > > removed, > > > > >> > > > > this broker's high leader epoch will force a leader > > election. > > > I > > > > >> > assume > > > > >> > > > > other Raft brokers can immediately advance their leader > > epoch > > > > >> passing > > > > >> > > the > > > > >> > > > > already bumped epoch such that leader election won't be > > > delayed. > > > > >> Is > > > > >> > > that > > > > >> > > > > right? > > > > >> > > > > > > > > >> > > > > 205. In a JBOD setting, could we use the existing tool to > > move > > > > the > > > > >> > Raft > > > > >> > > > log > > > > >> > > > > from one disk to another? > > > > >> > > > > > > > > >> > > > > 206. The KIP doesn't mention the local metadata store > > derived > > > > from > > > > >> > the > > > > >> > > > Raft > > > > >> > > > > log. Will that be covered in a separate KIP? > > > > >> > > > > > > > > >> > > > > 207. Since this is a critical component. Could we add a > > > section > > > > on > > > > >> > the > > > > >> > > > > testing plan for correctness? > > > > >> > > > > > > > > >> > > > > 208. Performance. Do we plan to do group commit (e.g. > buffer > > > > >> pending > > > > >> > > > > appends during a flush and then flush all accumulated > > pending > > > > >> records > > > > >> > > > > together in the next flush) for better throughput? > > > > >> > > > > > > > > >> > > > > 209. "the leader can actually defer fsync until it knows > > > > >> > "quorum.size - > > > > >> > > > 1" > > > > >> > > > > has get to a certain entry offset." Why is that > > "quorum.size - > > > > 1" > > > > >> > > instead > > > > >> > > > > of the majority of the quorum? > > > > >> > > > > > > > > >> > > > > Thanks, > > > > >> > > > > > > > > >> > > > > Jun > > > > >> > > > > > > > > >> > > > > On Mon, Jul 13, 2020 at 9:43 AM Jason Gustafson < > > > > >> ja...@confluent.io> > > > > >> > > > > wrote: > > > > >> > > > > > > > > >> > > > > > Hi All, > > > > >> > > > > > > > > > >> > > > > > Just a quick update on the proposal. We have decided to > > move > > > > >> quorum > > > > >> > > > > > reassignment to a separate KIP: > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-642%3A+Dynamic+quorum+reassignment > > > > >> > > > > > . > > > > >> > > > > > The way this ties into cluster bootstrapping is > > complicated, > > > > so > > > > >> we > > > > >> > > felt > > > > >> > > > > we > > > > >> > > > > > needed a bit more time for validation. That leaves the > > core > > > of > > > > >> this > > > > >> > > > > > proposal as quorum-based replication. If there are no > > > further > > > > >> > > comments, > > > > >> > > > > we > > > > >> > > > > > will plan to start a vote later this week. > > > > >> > > > > > > > > > >> > > > > > Thanks, > > > > >> > > > > > Jason > > > > >> > > > > > > > > > >> > > > > > On Wed, Jun 24, 2020 at 10:43 AM Guozhang Wang < > > > > >> wangg...@gmail.com > > > > >> > > > > > > >> > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > @Jun Rao <jun...@gmail.com> > > > > >> > > > > > > > > > > >> > > > > > > Regarding your comment about log compaction. After > some > > > > >> > deep-diving > > > > >> > > > > into > > > > >> > > > > > > this we've decided to propose a new snapshot-based log > > > > >> cleaning > > > > >> > > > > mechanism > > > > >> > > > > > > which would be used to replace the current compaction > > > > >> mechanism > > > > >> > for > > > > >> > > > > this > > > > >> > > > > > > meta log. A new KIP will be proposed specifically for > > this > > > > >> idea. > > > > >> > > > > > > > > > > >> > > > > > > All, > > > > >> > > > > > > > > > > >> > > > > > > I've updated the KIP wiki a bit updating one config " > > > > >> > > > > > > election.jitter.max.ms" > > > > >> > > > > > > to "election.backoff.max.ms" to make it more clear > > about > > > > the > > > > >> > > usage: > > > > >> > > > > the > > > > >> > > > > > > configured value will be the upper bound of the binary > > > > >> > exponential > > > > >> > > > > > backoff > > > > >> > > > > > > time after a failed election, before starting a new > one. > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > Guozhang > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > On Fri, Jun 12, 2020 at 9:26 AM Boyang Chen < > > > > >> > > > > reluctanthero...@gmail.com> > > > > >> > > > > > > wrote: > > > > >> > > > > > > > > > > >> > > > > > > > Thanks for the suggestions Guozhang. > > > > >> > > > > > > > > > > > >> > > > > > > > On Thu, Jun 11, 2020 at 2:51 PM Guozhang Wang < > > > > >> > > wangg...@gmail.com> > > > > >> > > > > > > wrote: > > > > >> > > > > > > > > > > > >> > > > > > > > > Hello Boyang, > > > > >> > > > > > > > > > > > > >> > > > > > > > > Thanks for the updated information. A few > questions > > > > here: > > > > >> > > > > > > > > > > > > >> > > > > > > > > 1) Should the quorum-file also update to support > > > > >> multi-raft? > > > > >> > > > > > > > > > > > > >> > > > > > > > > I'm neutral about this, as we don't know yet how > the > > > > >> > multi-raft > > > > >> > > > > > modules > > > > >> > > > > > > > would behave. If > > > > >> > > > > > > > we have different threads operating different raft > > > groups, > > > > >> > > > > > consolidating > > > > >> > > > > > > > the `checkpoint` files seems > > > > >> > > > > > > > not reasonable. We could always add > > `multi-quorum-file` > > > > >> later > > > > >> > if > > > > >> > > > > > > possible. > > > > >> > > > > > > > > > > > >> > > > > > > > 2) In the previous proposal, there's fields in the > > > > >> > > > FetchQuorumRecords > > > > >> > > > > > > like > > > > >> > > > > > > > > latestDirtyOffset, is that dropped intentionally? > > > > >> > > > > > > > > > > > > >> > > > > > > > > I dropped the latestDirtyOffset since it is > > associated > > > > >> with > > > > >> > the > > > > >> > > > log > > > > >> > > > > > > > compaction discussion. This is beyond this KIP scope > > and > > > > we > > > > >> > could > > > > >> > > > > > > > potentially get a separate KIP to talk about it. > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > 3) I think we also need to elaborate a bit more > > > details > > > > >> > > regarding > > > > >> > > > > > when > > > > >> > > > > > > to > > > > >> > > > > > > > > send metadata request and discover-brokers; > > currently > > > we > > > > >> only > > > > >> > > > > > discussed > > > > >> > > > > > > > > during bootstrap how these requests would be > sent. I > > > > think > > > > >> > the > > > > >> > > > > > > following > > > > >> > > > > > > > > scenarios would also need these requests > > > > >> > > > > > > > > > > > > >> > > > > > > > > 3.a) As long as a broker does not know the current > > > > quorum > > > > >> > > > > (including > > > > >> > > > > > > the > > > > >> > > > > > > > > leader and the voters), it should continue > > > periodically > > > > >> ask > > > > >> > > other > > > > >> > > > > > > brokers > > > > >> > > > > > > > > via "metadata. > > > > >> > > > > > > > > 3.b) As long as a broker does not know all the > > current > > > > >> quorum > > > > >> > > > > voter's > > > > >> > > > > > > > > connections, it should continue periodically ask > > other > > > > >> > brokers > > > > >> > > > via > > > > >> > > > > > > > > "discover-brokers". > > > > >> > > > > > > > > 3.c) When the leader's fetch timeout elapsed, it > > > should > > > > >> send > > > > >> > > > > metadata > > > > >> > > > > > > > > request. > > > > >> > > > > > > > > > > > > >> > > > > > > > > Make sense, will add to the KIP. > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > Guozhang > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > On Wed, Jun 10, 2020 at 5:20 PM Boyang Chen < > > > > >> > > > > > > reluctanthero...@gmail.com> > > > > >> > > > > > > > > wrote: > > > > >> > > > > > > > > > > > > >> > > > > > > > > > Hey all, > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > follow-up on the previous email, we made some > more > > > > >> updates: > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > 1. The Alter/DescribeQuorum RPCs are also > > > > re-structured > > > > >> to > > > > >> > > use > > > > >> > > > > > > > > multi-raft. > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > 2. We add observer status into the > > > > >> DescribeQuorumResponse > > > > >> > as > > > > >> > > we > > > > >> > > > > see > > > > >> > > > > > > it > > > > >> > > > > > > > > is a > > > > >> > > > > > > > > > low hanging fruit which is very useful for user > > > > >> debugging > > > > >> > and > > > > >> > > > > > > > > reassignment. > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > 3. The FindQuorum RPC is replaced with > > > DiscoverBrokers > > > > >> RPC, > > > > >> > > > which > > > > >> > > > > > is > > > > >> > > > > > > > > purely > > > > >> > > > > > > > > > in charge of discovering broker connections in a > > > > gossip > > > > >> > > manner. > > > > >> > > > > The > > > > >> > > > > > > > > quorum > > > > >> > > > > > > > > > leader discovery is piggy-back on the Metadata > RPC > > > for > > > > >> the > > > > >> > > > topic > > > > >> > > > > > > > > partition > > > > >> > > > > > > > > > leader, which in our case is the single metadata > > > > >> partition > > > > >> > > for > > > > >> > > > > the > > > > >> > > > > > > > > version > > > > >> > > > > > > > > > one. > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > Let me know if you have any questions. > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > Boyang > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > On Tue, Jun 9, 2020 at 11:01 PM Boyang Chen < > > > > >> > > > > > > > reluctanthero...@gmail.com> > > > > >> > > > > > > > > > wrote: > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > Hey all, > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > Thanks for the great discussions so far. I'm > > > posting > > > > >> some > > > > >> > > KIP > > > > >> > > > > > > updates > > > > >> > > > > > > > > > from > > > > >> > > > > > > > > > > our working group discussion: > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > 1. We will be changing the core RPCs from > > > > single-raft > > > > >> API > > > > >> > > to > > > > >> > > > > > > > > multi-raft. > > > > >> > > > > > > > > > > This means all protocols will be "batch" in > the > > > > first > > > > >> > > > version, > > > > >> > > > > > but > > > > >> > > > > > > > the > > > > >> > > > > > > > > > KIP > > > > >> > > > > > > > > > > itself only illustrates the design for a > single > > > > >> metadata > > > > >> > > > topic > > > > >> > > > > > > > > partition. > > > > >> > > > > > > > > > > The reason is to "keep the door open" for > future > > > > >> > extensions > > > > >> > > > of > > > > >> > > > > > this > > > > >> > > > > > > > > piece > > > > >> > > > > > > > > > > of module such as a sharded controller or > > general > > > > >> quorum > > > > >> > > > based > > > > >> > > > > > > topic > > > > >> > > > > > > > > > > replication, beyond the current Kafka > > replication > > > > >> > protocol. > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > 2. We will piggy-back on the current Kafka > Fetch > > > API > > > > >> > > instead > > > > >> > > > of > > > > >> > > > > > > > > inventing > > > > >> > > > > > > > > > > a new FetchQuorumRecords RPC. The motivation > is > > > > about > > > > >> the > > > > >> > > > same > > > > >> > > > > as > > > > >> > > > > > > #1 > > > > >> > > > > > > > as > > > > >> > > > > > > > > > > well as making the integration work easier, > > > instead > > > > of > > > > >> > > > letting > > > > >> > > > > > two > > > > >> > > > > > > > > > similar > > > > >> > > > > > > > > > > RPCs diverge. > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > 3. In the EndQuorumEpoch protocol, instead of > > only > > > > >> > sending > > > > >> > > > the > > > > >> > > > > > > > request > > > > >> > > > > > > > > to > > > > >> > > > > > > > > > > the most caught-up voter, we shall broadcast > the > > > > >> > > information > > > > >> > > > to > > > > >> > > > > > all > > > > >> > > > > > > > > > voters, > > > > >> > > > > > > > > > > with a sorted voter list in descending order > of > > > > their > > > > >> > > > > > corresponding > > > > >> > > > > > > > > > > replicated offset. In this way, the top voter > > will > > > > >> > become a > > > > >> > > > > > > candidate > > > > >> > > > > > > > > > > immediately, while the other voters shall wait > > for > > > > an > > > > >> > > > > exponential > > > > >> > > > > > > > > > back-off > > > > >> > > > > > > > > > > to trigger elections, which helps ensure the > top > > > > voter > > > > >> > gets > > > > >> > > > > > > elected, > > > > >> > > > > > > > > and > > > > >> > > > > > > > > > > the election eventually happens when the top > > voter > > > > is > > > > >> not > > > > >> > > > > > > responsive. > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > Please see the updated KIP and post any > > questions > > > or > > > > >> > > concerns > > > > >> > > > > on > > > > >> > > > > > > the > > > > >> > > > > > > > > > > mailing thread. > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > Boyang > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > On Fri, May 8, 2020 at 5:26 PM Jun Rao < > > > > >> j...@confluent.io > > > > >> > > > > > > >> > > > > wrote: > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > >> Hi, Guozhang and Jason, > > > > >> > > > > > > > > > >> > > > > >> > > > > > > > > > >> Thanks for the reply. A couple of more > replies. > > > > >> > > > > > > > > > >> > > > > >> > > > > > > > > > >> 102. Still not sure about this. How is the > > > > tombstone > > > > >> > issue > > > > >> > > > > > > addressed > > > > >> > > > > > > > > in > > > > >> > > > > > > > > > >> the > > > > >> > > > > > > > > > >> non-voter and the observer. They can die at > > any > > > > >> point > > > > >> > and > > > > >> > > > > > restart > > > > >> > > > > > > > at > > > > >> > > > > > > > > an > > > > >> > > > > > > > > > >> arbitrary later time, and the advancing of > the > > > > >> > firstDirty > > > > >> > > > > offset > > > > >> > > > > > > and > > > > >> > > > > > > > > the > > > > >> > > > > > > > > > >> removal of the tombstone can happen > > > independently. > > > > >> > > > > > > > > > >> > > > > >> > > > > > > > > > >> 106. I agree that it would be less confusing > if > > > we > > > > >> used > > > > >> > > > > "epoch" > > > > >> > > > > > > > > instead > > > > >> > > > > > > > > > of > > > > >> > > > > > > > > > >> "leader epoch" consistently. > > > > >> > > > > > > > > > >> > > > > >> > > > > > > > > > >> Jun > > > > >> > > > > > > > > > >> > > > > >> > > > > > > > > > >> On Thu, May 7, 2020 at 4:04 PM Guozhang Wang > < > > > > >> > > > > > wangg...@gmail.com> > > > > >> > > > > > > > > > wrote: > > > > >> > > > > > > > > > >> > > > > >> > > > > > > > > > >> > Thanks Jun. Further replies are in-lined. > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > On Mon, May 4, 2020 at 11:58 AM Jun Rao < > > > > >> > > j...@confluent.io > > > > >> > > > > > > > > >> > > > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > Hi, Guozhang, > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > Thanks for the reply. A few more replies > > > > inlined > > > > >> > > below. > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > On Sun, May 3, 2020 at 6:33 PM Guozhang > > Wang > > > < > > > > >> > > > > > > > wangg...@gmail.com> > > > > >> > > > > > > > > > >> wrote: > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > > Hello Jun, > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > Thanks for your comments! I'm replying > > > inline > > > > >> > below: > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > On Fri, May 1, 2020 at 12:36 PM Jun > Rao < > > > > >> > > > > j...@confluent.io > > > > >> > > > > > > > > > > >> > > > > > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > > 101. Bootstrapping related issues. > > > > >> > > > > > > > > > >> > > > > 101.1 Currently, we support auto > broker > > > id > > > > >> > > > generation. > > > > >> > > > > > Is > > > > >> > > > > > > > this > > > > >> > > > > > > > > > >> > > supported > > > > >> > > > > > > > > > >> > > > > for bootstrap brokers? > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > The vote ids would just be the broker > > ids. > > > > >> > > > > > > "bootstrap.servers" > > > > >> > > > > > > > > > >> would be > > > > >> > > > > > > > > > >> > > > similar to what client configs have > > today, > > > > >> where > > > > >> > > > > > > > "quorum.voters" > > > > >> > > > > > > > > > >> would > > > > >> > > > > > > > > > >> > be > > > > >> > > > > > > > > > >> > > > pre-defined config values. > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > My question was on the auto generated > > broker > > > > id. > > > > >> > > > > Currently, > > > > >> > > > > > > the > > > > >> > > > > > > > > > broker > > > > >> > > > > > > > > > >> > can > > > > >> > > > > > > > > > >> > > choose to have its broker Id auto > > generated. > > > > The > > > > >> > > > > generation > > > > >> > > > > > is > > > > >> > > > > > > > > done > > > > >> > > > > > > > > > >> > through > > > > >> > > > > > > > > > >> > > ZK to guarantee uniqueness. Without ZK, > > it's > > > > not > > > > >> > clear > > > > >> > > > how > > > > >> > > > > > the > > > > >> > > > > > > > > > broker > > > > >> > > > > > > > > > >> id > > > > >> > > > > > > > > > >> > is > > > > >> > > > > > > > > > >> > > auto generated. "quorum.voters" also > can't > > be > > > > set > > > > >> > > > > statically > > > > >> > > > > > > if > > > > >> > > > > > > > > > broker > > > > >> > > > > > > > > > >> > ids > > > > >> > > > > > > > > > >> > > are auto generated. > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > Jason has explained some ideas that we've > > > > >> discussed > > > > >> > so > > > > >> > > > > far, > > > > >> > > > > > > the > > > > >> > > > > > > > > > >> reason we > > > > >> > > > > > > > > > >> > intentional did not include them so far is > > that > > > > we > > > > >> > feel > > > > >> > > it > > > > >> > > > > is > > > > >> > > > > > > > > out-side > > > > >> > > > > > > > > > >> the > > > > >> > > > > > > > > > >> > scope of KIP-595. Under the umbrella of > > KIP-500 > > > > we > > > > >> > > should > > > > >> > > > > > > > definitely > > > > >> > > > > > > > > > >> > address them though. > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > On the high-level, our belief is that > > "joining > > > a > > > > >> > quorum" > > > > >> > > > and > > > > >> > > > > > > > > "joining > > > > >> > > > > > > > > > >> (or > > > > >> > > > > > > > > > >> > more specifically, registering brokers in) > > the > > > > >> > cluster" > > > > >> > > > > would > > > > >> > > > > > be > > > > >> > > > > > > > > > >> > de-coupled a bit, where the former should > be > > > > >> completed > > > > >> > > > > before > > > > >> > > > > > we > > > > >> > > > > > > > do > > > > >> > > > > > > > > > the > > > > >> > > > > > > > > > >> > latter. More specifically, assuming the > > quorum > > > is > > > > >> > > already > > > > >> > > > up > > > > >> > > > > > and > > > > >> > > > > > > > > > >> running, > > > > >> > > > > > > > > > >> > after the newly started broker found the > > leader > > > > of > > > > >> the > > > > >> > > > > quorum > > > > >> > > > > > it > > > > >> > > > > > > > can > > > > >> > > > > > > > > > >> send a > > > > >> > > > > > > > > > >> > specific RegisterBroker request including > its > > > > >> > listener / > > > > >> > > > > > > protocol > > > > >> > > > > > > > / > > > > >> > > > > > > > > > etc, > > > > >> > > > > > > > > > >> > and upon handling it the leader can send > back > > > the > > > > >> > > uniquely > > > > >> > > > > > > > generated > > > > >> > > > > > > > > > >> broker > > > > >> > > > > > > > > > >> > id to the new broker, while also executing > > the > > > > >> > > > > > "startNewBroker" > > > > >> > > > > > > > > > >> callback as > > > > >> > > > > > > > > > >> > the controller. > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > > > 102. Log compaction. One weak spot of > > log > > > > >> > > compaction > > > > >> > > > > is > > > > >> > > > > > > for > > > > >> > > > > > > > > the > > > > >> > > > > > > > > > >> > > consumer > > > > >> > > > > > > > > > >> > > > to > > > > >> > > > > > > > > > >> > > > > deal with deletes. When a key is > > deleted, > > > > >> it's > > > > >> > > > > retained > > > > >> > > > > > > as a > > > > >> > > > > > > > > > >> > tombstone > > > > >> > > > > > > > > > >> > > > > first and then physically removed. > If a > > > > >> client > > > > >> > > > misses > > > > >> > > > > > the > > > > >> > > > > > > > > > >> tombstone > > > > >> > > > > > > > > > >> > > > > (because it's physically removed), it > > may > > > > >> not be > > > > >> > > > able > > > > >> > > > > to > > > > >> > > > > > > > > update > > > > >> > > > > > > > > > >> its > > > > >> > > > > > > > > > >> > > > > metadata properly. The way we solve > > this > > > in > > > > >> > Kafka > > > > >> > > is > > > > >> > > > > > based > > > > >> > > > > > > > on > > > > >> > > > > > > > > a > > > > >> > > > > > > > > > >> > > > > configuration ( > > > > >> log.cleaner.delete.retention.ms) > > > > >> > > and > > > > >> > > > > we > > > > >> > > > > > > > > expect a > > > > >> > > > > > > > > > >> > > consumer > > > > >> > > > > > > > > > >> > > > > having seen an old key to finish > > reading > > > > the > > > > >> > > > deletion > > > > >> > > > > > > > > tombstone > > > > >> > > > > > > > > > >> > within > > > > >> > > > > > > > > > >> > > > that > > > > >> > > > > > > > > > >> > > > > time. There is no strong guarantee > for > > > that > > > > >> > since > > > > >> > > a > > > > >> > > > > > broker > > > > >> > > > > > > > > could > > > > >> > > > > > > > > > >> be > > > > >> > > > > > > > > > >> > > down > > > > >> > > > > > > > > > >> > > > > for a long time. It would be better > if > > we > > > > can > > > > >> > > have a > > > > >> > > > > > more > > > > >> > > > > > > > > > reliable > > > > >> > > > > > > > > > >> > way > > > > >> > > > > > > > > > >> > > of > > > > >> > > > > > > > > > >> > > > > dealing with deletes. > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > We propose to capture this in the > > > > >> > "FirstDirtyOffset" > > > > >> > > > > field > > > > >> > > > > > > of > > > > >> > > > > > > > > the > > > > >> > > > > > > > > > >> > quorum > > > > >> > > > > > > > > > >> > > > record fetch response: the offset is > the > > > > >> maximum > > > > >> > > > offset > > > > >> > > > > > that > > > > >> > > > > > > > log > > > > >> > > > > > > > > > >> > > compaction > > > > >> > > > > > > > > > >> > > > has reached up to. If the follower has > > > > fetched > > > > >> > > beyond > > > > >> > > > > this > > > > >> > > > > > > > > offset > > > > >> > > > > > > > > > it > > > > >> > > > > > > > > > >> > > means > > > > >> > > > > > > > > > >> > > > itself is safe hence it has seen all > > > records > > > > >> up to > > > > >> > > > that > > > > >> > > > > > > > offset. > > > > >> > > > > > > > > On > > > > >> > > > > > > > > > >> > > getting > > > > >> > > > > > > > > > >> > > > the response, the follower can then > > decide > > > if > > > > >> its > > > > >> > > end > > > > >> > > > > > offset > > > > >> > > > > > > > > > >> actually > > > > >> > > > > > > > > > >> > > below > > > > >> > > > > > > > > > >> > > > that dirty offset (and hence may miss > > some > > > > >> > > > tombstones). > > > > >> > > > > If > > > > >> > > > > > > > > that's > > > > >> > > > > > > > > > >> the > > > > >> > > > > > > > > > >> > > case: > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > 1) Naively, it could re-bootstrap > > metadata > > > > log > > > > >> > from > > > > >> > > > the > > > > >> > > > > > very > > > > >> > > > > > > > > > >> beginning > > > > >> > > > > > > > > > >> > to > > > > >> > > > > > > > > > >> > > > catch up. > > > > >> > > > > > > > > > >> > > > 2) During that time, it would refrain > > > itself > > > > >> from > > > > >> > > > > > answering > > > > >> > > > > > > > > > >> > > MetadataRequest > > > > >> > > > > > > > > > >> > > > from any clients. > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > I am not sure that the "FirstDirtyOffset" > > > field > > > > >> > fully > > > > >> > > > > > > addresses > > > > >> > > > > > > > > the > > > > >> > > > > > > > > > >> > issue. > > > > >> > > > > > > > > > >> > > Currently, the deletion tombstone is not > > > > removed > > > > >> > > > > immediately > > > > >> > > > > > > > > after a > > > > >> > > > > > > > > > >> > round > > > > >> > > > > > > > > > >> > > of cleaning. It's removed after a delay > in > > a > > > > >> > > subsequent > > > > >> > > > > > round > > > > >> > > > > > > of > > > > >> > > > > > > > > > >> > cleaning. > > > > >> > > > > > > > > > >> > > Consider an example where a key insertion > > is > > > at > > > > >> > offset > > > > >> > > > 200 > > > > >> > > > > > > and a > > > > >> > > > > > > > > > >> deletion > > > > >> > > > > > > > > > >> > > tombstone of the key is at 400. > Initially, > > > > >> > > > > FirstDirtyOffset > > > > >> > > > > > is > > > > >> > > > > > > > at > > > > >> > > > > > > > > > >> 300. A > > > > >> > > > > > > > > > >> > > follower/observer fetches from offset 0 > > and > > > > >> fetches > > > > >> > > the > > > > >> > > > > key > > > > >> > > > > > > at > > > > >> > > > > > > > > > offset > > > > >> > > > > > > > > > >> > 200. > > > > >> > > > > > > > > > >> > > A few rounds of cleaning happen. > > > > >> FirstDirtyOffset is > > > > >> > > at > > > > >> > > > > 500 > > > > >> > > > > > > and > > > > >> > > > > > > > > the > > > > >> > > > > > > > > > >> > > tombstone at 400 is physically removed. > The > > > > >> > > > > > follower/observer > > > > >> > > > > > > > > > >> continues > > > > >> > > > > > > > > > >> > the > > > > >> > > > > > > > > > >> > > fetch, but misses offset 400. It catches > > all > > > > the > > > > >> way > > > > >> > > to > > > > >> > > > > > > > > > >> FirstDirtyOffset > > > > >> > > > > > > > > > >> > > and declares its metadata as ready. > > However, > > > > its > > > > >> > > > metadata > > > > >> > > > > > > could > > > > >> > > > > > > > be > > > > >> > > > > > > > > > >> stale > > > > >> > > > > > > > > > >> > > since it actually misses the deletion of > > the > > > > key. > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > Yeah good question, I should have put > more > > > > >> details > > > > >> > in > > > > >> > > my > > > > >> > > > > > > > > explanation > > > > >> > > > > > > > > > >> :) > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > The idea is that we will adjust the log > > > > compaction > > > > >> for > > > > >> > > > this > > > > >> > > > > > raft > > > > >> > > > > > > > > based > > > > >> > > > > > > > > > >> > metadata log: before more details to be > > > > explained, > > > > >> > since > > > > >> > > > we > > > > >> > > > > > have > > > > >> > > > > > > > two > > > > >> > > > > > > > > > >> types > > > > >> > > > > > > > > > >> > of "watermarks" here, whereas in Kafka the > > > > >> watermark > > > > >> > > > > indicates > > > > >> > > > > > > > where > > > > >> > > > > > > > > > >> every > > > > >> > > > > > > > > > >> > replica have replicated up to and in Raft > the > > > > >> > watermark > > > > >> > > > > > > indicates > > > > >> > > > > > > > > > where > > > > >> > > > > > > > > > >> the > > > > >> > > > > > > > > > >> > majority of replicas (here only indicating > > > voters > > > > >> of > > > > >> > the > > > > >> > > > > > quorum, > > > > >> > > > > > > > not > > > > >> > > > > > > > > > >> > counting observers) have replicated up to, > > > let's > > > > >> call > > > > >> > > them > > > > >> > > > > > Kafka > > > > >> > > > > > > > > > >> watermark > > > > >> > > > > > > > > > >> > and Raft watermark. For this special log, > we > > > > would > > > > >> > > > maintain > > > > >> > > > > > both > > > > >> > > > > > > > > > >> > watermarks. > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > When log compacting on the leader, we would > > > only > > > > >> > compact > > > > >> > > > up > > > > >> > > > > to > > > > >> > > > > > > the > > > > >> > > > > > > > > > Kafka > > > > >> > > > > > > > > > >> > watermark, i.e. if there is at least one > > voter > > > > who > > > > >> > have > > > > >> > > > not > > > > >> > > > > > > > > replicated > > > > >> > > > > > > > > > >> an > > > > >> > > > > > > > > > >> > entry, it would not be compacted. The > > > > >> "dirty-offset" > > > > >> > is > > > > >> > > > the > > > > >> > > > > > > offset > > > > >> > > > > > > > > > that > > > > >> > > > > > > > > > >> > we've compacted up to and is communicated > to > > > > other > > > > >> > > voters, > > > > >> > > > > and > > > > >> > > > > > > the > > > > >> > > > > > > > > > other > > > > >> > > > > > > > > > >> > voters would also compact up to this value > > --- > > > > i.e. > > > > >> > the > > > > >> > > > > > > difference > > > > >> > > > > > > > > > here > > > > >> > > > > > > > > > >> is > > > > >> > > > > > > > > > >> > that instead of letting each replica doing > > log > > > > >> > > compaction > > > > >> > > > > > > > > > independently, > > > > >> > > > > > > > > > >> > we'll have the leader to decide upon which > > > offset > > > > >> to > > > > >> > > > compact > > > > >> > > > > > to, > > > > >> > > > > > > > and > > > > >> > > > > > > > > > >> > propagate this value to others to follow, > in > > a > > > > more > > > > >> > > > > > coordinated > > > > >> > > > > > > > > > manner. > > > > >> > > > > > > > > > >> > Also note when there are new voters joining > > the > > > > >> quorum > > > > >> > > who > > > > >> > > > > has > > > > >> > > > > > > not > > > > >> > > > > > > > > > >> > replicated up to the dirty-offset, of > because > > > of > > > > >> other > > > > >> > > > > issues > > > > >> > > > > > > they > > > > >> > > > > > > > > > >> > truncated their logs to below the > > dirty-offset, > > > > >> they'd > > > > >> > > > have > > > > >> > > > > to > > > > >> > > > > > > > > > >> re-bootstrap > > > > >> > > > > > > > > > >> > from the beginning, and during this period > of > > > > time > > > > >> the > > > > >> > > > > leader > > > > >> > > > > > > > > learned > > > > >> > > > > > > > > > >> about > > > > >> > > > > > > > > > >> > this lagging voter would not advance the > > > > watermark > > > > >> > (also > > > > >> > > > it > > > > >> > > > > > > would > > > > >> > > > > > > > > not > > > > >> > > > > > > > > > >> > decrement it), and hence not compacting > > either, > > > > >> until > > > > >> > > the > > > > >> > > > > > > voter(s) > > > > >> > > > > > > > > has > > > > >> > > > > > > > > > >> > caught up to that dirty-offset. > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > So back to your example above, before the > > > > bootstrap > > > > >> > > voter > > > > >> > > > > gets > > > > >> > > > > > > to > > > > >> > > > > > > > > 300 > > > > >> > > > > > > > > > no > > > > >> > > > > > > > > > >> > log compaction would happen on the leader; > > and > > > > >> until > > > > >> > > later > > > > >> > > > > > when > > > > >> > > > > > > > the > > > > >> > > > > > > > > > >> voter > > > > >> > > > > > > > > > >> > have got to beyond 400 and hence replicated > > > that > > > > >> > > > tombstone, > > > > >> > > > > > the > > > > >> > > > > > > > log > > > > >> > > > > > > > > > >> > compaction would possibly get to that > > tombstone > > > > and > > > > >> > > remove > > > > >> > > > > it. > > > > >> > > > > > > Say > > > > >> > > > > > > > > > >> later it > > > > >> > > > > > > > > > >> > the leader's log compaction reaches 500, it > > can > > > > >> send > > > > >> > > this > > > > >> > > > > back > > > > >> > > > > > > to > > > > >> > > > > > > > > the > > > > >> > > > > > > > > > >> voter > > > > >> > > > > > > > > > >> > who can then also compact locally up to > 500. > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > 105. Quorum State: In addition to > > > VotedId, > > > > >> do we > > > > >> > > > need > > > > >> > > > > > the > > > > >> > > > > > > > > epoch > > > > >> > > > > > > > > > >> > > > > corresponding to VotedId? Over time, > > the > > > > same > > > > >> > > broker > > > > >> > > > > Id > > > > >> > > > > > > > could > > > > >> > > > > > > > > be > > > > >> > > > > > > > > > >> > voted > > > > >> > > > > > > > > > >> > > in > > > > >> > > > > > > > > > >> > > > > different generations with different > > > epoch. > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > Hmm, this is a good point. Originally I > > > think > > > > >> the > > > > >> > > > > > > > "LeaderEpoch" > > > > >> > > > > > > > > > >> field > > > > >> > > > > > > > > > >> > in > > > > >> > > > > > > > > > >> > > > that file is corresponding to the > "latest > > > > known > > > > >> > > leader > > > > >> > > > > > > epoch", > > > > >> > > > > > > > > not > > > > >> > > > > > > > > > >> the > > > > >> > > > > > > > > > >> > > > "current leader epoch". For example, if > > the > > > > >> > current > > > > >> > > > > epoch > > > > >> > > > > > is > > > > >> > > > > > > > N, > > > > >> > > > > > > > > > and > > > > >> > > > > > > > > > >> > then > > > > >> > > > > > > > > > >> > > a > > > > >> > > > > > > > > > >> > > > vote-request with epoch N+1 is received > > and > > > > the > > > > >> > > voter > > > > >> > > > > > > granted > > > > >> > > > > > > > > the > > > > >> > > > > > > > > > >> vote > > > > >> > > > > > > > > > >> > > for > > > > >> > > > > > > > > > >> > > > it, then it means for this voter it > knows > > > the > > > > >> > > "latest > > > > >> > > > > > epoch" > > > > >> > > > > > > > is > > > > >> > > > > > > > > N > > > > >> > > > > > > > > > + > > > > >> > > > > > > > > > >> 1 > > > > >> > > > > > > > > > >> > > > although it is unknown if that sending > > > > >> candidate > > > > >> > > will > > > > >> > > > > > indeed > > > > >> > > > > > > > > > become > > > > >> > > > > > > > > > >> the > > > > >> > > > > > > > > > >> > > new > > > > >> > > > > > > > > > >> > > > leader (which would only be notified > via > > > > >> > > begin-quorum > > > > >> > > > > > > > request). > > > > >> > > > > > > > > > >> > However, > > > > >> > > > > > > > > > >> > > > when persisting the quorum state, we > > would > > > > >> encode > > > > >> > > > > > > leader-epoch > > > > >> > > > > > > > > to > > > > >> > > > > > > > > > >> N+1, > > > > >> > > > > > > > > > >> > > > while the leaderId to be the older > > leader. > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > But now thinking about this a bit > more, I > > > > feel > > > > >> we > > > > >> > > > should > > > > >> > > > > > use > > > > >> > > > > > > > two > > > > >> > > > > > > > > > >> > separate > > > > >> > > > > > > > > > >> > > > epochs, one for the "lates known" and > one > > > for > > > > >> the > > > > >> > > > > > "current" > > > > >> > > > > > > to > > > > >> > > > > > > > > > pair > > > > >> > > > > > > > > > >> > with > > > > >> > > > > > > > > > >> > > > the leaderId. I will update the wiki > > page. > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > Hmm, it's kind of weird to bump up the > > leader > > > > >> epoch > > > > >> > > > before > > > > >> > > > > > the > > > > >> > > > > > > > new > > > > >> > > > > > > > > > >> leader > > > > >> > > > > > > > > > >> > > is actually elected, right. > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > > > 106. "OFFSET_OUT_OF_RANGE: Used in > the > > > > >> > > > > > FetchQuorumRecords > > > > >> > > > > > > > API > > > > >> > > > > > > > > to > > > > >> > > > > > > > > > >> > > indicate > > > > >> > > > > > > > > > >> > > > > that the follower has fetched from an > > > > invalid > > > > >> > > offset > > > > >> > > > > and > > > > >> > > > > > > > > should > > > > >> > > > > > > > > > >> > > truncate > > > > >> > > > > > > > > > >> > > > to > > > > >> > > > > > > > > > >> > > > > the offset/epoch indicated in the > > > > response." > > > > >> > > > Observers > > > > >> > > > > > > can't > > > > >> > > > > > > > > > >> truncate > > > > >> > > > > > > > > > >> > > > their > > > > >> > > > > > > > > > >> > > > > logs. What should they do with > > > > >> > > OFFSET_OUT_OF_RANGE? > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > I'm not sure if I understand your > > question? > > > > >> > > Observers > > > > >> > > > > > should > > > > >> > > > > > > > > still > > > > >> > > > > > > > > > >> be > > > > >> > > > > > > > > > >> > > able > > > > >> > > > > > > > > > >> > > > to truncate their logs as well. > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > Hmm, I thought only the quorum nodes have > > > local > > > > >> logs > > > > >> > > and > > > > >> > > > > > > > observers > > > > >> > > > > > > > > > >> don't? > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > > 107. "The leader will continue sending > > > > >> > > > BeginQuorumEpoch > > > > >> > > > > to > > > > >> > > > > > > > each > > > > >> > > > > > > > > > >> known > > > > >> > > > > > > > > > >> > > > voter > > > > >> > > > > > > > > > >> > > > > until it has received its > endorsement." > > > If > > > > a > > > > >> > voter > > > > >> > > > is > > > > >> > > > > > down > > > > >> > > > > > > > > for a > > > > >> > > > > > > > > > >> long > > > > >> > > > > > > > > > >> > > > time, > > > > >> > > > > > > > > > >> > > > > sending BeginQuorumEpoch seems to add > > > > >> > unnecessary > > > > >> > > > > > > overhead. > > > > >> > > > > > > > > > >> > Similarly, > > > > >> > > > > > > > > > >> > > > if a > > > > >> > > > > > > > > > >> > > > > follower stops sending > > > FetchQuorumRecords, > > > > >> does > > > > >> > > the > > > > >> > > > > > leader > > > > >> > > > > > > > > keep > > > > >> > > > > > > > > > >> > sending > > > > >> > > > > > > > > > >> > > > > BeginQuorumEpoch? > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > Regarding BeginQuorumEpoch: that is a > > good > > > > >> point. > > > > >> > > The > > > > >> > > > > > > > > > >> > begin-quorum-epoch > > > > >> > > > > > > > > > >> > > > request is for voters to quickly get > the > > > new > > > > >> > leader > > > > >> > > > > > > > information; > > > > >> > > > > > > > > > >> > however > > > > >> > > > > > > > > > >> > > > even if they do not get them they can > > still > > > > >> > > eventually > > > > >> > > > > > learn > > > > >> > > > > > > > > about > > > > >> > > > > > > > > > >> that > > > > >> > > > > > > > > > >> > > > from others via gossiping FindQuorum. I > > > think > > > > >> we > > > > >> > can > > > > >> > > > > > adjust > > > > >> > > > > > > > the > > > > >> > > > > > > > > > >> logic > > > > >> > > > > > > > > > >> > to > > > > >> > > > > > > > > > >> > > > e.g. exponential back-off or with a > > limited > > > > >> > > > num.retries. > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > Regarding FetchQuorumRecords: if the > > > follower > > > > >> > sends > > > > >> > > > > > > > > > >> FetchQuorumRecords > > > > >> > > > > > > > > > >> > > > already, it means that follower already > > > knows > > > > >> that > > > > >> > > the > > > > >> > > > > > > broker > > > > >> > > > > > > > is > > > > >> > > > > > > > > > the > > > > >> > > > > > > > > > >> > > > leader, and hence we can stop retrying > > > > >> > > > BeginQuorumEpoch; > > > > >> > > > > > > > however > > > > >> > > > > > > > > > it > > > > >> > > > > > > > > > >> is > > > > >> > > > > > > > > > >> > > > possible that after a follower sends > > > > >> > > > FetchQuorumRecords > > > > >> > > > > > > > already, > > > > >> > > > > > > > > > >> > suddenly > > > > >> > > > > > > > > > >> > > > it stops send it (possibly because it > > > learned > > > > >> > about > > > > >> > > a > > > > >> > > > > > higher > > > > >> > > > > > > > > epoch > > > > >> > > > > > > > > > >> > > leader), > > > > >> > > > > > > > > > >> > > > and hence this broker may be a "zombie" > > > > leader > > > > >> and > > > > >> > > we > > > > >> > > > > > > propose > > > > >> > > > > > > > to > > > > >> > > > > > > > > > use > > > > >> > > > > > > > > > >> > the > > > > >> > > > > > > > > > >> > > > fetch.timeout to let the leader to try > to > > > > >> verify > > > > >> > if > > > > >> > > it > > > > >> > > > > has > > > > >> > > > > > > > > already > > > > >> > > > > > > > > > >> been > > > > >> > > > > > > > > > >> > > > stale. > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > It just seems that we should handle these > > two > > > > >> cases > > > > >> > > in a > > > > >> > > > > > > > > consistent > > > > >> > > > > > > > > > >> way? > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > Yes I agree, on the leader's side, the > > > > >> > > > FetchQuorumRecords > > > > >> > > > > > > from a > > > > >> > > > > > > > > > >> follower > > > > >> > > > > > > > > > >> > could mean that we no longer needs to send > > > > >> > > > BeginQuorumEpoch > > > > >> > > > > > > > anymore > > > > >> > > > > > > > > > --- > > > > >> > > > > > > > > > >> and > > > > >> > > > > > > > > > >> > it is already part of our current > > > implementations > > > > >> in > > > > >> > > > > > > > > > >> > > > > > >> > > https://github.com/confluentinc/kafka/commits/kafka-raft > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > Thanks, > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > Jun > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > Jun > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > On Wed, Apr 29, 2020 at 8:56 PM > > Guozhang > > > > >> Wang < > > > > >> > > > > > > > > > wangg...@gmail.com > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > Hello Leonard, > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > Thanks for your comments, I'm > relying > > > in > > > > >> line > > > > >> > > > below: > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > On Wed, Apr 29, 2020 at 1:58 AM > Wang > > > > >> (Leonard) > > > > >> > > Ge > > > > >> > > > < > > > > >> > > > > > > > > > >> > w...@confluent.io> > > > > >> > > > > > > > > > >> > > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > Hi Kafka developers, > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > It's great to see this proposal > and > > > it > > > > >> took > > > > >> > me > > > > >> > > > > some > > > > >> > > > > > > time > > > > >> > > > > > > > > to > > > > >> > > > > > > > > > >> > finish > > > > >> > > > > > > > > > >> > > > > > reading > > > > >> > > > > > > > > > >> > > > > > > it. > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > And I have the following > questions > > > > about > > > > >> the > > > > >> > > > > > Proposal: > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > - How do we plan to test this > > > design > > > > >> to > > > > >> > > > ensure > > > > >> > > > > > its > > > > >> > > > > > > > > > >> > correctness? > > > > >> > > > > > > > > > >> > > Or > > > > >> > > > > > > > > > >> > > > > > more > > > > >> > > > > > > > > > >> > > > > > > broadly, how do we ensure that > > our > > > > new > > > > >> > > ‘pull’ > > > > >> > > > > > based > > > > >> > > > > > > > > model > > > > >> > > > > > > > > > >> is > > > > >> > > > > > > > > > >> > > > > > functional > > > > >> > > > > > > > > > >> > > > > > > and > > > > >> > > > > > > > > > >> > > > > > > correct given that it is > > different > > > > >> from > > > > >> > the > > > > >> > > > > > > original > > > > >> > > > > > > > > RAFT > > > > >> > > > > > > > > > >> > > > > > implementation > > > > >> > > > > > > > > > >> > > > > > > which has formal proof of > > > > correctness? > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > We have two planned verifications > on > > > the > > > > >> > > > correctness > > > > >> > > > > > and > > > > >> > > > > > > > > > >> liveness > > > > >> > > > > > > > > > >> > of > > > > >> > > > > > > > > > >> > > > the > > > > >> > > > > > > > > > >> > > > > > design. One is via model > verification > > > > >> (TLA+) > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > https://github.com/guozhangwang/kafka-specification > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > Another is via the concurrent > > > simulation > > > > >> tests > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://github.com/confluentinc/kafka/commit/5c0c054597d2d9f458cad0cad846b0671efa2d91 > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > - Have we considered any > sensible > > > > >> defaults > > > > >> > > for > > > > >> > > > > the > > > > >> > > > > > > > > > >> > configuration, > > > > >> > > > > > > > > > >> > > > i.e. > > > > >> > > > > > > > > > >> > > > > > > all the election timeout, > fetch > > > time > > > > >> out, > > > > >> > > > etc.? > > > > >> > > > > > Or > > > > >> > > > > > > we > > > > >> > > > > > > > > > want > > > > >> > > > > > > > > > >> to > > > > >> > > > > > > > > > >> > > > leave > > > > >> > > > > > > > > > >> > > > > > > this to > > > > >> > > > > > > > > > >> > > > > > > a later stage when we do the > > > > >> performance > > > > >> > > > > testing, > > > > >> > > > > > > > etc. > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > This is a good question, the reason > > we > > > > did > > > > >> not > > > > >> > > set > > > > >> > > > > any > > > > >> > > > > > > > > default > > > > >> > > > > > > > > > >> > values > > > > >> > > > > > > > > > >> > > > for > > > > >> > > > > > > > > > >> > > > > > the timeout configurations is that > we > > > > >> think it > > > > >> > > may > > > > >> > > > > > take > > > > >> > > > > > > > some > > > > >> > > > > > > > > > >> > > > benchmarking > > > > >> > > > > > > > > > >> > > > > > experiments to get these defaults > > > right. > > > > >> Some > > > > >> > > > > > high-level > > > > >> > > > > > > > > > >> principles > > > > >> > > > > > > > > > >> > > to > > > > >> > > > > > > > > > >> > > > > > consider: 1) the fetch.timeout > should > > > be > > > > >> > around > > > > >> > > > the > > > > >> > > > > > same > > > > >> > > > > > > > > scale > > > > >> > > > > > > > > > >> with > > > > >> > > > > > > > > > >> > > zk > > > > >> > > > > > > > > > >> > > > > > session timeout, which is now 18 > > > seconds > > > > by > > > > >> > > > default > > > > >> > > > > -- > > > > >> > > > > > > in > > > > >> > > > > > > > > > >> practice > > > > >> > > > > > > > > > >> > > > we've > > > > >> > > > > > > > > > >> > > > > > seen unstable networks having more > > than > > > > 10 > > > > >> > secs > > > > >> > > of > > > > >> > > > > > > > transient > > > > >> > > > > > > > > > >> > > > > connectivity, > > > > >> > > > > > > > > > >> > > > > > 2) the election.timeout, however, > > > should > > > > be > > > > >> > > > smaller > > > > >> > > > > > than > > > > >> > > > > > > > the > > > > >> > > > > > > > > > >> fetch > > > > >> > > > > > > > > > >> > > > > timeout > > > > >> > > > > > > > > > >> > > > > > as is also suggested as a practical > > > > >> > optimization > > > > >> > > > in > > > > >> > > > > > > > > > literature: > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > >> https://www.cl.cam.ac.uk/~ms705/pub/papers/2015-osr-raft.pdf > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > Some more discussions can be found > > > here: > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > https://github.com/confluentinc/kafka/pull/301/files#r415420081 > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > - Have we considered > > piggybacking > > > > >> > > > > > > `BeginQuorumEpoch` > > > > >> > > > > > > > > with > > > > >> > > > > > > > > > >> the > > > > >> > > > > > > > > > >> > ` > > > > >> > > > > > > > > > >> > > > > > > FetchQuorumRecords`? I might > be > > > > >> missing > > > > >> > > > > something > > > > >> > > > > > > > > obvious > > > > >> > > > > > > > > > >> but > > > > >> > > > > > > > > > >> > I > > > > >> > > > > > > > > > >> > > am > > > > >> > > > > > > > > > >> > > > > > just > > > > >> > > > > > > > > > >> > > > > > > wondering why don’t we just > use > > > the > > > > >> > > > > `FindQuorum` > > > > >> > > > > > > and > > > > >> > > > > > > > > > >> > > > > > > `FetchQuorumRecords` > > > > >> > > > > > > > > > >> > > > > > > APIs and remove the > > > > `BeginQuorumEpoch` > > > > >> > API? > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > Note that Begin/EndQuorumEpoch is > > sent > > > > from > > > > >> > > leader > > > > >> > > > > -> > > > > >> > > > > > > > other > > > > >> > > > > > > > > > >> voter > > > > >> > > > > > > > > > >> > > > > > followers, while FindQuorum / Fetch > > are > > > > >> sent > > > > >> > > from > > > > >> > > > > > > follower > > > > >> > > > > > > > > to > > > > >> > > > > > > > > > >> > leader. > > > > >> > > > > > > > > > >> > > > > > Arguably one can eventually realize > > the > > > > new > > > > >> > > leader > > > > >> > > > > and > > > > >> > > > > > > > epoch > > > > >> > > > > > > > > > via > > > > >> > > > > > > > > > >> > > > > gossiping > > > > >> > > > > > > > > > >> > > > > > FindQuorum, but that could in > > practice > > > > >> > require a > > > > >> > > > > long > > > > >> > > > > > > > delay. > > > > >> > > > > > > > > > >> > Having a > > > > >> > > > > > > > > > >> > > > > > leader -> other voters request > helps > > > the > > > > >> new > > > > >> > > > leader > > > > >> > > > > > > epoch > > > > >> > > > > > > > to > > > > >> > > > > > > > > > be > > > > >> > > > > > > > > > >> > > > > propagated > > > > >> > > > > > > > > > >> > > > > > faster under a pull model. > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > - And about the > > > `FetchQuorumRecords` > > > > >> > > response > > > > >> > > > > > > schema, > > > > >> > > > > > > > > in > > > > >> > > > > > > > > > >> the > > > > >> > > > > > > > > > >> > > > > `Records` > > > > >> > > > > > > > > > >> > > > > > > field of the response, is it > > just > > > > one > > > > >> > > record > > > > >> > > > or > > > > >> > > > > > all > > > > >> > > > > > > > the > > > > >> > > > > > > > > > >> > records > > > > >> > > > > > > > > > >> > > > > > starting > > > > >> > > > > > > > > > >> > > > > > > from the FetchOffset? It > seems a > > > lot > > > > >> more > > > > >> > > > > > efficient > > > > >> > > > > > > > if > > > > >> > > > > > > > > we > > > > >> > > > > > > > > > >> sent > > > > >> > > > > > > > > > >> > > all > > > > >> > > > > > > > > > >> > > > > the > > > > >> > > > > > > > > > >> > > > > > > records during the > bootstrapping > > > of > > > > >> the > > > > >> > > > > brokers. > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > Yes the fetching is batched: > > > FetchOffset > > > > is > > > > >> > just > > > > >> > > > the > > > > >> > > > > > > > > starting > > > > >> > > > > > > > > > >> > offset > > > > >> > > > > > > > > > >> > > of > > > > >> > > > > > > > > > >> > > > > the > > > > >> > > > > > > > > > >> > > > > > batch of records. > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > - Regarding the disruptive > > broker > > > > >> issues, > > > > >> > > > does > > > > >> > > > > > our > > > > >> > > > > > > > pull > > > > >> > > > > > > > > > >> based > > > > >> > > > > > > > > > >> > > > model > > > > >> > > > > > > > > > >> > > > > > > suffer from it? If so, have we > > > > >> considered > > > > >> > > the > > > > >> > > > > > > > Pre-Vote > > > > >> > > > > > > > > > >> stage? > > > > >> > > > > > > > > > >> > If > > > > >> > > > > > > > > > >> > > > > not, > > > > >> > > > > > > > > > >> > > > > > > why? > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > The disruptive broker is stated in > > the > > > > >> > original > > > > >> > > > Raft > > > > >> > > > > > > paper > > > > >> > > > > > > > > > >> which is > > > > >> > > > > > > > > > >> > > the > > > > >> > > > > > > > > > >> > > > > > result of the push model design. > Our > > > > >> analysis > > > > >> > > > showed > > > > >> > > > > > > that > > > > >> > > > > > > > > with > > > > >> > > > > > > > > > >> the > > > > >> > > > > > > > > > >> > > pull > > > > >> > > > > > > > > > >> > > > > > model it is no longer an issue. > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > Thanks a lot for putting this up, > > > and I > > > > >> hope > > > > >> > > > that > > > > >> > > > > my > > > > >> > > > > > > > > > questions > > > > >> > > > > > > > > > >> > can > > > > >> > > > > > > > > > >> > > be > > > > >> > > > > > > > > > >> > > > > of > > > > >> > > > > > > > > > >> > > > > > > some value to make this KIP > better. > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > Hope to hear from you soon! > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > Best wishes, > > > > >> > > > > > > > > > >> > > > > > > Leonard > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > On Wed, Apr 29, 2020 at 1:46 AM > > Colin > > > > >> > McCabe < > > > > >> > > > > > > > > > >> cmcc...@apache.org > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > Hi Jason, > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > It's amazing to see this coming > > > > >> together > > > > >> > :) > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > I haven't had a chance to read > in > > > > >> detail, > > > > >> > > but > > > > >> > > > I > > > > >> > > > > > read > > > > >> > > > > > > > the > > > > >> > > > > > > > > > >> > outline > > > > >> > > > > > > > > > >> > > > and > > > > >> > > > > > > > > > >> > > > > a > > > > >> > > > > > > > > > >> > > > > > > few > > > > >> > > > > > > > > > >> > > > > > > > things jumped out at me. > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > First, for every epoch that is > 32 > > > > bits > > > > >> > > rather > > > > >> > > > > than > > > > >> > > > > > > > 64, I > > > > >> > > > > > > > > > >> sort > > > > >> > > > > > > > > > >> > of > > > > >> > > > > > > > > > >> > > > > wonder > > > > >> > > > > > > > > > >> > > > > > > if > > > > >> > > > > > > > > > >> > > > > > > > that's a good long-term choice. > > I > > > > keep > > > > >> > > > reading > > > > >> > > > > > > about > > > > >> > > > > > > > > > stuff > > > > >> > > > > > > > > > >> > like > > > > >> > > > > > > > > > >> > > > > this: > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1277 > > > > >> > > > > > > > . > > > > >> > > > > > > > > > >> > > Obviously, > > > > >> > > > > > > > > > >> > > > > > that > > > > >> > > > > > > > > > >> > > > > > > > JIRA is about zxid, which > > > increments > > > > >> much > > > > >> > > > faster > > > > >> > > > > > > than > > > > >> > > > > > > > we > > > > >> > > > > > > > > > >> expect > > > > >> > > > > > > > > > >> > > > these > > > > >> > > > > > > > > > >> > > > > > > > leader epochs to, but it would > > > still > > > > be > > > > >> > good > > > > >> > > > to > > > > >> > > > > > see > > > > >> > > > > > > > some > > > > >> > > > > > > > > > >> rough > > > > >> > > > > > > > > > >> > > > > > > calculations > > > > >> > > > > > > > > > >> > > > > > > > about how long 32 bits (or > > really, > > > 31 > > > > >> > bits) > > > > >> > > > will > > > > >> > > > > > > last > > > > >> > > > > > > > us > > > > >> > > > > > > > > > in > > > > >> > > > > > > > > > >> the > > > > >> > > > > > > > > > >> > > > cases > > > > >> > > > > > > > > > >> > > > > > > where > > > > >> > > > > > > > > > >> > > > > > > > we're using it, and what the > > space > > > > >> savings > > > > >> > > > we're > > > > >> > > > > > > > getting > > > > >> > > > > > > > > > >> really > > > > >> > > > > > > > > > >> > > is. > > > > >> > > > > > > > > > >> > > > > It > > > > >> > > > > > > > > > >> > > > > > > > seems like in most cases the > > > tradeoff > > > > >> may > > > > >> > > not > > > > >> > > > be > > > > >> > > > > > > worth > > > > >> > > > > > > > > it? > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > Another thing I've been > thinking > > > > about > > > > >> is > > > > >> > > how > > > > >> > > > we > > > > >> > > > > > do > > > > >> > > > > > > > > > >> > > > bootstrapping. I > > > > >> > > > > > > > > > >> > > > > > > > would prefer to be in a world > > where > > > > >> > > > formatting a > > > > >> > > > > > new > > > > >> > > > > > > > > Kafka > > > > >> > > > > > > > > > >> node > > > > >> > > > > > > > > > >> > > > was a > > > > >> > > > > > > > > > >> > > > > > > first > > > > >> > > > > > > > > > >> > > > > > > > class operation explicitly > > > initiated > > > > by > > > > >> > the > > > > >> > > > > admin, > > > > >> > > > > > > > > rather > > > > >> > > > > > > > > > >> than > > > > >> > > > > > > > > > >> > > > > > something > > > > >> > > > > > > > > > >> > > > > > > > that happened implicitly when > you > > > > >> started > > > > >> > up > > > > >> > > > the > > > > >> > > > > > > > broker > > > > >> > > > > > > > > > and > > > > >> > > > > > > > > > >> > > things > > > > >> > > > > > > > > > >> > > > > > > "looked > > > > >> > > > > > > > > > >> > > > > > > > blank." > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > The first problem is that > things > > > can > > > > >> "look > > > > >> > > > > blank" > > > > >> > > > > > > > > > >> accidentally > > > > >> > > > > > > > > > >> > if > > > > >> > > > > > > > > > >> > > > the > > > > >> > > > > > > > > > >> > > > > > > > storage system is having a bad > > day. > > > > >> > Clearly > > > > >> > > > in > > > > >> > > > > > the > > > > >> > > > > > > > > > non-Raft > > > > >> > > > > > > > > > >> > > world, > > > > >> > > > > > > > > > >> > > > > > this > > > > >> > > > > > > > > > >> > > > > > > > leads to data loss if the > broker > > > that > > > > >> is > > > > >> > > > > > (re)started > > > > >> > > > > > > > > this > > > > >> > > > > > > > > > >> way > > > > >> > > > > > > > > > >> > was > > > > >> > > > > > > > > > >> > > > the > > > > >> > > > > > > > > > >> > > > > > > > leader for some partitions. > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > The second problem is that we > > have > > > a > > > > >> bit > > > > >> > of > > > > >> > > a > > > > >> > > > > > > chicken > > > > >> > > > > > > > > and > > > > >> > > > > > > > > > >> egg > > > > >> > > > > > > > > > >> > > > problem > > > > >> > > > > > > > > > >> > > > > > > with > > > > >> > > > > > > > > > >> > > > > > > > certain configuration keys. > For > > > > >> example, > > > > >> > > > maybe > > > > >> > > > > > you > > > > >> > > > > > > > want > > > > >> > > > > > > > > > to > > > > >> > > > > > > > > > >> > > > configure > > > > >> > > > > > > > > > >> > > > > > > some > > > > >> > > > > > > > > > >> > > > > > > > connection security settings in > > > your > > > > >> > > cluster, > > > > >> > > > > but > > > > >> > > > > > > you > > > > >> > > > > > > > > > don't > > > > >> > > > > > > > > > >> > want > > > > >> > > > > > > > > > >> > > > them > > > > >> > > > > > > > > > >> > > > > > to > > > > >> > > > > > > > > > >> > > > > > > > ever be stored in a plaintext > > > config > > > > >> file. > > > > >> > > > (For > > > > >> > > > > > > > > example, > > > > >> > > > > > > > > > >> SCRAM > > > > >> > > > > > > > > > >> > > > > > > passwords, > > > > >> > > > > > > > > > >> > > > > > > > etc.) You could use a broker > API > > > to > > > > >> set > > > > >> > the > > > > >> > > > > > > > > > configuration, > > > > >> > > > > > > > > > >> but > > > > >> > > > > > > > > > >> > > > that > > > > >> > > > > > > > > > >> > > > > > > brings > > > > >> > > > > > > > > > >> > > > > > > > up the chicken and egg problem. > > > The > > > > >> > broker > > > > >> > > > > needs > > > > >> > > > > > to > > > > >> > > > > > > > be > > > > >> > > > > > > > > > >> > > configured > > > > >> > > > > > > > > > >> > > > to > > > > >> > > > > > > > > > >> > > > > > > know > > > > >> > > > > > > > > > >> > > > > > > > how to talk to you, but you > need > > to > > > > >> > > configure > > > > >> > > > it > > > > >> > > > > > > > before > > > > >> > > > > > > > > > you > > > > >> > > > > > > > > > >> can > > > > >> > > > > > > > > > >> > > > talk > > > > >> > > > > > > > > > >> > > > > to > > > > >> > > > > > > > > > >> > > > > > > > it. Using an external secret > > > manager > > > > >> like > > > > >> > > > Vault > > > > >> > > > > > is > > > > >> > > > > > > > one > > > > >> > > > > > > > > > way > > > > >> > > > > > > > > > >> to > > > > >> > > > > > > > > > >> > > > solve > > > > >> > > > > > > > > > >> > > > > > > this, > > > > >> > > > > > > > > > >> > > > > > > > but not everyone uses an > external > > > > >> secret > > > > >> > > > > manager. > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > quorum.voters seems like a > > similar > > > > >> > > > configuration > > > > >> > > > > > > key. > > > > >> > > > > > > > > In > > > > >> > > > > > > > > > >> the > > > > >> > > > > > > > > > >> > > > current > > > > >> > > > > > > > > > >> > > > > > > KIP, > > > > >> > > > > > > > > > >> > > > > > > > this is only read if there is > no > > > > other > > > > >> > > > > > configuration > > > > >> > > > > > > > > > >> specifying > > > > >> > > > > > > > > > >> > > the > > > > >> > > > > > > > > > >> > > > > > > quorum > > > > >> > > > > > > > > > >> > > > > > > > voter set. If we had a > > kafka.mkfs > > > > >> > command, > > > > >> > > we > > > > >> > > > > > > > wouldn't > > > > >> > > > > > > > > > need > > > > >> > > > > > > > > > >> > this > > > > >> > > > > > > > > > >> > > > key > > > > >> > > > > > > > > > >> > > > > > > > because we could assume that > > there > > > > was > > > > >> > > always > > > > >> > > > > > quorum > > > > >> > > > > > > > > > >> > information > > > > >> > > > > > > > > > >> > > > > stored > > > > >> > > > > > > > > > >> > > > > > > > locally. > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > best, > > > > >> > > > > > > > > > >> > > > > > > > Colin > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > On Thu, Apr 16, 2020, at 16:44, > > > Jason > > > > >> > > > Gustafson > > > > >> > > > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > > > Hi All, > > > > >> > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > I'd like to start a > discussion > > on > > > > >> > KIP-595: > > > > >> > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum > > > > >> > > > > > > > > > >> > > > > > > > . > > > > >> > > > > > > > > > >> > > > > > > > > This proposal specifies a > Raft > > > > >> protocol > > > > >> > to > > > > >> > > > > > > > ultimately > > > > >> > > > > > > > > > >> replace > > > > >> > > > > > > > > > >> > > > > > Zookeeper > > > > >> > > > > > > > > > >> > > > > > > > > as > > > > >> > > > > > > > > > >> > > > > > > > > documented in KIP-500. Please > > > take > > > > a > > > > >> > look > > > > >> > > > and > > > > >> > > > > > > share > > > > >> > > > > > > > > your > > > > >> > > > > > > > > > >> > > > thoughts. > > > > >> > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > A few minor notes to set the > > > stage > > > > a > > > > >> > > little > > > > >> > > > > bit: > > > > >> > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > - This KIP does not specify > the > > > > >> > structure > > > > >> > > of > > > > >> > > > > the > > > > >> > > > > > > > > > messages > > > > >> > > > > > > > > > >> > used > > > > >> > > > > > > > > > >> > > to > > > > >> > > > > > > > > > >> > > > > > > > represent > > > > >> > > > > > > > > > >> > > > > > > > > metadata in Kafka, nor does > it > > > > >> specify > > > > >> > the > > > > >> > > > > > > internal > > > > >> > > > > > > > > API > > > > >> > > > > > > > > > >> that > > > > >> > > > > > > > > > >> > > will > > > > >> > > > > > > > > > >> > > > > be > > > > >> > > > > > > > > > >> > > > > > > used > > > > >> > > > > > > > > > >> > > > > > > > > by the controller. Expect > these > > > to > > > > >> come > > > > >> > in > > > > >> > > > > later > > > > >> > > > > > > > > > >> proposals. > > > > >> > > > > > > > > > >> > > Here > > > > >> > > > > > > > > > >> > > > we > > > > >> > > > > > > > > > >> > > > > > are > > > > >> > > > > > > > > > >> > > > > > > > > primarily concerned with the > > > > >> replication > > > > >> > > > > > protocol > > > > >> > > > > > > > and > > > > >> > > > > > > > > > >> basic > > > > >> > > > > > > > > > >> > > > > > operational > > > > >> > > > > > > > > > >> > > > > > > > > mechanics. > > > > >> > > > > > > > > > >> > > > > > > > > - We expect many details to > > > change > > > > >> as we > > > > >> > > get > > > > >> > > > > > > closer > > > > >> > > > > > > > to > > > > >> > > > > > > > > > >> > > > integration > > > > >> > > > > > > > > > >> > > > > > with > > > > >> > > > > > > > > > >> > > > > > > > > the controller. Any changes > we > > > make > > > > >> will > > > > >> > > be > > > > >> > > > > made > > > > >> > > > > > > > > either > > > > >> > > > > > > > > > as > > > > >> > > > > > > > > > >> > > > > amendments > > > > >> > > > > > > > > > >> > > > > > > to > > > > >> > > > > > > > > > >> > > > > > > > > this KIP or, in the case of > > > larger > > > > >> > > changes, > > > > >> > > > as > > > > >> > > > > > new > > > > >> > > > > > > > > > >> proposals. > > > > >> > > > > > > > > > >> > > > > > > > > - We have a prototype > > > > implementation > > > > >> > > which I > > > > >> > > > > > will > > > > >> > > > > > > > put > > > > >> > > > > > > > > > >> online > > > > >> > > > > > > > > > >> > > > within > > > > >> > > > > > > > > > >> > > > > > the > > > > >> > > > > > > > > > >> > > > > > > > > next week which may help in > > > > >> > understanding > > > > >> > > > some > > > > >> > > > > > > > > details. > > > > >> > > > > > > > > > It > > > > >> > > > > > > > > > >> > has > > > > >> > > > > > > > > > >> > > > > > > diverged a > > > > >> > > > > > > > > > >> > > > > > > > > little bit from our proposal, > > so > > > I > > > > am > > > > >> > > > taking a > > > > >> > > > > > > > little > > > > >> > > > > > > > > > >> time to > > > > >> > > > > > > > > > >> > > > bring > > > > >> > > > > > > > > > >> > > > > > it > > > > >> > > > > > > > > > >> > > > > > > in > > > > >> > > > > > > > > > >> > > > > > > > > line. I'll post an update to > > this > > > > >> thread > > > > >> > > > when > > > > >> > > > > it > > > > >> > > > > > > is > > > > >> > > > > > > > > > >> available > > > > >> > > > > > > > > > >> > > for > > > > >> > > > > > > > > > >> > > > > > > review. > > > > >> > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > Finally, I want to mention > that > > > > this > > > > >> > > > proposal > > > > >> > > > > > was > > > > >> > > > > > > > > > drafted > > > > >> > > > > > > > > > >> by > > > > >> > > > > > > > > > >> > > > > myself, > > > > >> > > > > > > > > > >> > > > > > > > Boyang > > > > >> > > > > > > > > > >> > > > > > > > > Chen, and Guozhang Wang. > > > > >> > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > Thanks, > > > > >> > > > > > > > > > >> > > > > > > > > Jason > > > > >> > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > -- > > > > >> > > > > > > > > > >> > > > > > > Leonard Ge > > > > >> > > > > > > > > > >> > > > > > > Software Engineer Intern - > > Confluent > > > > >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > -- > > > > >> > > > > > > > > > >> > > > > > -- Guozhang > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > -- > > > > >> > > > > > > > > > >> > > > -- Guozhang > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > > > > >> > > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > -- > > > > >> > > > > > > > > > >> > -- Guozhang > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > -- > > > > >> > > > > > > > > -- Guozhang > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > -- > > > > >> > > > > > > -- Guozhang > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > >