Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Jun Rao Wed, 16 Dec 2020 09:59:35 -0800

Hi, Colin,

Thanks for the reply. A few more comments below.


206. "RemoveTopic is the last step, that scrubs all metadata about the
topic.  In order to get to that last step, the topic data needs to removed
from all brokers (after each broker notices that the topic is being
deleted)." Currently, this is done based on the response of
StopReplicaRequest. Since the controller no longer sends this request, how
does the controller know that the data for the deleted topic has
been removed in the brokers?

210. Thanks for the explanation. Sounds good to me.

211. I still don't see an example when ShouldFence is set to true in
BrokerHeartbeatReques. Could we add one?

213. The KIP now allows replicas to be assigned on brokers that are fenced,
which is an improvement. How do we permanently remove a broker (e.g.
cluster shrinking) to prevent it from being used for future replica
assignments?

214. "Currently, when a node is down, all of its ZK registration
information is gone.  But  we need this information in order to understand
things like whether the replicas of a particular partition are
well-balanced across racks." This is not quite true. Currently, even when
ZK registration is gone, the existing replica assignment is still available
in the metadata response.

Jun

On Wed, Dec 16, 2020 at 8:48 AM Colin McCabe <cmcc...@apache.org> wrote:

> On Tue, Dec 15, 2020, at 13:08, Jun Rao wrote:
> > Hi, Colin,
> >
> > Thanks for the reply. A few more follow up comments.
> >
> > 210. initial.broker.registration.timeout.ms: The default value is 90sec,
> > which seems long. If a broker fails the registration because of incorrect
> > configs, we want to fail faster. In comparison, the defaults for
> > zookeeper.connection.timeout.ms is 18 secs.
> >
>
> Hi Jun,
>
> I agree that the initial connection timeout here is longer than what we
> had with ZK.  The main reason I selected a slightly longer timeout here is
> to handle the case where the controllers and the brokers are co-located.
> For example, if you have a 3 node cluster and all three nodes are
> controllers+brokers, when you first bring up the cluster, we will have to
> stand up the controller quorum and then handle broker registrations.
> Although we believe Raft will start up relatively quickly, it's good to
> leave some extra margin here.
>
> I don't think there's a big disadvantage to having a slightly longer
> timeout here.  After all, starting up the brokers with no controllers is an
> admin mistake, which we don't expect to see very often.  Maybe let's set it
> to 60 seconds for now and maybe see if we can tweak it in the future if
> that turns out to be too long or short?
>
> > 211. "Once they are all moved, the controller responds to the heartbeat
> > with a nextState of SHUTDOWN." It seems that nextState is no longer
> > relevant. Also, could you add a bit more explanation on when ShouldFence
> is
> > set to true in BrokerHeartbeatRequest?
> >
>
> Good catch.  I removed the obsolete section referring to nextState and
> added a reference to the new boolean.  I also added some more information
> about ShouldFence and about the rationale for separating fencing from
> registration.
>
> > 212. Related to the above, what does the broker do if IsFenced is true in
> > the BrokerHeartbeatResponse? Will the broker do the same thing on
> receiving
> > a FenceBrokerRecord from the metadata log?
> >
>
> The broker only checks this boolean during the startup process.  After the
> startup process is finished, it ignores this boolean.
>
> The broker uses fence / unfence records in the metadata log to determine
> which brokers should appear in its MetadataResponse.
>
> best,
> Colin
>
> > Jun
> >
> > On Tue, Dec 15, 2020 at 8:51 AM Colin McCabe <cmcc...@apache.org> wrote:
> >
> > > On Tue, Dec 15, 2020, at 04:13, Tom Bentley wrote:
> > > > Hi Colin,
> > > >
> > > > The KIP says that "brokers which are fenced will not appear in
> > > > MetadataResponses.  So clients that have up-to-date metadata will
> not try
> > > > to contact fenced brokers.", which is fine, but it doesn't seem to
> > > > elaborate on what happens for clients which try to contact a broker
> > > (using
> > > > stale metadata, for example) and find it in the registered-but-fenced
> > > > state.
> > >
> > > Hi Tom,
> > >
> > > I have dropped the broker-side fencing from this proposal.  So now the
> > > fencing is basically the same as today's fencing: it's controller-side
> > > only.  That means that clients using stale metadata can contact fenced
> > > brokers and communicate with them.  The only case where the broker
> will be
> > > inaccessible is when it hasn't finished starting yet (i.e., has not yet
> > > attained RUNNING state.)
> > >
> > > Just like today, we expect the safeguards built into the replication
> > > protocol to prevent the worst corner cases that could result from
> this.  I
> > > do think we will probably take up this issue later and find a better
> > > solution for client-side metadata, but this KIP is big enough as-is.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > > You said previously that the broker will respond with a retriable
> > > > error, which is again fine, but you didn't answer my question about
> which
> > > > error code you've chosen for this. I realise that this doesn't really
> > > > affect the correctness of the behaviour, but I'm not aware of any
> > > existing
> > > > error code which is a good fit. So it would be good to understand
> about
> > > how
> > > > you're making this backward compatible for clients.
> > > >
> > > > Many thanks,
> > > >
> > > > Tom
> > > >
> > > > On Tue, Dec 15, 2020 at 1:42 AM Colin McCabe <cmcc...@apache.org>
> wrote:
> > > >
> > > > > On Fri, Dec 11, 2020, at 17:07, Jun Rao wrote:
> > > > > > Hi, Colin,
> > > > > >
> > > > > > Thanks for the reply. Just a couple of more comments below.
> > > > > >
> > > > > > 210. Since we are deprecating zookeeper.connection.timeout.ms,
> > > should we
> > > > > > add a new config to bound the time for a broker to connect to the
> > > > > > controller during starting up?
> > > > > >
> > > > >
> > > > > Good idea.  I added initial.broker.registration.timeout.ms for
> this.
> > > > >
> > > > > > 211. BrokerHeartbeat no longer has the state field in the
> > > > > request/response.
> > > > > > However, (a) the controller shutdown section still has "In its
> > > periodic
> > > > > > heartbeats, the broker asks the controller if it can transition
> into
> > > the
> > > > > > SHUTDOWN state.  This motivates the controller to move all of the
> > > leaders
> > > > > > off of that broker.  Once they are all moved, the controller
> > > responds to
> > > > > > the heartbeat with a nextState of SHUTDOWN."; (2) the
> description of
> > > > > > BrokerHeartbeat still references currentState and targetState.
> > > > > >
> > > > >
> > > > > Thanks.  I've made these sections clearer and removed the obsolete
> > > > > references to sending states.
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Fri, Dec 11, 2020 at 1:33 PM Colin McCabe <cmcc...@apache.org
> >
> > > wrote:
> > > > > >
> > > > > > > On Wed, Dec 9, 2020, at 10:10, Jun Rao wrote:
> > > > > > > > Hi, Colin,
> > > > > > > >
> > > > > > > > Thanks for the update. A few more follow up comments.
> > > > > > > >
> > > > > > >
> > > > > > > Hi Jun,
> > > > > > >
> > > > > > > Thanks again for the review.
> > > > > > >
> > > > > > > > 100. FailedReplicaRecord: Since this is reported by each
> broker
> > > > > > > > independently, perhaps we could use a more concise
> representation
> > > > > that
> > > > > > > has
> > > > > > > > a top level broker field, an array of topics, which has an
> array
> > > of
> > > > > > > > partitions.
> > > > > > > >
> > > > > > >
> > > > > > > The issue is that there is a size limit on the each record.
> > > Putting
> > > > > all
> > > > > > > of the partitions of a log directory into a single record would
> > > > > probably
> > > > > > > break that in many cases.  Still, we can optimize a bit by
> having
> > > an
> > > > > array
> > > > > > > of partition IDs, since nearly all the time, we have more than
> one
> > > > > from the
> > > > > > > same topic.
> > > > > > >
> > > > > > > > 200. Sounds good. If we remove the broker-side fencing
> logic, do
> > > we
> > > > > plan
> > > > > > > to
> > > > > > > > still keep FENCED in broker state? Do we plan to expose the
> new
> > > > > states
> > > > > > > > through the existing BrokerState metric and if so, what are
> the
> > > > > values
> > > > > > > for
> > > > > > > > the new states?
> > > > > > > >
> > > > > > >
> > > > > > > No, we don't need FENCED any more.  I have removed it from the
> KIP.
> > > > > > >
> > > > > > > The new states are very similar to the current ones, actually.
> > > There
> > > > > are
> > > > > > > no new states or removed ones.  The main change in the broker
> state
> > > > > machine
> > > > > > > is that the RECOVERING_FROM_UNCLEAN_SHUTDOWN state has been
> > > renamed to
> > > > > > > RECOVERY.  Also, unlike previously, the broker will always pass
> > > through
> > > > > > > RECOVERY (although it may only stay in this state for a very
> short
> > > > > amount
> > > > > > > of time).
> > > > > > >
> > > > > > > > 201. This may be fine too. Could we document what happens
> when
> > > the
> > > > > > > > broker.id/controller.id in metadata.properties don't match
> the
> > > > > broker
> > > > > > > > config when the broker starts up?
> > > > > > > >
> > > > > > >
> > > > > > > I added some documentation about this.
> > > > > > >
> > > > > > > > 204. There is still "The highest metadata offset which the
> broker
> > > > > has not
> > > > > > > > reached" referenced under BrokerRegistration.
> > > > > > > >
> > > > > > >
> > > > > > > It should be CurrentMetadataOffset.  Fixed.
> > > > > > >
> > > > > > > > 206. Is that separate step needed given KIP-516? With
> KIP-516 (
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-516%3A+Topic+Identifiers#KIP516:TopicIdentifiers-LeaderAndIsr
> > > > > > > ),
> > > > > > > > we don't need to wait for the topic data to be removed from
> all
> > > > > brokers
> > > > > > > > before removing the topic metadata. The combination of
> unmatching
> > > > > > > > topicId
> > > > > > > > or the missing topicId from the metadata is enough for the
> > > broker to
> > > > > > > > clean
> > > > > > > > up deleted topics asynchronously.
> > > > > > >
> > > > > > > It won't be needed once KIP-516 is adopted, but this hasn't
> been
> > > > > > > implemented yet.
> > > > > > >
> > > > > > > best,
> > > > > > > Colin
> > > > > > >
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Dec 8, 2020 at 5:27 PM Colin McCabe <
> cmcc...@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > On Thu, Dec 3, 2020, at 16:37, Jun Rao wrote:
> > > > > > > > > > Hi, Colin,
> > > > > > > > > >
> > > > > > > > > > Thanks for the updated KIP. A few more comments below.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi Jun,
> > > > > > > > >
> > > > > > > > > Thanks again for the reviews.
> > > > > > > > >
> > > > > > > > > > 80.2 For deprecated configs, we need to include
> zookeeper.*
> > > and
> > > > > > > > > > broker.id.generation.enable.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Added.
> > > > > > > > >
> > > > > > > > > > 83.1 If a broker is down, does the controller keep the
> > > previously
> > > > > > > > > > registered broker epoch forever? If not, how long does
> the
> > > > > controller
> > > > > > > > > keep
> > > > > > > > > > it? What does the controller do when receiving a broker
> > > heartbeat
> > > > > > > request
> > > > > > > > > > with an unfound broker epoch?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes, the controller keeps the previous registration
> forever.
> > > > > > > > >
> > > > > > > > > Broker heartbeat requests with an incorrect broker epoch
> will
> > > be
> > > > > > > rejected
> > > > > > > > > with STALE_BROKER_EPOCH.
> > > > > > > > >
> > > > > > > > > > 100. Have you figured out if we need to add a new record
> > > type for
> > > > > > > > > reporting
> > > > > > > > > > partitions on failed disks?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I added FailedReplicaRecord to reflect the case where a
> JBOD
> > > > > directory
> > > > > > > has
> > > > > > > > > failed, leading to failed replicas.
> > > > > > > > >
> > > > > > > > > > 102. For debugging purposes, sometimes it's useful to
> read
> > > the
> > > > > > > metadata
> > > > > > > > > > topic using tools like console-consumer. Should we
> support
> > > that
> > > > > and
> > > > > > > if
> > > > > > > > > so,
> > > > > > > > > > how?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > For now, we have the ability to read the metadata logs
> with the
> > > > > > > dump-logs
> > > > > > > > > tool.  I think we will come up with some other tools in the
> > > future
> > > > > as
> > > > > > > we
> > > > > > > > > get experience.
> > > > > > > > >
> > > > > > > > > > 200. "brokers which are fenced will not appear in
> > > > > MetadataResponses.
> > > > > > > The
> > > > > > > > > > broker will not respond to these requests-- instead, it
> will
> > > > > simply
> > > > > > > > > > disconnect." If the controller is partitioned off from
> the
> > > > > brokers,
> > > > > > > this
> > > > > > > > > > design will cause every broker to stop accepting new
> client
> > > > > > > requests. In
> > > > > > > > > > contrast, if ZK is partitioned off, the existing
> behavior is
> > > > > that the
> > > > > > > > > > brokers can continue to work based on the last known
> > > metadata.
> > > > > So, I
> > > > > > > am
> > > > > > > > > not
> > > > > > > > > > sure if we should change the existing behavior because
> of the
> > > > > bigger
> > > > > > > > > impact
> > > > > > > > > > in the new one. Another option is to keep the existing
> > > behavior
> > > > > and
> > > > > > > > > expose
> > > > > > > > > > a metric for fenced brokers so that the operator could be
> > > > > alerted.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm skeptical about how well running without ZK currently
> > > works.
> > > > > > > However,
> > > > > > > > > I will move the broker-side fencing into a follow-up KIP.
> This
> > > > > KIP is
> > > > > > > > > already pretty large and there is no hard dependency on
> this.
> > > > > There
> > > > > > > may
> > > > > > > > > also be other ways of accomplishing the positive effects of
> > > what
> > > > > > > > > broker-side fencing, so more discussion is needed.
> > > > > > > > >
> > > > > > > > > > 201. I read Ron's comment, but I am still not sure the
> > > benefit of
> > > > > > > keeping
> > > > > > > > > > broker.id and controller.id in meta.properties. It seems
> > > that
> > > > > we are
> > > > > > > > > just
> > > > > > > > > > duplicating the same info in two places and have the
> > > additional
> > > > > > > burden of
> > > > > > > > > > making sure the values in the two places are consistent.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think the reasoning is that having broker.id protects us
> > > against
> > > > > > > > > accidentally bringing up a broker with a disk from a
> different
> > > > > > > broker.  I
> > > > > > > > > don't feel strongly about this but it seemed simpler to
> keep
> > > it.
> > > > > > > > >
> > > > > > > > > > 202. controller.connect.security.protocol: Is this needed
> > > since
> > > > > > > > > > controller.listener.names and
> listener.security.protocol.map
> > > > > imply
> > > > > > > the
> > > > > > > > > > security protocol already?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > You're right, this isn't needed.  I'll remove it.
> > > > > > > > >
> > > > > > > > > > 203. registration.heartbeat.interval.ms: It defaults to
> 2k.
> > > ZK
> > > > > uses
> > > > > > > 1/3
> > > > > > > > > of
> > > > > > > > > > the session timeout for heartbeat. So, given the default
> 18k
> > > for
> > > > > > > > > > registration.lease.timeout.ms, should we default
> > > > > > > > > > registration.heartbeat.interval.ms to 6k?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > 6 seconds seems like a pretty long time between
> heartbeats.  It
> > > > > might
> > > > > > > be
> > > > > > > > > useful to know when a broker is missing heartbeats, with
> less
> > > time
> > > > > than
> > > > > > > > > that.  I provisionally set it to 3 seconds (we can always
> > > change
> > > > > > > later...)
> > > > > > > > >
> > > > > > > > > I also changed the name of these configurations to "
> > > > > > > > > broker.heartbeat.interval.ms" and "
> > > broker.registration.timeout.ms"
> > > > > to
> > > > > > > try
> > > > > > > > > to clarify them a bit.
> > > > > > > > >
> > > > > > > > > > 204. "The highest metadata offset which the broker has
> not
> > > > > reached."
> > > > > > > It
> > > > > > > > > > seems this should be "has reached".
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I changed this to "one more than the highest metadata
> offset
> > > which
> > > > > the
> > > > > > > > > broker has reached."
> > > > > > > > >
> > > > > > > > > > 205. UnfenceBrokerRecord and UnregisterBrokerRecord: To
> me,
> > > they
> > > > > > > seem to
> > > > > > > > > be
> > > > > > > > > > the same. Do we need both?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Unregistration means that the broker has been removed from
> the
> > > > > cluster.
> > > > > > > > > That is different than unfencing, which marks the broker as
> > > active.
> > > > > > > > >
> > > > > > > > > > 206. TopicRecord: The Deleting field is used to indicate
> > > that the
> > > > > > > topic
> > > > > > > > > is
> > > > > > > > > > being deleted. I am wondering if this is really needed
> since
> > > > > > > RemoveTopic
> > > > > > > > > > already indicates the same thing.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > RemoveTopic is the last step, that scrubs all metadata
> about
> > > the
> > > > > topic.
> > > > > > > > > In order to get to that last step, the topic data needs to
> > > removed
> > > > > > > from all
> > > > > > > > > brokers (after each broker notices that the topic is being
> > > > > deleted).
> > > > > > > > >
> > > > > > > > > best,
> > > > > > > > > Colin
> > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > On Wed, Dec 2, 2020 at 2:50 PM Colin McCabe <
> > > cmcc...@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > On Wed, Dec 2, 2020, at 14:07, Ron Dagostino wrote:
> > > > > > > > > > > > Hi Colin.  Thanks for the updates.  It's now clear
> to me
> > > that
> > > > > > > brokers
> > > > > > > > > > > > keep their broker epoch for the life of their JVM --
> they
> > > > > > > register
> > > > > > > > > > > > once, get their broker epoch in the response, and
> then
> > > never
> > > > > > > > > > > > re-register again.  Brokers may get fenced, but they
> > > keep the
> > > > > > > same
> > > > > > > > > > > > broker epoch for the life of their JVM.  The
> incarnation
> > > ID
> > > > > is
> > > > > > > also
> > > > > > > > > > > > kept for the life of the JVM but is generated by the
> > > broker
> > > > > > > itself
> > > > > > > > > > > > upon startup, and the combination of the two allows
> the
> > > > > > > Controller to
> > > > > > > > > > > > act idempotently if any previously-sent registration
> > > response
> > > > > > > gets
> > > > > > > > > > > > lost.  Makes sense.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks, Ron.  That's a good summary.
> > > > > > > > > > >
> > > > > > > > > > > > One thing I wonder about is if it might be helpful
> for
> > > the
> > > > > > > broker to
> > > > > > > > > > > > send the Cluster ID as determined from its
> > > meta.properties
> > > > > file
> > > > > > > in
> > > > > > > > > its
> > > > > > > > > > > > registration request.  Does it even make sense for
> the
> > > > > broker to
> > > > > > > > > > > > successfully register and enter the Fenced state if
> it
> > > has
> > > > > the
> > > > > > > wrong
> > > > > > > > > > > > Cluster ID?
> > > > > > > > > > >
> > > > > > > > > > > Yeah, that's a good idea.  Let's have the broker pass
> its
> > > > > cluster
> > > > > > > ID in
> > > > > > > > > > > the registration RPC, and then registration can fail
> if the
> > > > > broker
> > > > > > > is
> > > > > > > > > > > configured for the wrong cluster.
> > > > > > > > > > >
> > > > > > > > > > > >  The nextMetadatOffset value that the broker
> communicates
> > > > > > > > > > > > in its registration request only has meaning within
> the
> > > > > correct
> > > > > > > > > > > > cluster, so it feels to me that the Controller should
> > > have
> > > > > some
> > > > > > > way
> > > > > > > > > to
> > > > > > > > > > > > perform this sanity check.  There is currently
> (pre-KIP
> > > 500)
> > > > > a
> > > > > > > check
> > > > > > > > > > > > in the broker to make sure its configured cluster ID
> > > matches
> > > > > the
> > > > > > > one
> > > > > > > > > > > > stored in ZooKeeper, and we will have to perform this
> > > > > validation
> > > > > > > > > > > > somewhere in the KIP-500 world.  If the Controller
> > > doesn't
> > > > > do it
> > > > > > > > > > > > within the registration request then the broker will
> > > have to
> > > > > > > make a
> > > > > > > > > > > > metadata request to the Controller, retrieve the
> Cluster
> > > ID,
> > > > > and
> > > > > > > > > > > > perform the check itself.  It feels to me that it
> might
> > > be
> > > > > > > better for
> > > > > > > > > > > > the Controller to just do it, and then the broker
> doesn't
> > > > > have to
> > > > > > > > > > > > worry about it anymore once it successfully
> registers.
> > > > > > > > > > > >
> > > > > > > > > > > > I also have a question about the broker.id value and
> > > > > > > > > meta.properties.
> > > > > > > > > > > > The KIP now says "In version 0 of meta.properties,
> there
> > > is a
> > > > > > > > > > > > broker.id field.  Version 1 does not have this
> field.
> > > It
> > > > > is no
> > > > > > > > > longer
> > > > > > > > > > > > needed because we no longer support dynamic broker id
> > > > > > > assignment."
> > > > > > > > > > > > But then there is an example version 1
> meta.properties
> > > file
> > > > > that
> > > > > > > > > shows
> > > > > > > > > > > > the broker.id value.  I actually wonder if maybe the
> > > > > broker.id
> > > > > > > value
> > > > > > > > > > > > would be good to keep in the version 1
> meta.properties
> > > file
> > > > > > > because
> > > > > > > > > it
> > > > > > > > > > > > currently (pre-KIP 500, version 0) acts as a sanity
> > > check to
> > > > > make
> > > > > > > > > sure
> > > > > > > > > > > > the broker is using the correct log directory.
> Similarly
> > > > > with
> > > > > > > the
> > > > > > > > > > > > controller.id value on controllers -- it would
> allow the
> > > > > same
> > > > > > > type
> > > > > > > > > of
> > > > > > > > > > > > sanity check for quorum controllers.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > That's a good point.  I will add broker.id back, and
> also
> > > add
> > > > > > > > > > > controller.id as a possibility.
> > > > > > > > > > >
> > > > > > > > > > > cheers,
> > > > > > > > > > > Colin
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Nov 30, 2020 at 7:41 PM Colin McCabe <
> > > > > cmcc...@apache.org
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Oct 23, 2020, at 16:10, Jun Rao wrote:
> > > > > > > > > > > > > > Hi, Colin,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks again for the reply.  Sorry for the long
> > > hiatus.  I
> > > > > was
> > > > > > > on
> > > > > > > > > > > vacation for a while.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 55. There is still text that favors new broker
> > > > > registration.
> > > > > > > > > "When a
> > > > > > > > > > > broker
> > > > > > > > > > > > > > first starts up, when it is in the INITIAL
> state, it
> > > will
> > > > > > > always
> > > > > > > > > > > "win"
> > > > > > > > > > > > > > broker ID conflicts.  However, once it is
> granted a
> > > > > lease, it
> > > > > > > > > > > transitions
> > > > > > > > > > > > > > out of the INITIAL state.  Thereafter, it may
> lose
> > > > > subsequent
> > > > > > > > > > > conflicts if
> > > > > > > > > > > > > > its broker epoch is stale.  (See KIP-380 for some
> > > > > background
> > > > > > > on
> > > > > > > > > > > broker
> > > > > > > > > > > > > > epoch.)  The reason for favoring new processes
> is to
> > > > > > > accommodate
> > > > > > > > > the
> > > > > > > > > > > common
> > > > > > > > > > > > > > case where a process is killed with kill -9 and
> then
> > > > > > > restarted.
> > > > > > > > > We
> > > > > > > > > > > want it
> > > > > > > > > > > > > > to be able to reclaim its old ID quickly in this
> > > case."
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the reminder.  I have clarified the
> language
> > > > > here.
> > > > > > > > > > > Hopefully now it is clear that we don't allow quick
> re-use
> > > of
> > > > > > > broker
> > > > > > > > > IDs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 80.1 Sounds good. Could you document that
> listeners
> > > is a
> > > > > > > required
> > > > > > > > > > > config
> > > > > > > > > > > > > > now? It would also be useful to annotate other
> > > required
> > > > > > > configs.
> > > > > > > > > For
> > > > > > > > > > > > > > example, controller.connect should be required.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I added a note specifying that these are required.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 80.2 Could you list all deprecated existing
> configs?
> > > > > Another
> > > > > > > one
> > > > > > > > > is
> > > > > > > > > > > > > > control.plane.listener.name since the
> controller no
> > > > > longer
> > > > > > > sends
> > > > > > > > > > > > > > LeaderAndIsr, UpdateMetadata and StopReplica
> > > requests.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I added a section specifying some deprecated
> configs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 83.1 It seems that the broker can transition from
> > > FENCED
> > > > > to
> > > > > > > > > RUNNING
> > > > > > > > > > > without
> > > > > > > > > > > > > > registering for a new broker epoch. I am not
> sure how
> > > > > this
> > > > > > > works.
> > > > > > > > > > > Once the
> > > > > > > > > > > > > > controller fences a broker, there is no need for
> the
> > > > > > > controller
> > > > > > > > > to
> > > > > > > > > > > keep the
> > > > > > > > > > > > > > boker epoch around. So, if the fenced broker's
> > > heartbeat
> > > > > > > request
> > > > > > > > > > > with the
> > > > > > > > > > > > > > existing broker epoch will be rejected, leading
> the
> > > > > broker
> > > > > > > back
> > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > > > > FENCED state again.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > The broker epoch refers to the broker registration.
> > > So we
> > > > > DO
> > > > > > > keep
> > > > > > > > > the
> > > > > > > > > > > broker epoch around even while the broker is fenced.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The broker epoch changes only when there is a new
> > > broker
> > > > > > > > > > > registration.  Fencing or unfencing the broker doesn't
> > > change
> > > > > the
> > > > > > > > > broker
> > > > > > > > > > > epoch.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 83.5 Good point on KIP-590. Then should we
> expose the
> > > > > > > controller
> > > > > > > > > for
> > > > > > > > > > > > > > debugging purposes? If not, we should deprecate
> the
> > > > > > > controllerID
> > > > > > > > > > > field in
> > > > > > > > > > > > > > MetadataResponse?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think it's OK to expose it for now, with the
> proviso
> > > > > that it
> > > > > > > > > won't
> > > > > > > > > > > be reachable by clients.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 90. We rejected the shared ID with just one
> reason
> > > "This
> > > > > is
> > > > > > > not a
> > > > > > > > > > > good idea
> > > > > > > > > > > > > > because NetworkClient assumes a single ID
> space.  So
> > > if
> > > > > > > there is
> > > > > > > > > > > both a
> > > > > > > > > > > > > > controller 1 and a broker 1, we don't have a way
> of
> > > > > picking
> > > > > > > the
> > > > > > > > > > > "right"
> > > > > > > > > > > > > > one." This doesn't seem to be a strong reason.
> For
> > > > > example,
> > > > > > > we
> > > > > > > > > could
> > > > > > > > > > > > > > address the NetworkClient issue with the node
> type
> > > as you
> > > > > > > pointed
> > > > > > > > > > > out or
> > > > > > > > > > > > > > using the negative value of a broker ID as the
> > > > > controller ID.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > It would require a lot of code changes to support
> > > multiple
> > > > > > > types of
> > > > > > > > > > > node IDs.  It's not clear to me that the end result
> would
> > > be
> > > > > > > better --
> > > > > > > > > I
> > > > > > > > > > > tend to think it would be worse, since it would be more
> > > > > complex.
> > > > > > > In a
> > > > > > > > > > > similar vein, using negative numbers seems dangerous,
> > > since we
> > > > > use
> > > > > > > > > > > negatives or -1 as "special values" in many places.
> For
> > > > > example,
> > > > > > > -1
> > > > > > > > > often
> > > > > > > > > > > represents "no such node."
> > > > > > > > > > > > >
> > > > > > > > > > > > > One important thing to keep in mind is that we
> want to
> > > be
> > > > > able
> > > > > > > to
> > > > > > > > > > > transition from a broker and a controller being
> co-located
> > > to
> > > > > them
> > > > > > > no
> > > > > > > > > > > longer being co-located.  This is much easier to do
> when
> > > they
> > > > > have
> > > > > > > > > separate
> > > > > > > > > > > IDs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 100. In KIP-589
> > > > > > > > > > > > > > <
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > > > > > > > > > > >,
> > > > > > > > > > > > > > the broker reports all offline replicas due to a
> disk
> > > > > > > failure to
> > > > > > > > > the
> > > > > > > > > > > > > > controller. It seems this information needs to be
> > > > > persisted
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > > > > > metadata
> > > > > > > > > > > > > > log. Do we have a corresponding record for that?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hmm, I have to look into this a little bit more.
> We
> > > may
> > > > > need
> > > > > > > a new
> > > > > > > > > > > record type.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 101. Currently, StopReplica request has 2 modes,
> > > without
> > > > > > > deletion
> > > > > > > > > > > and with
> > > > > > > > > > > > > > deletion. The former is used for controlled
> shutdown
> > > and
> > > > > > > handling
> > > > > > > > > > > disk
> > > > > > > > > > > > > > failure, and causes the follower to stop. The
> latter
> > > is
> > > > > for
> > > > > > > topic
> > > > > > > > > > > deletion
> > > > > > > > > > > > > > and partition reassignment, and causes the
> replica
> > > to be
> > > > > > > deleted.
> > > > > > > > > > > Since we
> > > > > > > > > > > > > > are deprecating StopReplica, could we document
> what
> > > > > triggers
> > > > > > > the
> > > > > > > > > > > stopping
> > > > > > > > > > > > > > of a follower and the deleting of a replica now?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > RemoveTopic triggers deletion.  In general the
> > > > > functionality of
> > > > > > > > > > > StopReplica is subsumed by the metadata records.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 102. Should we include the metadata topic in the
> > > > > > > > > MetadataResponse?
> > > > > > > > > > > If so,
> > > > > > > > > > > > > > when it will be included and what will the
> metadata
> > > > > response
> > > > > > > look
> > > > > > > > > > > like?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > No, it won't be included in the metadata response
> sent
> > > back
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > brokers.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 103. "The active controller assigns the broker a
> new
> > > > > broker
> > > > > > > > > epoch,
> > > > > > > > > > > based on
> > > > > > > > > > > > > > the latest committed offset in the log." This
> seems
> > > > > > > inaccurate
> > > > > > > > > since
> > > > > > > > > > > the
> > > > > > > > > > > > > > latest committed offset doesn't always advance on
> > > every
> > > > > log
> > > > > > > > > append.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Given that the new broker epoch won't be visible
> until
> > > the
> > > > > > > commit
> > > > > > > > > has
> > > > > > > > > > > happened, I have changed this to "the next available
> > > offset in
> > > > > the
> > > > > > > log"
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 104. REGISTERING(1) : It says "Otherwise, the
> broker
> > > > > moves
> > > > > > > into
> > > > > > > > > the
> > > > > > > > > > > FENCED
> > > > > > > > > > > > > > state.". It seems this should be RUNNING?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 105. RUNNING: Should we require the broker to
> catch
> > > up
> > > > > to the
> > > > > > > > > > > metadata log
> > > > > > > > > > > > > > to get into this state?
> > > > > > > > > > > > >
> > > > > > > > > > > > > For 104 and 105, these sections have been reworked.
> > > > > > > > > > > > >
> > > > > > > > > > > > > best,
> > > > > > > > > > > > > Colin
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Oct 23, 2020 at 1:20 PM Colin McCabe <
> > > > > > > cmcc...@apache.org
> > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Oct 21, 2020, at 05:51, Tom Bentley
> wrote:
> > > > > > > > > > > > > > > > Hi Colin,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Mon, Oct 19, 2020, at 08:59, Ron Dagostino
> > > wrote:
> > > > > > > > > > > > > > > > > > Hi Colin.  Thanks for the hard work on
> this
> > > KIP.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I have some questions about what happens
> to a
> > > > > broker
> > > > > > > > > when it
> > > > > > > > > > > becomes
> > > > > > > > > > > > > > > > > > fenced (e.g. because it can't send a
> > > heartbeat
> > > > > > > request to
> > > > > > > > > > > keep its
> > > > > > > > > > > > > > > > > > lease).  The KIP says "When a broker is
> > > fenced,
> > > > > it
> > > > > > > cannot
> > > > > > > > > > > process any
> > > > > > > > > > > > > > > > > > client requests.  This prevents brokers
> which
> > > > > are not
> > > > > > > > > > > receiving
> > > > > > > > > > > > > > > > > > metadata updates or that are not
> receiving
> > > and
> > > > > > > processing
> > > > > > > > > > > them fast
> > > > > > > > > > > > > > > > > > enough from causing issues to clients."
> And
> > > in
> > > > > the
> > > > > > > > > > > description of the
> > > > > > > > > > > > > > > > > > FENCED(4) state it likewise says "While
> in
> > > this
> > > > > > > state,
> > > > > > > > > the
> > > > > > > > > > > broker
> > > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > > > not respond to client requests."  It
> makes
> > > sense
> > > > > > > that a
> > > > > > > > > > > fenced broker
> > > > > > > > > > > > > > > > > > should not accept producer requests -- I
> > > assume
> > > > > any
> > > > > > > such
> > > > > > > > > > > requests
> > > > > > > > > > > > > > > > > > would result in
> NotLeaderOrFollowerException.
> > > > > But
> > > > > > > what
> > > > > > > > > > > about KIP-392
> > > > > > > > > > > > > > > > > > (fetch from follower) consumer
> requests?  It
> > > is
> > > > > > > > > conceivable
> > > > > > > > > > > that
> > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > could continue.  Related to that, would a
> > > fenced
> > > > > > > broker
> > > > > > > > > > > continue to
> > > > > > > > > > > > > > > > > > fetch data for partitions where it
> thinks it
> > > is a
> > > > > > > > > follower?
> > > > > > > > > > > Even if
> > > > > > > > > > > > > > > > > > it rejects consumer requests it might
> still
> > > > > continue
> > > > > > > to
> > > > > > > > > > > fetch as a
> > > > > > > > > > > > > > > > > > follower.  Might it be helpful to clarify
> > > both
> > > > > > > decisions
> > > > > > > > > > > here?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Ron,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Good question.  I think a fenced broker
> should
> > > > > > > continue to
> > > > > > > > > > > fetch on
> > > > > > > > > > > > > > > > > partitions it was already fetching before
> it
> > > was
> > > > > > > fenced,
> > > > > > > > > > > unless it
> > > > > > > > > > > > > > > hits a
> > > > > > > > > > > > > > > > > problem.  At that point it won't be able to
> > > > > continue,
> > > > > > > > > since it
> > > > > > > > > > > doesn't
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > the new metadata.  For example, it won't
> know
> > > about
> > > > > > > > > leadership
> > > > > > > > > > > changes
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > the partitions it's fetching.  The
> rationale
> > > for
> > > > > > > > > continuing to
> > > > > > > > > > > fetch
> > > > > > > > > > > > > > > is to
> > > > > > > > > > > > > > > > > try to avoid disruptions as much as
> possible.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I don't think fenced brokers should accept
> > > client
> > > > > > > requests.
> > > > > > > > > > > The issue
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > that the fenced broker may or may not have
> any
> > > > > data it
> > > > > > > is
> > > > > > > > > > > supposed to
> > > > > > > > > > > > > > > > > have.  It may or may not have applied any
> > > > > configuration
> > > > > > > > > > > changes, etc.
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > it is supposed to have applied.  So it
> could
> > > get
> > > > > pretty
> > > > > > > > > > > confusing, and
> > > > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > > potentially waste the client's time.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > When fenced, how would the broker reply to a
> > > client
> > > > > > > which did
> > > > > > > > > > > make a
> > > > > > > > > > > > > > > > request?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Tom,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The broker will respond with a retryable error
> in
> > > that
> > > > > > > case.
> > > > > > > > > Once
> > > > > > > > > > > the
> > > > > > > > > > > > > > > client has re-fetched its metadata, it will no
> > > longer
> > > > > see
> > > > > > > the
> > > > > > > > > > > fenced broker
> > > > > > > > > > > > > > > as part of the cluster.  I added a note to the
> KIP.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > best,
> > > > > > > > > > > > > > > Colin
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Tom
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Reply via email to