Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Tom Bentley Mon, 05 Aug 2019 10:02:49 -0700

Hi Colin,

Thanks for the KIP.


Currently ZooKeeper provides a convenient notification mechanism for
knowing that broker and topic configuration has changed. While KIP-500 does
suggest that incremental metadata update is expected to come to clients
eventually, that would seem to imply that for some number of releases there
would be no equivalent mechanism for knowing about config changes. Is there
any thinking at this point about how a similar notification might be
provided in the future?

Thanks,

Tom

On Mon, Aug 5, 2019 at 3:49 PM Viktor Somogyi-Vass <viktorsomo...@gmail.com>
wrote:

> Hey Colin,
>
> I think this is a long-awaited KIP, thanks for driving it. I'm excited to
> see this in Kafka once. I collected my questions (and I accept the "TBD"
> answer as they might be a bit deep for this high level :) ).
> 1.) Are there any specific reasons for the Controller just periodically
> persisting its state on disk periodically instead of asynchronously with
> every update? Wouldn't less frequent saves increase the chance for missing
> a state change if the controller crashes between two saves?
> 2.) Why can't we allow brokers to fetch metadata from the follower
> controllers? I assume that followers would have up-to-date information
> therefore brokers could fetch from there in theory.
>
> Thanks,
> Viktor
>
> On Sun, Aug 4, 2019 at 6:58 AM Boyang Chen <reluctanthero...@gmail.com>
> wrote:
>
> > Thanks for explaining Ismael! Breaking down into follow-up KIPs sounds
> like
> > a good idea.
> >
> > On Sat, Aug 3, 2019 at 10:14 AM Ismael Juma <ism...@juma.me.uk> wrote:
> >
> > > Hi Boyang,
> > >
> > > Yes, there will be several KIPs that will discuss the items you
> describe
> > in
> > > detail. Colin, it may be helpful to make this clear in the KIP 500
> > > description.
> > >
> > > Ismael
> > >
> > > On Sat, Aug 3, 2019 at 9:32 AM Boyang Chen <reluctanthero...@gmail.com
> >
> > > wrote:
> > >
> > > > Thanks Colin for initiating this important effort!
> > > >
> > > > One question I have is whether we have a session discussing the
> > > controller
> > > > failover in the new architecture? I know we are using Raft protocol
> to
> > > > failover, yet it's still valuable to discuss the steps new cluster is
> > > going
> > > > to take to reach the stable stage again, so that we could easily
> > measure
> > > > the availability of the metadata servers.
> > > >
> > > > Another suggestion I have is to write a step-by-step design doc like
> > what
> > > > we did in KIP-98
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
> > > > >,
> > > > including the new request protocols and how they are interacting in
> the
> > > new
> > > > cluster. For a complicated change like this, an implementation design
> > doc
> > > > help a lot in the review process, otherwise most discussions we have
> > will
> > > > focus on high level and lose important details as we discover them in
> > the
> > > > post-agreement phase.
> > > >
> > > > Boyang
> > > >
> > > > On Fri, Aug 2, 2019 at 5:17 PM Colin McCabe <cmcc...@apache.org>
> > wrote:
> > > >
> > > > > On Fri, Aug 2, 2019, at 16:33, Jose Armando Garcia Sancio wrote:
> > > > > > Thanks Colin for the detail KIP. I have a few comments and
> > questions.
> > > > > >
> > > > > > In the KIP's Motivation and Overview you mentioned the
> LeaderAndIsr
> > > and
> > > > > > UpdateMetadata RPC. For example, "updates which the controller
> > > pushes,
> > > > > such
> > > > > > as LeaderAndIsr and UpdateMetadata messages". Is your thinking
> that
> > > we
> > > > > will
> > > > > > use MetadataFetch as a replacement to just UpdateMetadata only
> and
> > > add
> > > > > > topic configuration in this state?
> > > > > >
> > > > >
> > > > > Hi Jose,
> > > > >
> > > > > Thanks for taking a look.
> > > > >
> > > > > The goal is for MetadataFetchRequest to replace both
> > > LeaderAndIsrRequest
> > > > > and UpdateMetadataRequest.  Topic configurations would be fetched
> > along
> > > > > with the other metadata.
> > > > >
> > > > > > In the section "Broker Metadata Management", you mention "Just
> like
> > > > with
> > > > > a
> > > > > > fetch request, the broker will track the offset of the last
> updates
> > > it
> > > > > > fetched". To keep the log consistent Raft requires that the
> > followers
> > > > > keep
> > > > > > all of the log entries (term/epoch and offset) that are after the
> > > > > > highwatermark. Any log entry before the highwatermark can be
> > > > > > compacted/snapshot. Do we expect the MetadataFetch API to only
> > return
> > > > log
> > > > > > entries up to the highwatermark?  Unlike the Raft replication API
> > > which
> > > > > > will replicate/fetch log entries after the highwatermark for
> > > consensus?
> > > > >
> > > > > Good question.  Clearly, we shouldn't expose metadata updates to
> the
> > > > > brokers until they've been stored on a majority of the Raft nodes.
> > The
> > > > > most obvious way to do that, like you mentioned, is to have the
> > brokers
> > > > > only fetch up to the HWM, but not beyond.  There might be a more
> > clever
> > > > way
> > > > > to do it by fetching the data, but not having the brokers act on it
> > > until
> > > > > the HWM advances.  I'm not sure if that's worth it or not.  We'll
> > > discuss
> > > > > this more in a separate KIP that just discusses just Raft.
> > > > >
> > > > > >
> > > > > > In section "Broker Metadata Management", you mention "the
> > controller
> > > > will
> > > > > > send a full metadata image rather than a series of deltas". This
> > KIP
> > > > > > doesn't go into the set of operations that need to be supported
> on
> > > top
> > > > of
> > > > > > Raft but it would be interested if this "full metadata image"
> could
> > > be
> > > > > > express also as deltas. For example, assuming we are replicating
> a
> > > map
> > > > > this
> > > > > > "full metadata image" could be a sequence of "put" operations
> > (znode
> > > > > create
> > > > > > to borrow ZK semantics).
> > > > >
> > > > > The full image can definitely be expressed as a sum of deltas.  At
> > some
> > > > > point, the number of deltas will get large enough that sending a
> full
> > > > image
> > > > > is better, though.  One question that we're still thinking about is
> > how
> > > > > much of this can be shared with generic Kafka log code, and how
> much
> > > > should
> > > > > be different.
> > > > >
> > > > > >
> > > > > > In section "Broker Metadata Management", you mention "This
> request
> > > will
> > > > > > double as a heartbeat, letting the controller know that the
> broker
> > is
> > > > > > alive". In section "Broker State Machine", you mention "The
> > > > MetadataFetch
> > > > > > API serves as this registration mechanism". Does this mean that
> the
> > > > > > MetadataFetch Request will optionally include broker
> configuration
> > > > > > information?
> > > > >
> > > > > I was originally thinking that the MetadataFetchRequest should
> > include
> > > > > broker configuration information.  Thinking about this more, maybe
> we
> > > > > should just have a special registration RPC that contains that
> > > > information,
> > > > > to avoid sending it over the wire all the time.
> > > > >
> > > > > > Does this also mean that MetadataFetch request will result in
> > > > > > a "write"/AppendEntries through the Raft replication protocol
> > before
> > > > you
> > > > > > can send the associated MetadataFetch Response?
> > > > >
> > > > > I think we should require the broker to be out of the Offline state
> > > > before
> > > > > allowing it to fetch metadata, yes.  So the separate registration
> RPC
> > > > > should have completed first.
> > > > >
> > > > > >
> > > > > > In section "Broker State", you mention that a broker can
> transition
> > > to
> > > > > > online after it is caught with the metadata. What do you mean by
> > > this?
> > > > > > Metadata is always changing. How does the broker know that it is
> > > caught
> > > > > up
> > > > > > since it doesn't participate in the consensus or the advancement
> of
> > > the
> > > > > > highwatermark?
> > > > >
> > > > > That's a good point.  Being "caught up" is somewhat of a fuzzy
> > concept
> > > > > here, since the brokers do not participate in the metadata
> consensus.
> > > I
> > > > > think ideally we would want to define it in terms of time ("the
> > broker
> > > > has
> > > > > all the updates from the last 2 minutes", for example.)  We should
> > > spell
> > > > > this out better in the KIP.
> > > > >
> > > > > >
> > > > > > In section "Start the controller quorum nodes", you mention "Once
> > it
> > > > has
> > > > > > taken over the /controller node, the active controller will
> proceed
> > > to
> > > > > load
> > > > > > the full state of ZooKeeper.  It will write out this information
> to
> > > the
> > > > > > quorum's metadata storage.  After this point, the metadata quorum
> > > will
> > > > be
> > > > > > the metadata store of record, rather than the data in ZooKeeper."
> > > > During
> > > > > > this migration do should we expect to have a small period
> > controller
> > > > > > unavailability while the controller replicas this state to all of
> > the
> > > > > raft
> > > > > > nodes in the controller quorum and we buffer new controller API
> > > > requests?
> > > > >
> > > > > Yes, the controller would be unavailable during this time.  I don't
> > > think
> > > > > this will be that different from the current period of
> unavailability
> > > > when
> > > > > a new controller starts up and needs to load the full state from
> ZK.
> > > The
> > > > > main difference is that in this period, we'd have to write to the
> > > > > controller quorum rather than just to memory.  But we believe this
> > > should
> > > > > be pretty fast.
> > > > >
> > > > > regards,
> > > > > Colin
> > > > >
> > > > > >
> > > > > > Thanks!
> > > > > > -Jose
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Reply via email to