Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Ismael Juma Sat, 03 Aug 2019 10:15:44 -0700

Hi Boyang,

Yes, there will be several KIPs that will discuss the items you describe in
detail. Colin, it may be helpful to make this clear in the KIP 500
description.


Ismael

On Sat, Aug 3, 2019 at 9:32 AM Boyang Chen <reluctanthero...@gmail.com>
wrote:

> Thanks Colin for initiating this important effort!
>
> One question I have is whether we have a session discussing the controller
> failover in the new architecture? I know we are using Raft protocol to
> failover, yet it's still valuable to discuss the steps new cluster is going
> to take to reach the stable stage again, so that we could easily measure
> the availability of the metadata servers.
>
> Another suggestion I have is to write a step-by-step design doc like what
> we did in KIP-98
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
> >,
> including the new request protocols and how they are interacting in the new
> cluster. For a complicated change like this, an implementation design doc
> help a lot in the review process, otherwise most discussions we have will
> focus on high level and lose important details as we discover them in the
> post-agreement phase.
>
> Boyang
>
> On Fri, Aug 2, 2019 at 5:17 PM Colin McCabe <cmcc...@apache.org> wrote:
>
> > On Fri, Aug 2, 2019, at 16:33, Jose Armando Garcia Sancio wrote:
> > > Thanks Colin for the detail KIP. I have a few comments and questions.
> > >
> > > In the KIP's Motivation and Overview you mentioned the LeaderAndIsr and
> > > UpdateMetadata RPC. For example, "updates which the controller pushes,
> > such
> > > as LeaderAndIsr and UpdateMetadata messages". Is your thinking that we
> > will
> > > use MetadataFetch as a replacement to just UpdateMetadata only and add
> > > topic configuration in this state?
> > >
> >
> > Hi Jose,
> >
> > Thanks for taking a look.
> >
> > The goal is for MetadataFetchRequest to replace both LeaderAndIsrRequest
> > and UpdateMetadataRequest.  Topic configurations would be fetched along
> > with the other metadata.
> >
> > > In the section "Broker Metadata Management", you mention "Just like
> with
> > a
> > > fetch request, the broker will track the offset of the last updates it
> > > fetched". To keep the log consistent Raft requires that the followers
> > keep
> > > all of the log entries (term/epoch and offset) that are after the
> > > highwatermark. Any log entry before the highwatermark can be
> > > compacted/snapshot. Do we expect the MetadataFetch API to only return
> log
> > > entries up to the highwatermark?  Unlike the Raft replication API which
> > > will replicate/fetch log entries after the highwatermark for consensus?
> >
> > Good question.  Clearly, we shouldn't expose metadata updates to the
> > brokers until they've been stored on a majority of the Raft nodes.  The
> > most obvious way to do that, like you mentioned, is to have the brokers
> > only fetch up to the HWM, but not beyond.  There might be a more clever
> way
> > to do it by fetching the data, but not having the brokers act on it until
> > the HWM advances.  I'm not sure if that's worth it or not.  We'll discuss
> > this more in a separate KIP that just discusses just Raft.
> >
> > >
> > > In section "Broker Metadata Management", you mention "the controller
> will
> > > send a full metadata image rather than a series of deltas". This KIP
> > > doesn't go into the set of operations that need to be supported on top
> of
> > > Raft but it would be interested if this "full metadata image" could be
> > > express also as deltas. For example, assuming we are replicating a map
> > this
> > > "full metadata image" could be a sequence of "put" operations (znode
> > create
> > > to borrow ZK semantics).
> >
> > The full image can definitely be expressed as a sum of deltas.  At some
> > point, the number of deltas will get large enough that sending a full
> image
> > is better, though.  One question that we're still thinking about is how
> > much of this can be shared with generic Kafka log code, and how much
> should
> > be different.
> >
> > >
> > > In section "Broker Metadata Management", you mention "This request will
> > > double as a heartbeat, letting the controller know that the broker is
> > > alive". In section "Broker State Machine", you mention "The
> MetadataFetch
> > > API serves as this registration mechanism". Does this mean that the
> > > MetadataFetch Request will optionally include broker configuration
> > > information?
> >
> > I was originally thinking that the MetadataFetchRequest should include
> > broker configuration information.  Thinking about this more, maybe we
> > should just have a special registration RPC that contains that
> information,
> > to avoid sending it over the wire all the time.
> >
> > > Does this also mean that MetadataFetch request will result in
> > > a "write"/AppendEntries through the Raft replication protocol before
> you
> > > can send the associated MetadataFetch Response?
> >
> > I think we should require the broker to be out of the Offline state
> before
> > allowing it to fetch metadata, yes.  So the separate registration RPC
> > should have completed first.
> >
> > >
> > > In section "Broker State", you mention that a broker can transition to
> > > online after it is caught with the metadata. What do you mean by this?
> > > Metadata is always changing. How does the broker know that it is caught
> > up
> > > since it doesn't participate in the consensus or the advancement of the
> > > highwatermark?
> >
> > That's a good point.  Being "caught up" is somewhat of a fuzzy concept
> > here, since the brokers do not participate in the metadata consensus.  I
> > think ideally we would want to define it in terms of time ("the broker
> has
> > all the updates from the last 2 minutes", for example.)  We should spell
> > this out better in the KIP.
> >
> > >
> > > In section "Start the controller quorum nodes", you mention "Once it
> has
> > > taken over the /controller node, the active controller will proceed to
> > load
> > > the full state of ZooKeeper.  It will write out this information to the
> > > quorum's metadata storage.  After this point, the metadata quorum will
> be
> > > the metadata store of record, rather than the data in ZooKeeper."
> During
> > > this migration do should we expect to have a small period controller
> > > unavailability while the controller replicas this state to all of the
> > raft
> > > nodes in the controller quorum and we buffer new controller API
> requests?
> >
> > Yes, the controller would be unavailable during this time.  I don't think
> > this will be that different from the current period of unavailability
> when
> > a new controller starts up and needs to load the full state from ZK.  The
> > main difference is that in this period, we'd have to write to the
> > controller quorum rather than just to memory.  But we believe this should
> > be pretty fast.
> >
> > regards,
> > Colin
> >
> > >
> > > Thanks!
> > > -Jose
> > >
> >
>

Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Reply via email to