Re: [DISCUSS] KIP-866 ZooKeeper to KRaft Migration

Luke Chen Mon, 24 Oct 2022 00:20:45 -0700

Hi David,

Thanks for the good and complicated proposal! :)
Some questions:


1. "MigrationReady" state: The KIP said:
The KRaft quorum has been started
I don't think we'll automatically enter this state after KRaft quorum
started, do we?
I think KRaft active controller should take over leadership by writing a
znode in /controller path before we entering this sate, is that correct?

2. "MigrationActive" state: the KIP said:
ZK state has been migrated, controller is in dual-write mode, brokers are
being restarted in KRaft mode
What confuses me is the last part: "brokers are being restarted in KRaft
mode".
How could we detect brokers are being restarted in KRaft mode? Old ZK
broker is removed and new KRaft broker is up within N minutes?
I think we don't have to rely on the condition "brokers are being restarted
in KRaft mode" to enter this state.
"brokers are being restarted in KRaft mode" should be a process happened
between "MigrationActive" and "MigrationFinished". Does that make sense?

3. When "Zookeeper" mode trying to enter "MigrationEligible", if there is a
cluster with tens of brokers, how could users know why this cluster cannot
be in "MigrationEligible" state, yet? Check that znode manually one by one?
Or do we plan to create a tool to help them? Or maybe expose the "unready
ZK brokers" metrics?

4. Same for "MigrationActive" entering "MigrationFinished" state. Since
that will be some process for restarting ZK broker into KRaft broker one by
one, could we expose the "remaining ZK brokers" as metrics?

5. When users are in "MigrationReady"/"MigrationActive"/"MigrationFinished"
states, and they accidentally change the KRaft controller config:
"kafka.metadata.migration.enable"
to false. What will happen to this cluster? No dual-write for it? Will we
have any protection for it?

6. About the "MigrationState" metric:
The "ZooKeeper" and "MigrationEligible" is reported by ZK controller, and
the rest of states are reported by KRaft controller -->  makes sense to me.
One question from it is, when KRaft controller takes over the leadership
from ZK controller, what will the "MigrationState" value in old ZK
controller? keep in "MigrationEligible" doesn't make sense. Will there be a
empty or null state?

7. About the "MetadataType" metric:
An enumeration of: ZooKeeper (1), Dual (2), KRaft (3). Each broker reports
this.
I don't know how we could map the migration state to these 3 types.
What is the metadataType when cluster in "MigrationReady" state? Still
Zookeeper?
When will brokers enter Dual type?
This is unclear in the KIP.

Thank you.
Luke

On Thu, Oct 20, 2022 at 11:33 PM David Arthur
<[email protected]> wrote:

> Igor, thanks for taking a look! Since JBOD in KRaft is still under
> discussion and not likely to land before the ZK migration, I think we'll
> need to defer it. For migrating JBOD clusters from ZK to KRaft, we'll also
> need to address the log dir failure mechanism which currently uses a
> special ZNode written to by the brokers. There is an old KIP
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> which proposes a new RPC to move that ZK write to the controller, but I'm
> not sure if that's the approach we'll want to take. I read through the
> discussion over in KIP-858 and it sounds like there are some good ideas
> there.
>
> To answer your question more directly, migrating ZK clusters using JBOD is
> out of scope for this KIP. It might be possible to stop using JBOD using
> reassignments, but I'm not sure about that.
>
> -David
>
> On Tue, Oct 18, 2022 at 12:17 PM Igor Soarez <[email protected]> wrote:
>
> > Hi David,
> >
> > Thanks for the KIP, this is very exciting!
> >
> > How does JBOD relate to this work? KRaft mode doesn't yet support
> > configuring brokers with multiple log directories. If the brokers in the
> > existing cluster are configured with multiple log dirs, does the
> migration
> > imply that the existing brokers need to drop use of that feature? Or is
> > there some way to upgrade them later?
> >
> > Thanks,
> >
> > --
> > Igor
> >
> > On Mon, Oct 17, 2022, at 10:07 PM, David Arthur wrote:
> > > I've updated the KIP with the following changes (the confluence diff is
> > not
> > > helpful here since i rearranged some things)
> > >
> > > * Added ZooKeeperBlockingKRaftMillis metric
> > > * Added section on new broker registration JSON
> > > * Removed section on MigrationCheck RPC
> > > * Added change to UpdateMetadataRequest
> > > * Added section "Additional ZK Broker Configs" (includes configs to
> > connect
> > > to KRaft quorum)
> > > * Added section on "Incompatible Brokers" under Failure Modes
> > > * Clarified many things per this discussion thread
> > >
> > > I realized we need the KRaft controller to pick the correct
> > > "metadata.version" when initializing the migration. I included the IBP
> > of a
> > > broker in its registration data so the KRaft controller can verify the
> > IBP
> > > and pick the correct "metadata.version" when starting the migration.
> > > Otherwise, we could inadvertently downgrade the IBP/metadata.version as
> > > part of the migration.
> > >
> > > I also added "clusterId" to the broker registration data so the KRaft
> > could
> > > verify it.
> > >
> > > -David
> > >
> > >
> > >
> > > On Mon, Oct 17, 2022 at 12:14 PM Colin McCabe <[email protected]>
> > wrote:
> > >
> > >> On Fri, Oct 14, 2022, at 11:39, Jun Rao wrote:
> > >> > Hi, Colin,
> > >> >
> > >> > 10. "That all goes away in the new mode, and we just have some code
> > which
> > >> > analyzes __cluster_metadata and reflects it in 1) updates to ZK and
> 2)
> > >> > messages sent out to brokers."
> > >> > Hmm, I am not sure it's that simple. Some of the complexity of the
> > >> ZK-based
> > >> > controller are (1) using the pipelining approach to write to ZK for
> > >> better
> > >> > throughput and using conditional writes for correctness; (2) sending
> > the
> > >> > proper LeaderAndIsr and UpdateMetadata requests. For example, during
> > >> > controller failover, the full metadata needs to be sent while during
> > >> > individual broker failure, only some of the metadata needs to be
> > updated.
> > >> > The controlled shutdown handling sometimes uses StopReplicaRequest
> > and
> > >> > some other times uses LeaderAndIsrRequest. (3) triggering new events
> > >> based
> > >> > on the responses of LeaderAndIsr (e.g. for topic deletion). Some of
> > those
> > >> > complexity could be re-implemented in a more efficient way, but we
> > need
> > >> to
> > >> > be really careful not to generate regression. Some of the other
> > >> complexity
> > >> > just won't go away. Reimplementing all those logic for the 30 or so
> > >> events
> > >> > in the ZK-based controller is possible, but seems a bit daunting and
> > >> risky.
> > >>
> > >> Hi Jun,
> > >>
> > >> Thanks for the response.  I agree that there is some work here but I
> > don't
> > >> think it's as bad as it might seem. Most of the code for writing to ZK
> > can
> > >> be reused from the old controller. We should at least evaluate using
> the
> > >> old ControllerChannelManager as well. I don't know if we'll end up
> > doing it
> > >> or not but it's a possibility. (The main reason to not do it is that
> the
> > >> response handling will be a bit different)
> > >>
> > >> The question of what to do during a controller failover is an
> > interesting
> > >> one. Technically, we should be able to continue sending incremental
> > updates
> > >> to the legacy nodes, for the same reason we can in KRaft mode.
> However,
> > >> probably we should just copy what ZK mode does and send them a full
> > >> metadata update. This will allow us to have an easy way to handle
> > >> divergences caused by lost RPCs by changing the controller (just as we
> > do
> > >> in ZK mode, unfortunately). I suppose we should document this in the
> > KIP...
> > >>
> > >> I agree controlled shutdown is tricky to get just right. I suppose
> this
> > is
> > >> a case where the RPCs we send out are not purely "fire and forget"; we
> > have
> > >> to listen for the response. But that can be done in an event-based
> way.
> > >> Controlled shutdown is also probably the last thing we'll implement
> > once we
> > >> have the basic lift and shift.
> > >>
> > >> Topic deletion is another annoying thing. I wonder if we could just
> > delete
> > >> immediately here and skip the complexity of implementing "deleting
> > state."
> > >> Topic IDs will exist in IBP 3.4, in both ZK and KRaft mode, so it
> > should be
> > >> OK, right?
> > >>
> > >> best,
> > >> Colin
> > >>
> > >>
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jun
> > >> >
> > >> > On Fri, Oct 14, 2022 at 9:29 AM Colin McCabe <[email protected]>
> > wrote:
> > >> >
> > >> >> On Thu, Oct 13, 2022, at 11:44, Jun Rao wrote:
> > >> >> > Hi, Colin,
> > >> >> >
> > >> >> > Thanks for the reply.
> > >> >> >
> > >> >> > 10. This is a bit on the implementation side. If you look at the
> > >> existing
> > >> >> > ZK-based controller, most of the logic is around maintaining an
> > >> in-memory
> > >> >> > state of all the resources (broker, topic, partition, etc),
> > >> >> reading/writing
> > >> >> > to ZK, sending LeaderAndIsr and UpdateMetadata requests and
> > handling
> > >> the
> > >> >> > responses to brokers. So we need all that logic in the dual write
> > >> mode.
> > >> >> One
> > >> >> > option is to duplicate all that logic in some new code. This can
> > be a
> > >> bit
> > >> >> > error prone and makes the code a bit harder to maintain if we
> need
> > to
> > >> fix
> > >> >> > some critical issues in ZK-based controllers. Another option is
> to
> > try
> > >> >> > reusing the existing code in the ZK-based controller. For
> example,
> > we
> > >> >> could
> > >> >> > start the EventManager in the ZK-based controller, but let the
> > KRaft
> > >> >> > controller ingest new events. This has its own challenges: (1)
> the
> > >> >> existing
> > >> >> > logic only logs ZK failures and doesn't expose them to the
> caller;
> > (2)
> > >> >> the
> > >> >> > existing logic may add new events to the queue itself and we
> > probably
> > >> >> need
> > >> >> > to think through how this is coordinated with the KRaft
> controller;
> > >> (3)
> > >> >> it
> > >> >> > registers some ZK listeners unnecessarily (may not be a big
> > concern).
> > >> So
> > >> >> we
> > >> >> > need to get around those issues somehow. I am wondering if we
> have
> > >> >> > considered both options and which approach we are leaning towards
> > for
> > >> the
> > >> >> > implementation.
> > >> >> >
> > >> >>
> > >> >> Yes, this is a good question. My take is that a big part of the
> > >> complexity
> > >> >> in the old controller code results from the fact that we use ZK as
> a
> > >> >> multi-writer database for propagating information between different
> > >> >> components. So in the old controller, every write to ZK needs to be
> > >> >> structured as a compare and swap to be fully correct. Every time we
> > get
> > >> >> notified about something, it's usually in the form of "this znode
> > >> changed"
> > >> >> which prompts a full reload of part of the data in ZK (which itself
> > has
> > >> >> multiple parts, loading, deserializing, reconciling, etc.) That all
> > goes
> > >> >> away in the new mode, and we just have some code which analyzes
> > >> >> __cluster_metadata and reflects it in 1) updates to ZK and 2)
> > messages
> > >> sent
> > >> >> out to brokers.
> > >> >>
> > >> >> This is pretty decoupled from the other logic in QuorumController
> and
> > >> >> should be easy to unit test, since the same inputs from the log
> > always
> > >> >> produce the same output in ZK. Basically, ZK is write-only for us,
> > we do
> > >> >> not read it (with the big exception of broker registration znodes)
> > and I
> > >> >> think that will greatly simplify things.
> > >> >>
> > >> >> So I think dual-write mode as described here will be substantially
> > >> simpler
> > >> >> than trying to run part or all of the old controller in parallel. I
> > do
> > >> >> think we will reuse a bunch of the serialization / deserialization
> > code
> > >> for
> > >> >> znodes and possibly the code for communicating with ZK.
> > >> >>
> > >> >> best,
> > >> >> Colin
> > >> >>
> > >> >>
> > >> >> >
> > >> >> > 14. Good point and make sense.
> > >> >> >
> > >> >> > Thanks,
> > >> >> >
> > >> >> > Jun
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > On Wed, Oct 12, 2022 at 3:27 PM Colin McCabe <[email protected]
> >
> > >> wrote:
> > >> >> >
> > >> >> >> Hi Jun,
> > >> >> >>
> > >> >> >> Thanks for taking a look. I can answer some questions here
> > because I
> > >> >> >> collaborated on this a bit, and David is on vacation for a few
> > days.
> > >> >> >>
> > >> >> >> On Wed, Oct 12, 2022, at 14:41, Jun Rao wrote:
> > >> >> >> > Hi, David,
> > >> >> >> >
> > >> >> >> > Thanks for the KIP. A few comments below.
> > >> >> >> >
> > >> >> >> > 10. It's still not very clear to me how the KRaft controller
> > works
> > >> in
> > >> >> the
> > >> >> >> > dual writes mode to KRaft log and ZK when the brokers still
> run
> > in
> > >> ZK
> > >> >> >> mode.
> > >> >> >> > Does the KRaft controller run a ZK based controller in
> parallel
> > or
> > >> do
> > >> >> we
> > >> >> >> > derive what needs to be written to ZK based on KRaft
> controller
> > >> logic?
> > >> >> >>
> > >> >> >> We derive what needs to be written to ZK based on KRaft
> controller
> > >> >> logic.
> > >> >> >>
> > >> >> >> > I am also not sure how the KRaft controller handles broker
> > >> >> >> > registration/deregistration, since brokers are still running
> in
> > ZK
> > >> >> mode
> > >> >> >> and
> > >> >> >> > are not heartbeating to the KRaft controller.
> > >> >> >>
> > >> >> >> The new controller will listen for broker registrations under
> > >> /brokers.
> > >> >> >> This is the only znode watch that the new controller will do.
> > >> >> >>
> > >> >> >> We did consider changing how ZK-based broker registration
> worked,
> > >> but it
> > >> >> >> just ended up being too much work for not enough gain.
> > >> >> >>
> > >> >> >> >
> > >> >> >> > 12. "A new set of nodes will be provisioned to host the
> > controller
> > >> >> >> quorum."
> > >> >> >> > I guess we don't support starting the KRaft controller quorum
> on
> > >> >> existing
> > >> >> >> > brokers. It would be useful to make that clear.
> > >> >> >> >
> > >> >> >>
> > >> >> >> Agreed
> > >> >> >>
> > >> >> >> > 13. "Once the quorum is established and a leader is elected,
> the
> > >> >> >> controller
> > >> >> >> > will check the state of the cluster using the MigrationCheck
> > RPC."
> > >> How
> > >> >> >> does
> > >> >> >> > the quorum controller detect other brokers? Does the
> controller
> > >> node
> > >> >> need
> > >> >> >> > to be configured with ZK connection string? If so, it would be
> > >> useful
> > >> >> to
> > >> >> >> > document the additional configs that the quorum controller
> > needs to
> > >> >> set.
> > >> >> >> >
> > >> >> >>
> > >> >> >> Yes, the controllers monitor ZK for broker registrations, as I
> > >> mentioned
> > >> >> >> above. So they need zk.connect and the other ZK connection
> > >> >> configurations.
> > >> >> >>
> > >> >> >> > 14. "In order to prevent further writes to ZK, the first thing
> > the
> > >> new
> > >> >> >> > KRaft quorum must do is take over leadership of the ZK
> > controller.
> > >> "
> > >> >> The
> > >> >> >> ZK
> > >> >> >> > controller processing changes to /controller update
> > asynchronously.
> > >> >> How
> > >> >> >> > does the KRaft controller know when the ZK controller has
> > resigned
> > >> >> before
> > >> >> >> > it can safely copy the ZK data?
> > >> >> >> >
> > >> >> >>
> > >> >> >> This should be done through expectedControllerEpochZkVersion,
> just
> > >> like
> > >> >> in
> > >> >> >> ZK mode, right? We should bump this epoch value so that any
> writes
> > >> from
> > >> >> the
> > >> >> >> old controller will not go through. I agree we should spell this
> > out
> > >> in
> > >> >> the
> > >> >> >> KIP.
> > >> >> >>
> > >> >> >> > 15. We have the following sentences. One says ControllerId is
> a
> > >> random
> > >> >> >> > KRaft broker and the other says it's the active controller.
> > Which
> > >> one
> > >> >> is
> > >> >> >> > correct?
> > >> >> >> > "UpdateMetadata: for certain metadata changes, the KRaft
> > controller
> > >> >> will
> > >> >> >> > need to send UpdateMetadataRequests to the ZK brokers. For the
> > >> >> >> > “ControllerId” field in this request, the controller should
> > >> specify a
> > >> >> >> > random KRaft broker."
> > >> >> >> > "In the UpdateMetadataRequest sent by the KRaft controller to
> > the
> > >> ZK
> > >> >> >> > brokers, the ControllerId will point to the active controller
> > which
> > >> >> will
> > >> >> >> be
> > >> >> >> > used for the inter-broker requests."
> > >> >> >> >
> > >> >> >>
> > >> >> >> Yeah, this seems like an error to me as well. A random value is
> > not
> > >> >> really
> > >> >> >> useful. Plus the text here is self-contradictory, as you pointed
> > out.
> > >> >> >>
> > >> >> >> I suspect what we should do here is add a new field,
> > >> KRaftControllerId,
> > >> >> >> and populate it with the real controller ID, and leave the old
> > >> >> controllerId
> > >> >> >> field as -1. A ZK-based broker that sees this can then consult
> its
> > >> >> >> controller.quorum.voters configuration to see where it should
> send
> > >> >> >> controller-bound RPCs. That (static) configuration lets us map
> > >> between
> > >> >> >> controller ID and host:port.
> > >> >> >>
> > >> >> >> We should still keep our existing epoch logic for deciding when
> > >> >> >> UpdateMetadataRequest / LeaderAndIsrRequests are stale, with the
> > >> caveat
> > >> >> >> that any kraft-based epoch should be treated as greater than any
> > >> >> ZK-based
> > >> >> >> epoch. After all, the kraft epoch is coming from the epoch of
> > >> >> >> __cluster_metadata, whereas the ZK epoch comes from ZK.
> > >> >> >>
> > >> >> >> >
> > >> >> >> > 16. "Additionally, the controller must specify if a broker in
> > >> >> >> “LiveBrokers”
> > >> >> >> > is KRaft or ZK." Does that require any protocol changes to
> > >> >> >> UpdateMetadata?
> > >> >> >> >
> > >> >> >>
> > >> >> >> Yeah, I am also curious why the we need to care whether brokers
> > are
> > >> ZK
> > >> >> or
> > >> >> >> KRaft in UpdateMetadataRequest. We don't reveal this to clients,
> > so
> > >> can
> > >> >> we
> > >> >> >> just leave this out?
> > >> >> >>
> > >> >> >> best,
> > >> >> >> Colin
> > >> >> >>
> > >> >> >> > Thanks,
> > >> >> >> >
> > >> >> >> > Jun
> > >> >> >> >
> > >> >> >> > On Wed, Oct 5, 2022 at 10:07 AM Mickael Maison <
> > >> >> [email protected]
> > >> >> >> >
> > >> >> >> > wrote:
> > >> >> >> >
> > >> >> >> >> Hi David,
> > >> >> >> >>
> > >> >> >> >> Thanks for starting this important KIP.
> > >> >> >> >>
> > >> >> >> >> I've just taken a quick look so far but I've got a couple of
> > >> initial
> > >> >> >> >> questions:
> > >> >> >> >>
> > >> >> >> >> 1) What happens if a non KRaft compatible broker (or with
> > >> >> >> >> kafka.metadata.migration.enable set to false) joins the
> cluster
> > >> after
> > >> >> >> >> the migration is triggered?
> > >> >> >> >>
> > >> >> >> >> 2) In the Failure Modes section you mention a scenario where
> a
> > >> write
> > >> >> >> >> to ZK fails. What happens when the divergence limit is
> > reached? Is
> > >> >> >> >> this a fatal condition? How much divergence should we allow?
> > >> >> >> >>
> > >> >> >> >> Thanks,
> > >> >> >> >> Mickael
> > >> >> >> >>
> > >> >> >> >> On Wed, Oct 5, 2022 at 12:20 AM David Arthur <
> [email protected]
> > >
> > >> >> wrote:
> > >> >> >> >> >
> > >> >> >> >> > Hey folks, I wanted to get the ball rolling on the
> discussion
> > >> for
> > >> >> the
> > >> >> >> >> > ZooKeeper migration KIP. This KIP details how we plan to do
> > an
> > >> >> online
> > >> >> >> >> > migration of metadata from ZooKeeper to KRaft as well as a
> > >> rolling
> > >> >> >> >> > upgrade of brokers to KRaft mode.
> > >> >> >> >> >
> > >> >> >> >> > The general idea is to keep KRaft and ZooKeeper in sync
> > during
> > >> the
> > >> >> >> >> > migration, so both types of brokers can exist
> simultaneously.
> > >> Then,
> > >> >> >> >> > once everything is migrated and updated, we can turn off
> > >> ZooKeeper
> > >> >> >> >> > writes.
> > >> >> >> >> >
> > >> >> >> >> > This is a pretty complex KIP, so please take a look :)
> > >> >> >> >> >
> > >> >> >> >> >
> > >> >> >> >>
> > >> >> >>
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration
> > >> >> >> >> >
> > >> >> >> >> > Thanks!
> > >> >> >> >> > David
> > >> >> >> >>
> > >> >> >>
> > >> >>
> > >>
> > >
> > >
> > > --
> > > -David
> >
>
>
> --
> -David
>

Re: [DISCUSS] KIP-866 ZooKeeper to KRaft Migration

Reply via email to