Jun,

20/21 If we use a tagged field, then I don't think clients need to be
concerned with it, right?. In ApiVersionsResponse sent by brokers to
clients, this field would be omitted. Clients can't connect directly to the
KRaft controller nodes. Also, we have a precedent of sending controller
node state between controllers through ApiVersions ("metadata.version"
feature), so I think it fits well.

40. For new KRaft clusters, we could add another state to indicate it was
not migrated from ZK, but created as a KRaft cluster. Maybe
"kafka.controller:type=KafkaController,name=MigrationState" => "None" ? We
could also omit that metric for unmigrated clusters, but I'm not a fan of
using the absence of a value to signal something.

-----

Akhilesh, thanks for reviewing the KIP!

1. MigrationState and MetadataType are mostly the same on the controller,
but we do have the "MigratingZkData" state that seems useful to report as a
metric. Aside from looking at logs, observing the controller in this state
is the only way to see how long its taking to copy data from ZK.

"KRaftMode" instead of "MigrationFinalized" is similar to Jun's question
about non-migrated clusters. I think it's useful to have a distinct
MigrationState for clusters that have been migrated and those that were
never migrated. This does mean we'll report the MigrationState long after
the migration is complete, but we can drop these metrics in 4.0 once ZK is
removed.

2. The "ZkMigrationReady" will indicate that the controller has
"kafka.metadata.migration.enable" _and_ the ZK configs set. We need some
way to indicate that the whole quorum is correctly configured to handle the
migration so we don't failover to a controller that's not configured for
ZK. Did I understand your question correctly?

3. Yea, good idea. While the KRaft controller has
MigrationState=MigrationIneligible, we could also report
"kafka.controller:type=KafkaController,name=MigrationInelgibleBrokerCount".
It might be useful to report ineligible controllers as well since that can
prevent the migration from starting.

4. I think I covered this in "Incompatible Brokers". We effectively fence
these brokers by not sending them metadata RPCs.

Thanks!
David


On Tue, Nov 1, 2022 at 3:58 PM Akhilesh Chaganti <akhilesh....@gmail.com>
wrote:

> Hi David,
>
>
> Thanks for the KIP. I have some questions/suggestions.
>
>
> 1) I see two new metrics:
> kafka.controller:type=KafkaController,name=MetadataType and
> kafka.controller:type=KafkaController,name=MigrationState. Won't the second
> metric already cover the cases of the first metric? Also, instead of
> MigrationFinalized, we could directly say the state is KRaftMode. So we can
> use the same value for default KRaft clusters.
>
>
> 2) ZkMigrationReady in ApiVersionsResponse from KRaft Controller. By
> default, we plan to start the Controller quorum in "
> *kafka.metadata.migration.enable*" config set to true. Then do we need this
> additional information again to make sure The controllers are ready for
> migration? What would happen if the Controller assumes it is ready for
> migration from 3.4 by default if it doesn't see both MigrationMetadata
> records?
>
>
> 3) I see that we do not impose order on rolling the brokers with migration
> flags and provisioning the controller quorum. Along with the KRaft
> controller emitting migration state metrics, it may be better to emit the
> broker count for the brokers not ready for migration yet. This will give us
> more insight into any roll issues.
>
>
> 4) Once the KRaft controller is in migration mode, we should also
> prevent/handle ZkBrokerRegistrations that don't enable migration mode.
>
>
> Thanks
> Akhilesh
>
>
> On Tue, Nov 1, 2022 at 10:49 AM Jun Rao <j...@confluent.io.invalid> wrote:
>
> > Hi, David,
> >
> > Thanks for the reply.
> >
> > 20/21. Regarding the new ZkMigrationReady field in ApiVersionsResponse,
> it
> > seems that this is a bit intrusive since it exposes unneeded info to the
> > clients. Another option is to add that field as part of the Fetch
> request.
> > We can choose to only set that field in the very first Fetch request
> from a
> > Quorum follower.
> >
> > 40. For kafka.controller:type=KafkaController,name=MigrationState, what
> is
> > the value for a brand new KRaft cluster?
> >
> > Jun
> >
> > On Mon, Oct 31, 2022 at 2:35 PM David Arthur
> > <david.art...@confluent.io.invalid> wrote:
> >
> > > 30. I think we can keep the single ControllerId field in those requests
> > > since they are only used for fencing (as far as I know). Internally,
> the
> > > broker components that handle those requests will compare the
> > ControllerId
> > > with that of MetadataCache (which is updated via UMR).
> > >
> > > The reason we need the separate KRaftControllerId in the UpdateMetadata
> > > code path so that we can have different connection behavior for a KRaft
> > > controller vs ZK controller.
> > >
> > > 31. It seems reasonable to keep the MigrationRecord in the snapshot. I
> > was
> > > thinking the same thing in terms of understanding the loss for a
> > > migration-after-finalization. However, once a snapshot has been taken
> > that
> > > includes the final MigrationRecord, we can't easily see which records
> > came
> > > after it.
> > >
> > > 32. You're correct, we can just use the modify time from the Stat. The
> > > other two fields are primarily informational and are there for
> operators
> > > who want to inspect the state of the migration. They aren't required
> for
> > > correctness
> > >
> > > 33. Yes that's right. I detail that in "Controller Leadership" section
> > >
> > > 34. Right, I'll fix that.
> > >
> > > Thanks,
> > > David
> > >
> > > On Mon, Oct 31, 2022 at 2:55 PM Jun Rao <j...@confluent.io.invalid>
> > wrote:
> > >
> > > > Hi, David,
> > > >
> > > > Thanks for the updated KIP. A few more comments.
> > > >
> > > > 30. LeaderAndIsrRequest/StopReplicaRequest both have a controllerId
> > > field.
> > > > Should we add a KRaftControllerId field like UpdateMetadata?
> > > >
> > > > 31. "If a migration has been finalized, but the KRaft quroum comes up
> > > with
> > > > kafka.metadata.migration.enable, we must not re-enter the migration
> > mode.
> > > > In this case, while replaying the log, the controller can see the
> > second
> > > > MigrationRecord and know that the migration is finalized and should
> not
> > > be
> > > > resumed. " Hmm, do we want to keep the MigrationRecord in the
> snapshot
> > > and
> > > > the metadata log forever after migration is finalized? If not, we
> can't
> > > > know for sure whether a migration has happened or not. Also, it might
> > be
> > > > useful to support switching back to ZK mode after the migration is
> > > > finalized, with the understanding of potential metadata loss. In that
> > > case,
> > > > we could just trim all metadata log and recopy the ZK metadata back.
> > > >
> > > > 32. The /migration node in ZK: Do we need last_update_time_ms since
> ZK
> > > > Stats already has an MTime? Also, how do we plan to use the
> > > > kraft_controller_id and kraft_controller_epoch fields?
> > > >
> > > > 33. Controller migration: We will force a write to the "/controller"
> > and
> > > > "/controller_epoch" ZNodes before copying ZK data, right?
> > > >
> > > > 34. "Operator can remove the persistent "/controller" and
> > > > "/controller_epoch" nodes allowing for ZK controller election to take
> > > > place". I guess the operator only needs to remove the /controller
> path?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Oct 31, 2022 at 7:17 AM David Arthur
> > > > <david.art...@confluent.io.invalid> wrote:
> > > >
> > > > > Happy Monday, everyone! I've updated the KIP with the following
> > > changes:
> > > > >
> > > > > * Clarified MetadataType metric usages (broker vs controller)
> > > > > * Added ZkMigrationReady tagged field to ApiVersionsResponse (for
> use
> > > by
> > > > > KRaft controller quorum)
> > > > > * Added MigrationRecord with two states: Started and Finished
> > > > > * Documented ZK configs for KRaft controller
> > > > > * Simplified state machine description (internally, more states
> will
> > > > exist,
> > > > > but only the four documented are interesting to operators)
> > > > > * Clarified some things in Controller Migration section
> > > > > * Removed KRaft -> ZK parts of Broker Registration
> > > > > * Added Misconfigurations section to Failure Modes
> > > > >
> > > > > Let me know if I've missed anything from the past two weeks of
> > > > discussion.
> > > > >
> > > > > Thanks again to everyone who has reviewed this KIP so far!
> > > > >
> > > > > -David
> > > > >
> > > > > On Fri, Oct 28, 2022 at 2:26 PM Jun Rao <j...@confluent.io.invalid>
> > > > wrote:
> > > > >
> > > > > > Hi, David,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 20/21. Sounds good.
> > > > > >
> > > > > > Could you update the doc with all the changes being discussed?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Fri, Oct 28, 2022 at 10:11 AM David Arthur
> > > > > > <david.art...@confluent.io.invalid> wrote:
> > > > > >
> > > > > > > Jun,
> > > > > > >
> > > > > > > 20/21. I was also wondering about a "migration" record. In
> > addition
> > > > to
> > > > > > the
> > > > > > > scenario you mentioned, we also need a way to prevent the
> cluster
> > > > from
> > > > > > > re-entering the dual write mode after the migration has been
> > > > > finalized. I
> > > > > > > could see this happening inadvertently via a change in some
> > > > > configuration
> > > > > > > management system. How about we add a record that marks the
> > > beginning
> > > > > and
> > > > > > > end of the dual-write mode. The first occurrence of the record
> > > could
> > > > be
> > > > > > > included in the metadata transaction when we migrate data from
> > ZK.
> > > > > > >
> > > > > > > With this, the active controller would decide whether to enter
> > dual
> > > > > write
> > > > > > > mode, finalize the migration based, or fail based on:
> > > > > > >
> > > > > > > * Metadata log state
> > > > > > > * It's own configuration ("kafka.metadata.migration.enable",
> > > > > > > "zookeeper.connect", etc)
> > > > > > > * The other controllers configuration (via ApiVersionsResponse)
> > > > > > >
> > > > > > > WDYT?
> > > > > > >
> > > > > > > 22. Since we will need the fencing anyways as a safe-guard,
> then
> > I
> > > > > agree
> > > > > > > would could skip the registration of KRaft brokers in ZK to
> > simply
> > > > > > things a
> > > > > > > bit.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > David
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Oct 27, 2022 at 5:11 PM Jun Rao
> <j...@confluent.io.invalid
> > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi, David,
> > > > > > > >
> > > > > > > > Thanks for the reply.
> > > > > > > >
> > > > > > > > 20/21. Relying upon the presence of ZK configs to determine
> > > whether
> > > > > the
> > > > > > > > KRaft controller is in a dual write mode seems a bit error
> > prone.
> > > > If
> > > > > > > > someone accidentally adds a ZK configuration to a brand new
> > KRaft
> > > > > > > cluster,
> > > > > > > > ideally it shouldn't cause the controller to get into a weird
> > > > state.
> > > > > > Have
> > > > > > > > we considered storing the migration state in a metadata
> record?
> > > > > > > >
> > > > > > > > 22. If we have the broker fencing logic, do we need to write
> > the
> > > > > broker
> > > > > > > > registration path in ZK for KRaft brokers at all?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Oct 27, 2022 at 1:02 PM David Arthur
> > > > > > > > <david.art...@confluent.io.invalid> wrote:
> > > > > > > >
> > > > > > > > > Jun,
> > > > > > > > >
> > > > > > > > > 20/21. A KRaft controller will recover the migration state
> by
> > > > > reading
> > > > > > > the
> > > > > > > > > "/migration" ZNode. If the migration enable config is set,
> > and
> > > > the
> > > > > ZK
> > > > > > > > > migration is complete, it will enter the dual-write mode.
> > > Before
> > > > an
> > > > > > > > > operator can decommission ZK, they will need to finalize
> the
> > > > > > migration
> > > > > > > > > which involves removing the migration config and the ZK
> > config.
> > > > > I'll
> > > > > > > > > clarify this in the KIP.
> > > > > > > > >
> > > > > > > > > 22. Yea, we could see an incorrect broker ID during that
> > > window.
> > > > > If
> > > > > > we
> > > > > > > > > ended up with a state where we saw a ZK broker ID that
> > > conflicted
> > > > > > with
> > > > > > > a
> > > > > > > > > KRaft broker ID, we would need to fence one of them. I
> would
> > > > > probably
> > > > > > > opt
> > > > > > > > > to fence the KRaft broker in that case since broker
> > > registration
> > > > > and
> > > > > > > > > fencing is more robust in KRaft. Hopefully this is a rare
> > case.
> > > > > > > > >
> > > > > > > > > 26. Sounds good.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Oct 27, 2022 at 1:34 PM Jun Rao
> > > <j...@confluent.io.invalid
> > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi, David,
> > > > > > > > > >
> > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > >
> > > > > > > > > > 20/21. Using a tagged field in ApiVersionRequest could
> > work.
> > > > > > Related
> > > > > > > to
> > > > > > > > > > this, how does a KRaft controller know that it's in the
> > dual
> > > > > write
> > > > > > > > mode?
> > > > > > > > > > Does it need to read the /controller path from ZK? After
> > the
> > > > > > > migration,
> > > > > > > > > > people may have the ZK cluster decommissioned, but still
> > have
> > > > the
> > > > > > ZK
> > > > > > > > > > configs left in the KRaft controller. Will this cause the
> > > KRaft
> > > > > > > > > controller
> > > > > > > > > > to be stuck because it doesn't know which mode it is in?
> > > > > > > > > >
> > > > > > > > > > 22. Using the ephemeral node matches the current ZK-based
> > > > broker
> > > > > > > > behavior
> > > > > > > > > > better. However, it leaves a window for incorrect broker
> > > > > > registration
> > > > > > > > to
> > > > > > > > > > sneak in during KRaft controller failover.
> > > > > > > > > >
> > > > > > > > > > 26. Then, we could just remove Broker Registration in
> that
> > > > > section.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > On Wed, Oct 26, 2022 at 2:21 PM David Arthur
> > > > > > > > > > <david.art...@confluent.io.invalid> wrote:
> > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > > 20/21 It could definitely cause problems if we failover
> > to
> > > a
> > > > > > > > controller
> > > > > > > > > > > without "kafka.metadata.migration.enable". The only
> > > > mechanism I
> > > > > > > know
> > > > > > > > of
> > > > > > > > > > for
> > > > > > > > > > > controllers to learn things about one another is
> > > ApiVersions.
> > > > > We
> > > > > > > > > > currently
> > > > > > > > > > > use this for checking support for "metadata.version"
> (in
> > > > KRaft
> > > > > > > mode).
> > > > > > > > > We
> > > > > > > > > > > could add a "zk.migration" feature flag that's enabled
> > on a
> > > > > > > > controller
> > > > > > > > > > only
> > > > > > > > > > > if the config is set. Another possibility would be a
> > tagged
> > > > > field
> > > > > > > on
> > > > > > > > > > > ApiVersionResponse that indicated if the config was
> set.
> > > Both
> > > > > > seem
> > > > > > > > > > somewhat
> > > > > > > > > > > inelegant. I think a tagged field would be a bit
> simpler
> > > (and
> > > > > > > > arguably
> > > > > > > > > > less
> > > > > > > > > > > hacky).
> > > > > > > > > > >
> > > > > > > > > > > For 20, we could avoid entering the migration state
> > unless
> > > > the
> > > > > > > whole
> > > > > > > > > > quorum
> > > > > > > > > > > had the field present in their NodeApiVersions. For 21,
> > we
> > > > > could
> > > > > > > > avoid
> > > > > > > > > > > leaving the migration state unless the whole quorum did
> > not
> > > > > have
> > > > > > > the
> > > > > > > > > > field
> > > > > > > > > > > in their NodeApiVersions. Do you think this would be
> > > > > sufficient?
> > > > > > > > > > >
> > > > > > > > > > > 22. Right, we need to write the broker info back to ZK
> > just
> > > > as
> > > > > a
> > > > > > > > > > safeguard
> > > > > > > > > > > against incorrect broker IDs getting registered into
> ZK.
> > I
> > > > was
> > > > > > > > thinking
> > > > > > > > > > > these would be persistent nodes, but it's probably fine
> > to
> > > > make
> > > > > > > them
> > > > > > > > > > > ephemeral and have the active KRaft controller keep
> them
> > up
> > > > to
> > > > > > > date.
> > > > > > > > > > >
> > > > > > > > > > > 23. Right. When the broker comes up as a KRaft broker,
> > it's
> > > > old
> > > > > > > > > > > /brokers/ids ZNode will be gone and it will register
> > itself
> > > > > with
> > > > > > > the
> > > > > > > > > > KRaft
> > > > > > > > > > > controller. The controller will know that it is now in
> > > KRaft
> > > > > mode
> > > > > > > and
> > > > > > > > > > will
> > > > > > > > > > > stop sending it the old RPCs.
> > > > > > > > > > >
> > > > > > > > > > > 24. Ok, I'll add these
> > > > > > > > > > >
> > > > > > > > > > > 25. I realize now the "/controller_epoch" node is
> already
> > > > > > > persistent.
> > > > > > > > > It
> > > > > > > > > > > should be sufficient to remove the "/controller" node
> to
> > > > > trigger
> > > > > > an
> > > > > > > > > > > election.
> > > > > > > > > > >
> > > > > > > > > > > 26. Hmm, not sure, but I don't think it uses watches.
> > > > > KafkaServer
> > > > > > > > > > registers
> > > > > > > > > > > the broker info and later loads all brokers in the
> > cluster
> > > to
> > > > > > check
> > > > > > > > > that
> > > > > > > > > > > the endpoints don't conflict. Things that do use
> watches
> > > are
> > > > > > > dynamic
> > > > > > > > > > > configs, ACLs, and some others (listed in the KIP).
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Oct 26, 2022 at 4:26 PM David Arthur <
> > > > > > > > > david.art...@confluent.io>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Luke and Andrew, thanks for taking a look!
> > > > > > > > > > > >
> > > > > > > > > > > > I think the names of the state machine in the KIP
> > aren't
> > > > the
> > > > > > > best.
> > > > > > > > > I'll
> > > > > > > > > > > > try to improve that section in general.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. "MigrationReady" is probably better named
> > > > > "MigratingFromZk"
> > > > > > or
> > > > > > > > > > > > something. It's meant to be the state when the KRaft
> > > > > controller
> > > > > > > is
> > > > > > > > > > > actively
> > > > > > > > > > > > migrating the data out of ZK. This happens after we
> > have
> > > > > > detected
> > > > > > > > > that
> > > > > > > > > > > the
> > > > > > > > > > > > cluster is eligible and before we enter the dual
> write
> > > > mode.
> > > > > > > > > > > >
> > > > > > > > > > > > 2. "MigrationActive" should probably be called
> > > > > > "BrokerMigration".
> > > > > > > > > > > > Transitioning to "MigrationFinished" can happen
> > > > automatically
> > > > > > > when
> > > > > > > > > all
> > > > > > > > > > > the
> > > > > > > > > > > > known ZK brokers have been migrated. Since we don't
> > have
> > > > > > > permanent
> > > > > > > > > > > > registrations of ZK brokers, we can use the partition
> > > > > > assignments
> > > > > > > > as
> > > > > > > > > a
> > > > > > > > > > > > proxy for this.
> > > > > > > > > > > >
> > > > > > > > > > > > 3. A metric for unready brokers makes sense. We can
> > also
> > > > log
> > > > > a
> > > > > > > > > message
> > > > > > > > > > on
> > > > > > > > > > > > the controller when it tries to start the migration
> > > > > > > > > > > >
> > > > > > > > > > > > 4. The MetadataType metric is meant to help see this.
> > > E.g.,
> > > > > > some
> > > > > > > > > > brokers
> > > > > > > > > > > > would have MetadataType=ZK, some would have
> > > > MetadataType=Dual
> > > > > > > > > > > >
> > > > > > > > > > > > 5. On ZK brokers, not having this config set would
> > > prevent
> > > > > the
> > > > > > > new
> > > > > > > > > > broker
> > > > > > > > > > > > registration data from being written to ZK. This
> means
> > > the
> > > > > > KRaft
> > > > > > > > > > > controller
> > > > > > > > > > > > won't send RPCs to it. If a broker has been migrated
> to
> > > > KRaft
> > > > > > > > > already,
> > > > > > > > > > > I'm
> > > > > > > > > > > > not sure there is any harm in removing this config.
> If
> > we
> > > > > > decide
> > > > > > > > that
> > > > > > > > > > we
> > > > > > > > > > > > need to guarantee that KRaft brokers have this config
> > set
> > > > > > during
> > > > > > > > the
> > > > > > > > > > > > migration, we can include it in the broker
> registration
> > > > > that's
> > > > > > > sent
> > > > > > > > > to
> > > > > > > > > > > > KRaft. This would let the controller keep that broker
> > > > fenced.
> > > > > > > > > > > >
> > > > > > > > > > > > 6. Once the KRaft controller takes leadership, the ZK
> > > > > > controller
> > > > > > > > > won't
> > > > > > > > > > be
> > > > > > > > > > > > active any more and will stop reporting the
> > > MigrationState
> > > > > > > metric.
> > > > > > > > > > > >
> > > > > > > > > > > > 7. The "Dual" MetadataType is reported by brokers
> > running
> > > > in
> > > > > > > KRaft
> > > > > > > > > mode
> > > > > > > > > > > > when the migration enable config is set. I think the
> > > > > controller
> > > > > > > > > should
> > > > > > > > > > > also
> > > > > > > > > > > > report this, not just the brokers. I'll clarify in
> the
> > > KIP.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Andrew:
> > > > > > > > > > > >
> > > > > > > > > > > > > How will the code get all the ZooKeeper config?
> Will
> > it
> > > > do
> > > > > > some
> > > > > > > > > sort
> > > > > > > > > > of
> > > > > > > > > > > > scan of the ZooKeeper data store? ... Do we need to
> do
> > > > > anything
> > > > > > > > > special
> > > > > > > > > > > to
> > > > > > > > > > > > make sure we get all data from ZooKeeper?
> > > > > > > > > > > >
> > > > > > > > > > > > With a few exceptions, all data written to ZK by
> Kafka
> > > > > happens
> > > > > > on
> > > > > > > > the
> > > > > > > > > > > > controller (single writer principle). In our
> migration
> > > > code,
> > > > > we
> > > > > > > can
> > > > > > > > > > > > unconditionally update the "/controller" and
> > > > > > "/controller_epoch"
> > > > > > > > > ZNodes
> > > > > > > > > > > and
> > > > > > > > > > > > effectively force the migration component to gain
> > > > leadership
> > > > > of
> > > > > > > the
> > > > > > > > > ZK
> > > > > > > > > > > > controller. Once this happens, we don't expect any
> data
> > > to
> > > > be
> > > > > > > > written
> > > > > > > > > > to
> > > > > > > > > > > > ZK, so we can read it iteratively without worrying
> > about
> > > > any
> > > > > > > > > concurrent
> > > > > > > > > > > > writes.
> > > > > > > > > > > >
> > > > > > > > > > > > As for the linearizable thing, I believe this just
> > means
> > > > that
> > > > > > > reads
> > > > > > > > > may
> > > > > > > > > > > be
> > > > > > > > > > > > served by a quorum follower which has stale data.
> We'll
> > > be
> > > > > > > reading
> > > > > > > > > from
> > > > > > > > > > > ZK
> > > > > > > > > > > > the same way the ZK controller does, so I think we
> will
> > > be
> > > > > fine
> > > > > > > > > > regarding
> > > > > > > > > > > > consistency. If we wanted to be extra careful, we
> could
> > > > add a
> > > > > > > delay
> > > > > > > > > > prior
> > > > > > > > > > > > to iterating through the znodes to give partitioned
> ZK
> > > > > > followers
> > > > > > > a
> > > > > > > > > > chance
> > > > > > > > > > > > to get kicked out of the quorum. I don't think we'll
> > need
> > > > > that
> > > > > > > > though
> > > > > > > > > > > >
> > > > > > > > > > > > > What happens if we commit the transaction then fail
> > > right
> > > > > > after
> > > > > > > > > that
> > > > > > > > > > > and
> > > > > > > > > > > > restart?
> > > > > > > > > > > >
> > > > > > > > > > > > If we commit the "migration" transaction to the
> > metadata
> > > > log,
> > > > > > it
> > > > > > > > > won't
> > > > > > > > > > be
> > > > > > > > > > > > a problem if we failover or restart. We can recover
> our
> > > > state
> > > > > > > based
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > > metadata log and what exists in ZK. If we see a
> > migration
> > > > > > > > transaction
> > > > > > > > > > in
> > > > > > > > > > > > the kraft log, but no "/migration" ZNode, we'll know
> we
> > > > > failed
> > > > > > > > before
> > > > > > > > > > > > writing to ZK. If we see a partial transaction in the
> > > log,
> > > > > then
> > > > > > > we
> > > > > > > > > can
> > > > > > > > > > > > abort it and restart the migration.
> > > > > > > > > > > >
> > > > > > > > > > > > > This sort of leads to me wondering if we
> will/should
> > > > check
> > > > > > the
> > > > > > > > > > metadata
> > > > > > > > > > > > log state before doing the migration? That could also
> > > catch
> > > > > > > > mistakes
> > > > > > > > > > such
> > > > > > > > > > > > as if a KRaft quorum is used that already had some
> > > metadata
> > > > > > > which I
> > > > > > > > > > > assume
> > > > > > > > > > > > we want to prevent.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, we should check that the metadata log is empty
> > > before
> > > > > > > > > attempting a
> > > > > > > > > > > > migration (perhaps excepting the bootstrap metadata
> --
> > > need
> > > > > to
> > > > > > > > think
> > > > > > > > > on
> > > > > > > > > > > > that one still).
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Oct 24, 2022 at 8:57 AM Andrew Grant
> > > > > > > > > > <agr...@confluent.io.invalid
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> Hey David,
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> Thanks for the KIP. I had a few small questions.
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> “The ZK data migration will copy the existing ZK
> data
> > > into
> > > > > the
> > > > > > > > KRaft
> > > > > > > > > > > >> metadata log and establish the new KRaft active
> > > controller
> > > > > as
> > > > > > > the
> > > > > > > > > > active
> > > > > > > > > > > >> controller from a ZK perspective.”
> > > > > > > > > > > >>
> > > > > > > > > > > >> How will the code get all the ZooKeeper config? Will
> > it
> > > do
> > > > > > some
> > > > > > > > sort
> > > > > > > > > > of
> > > > > > > > > > > >> scan of the ZooKeeper data store? Also I’m not a
> > > ZooKeeper
> > > > > > > expert
> > > > > > > > > but
> > > > > > > > > > > per
> > > > > > > > > > > >>
> > > > > > >
> https://zookeeper.apache.org/doc/current/zookeeperInternals.html
> > > > > > > > > > “Read
> > > > > > > > > > > >> operations in ZooKeeper are *not linearizable* since
> > > they
> > > > > can
> > > > > > > > return
> > > > > > > > > > > >> potentially stale data.” Do we need to do anything
> > > special
> > > > > to
> > > > > > > make
> > > > > > > > > > sure
> > > > > > > > > > > we
> > > > > > > > > > > >> get all data from ZooKeeper?
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> “For the initial migration, the controller will
> > utilize
> > > > > > KIP-868
> > > > > > > > > > Metadata
> > > > > > > > > > > >> Transactions
> > > > > > > > > > > >> <
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-868+Metadata+Transactions
> > > > > > > > > > > >> >
> > > > > > > > > > > >> to
> > > > > > > > > > > >> write all of the ZK metadata in a single
> transaction.
> > If
> > > > the
> > > > > > > > > > controller
> > > > > > > > > > > >> fails before this transaction is finalized, the next
> > > > active
> > > > > > > > > controller
> > > > > > > > > > > >> will
> > > > > > > > > > > >> abort the transaction and restart the migration
> > > process.”
> > > > > What
> > > > > > > > > happens
> > > > > > > > > > > if
> > > > > > > > > > > >> we commit the transaction then fail right after that
> > and
> > > > > > > restart?
> > > > > > > > > This
> > > > > > > > > > > >> sort
> > > > > > > > > > > >> of leads to me wondering if we will/should check the
> > > > > metadata
> > > > > > > log
> > > > > > > > > > state
> > > > > > > > > > > >> before doing the migration? That could also catch
> > > mistakes
> > > > > > such
> > > > > > > as
> > > > > > > > > if
> > > > > > > > > > a
> > > > > > > > > > > >> KRaft quorum is used that already had some metadata
> > > which
> > > > I
> > > > > > > assume
> > > > > > > > > we
> > > > > > > > > > > want
> > > > > > > > > > > >> to prevent.
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> For the Test Plan section do you think it’s worth
> > > calling
> > > > > out
> > > > > > > > > > > performance
> > > > > > > > > > > >> testing migrations of ZooKeeper deployments that
> have
> > > > > “large”
> > > > > > > > > amounts
> > > > > > > > > > of
> > > > > > > > > > > >> metadata?
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> Thanks,
> > > > > > > > > > > >>
> > > > > > > > > > > >> Andrew
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Mon, Oct 24, 2022 at 3:20 AM Luke Chen <
> > > > > show...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > Hi David,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Thanks for the good and complicated proposal! :)
> > > > > > > > > > > >> > Some questions:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 1. "MigrationReady" state: The KIP said:
> > > > > > > > > > > >> > The KRaft quorum has been started
> > > > > > > > > > > >> > I don't think we'll automatically enter this state
> > > after
> > > > > > KRaft
> > > > > > > > > > quorum
> > > > > > > > > > > >> > started, do we?
> > > > > > > > > > > >> > I think KRaft active controller should take over
> > > > > leadership
> > > > > > by
> > > > > > > > > > > writing a
> > > > > > > > > > > >> > znode in /controller path before we entering this
> > > sate,
> > > > is
> > > > > > > that
> > > > > > > > > > > correct?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2. "MigrationActive" state: the KIP said:
> > > > > > > > > > > >> > ZK state has been migrated, controller is in
> > > dual-write
> > > > > > mode,
> > > > > > > > > > brokers
> > > > > > > > > > > >> are
> > > > > > > > > > > >> > being restarted in KRaft mode
> > > > > > > > > > > >> > What confuses me is the last part: "brokers are
> > being
> > > > > > > restarted
> > > > > > > > in
> > > > > > > > > > > KRaft
> > > > > > > > > > > >> > mode".
> > > > > > > > > > > >> > How could we detect brokers are being restarted in
> > > KRaft
> > > > > > mode?
> > > > > > > > Old
> > > > > > > > > > ZK
> > > > > > > > > > > >> > broker is removed and new KRaft broker is up
> within
> > N
> > > > > > minutes?
> > > > > > > > > > > >> > I think we don't have to rely on the condition
> > > "brokers
> > > > > are
> > > > > > > > being
> > > > > > > > > > > >> restarted
> > > > > > > > > > > >> > in KRaft mode" to enter this state.
> > > > > > > > > > > >> > "brokers are being restarted in KRaft mode" should
> > be
> > > a
> > > > > > > process
> > > > > > > > > > > happened
> > > > > > > > > > > >> > between "MigrationActive" and "MigrationFinished".
> > > Does
> > > > > that
> > > > > > > > make
> > > > > > > > > > > sense?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 3. When "Zookeeper" mode trying to enter
> > > > > > "MigrationEligible",
> > > > > > > if
> > > > > > > > > > there
> > > > > > > > > > > >> is a
> > > > > > > > > > > >> > cluster with tens of brokers, how could users know
> > why
> > > > > this
> > > > > > > > > cluster
> > > > > > > > > > > >> cannot
> > > > > > > > > > > >> > be in "MigrationEligible" state, yet? Check that
> > znode
> > > > > > > manually
> > > > > > > > > one
> > > > > > > > > > by
> > > > > > > > > > > >> one?
> > > > > > > > > > > >> > Or do we plan to create a tool to help them? Or
> > maybe
> > > > > expose
> > > > > > > the
> > > > > > > > > > > >> "unready
> > > > > > > > > > > >> > ZK brokers" metrics?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 4. Same for "MigrationActive" entering
> > > > "MigrationFinished"
> > > > > > > > state.
> > > > > > > > > > > Since
> > > > > > > > > > > >> > that will be some process for restarting ZK broker
> > > into
> > > > > > KRaft
> > > > > > > > > broker
> > > > > > > > > > > >> one by
> > > > > > > > > > > >> > one, could we expose the "remaining ZK brokers" as
> > > > > metrics?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 5. When users are in
> > > > > > > > > > > >>
> "MigrationReady"/"MigrationActive"/"MigrationFinished"
> > > > > > > > > > > >> > states, and they accidentally change the KRaft
> > > > controller
> > > > > > > > config:
> > > > > > > > > > > >> > "kafka.metadata.migration.enable"
> > > > > > > > > > > >> > to false. What will happen to this cluster? No
> > > > dual-write
> > > > > > for
> > > > > > > > it?
> > > > > > > > > > Will
> > > > > > > > > > > >> we
> > > > > > > > > > > >> > have any protection for it?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 6. About the "MigrationState" metric:
> > > > > > > > > > > >> > The "ZooKeeper" and "MigrationEligible" is
> reported
> > by
> > > > ZK
> > > > > > > > > > controller,
> > > > > > > > > > > >> and
> > > > > > > > > > > >> > the rest of states are reported by KRaft
> controller
> > > -->
> > > > > > makes
> > > > > > > > > sense
> > > > > > > > > > > to
> > > > > > > > > > > >> me.
> > > > > > > > > > > >> > One question from it is, when KRaft controller
> takes
> > > > over
> > > > > > the
> > > > > > > > > > > leadership
> > > > > > > > > > > >> > from ZK controller, what will the "MigrationState"
> > > value
> > > > > in
> > > > > > > old
> > > > > > > > ZK
> > > > > > > > > > > >> > controller? keep in "MigrationEligible" doesn't
> make
> > > > > sense.
> > > > > > > Will
> > > > > > > > > > there
> > > > > > > > > > > >> be a
> > > > > > > > > > > >> > empty or null state?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 7. About the "MetadataType" metric:
> > > > > > > > > > > >> > An enumeration of: ZooKeeper (1), Dual (2), KRaft
> > (3).
> > > > > Each
> > > > > > > > broker
> > > > > > > > > > > >> reports
> > > > > > > > > > > >> > this.
> > > > > > > > > > > >> > I don't know how we could map the migration state
> to
> > > > > these 3
> > > > > > > > > types.
> > > > > > > > > > > >> > What is the metadataType when cluster in
> > > > "MigrationReady"
> > > > > > > state?
> > > > > > > > > > Still
> > > > > > > > > > > >> > Zookeeper?
> > > > > > > > > > > >> > When will brokers enter Dual type?
> > > > > > > > > > > >> > This is unclear in the KIP.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Thank you.
> > > > > > > > > > > >> > Luke
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > On Thu, Oct 20, 2022 at 11:33 PM David Arthur
> > > > > > > > > > > >> > <david.art...@confluent.io.invalid> wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > Igor, thanks for taking a look! Since JBOD in
> > KRaft
> > > is
> > > > > > still
> > > > > > > > > under
> > > > > > > > > > > >> > > discussion and not likely to land before the ZK
> > > > > > migration, I
> > > > > > > > > think
> > > > > > > > > > > >> we'll
> > > > > > > > > > > >> > > need to defer it. For migrating JBOD clusters
> from
> > > ZK
> > > > to
> > > > > > > > KRaft,
> > > > > > > > > > > we'll
> > > > > > > > > > > >> > also
> > > > > > > > > > > >> > > need to address the log dir failure mechanism
> > which
> > > > > > > currently
> > > > > > > > > > uses a
> > > > > > > > > > > >> > > special ZNode written to by the brokers. There
> is
> > an
> > > > old
> > > > > > KIP
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > > > > > > > > > > >> > > which proposes a new RPC to move that ZK write
> to
> > > the
> > > > > > > > > controller,
> > > > > > > > > > > but
> > > > > > > > > > > >> I'm
> > > > > > > > > > > >> > > not sure if that's the approach we'll want to
> > take.
> > > I
> > > > > read
> > > > > > > > > through
> > > > > > > > > > > the
> > > > > > > > > > > >> > > discussion over in KIP-858 and it sounds like
> > there
> > > > are
> > > > > > some
> > > > > > > > > good
> > > > > > > > > > > >> ideas
> > > > > > > > > > > >> > > there.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > To answer your question more directly, migrating
> > ZK
> > > > > > clusters
> > > > > > > > > using
> > > > > > > > > > > >> JBOD
> > > > > > > > > > > >> > is
> > > > > > > > > > > >> > > out of scope for this KIP. It might be possible
> to
> > > > stop
> > > > > > > using
> > > > > > > > > JBOD
> > > > > > > > > > > >> using
> > > > > > > > > > > >> > > reassignments, but I'm not sure about that.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > -David
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > On Tue, Oct 18, 2022 at 12:17 PM Igor Soarez <
> > > > > i...@soarez.me
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > Hi David,
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > Thanks for the KIP, this is very exciting!
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > How does JBOD relate to this work? KRaft mode
> > > > doesn't
> > > > > > yet
> > > > > > > > > > support
> > > > > > > > > > > >> > > > configuring brokers with multiple log
> > directories.
> > > > If
> > > > > > the
> > > > > > > > > > brokers
> > > > > > > > > > > in
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > > existing cluster are configured with multiple
> > log
> > > > > dirs,
> > > > > > > does
> > > > > > > > > the
> > > > > > > > > > > >> > > migration
> > > > > > > > > > > >> > > > imply that the existing brokers need to drop
> use
> > > of
> > > > > that
> > > > > > > > > > feature?
> > > > > > > > > > > >> Or is
> > > > > > > > > > > >> > > > there some way to upgrade them later?
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > Thanks,
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > --
> > > > > > > > > > > >> > > > Igor
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > On Mon, Oct 17, 2022, at 10:07 PM, David
> Arthur
> > > > wrote:
> > > > > > > > > > > >> > > > > I've updated the KIP with the following
> > changes
> > > > (the
> > > > > > > > > > confluence
> > > > > > > > > > > >> diff
> > > > > > > > > > > >> > is
> > > > > > > > > > > >> > > > not
> > > > > > > > > > > >> > > > > helpful here since i rearranged some things)
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > * Added ZooKeeperBlockingKRaftMillis metric
> > > > > > > > > > > >> > > > > * Added section on new broker registration
> > JSON
> > > > > > > > > > > >> > > > > * Removed section on MigrationCheck RPC
> > > > > > > > > > > >> > > > > * Added change to UpdateMetadataRequest
> > > > > > > > > > > >> > > > > * Added section "Additional ZK Broker
> Configs"
> > > > > > (includes
> > > > > > > > > > configs
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > > > connect
> > > > > > > > > > > >> > > > > to KRaft quorum)
> > > > > > > > > > > >> > > > > * Added section on "Incompatible Brokers"
> > under
> > > > > > Failure
> > > > > > > > > Modes
> > > > > > > > > > > >> > > > > * Clarified many things per this discussion
> > > thread
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > I realized we need the KRaft controller to
> > pick
> > > > the
> > > > > > > > correct
> > > > > > > > > > > >> > > > > "metadata.version" when initializing the
> > > > migration.
> > > > > I
> > > > > > > > > included
> > > > > > > > > > > the
> > > > > > > > > > > >> > IBP
> > > > > > > > > > > >> > > > of a
> > > > > > > > > > > >> > > > > broker in its registration data so the KRaft
> > > > > > controller
> > > > > > > > can
> > > > > > > > > > > verify
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > > IBP
> > > > > > > > > > > >> > > > > and pick the correct "metadata.version" when
> > > > > starting
> > > > > > > the
> > > > > > > > > > > >> migration.
> > > > > > > > > > > >> > > > > Otherwise, we could inadvertently downgrade
> > the
> > > > > > > > > > > >> IBP/metadata.version
> > > > > > > > > > > >> > as
> > > > > > > > > > > >> > > > > part of the migration.
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > I also added "clusterId" to the broker
> > > > registration
> > > > > > data
> > > > > > > > so
> > > > > > > > > > the
> > > > > > > > > > > >> KRaft
> > > > > > > > > > > >> > > > could
> > > > > > > > > > > >> > > > > verify it.
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > -David
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > On Mon, Oct 17, 2022 at 12:14 PM Colin
> McCabe
> > <
> > > > > > > > > > > cmcc...@apache.org
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > >> On Fri, Oct 14, 2022, at 11:39, Jun Rao
> > wrote:
> > > > > > > > > > > >> > > > >> > Hi, Colin,
> > > > > > > > > > > >> > > > >> >
> > > > > > > > > > > >> > > > >> > 10. "That all goes away in the new mode,
> > and
> > > we
> > > > > > just
> > > > > > > > have
> > > > > > > > > > > some
> > > > > > > > > > > >> > code
> > > > > > > > > > > >> > > > which
> > > > > > > > > > > >> > > > >> > analyzes __cluster_metadata and reflects
> it
> > > in
> > > > 1)
> > > > > > > > updates
> > > > > > > > > > to
> > > > > > > > > > > ZK
> > > > > > > > > > > >> > and
> > > > > > > > > > > >> > > 2)
> > > > > > > > > > > >> > > > >> > messages sent out to brokers."
> > > > > > > > > > > >> > > > >> > Hmm, I am not sure it's that simple. Some
> > of
> > > > the
> > > > > > > > > complexity
> > > > > > > > > > > of
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > >> ZK-based
> > > > > > > > > > > >> > > > >> > controller are (1) using the pipelining
> > > > approach
> > > > > to
> > > > > > > > write
> > > > > > > > > > to
> > > > > > > > > > > ZK
> > > > > > > > > > > >> > for
> > > > > > > > > > > >> > > > >> better
> > > > > > > > > > > >> > > > >> > throughput and using conditional writes
> for
> > > > > > > > correctness;
> > > > > > > > > > (2)
> > > > > > > > > > > >> > sending
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > >> > proper LeaderAndIsr and UpdateMetadata
> > > > requests.
> > > > > > For
> > > > > > > > > > example,
> > > > > > > > > > > >> > during
> > > > > > > > > > > >> > > > >> > controller failover, the full metadata
> > needs
> > > to
> > > > > be
> > > > > > > sent
> > > > > > > > > > while
> > > > > > > > > > > >> > during
> > > > > > > > > > > >> > > > >> > individual broker failure, only some of
> the
> > > > > > metadata
> > > > > > > > > needs
> > > > > > > > > > to
> > > > > > > > > > > >> be
> > > > > > > > > > > >> > > > updated.
> > > > > > > > > > > >> > > > >> > The controlled shutdown handling
> sometimes
> > > uses
> > > > > > > > > > > >> StopReplicaRequest
> > > > > > > > > > > >> > > > and
> > > > > > > > > > > >> > > > >> > some other times uses
> LeaderAndIsrRequest.
> > > (3)
> > > > > > > > triggering
> > > > > > > > > > new
> > > > > > > > > > > >> > events
> > > > > > > > > > > >> > > > >> based
> > > > > > > > > > > >> > > > >> > on the responses of LeaderAndIsr (e.g.
> for
> > > > topic
> > > > > > > > > deletion).
> > > > > > > > > > > >> Some
> > > > > > > > > > > >> > of
> > > > > > > > > > > >> > > > those
> > > > > > > > > > > >> > > > >> > complexity could be re-implemented in a
> > more
> > > > > > > efficient
> > > > > > > > > way,
> > > > > > > > > > > >> but we
> > > > > > > > > > > >> > > > need
> > > > > > > > > > > >> > > > >> to
> > > > > > > > > > > >> > > > >> > be really careful not to generate
> > regression.
> > > > > Some
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > other
> > > > > > > > > > > >> > > > >> complexity
> > > > > > > > > > > >> > > > >> > just won't go away. Reimplementing all
> > those
> > > > > logic
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > 30
> > > > > > > > > > > >> or
> > > > > > > > > > > >> > so
> > > > > > > > > > > >> > > > >> events
> > > > > > > > > > > >> > > > >> > in the ZK-based controller is possible,
> but
> > > > > seems a
> > > > > > > bit
> > > > > > > > > > > >> daunting
> > > > > > > > > > > >> > and
> > > > > > > > > > > >> > > > >> risky.
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >> Hi Jun,
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >> Thanks for the response.  I agree that
> there
> > is
> > > > > some
> > > > > > > work
> > > > > > > > > > here
> > > > > > > > > > > >> but I
> > > > > > > > > > > >> > > > don't
> > > > > > > > > > > >> > > > >> think it's as bad as it might seem. Most of
> > the
> > > > > code
> > > > > > > for
> > > > > > > > > > > writing
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > ZK
> > > > > > > > > > > >> > > > can
> > > > > > > > > > > >> > > > >> be reused from the old controller. We
> should
> > at
> > > > > least
> > > > > > > > > > evaluate
> > > > > > > > > > > >> using
> > > > > > > > > > > >> > > the
> > > > > > > > > > > >> > > > >> old ControllerChannelManager as well. I
> don't
> > > > know
> > > > > if
> > > > > > > > we'll
> > > > > > > > > > end
> > > > > > > > > > > >> up
> > > > > > > > > > > >> > > > doing it
> > > > > > > > > > > >> > > > >> or not but it's a possibility. (The main
> > reason
> > > > to
> > > > > > not
> > > > > > > do
> > > > > > > > > it
> > > > > > > > > > is
> > > > > > > > > > > >> that
> > > > > > > > > > > >> > > the
> > > > > > > > > > > >> > > > >> response handling will be a bit different)
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >> The question of what to do during a
> > controller
> > > > > > failover
> > > > > > > > is
> > > > > > > > > an
> > > > > > > > > > > >> > > > interesting
> > > > > > > > > > > >> > > > >> one. Technically, we should be able to
> > continue
> > > > > > sending
> > > > > > > > > > > >> incremental
> > > > > > > > > > > >> > > > updates
> > > > > > > > > > > >> > > > >> to the legacy nodes, for the same reason we
> > can
> > > > in
> > > > > > > KRaft
> > > > > > > > > > mode.
> > > > > > > > > > > >> > > However,
> > > > > > > > > > > >> > > > >> probably we should just copy what ZK mode
> > does
> > > > and
> > > > > > send
> > > > > > > > > them
> > > > > > > > > > a
> > > > > > > > > > > >> full
> > > > > > > > > > > >> > > > >> metadata update. This will allow us to have
> > an
> > > > easy
> > > > > > way
> > > > > > > > to
> > > > > > > > > > > handle
> > > > > > > > > > > >> > > > >> divergences caused by lost RPCs by changing
> > the
> > > > > > > > controller
> > > > > > > > > > > (just
> > > > > > > > > > > >> as
> > > > > > > > > > > >> > we
> > > > > > > > > > > >> > > > do
> > > > > > > > > > > >> > > > >> in ZK mode, unfortunately). I suppose we
> > should
> > > > > > > document
> > > > > > > > > this
> > > > > > > > > > > in
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > KIP...
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >> I agree controlled shutdown is tricky to
> get
> > > just
> > > > > > > right.
> > > > > > > > I
> > > > > > > > > > > >> suppose
> > > > > > > > > > > >> > > this
> > > > > > > > > > > >> > > > is
> > > > > > > > > > > >> > > > >> a case where the RPCs we send out are not
> > > purely
> > > > > > "fire
> > > > > > > > and
> > > > > > > > > > > >> forget";
> > > > > > > > > > > >> > we
> > > > > > > > > > > >> > > > have
> > > > > > > > > > > >> > > > >> to listen for the response. But that can be
> > > done
> > > > in
> > > > > > an
> > > > > > > > > > > >> event-based
> > > > > > > > > > > >> > > way.
> > > > > > > > > > > >> > > > >> Controlled shutdown is also probably the
> last
> > > > thing
> > > > > > > we'll
> > > > > > > > > > > >> implement
> > > > > > > > > > > >> > > > once we
> > > > > > > > > > > >> > > > >> have the basic lift and shift.
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >> Topic deletion is another annoying thing. I
> > > > wonder
> > > > > if
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > > >> just
> > > > > > > > > > > >> > > > delete
> > > > > > > > > > > >> > > > >> immediately here and skip the complexity of
> > > > > > > implementing
> > > > > > > > > > > >> "deleting
> > > > > > > > > > > >> > > > state."
> > > > > > > > > > > >> > > > >> Topic IDs will exist in IBP 3.4, in both ZK
> > and
> > > > > KRaft
> > > > > > > > mode,
> > > > > > > > > > so
> > > > > > > > > > > it
> > > > > > > > > > > >> > > > should be
> > > > > > > > > > > >> > > > >> OK, right?
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >> best,
> > > > > > > > > > > >> > > > >> Colin
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >> >
> > > > > > > > > > > >> > > > >> > Thanks,
> > > > > > > > > > > >> > > > >> >
> > > > > > > > > > > >> > > > >> > Jun
> > > > > > > > > > > >> > > > >> >
> > > > > > > > > > > >> > > > >> > On Fri, Oct 14, 2022 at 9:29 AM Colin
> > McCabe
> > > <
> > > > > > > > > > > >> cmcc...@apache.org>
> > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > >> > > > >> >
> > > > > > > > > > > >> > > > >> >> On Thu, Oct 13, 2022, at 11:44, Jun Rao
> > > wrote:
> > > > > > > > > > > >> > > > >> >> > Hi, Colin,
> > > > > > > > > > > >> > > > >> >> >
> > > > > > > > > > > >> > > > >> >> > Thanks for the reply.
> > > > > > > > > > > >> > > > >> >> >
> > > > > > > > > > > >> > > > >> >> > 10. This is a bit on the
> implementation
> > > > side.
> > > > > If
> > > > > > > you
> > > > > > > > > > look
> > > > > > > > > > > at
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > > >> existing
> > > > > > > > > > > >> > > > >> >> > ZK-based controller, most of the logic
> > is
> > > > > around
> > > > > > > > > > > >> maintaining an
> > > > > > > > > > > >> > > > >> in-memory
> > > > > > > > > > > >> > > > >> >> > state of all the resources (broker,
> > topic,
> > > > > > > > partition,
> > > > > > > > > > > etc),
> > > > > > > > > > > >> > > > >> >> reading/writing
> > > > > > > > > > > >> > > > >> >> > to ZK, sending LeaderAndIsr and
> > > > UpdateMetadata
> > > > > > > > > requests
> > > > > > > > > > > and
> > > > > > > > > > > >> > > > handling
> > > > > > > > > > > >> > > > >> the
> > > > > > > > > > > >> > > > >> >> > responses to brokers. So we need all
> > that
> > > > > logic
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > dual
> > > > > > > > > > > >> > write
> > > > > > > > > > > >> > > > >> mode.
> > > > > > > > > > > >> > > > >> >> One
> > > > > > > > > > > >> > > > >> >> > option is to duplicate all that logic
> in
> > > > some
> > > > > > new
> > > > > > > > > code.
> > > > > > > > > > > This
> > > > > > > > > > > >> > can
> > > > > > > > > > > >> > > > be a
> > > > > > > > > > > >> > > > >> bit
> > > > > > > > > > > >> > > > >> >> > error prone and makes the code a bit
> > > harder
> > > > to
> > > > > > > > > maintain
> > > > > > > > > > if
> > > > > > > > > > > >> we
> > > > > > > > > > > >> > > need
> > > > > > > > > > > >> > > > to
> > > > > > > > > > > >> > > > >> fix
> > > > > > > > > > > >> > > > >> >> > some critical issues in ZK-based
> > > > controllers.
> > > > > > > > Another
> > > > > > > > > > > >> option is
> > > > > > > > > > > >> > > to
> > > > > > > > > > > >> > > > try
> > > > > > > > > > > >> > > > >> >> > reusing the existing code in the
> > ZK-based
> > > > > > > > controller.
> > > > > > > > > > For
> > > > > > > > > > > >> > > example,
> > > > > > > > > > > >> > > > we
> > > > > > > > > > > >> > > > >> >> could
> > > > > > > > > > > >> > > > >> >> > start the EventManager in the ZK-based
> > > > > > controller,
> > > > > > > > but
> > > > > > > > > > let
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > KRaft
> > > > > > > > > > > >> > > > >> >> > controller ingest new events. This has
> > its
> > > > own
> > > > > > > > > > challenges:
> > > > > > > > > > > >> (1)
> > > > > > > > > > > >> > > the
> > > > > > > > > > > >> > > > >> >> existing
> > > > > > > > > > > >> > > > >> >> > logic only logs ZK failures and
> doesn't
> > > > expose
> > > > > > > them
> > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > > >> > > caller;
> > > > > > > > > > > >> > > > (2)
> > > > > > > > > > > >> > > > >> >> the
> > > > > > > > > > > >> > > > >> >> > existing logic may add new events to
> the
> > > > queue
> > > > > > > > itself
> > > > > > > > > > and
> > > > > > > > > > > we
> > > > > > > > > > > >> > > > probably
> > > > > > > > > > > >> > > > >> >> need
> > > > > > > > > > > >> > > > >> >> > to think through how this is
> coordinated
> > > > with
> > > > > > the
> > > > > > > > > KRaft
> > > > > > > > > > > >> > > controller;
> > > > > > > > > > > >> > > > >> (3)
> > > > > > > > > > > >> > > > >> >> it
> > > > > > > > > > > >> > > > >> >> > registers some ZK listeners
> > unnecessarily
> > > > (may
> > > > > > not
> > > > > > > > be
> > > > > > > > > a
> > > > > > > > > > > big
> > > > > > > > > > > >> > > > concern).
> > > > > > > > > > > >> > > > >> So
> > > > > > > > > > > >> > > > >> >> we
> > > > > > > > > > > >> > > > >> >> > need to get around those issues
> > somehow. I
> > > > am
> > > > > > > > > wondering
> > > > > > > > > > if
> > > > > > > > > > > >> we
> > > > > > > > > > > >> > > have
> > > > > > > > > > > >> > > > >> >> > considered both options and which
> > approach
> > > > we
> > > > > > are
> > > > > > > > > > leaning
> > > > > > > > > > > >> > towards
> > > > > > > > > > > >> > > > for
> > > > > > > > > > > >> > > > >> the
> > > > > > > > > > > >> > > > >> >> > implementation.
> > > > > > > > > > > >> > > > >> >> >
> > > > > > > > > > > >> > > > >> >>
> > > > > > > > > > > >> > > > >> >> Yes, this is a good question. My take is
> > > that
> > > > a
> > > > > > big
> > > > > > > > part
> > > > > > > > > > of
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > >> complexity
> > > > > > > > > > > >> > > > >> >> in the old controller code results from
> > the
> > > > fact
> > > > > > > that
> > > > > > > > we
> > > > > > > > > > use
> > > > > > > > > > > >> ZK
> > > > > > > > > > > >> > as
> > > > > > > > > > > >> > > a
> > > > > > > > > > > >> > > > >> >> multi-writer database for propagating
> > > > > information
> > > > > > > > > between
> > > > > > > > > > > >> > different
> > > > > > > > > > > >> > > > >> >> components. So in the old controller,
> > every
> > > > > write
> > > > > > to
> > > > > > > > ZK
> > > > > > > > > > > needs
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > be
> > > > > > > > > > > >> > > > >> >> structured as a compare and swap to be
> > fully
> > > > > > > correct.
> > > > > > > > > > Every
> > > > > > > > > > > >> time
> > > > > > > > > > > >> > we
> > > > > > > > > > > >> > > > get
> > > > > > > > > > > >> > > > >> >> notified about something, it's usually
> in
> > > the
> > > > > form
> > > > > > > of
> > > > > > > > > > "this
> > > > > > > > > > > >> znode
> > > > > > > > > > > >> > > > >> changed"
> > > > > > > > > > > >> > > > >> >> which prompts a full reload of part of
> the
> > > > data
> > > > > in
> > > > > > > ZK
> > > > > > > > > > (which
> > > > > > > > > > > >> > itself
> > > > > > > > > > > >> > > > has
> > > > > > > > > > > >> > > > >> >> multiple parts, loading, deserializing,
> > > > > > reconciling,
> > > > > > > > > etc.)
> > > > > > > > > > > >> That
> > > > > > > > > > > >> > all
> > > > > > > > > > > >> > > > goes
> > > > > > > > > > > >> > > > >> >> away in the new mode, and we just have
> > some
> > > > code
> > > > > > > which
> > > > > > > > > > > >> analyzes
> > > > > > > > > > > >> > > > >> >> __cluster_metadata and reflects it in 1)
> > > > updates
> > > > > > to
> > > > > > > ZK
> > > > > > > > > and
> > > > > > > > > > > 2)
> > > > > > > > > > > >> > > > messages
> > > > > > > > > > > >> > > > >> sent
> > > > > > > > > > > >> > > > >> >> out to brokers.
> > > > > > > > > > > >> > > > >> >>
> > > > > > > > > > > >> > > > >> >> This is pretty decoupled from the other
> > > logic
> > > > in
> > > > > > > > > > > >> QuorumController
> > > > > > > > > > > >> > > and
> > > > > > > > > > > >> > > > >> >> should be easy to unit test, since the
> > same
> > > > > inputs
> > > > > > > > from
> > > > > > > > > > the
> > > > > > > > > > > >> log
> > > > > > > > > > > >> > > > always
> > > > > > > > > > > >> > > > >> >> produce the same output in ZK.
> Basically,
> > ZK
> > > > is
> > > > > > > > > write-only
> > > > > > > > > > > for
> > > > > > > > > > > >> > us,
> > > > > > > > > > > >> > > > we do
> > > > > > > > > > > >> > > > >> >> not read it (with the big exception of
> > > broker
> > > > > > > > > registration
> > > > > > > > > > > >> > znodes)
> > > > > > > > > > > >> > > > and I
> > > > > > > > > > > >> > > > >> >> think that will greatly simplify things.
> > > > > > > > > > > >> > > > >> >>
> > > > > > > > > > > >> > > > >> >> So I think dual-write mode as described
> > here
> > > > > will
> > > > > > be
> > > > > > > > > > > >> > substantially
> > > > > > > > > > > >> > > > >> simpler
> > > > > > > > > > > >> > > > >> >> than trying to run part or all of the
> old
> > > > > > controller
> > > > > > > > in
> > > > > > > > > > > >> > parallel. I
> > > > > > > > > > > >> > > > do
> > > > > > > > > > > >> > > > >> >> think we will reuse a bunch of the
> > > > > serialization /
> > > > > > > > > > > >> > deserialization
> > > > > > > > > > > >> > > > code
> > > > > > > > > > > >> > > > >> for
> > > > > > > > > > > >> > > > >> >> znodes and possibly the code for
> > > communicating
> > > > > > with
> > > > > > > > ZK.
> > > > > > > > > > > >> > > > >> >>
> > > > > > > > > > > >> > > > >> >> best,
> > > > > > > > > > > >> > > > >> >> Colin
> > > > > > > > > > > >> > > > >> >>
> > > > > > > > > > > >> > > > >> >>
> > > > > > > > > > > >> > > > >> >> >
> > > > > > > > > > > >> > > > >> >> > 14. Good point and make sense.
> > > > > > > > > > > >> > > > >> >> >
> > > > > > > > > > > >> > > > >> >> > Thanks,
> > > > > > > > > > > >> > > > >> >> >
> > > > > > > > > > > >> > > > >> >> > Jun
> > > > > > > > > > > >> > > > >> >> >
> > > > > > > > > > > >> > > > >> >> >
> > > > > > > > > > > >> > > > >> >> >
> > > > > > > > > > > >> > > > >> >> >
> > > > > > > > > > > >> > > > >> >> > On Wed, Oct 12, 2022 at 3:27 PM Colin
> > > > McCabe <
> > > > > > > > > > > >> > cmcc...@apache.org
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > >> wrote:
> > > > > > > > > > > >> > > > >> >> >
> > > > > > > > > > > >> > > > >> >> >> Hi Jun,
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> Thanks for taking a look. I can
> answer
> > > some
> > > > > > > > questions
> > > > > > > > > > > here
> > > > > > > > > > > >> > > > because I
> > > > > > > > > > > >> > > > >> >> >> collaborated on this a bit, and David
> > is
> > > on
> > > > > > > > vacation
> > > > > > > > > > for
> > > > > > > > > > > a
> > > > > > > > > > > >> few
> > > > > > > > > > > >> > > > days.
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> On Wed, Oct 12, 2022, at 14:41, Jun
> Rao
> > > > > wrote:
> > > > > > > > > > > >> > > > >> >> >> > Hi, David,
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> > Thanks for the KIP. A few comments
> > > below.
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> > 10. It's still not very clear to me
> > how
> > > > the
> > > > > > > KRaft
> > > > > > > > > > > >> controller
> > > > > > > > > > > >> > > > works
> > > > > > > > > > > >> > > > >> in
> > > > > > > > > > > >> > > > >> >> the
> > > > > > > > > > > >> > > > >> >> >> > dual writes mode to KRaft log and
> ZK
> > > when
> > > > > the
> > > > > > > > > brokers
> > > > > > > > > > > >> still
> > > > > > > > > > > >> > > run
> > > > > > > > > > > >> > > > in
> > > > > > > > > > > >> > > > >> ZK
> > > > > > > > > > > >> > > > >> >> >> mode.
> > > > > > > > > > > >> > > > >> >> >> > Does the KRaft controller run a ZK
> > > based
> > > > > > > > controller
> > > > > > > > > > in
> > > > > > > > > > > >> > > parallel
> > > > > > > > > > > >> > > > or
> > > > > > > > > > > >> > > > >> do
> > > > > > > > > > > >> > > > >> >> we
> > > > > > > > > > > >> > > > >> >> >> > derive what needs to be written to
> ZK
> > > > based
> > > > > > on
> > > > > > > > > KRaft
> > > > > > > > > > > >> > > controller
> > > > > > > > > > > >> > > > >> logic?
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> We derive what needs to be written to
> > ZK
> > > > > based
> > > > > > on
> > > > > > > > > KRaft
> > > > > > > > > > > >> > > controller
> > > > > > > > > > > >> > > > >> >> logic.
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> > I am also not sure how the KRaft
> > > > controller
> > > > > > > > handles
> > > > > > > > > > > >> broker
> > > > > > > > > > > >> > > > >> >> >> > registration/deregistration, since
> > > > brokers
> > > > > > are
> > > > > > > > > still
> > > > > > > > > > > >> running
> > > > > > > > > > > >> > > in
> > > > > > > > > > > >> > > > ZK
> > > > > > > > > > > >> > > > >> >> mode
> > > > > > > > > > > >> > > > >> >> >> and
> > > > > > > > > > > >> > > > >> >> >> > are not heartbeating to the KRaft
> > > > > controller.
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> The new controller will listen for
> > broker
> > > > > > > > > registrations
> > > > > > > > > > > >> under
> > > > > > > > > > > >> > > > >> /brokers.
> > > > > > > > > > > >> > > > >> >> >> This is the only znode watch that the
> > new
> > > > > > > > controller
> > > > > > > > > > will
> > > > > > > > > > > >> do.
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> We did consider changing how ZK-based
> > > > broker
> > > > > > > > > > registration
> > > > > > > > > > > >> > > worked,
> > > > > > > > > > > >> > > > >> but it
> > > > > > > > > > > >> > > > >> >> >> just ended up being too much work for
> > not
> > > > > > enough
> > > > > > > > > gain.
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> > 12. "A new set of nodes will be
> > > > provisioned
> > > > > > to
> > > > > > > > host
> > > > > > > > > > the
> > > > > > > > > > > >> > > > controller
> > > > > > > > > > > >> > > > >> >> >> quorum."
> > > > > > > > > > > >> > > > >> >> >> > I guess we don't support starting
> the
> > > > KRaft
> > > > > > > > > > controller
> > > > > > > > > > > >> > quorum
> > > > > > > > > > > >> > > on
> > > > > > > > > > > >> > > > >> >> existing
> > > > > > > > > > > >> > > > >> >> >> > brokers. It would be useful to make
> > > that
> > > > > > clear.
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> Agreed
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> > 13. "Once the quorum is established
> > > and a
> > > > > > > leader
> > > > > > > > is
> > > > > > > > > > > >> elected,
> > > > > > > > > > > >> > > the
> > > > > > > > > > > >> > > > >> >> >> controller
> > > > > > > > > > > >> > > > >> >> >> > will check the state of the cluster
> > > using
> > > > > the
> > > > > > > > > > > >> MigrationCheck
> > > > > > > > > > > >> > > > RPC."
> > > > > > > > > > > >> > > > >> How
> > > > > > > > > > > >> > > > >> >> >> does
> > > > > > > > > > > >> > > > >> >> >> > the quorum controller detect other
> > > > brokers?
> > > > > > > Does
> > > > > > > > > the
> > > > > > > > > > > >> > > controller
> > > > > > > > > > > >> > > > >> node
> > > > > > > > > > > >> > > > >> >> need
> > > > > > > > > > > >> > > > >> >> >> > to be configured with ZK connection
> > > > string?
> > > > > > If
> > > > > > > > so,
> > > > > > > > > it
> > > > > > > > > > > >> would
> > > > > > > > > > > >> > be
> > > > > > > > > > > >> > > > >> useful
> > > > > > > > > > > >> > > > >> >> to
> > > > > > > > > > > >> > > > >> >> >> > document the additional configs
> that
> > > the
> > > > > > quorum
> > > > > > > > > > > >> controller
> > > > > > > > > > > >> > > > needs to
> > > > > > > > > > > >> > > > >> >> set.
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> Yes, the controllers monitor ZK for
> > > broker
> > > > > > > > > > registrations,
> > > > > > > > > > > >> as I
> > > > > > > > > > > >> > > > >> mentioned
> > > > > > > > > > > >> > > > >> >> >> above. So they need zk.connect and
> the
> > > > other
> > > > > ZK
> > > > > > > > > > > connection
> > > > > > > > > > > >> > > > >> >> configurations.
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> > 14. "In order to prevent further
> > writes
> > > > to
> > > > > > ZK,
> > > > > > > > the
> > > > > > > > > > > first
> > > > > > > > > > > >> > thing
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > >> new
> > > > > > > > > > > >> > > > >> >> >> > KRaft quorum must do is take over
> > > > > leadership
> > > > > > of
> > > > > > > > the
> > > > > > > > > > ZK
> > > > > > > > > > > >> > > > controller.
> > > > > > > > > > > >> > > > >> "
> > > > > > > > > > > >> > > > >> >> The
> > > > > > > > > > > >> > > > >> >> >> ZK
> > > > > > > > > > > >> > > > >> >> >> > controller processing changes to
> > > > > /controller
> > > > > > > > update
> > > > > > > > > > > >> > > > asynchronously.
> > > > > > > > > > > >> > > > >> >> How
> > > > > > > > > > > >> > > > >> >> >> > does the KRaft controller know when
> > the
> > > > ZK
> > > > > > > > > controller
> > > > > > > > > > > has
> > > > > > > > > > > >> > > > resigned
> > > > > > > > > > > >> > > > >> >> before
> > > > > > > > > > > >> > > > >> >> >> > it can safely copy the ZK data?
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> This should be done through
> > > > > > > > > > > >> expectedControllerEpochZkVersion,
> > > > > > > > > > > >> > > just
> > > > > > > > > > > >> > > > >> like
> > > > > > > > > > > >> > > > >> >> in
> > > > > > > > > > > >> > > > >> >> >> ZK mode, right? We should bump this
> > epoch
> > > > > value
> > > > > > > so
> > > > > > > > > that
> > > > > > > > > > > any
> > > > > > > > > > > >> > > writes
> > > > > > > > > > > >> > > > >> from
> > > > > > > > > > > >> > > > >> >> the
> > > > > > > > > > > >> > > > >> >> >> old controller will not go through. I
> > > agree
> > > > > we
> > > > > > > > should
> > > > > > > > > > > spell
> > > > > > > > > > > >> > this
> > > > > > > > > > > >> > > > out
> > > > > > > > > > > >> > > > >> in
> > > > > > > > > > > >> > > > >> >> the
> > > > > > > > > > > >> > > > >> >> >> KIP.
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> > 15. We have the following
> sentences.
> > > One
> > > > > says
> > > > > > > > > > > >> ControllerId
> > > > > > > > > > > >> > is
> > > > > > > > > > > >> > > a
> > > > > > > > > > > >> > > > >> random
> > > > > > > > > > > >> > > > >> >> >> > KRaft broker and the other says
> it's
> > > the
> > > > > > active
> > > > > > > > > > > >> controller.
> > > > > > > > > > > >> > > > Which
> > > > > > > > > > > >> > > > >> one
> > > > > > > > > > > >> > > > >> >> is
> > > > > > > > > > > >> > > > >> >> >> > correct?
> > > > > > > > > > > >> > > > >> >> >> > "UpdateMetadata: for certain
> metadata
> > > > > > changes,
> > > > > > > > the
> > > > > > > > > > > KRaft
> > > > > > > > > > > >> > > > controller
> > > > > > > > > > > >> > > > >> >> will
> > > > > > > > > > > >> > > > >> >> >> > need to send UpdateMetadataRequests
> > to
> > > > the
> > > > > ZK
> > > > > > > > > > brokers.
> > > > > > > > > > > >> For
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > > >> >> >> > “ControllerId” field in this
> request,
> > > the
> > > > > > > > > controller
> > > > > > > > > > > >> should
> > > > > > > > > > > >> > > > >> specify a
> > > > > > > > > > > >> > > > >> >> >> > random KRaft broker."
> > > > > > > > > > > >> > > > >> >> >> > "In the UpdateMetadataRequest sent
> by
> > > the
> > > > > > KRaft
> > > > > > > > > > > >> controller
> > > > > > > > > > > >> > to
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > >> ZK
> > > > > > > > > > > >> > > > >> >> >> > brokers, the ControllerId will
> point
> > to
> > > > the
> > > > > > > > active
> > > > > > > > > > > >> > controller
> > > > > > > > > > > >> > > > which
> > > > > > > > > > > >> > > > >> >> will
> > > > > > > > > > > >> > > > >> >> >> be
> > > > > > > > > > > >> > > > >> >> >> > used for the inter-broker
> requests."
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> Yeah, this seems like an error to me
> as
> > > > > well. A
> > > > > > > > > random
> > > > > > > > > > > >> value
> > > > > > > > > > > >> > is
> > > > > > > > > > > >> > > > not
> > > > > > > > > > > >> > > > >> >> really
> > > > > > > > > > > >> > > > >> >> >> useful. Plus the text here is
> > > > > > self-contradictory,
> > > > > > > > as
> > > > > > > > > > you
> > > > > > > > > > > >> > pointed
> > > > > > > > > > > >> > > > out.
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> I suspect what we should do here is
> > add a
> > > > new
> > > > > > > > field,
> > > > > > > > > > > >> > > > >> KRaftControllerId,
> > > > > > > > > > > >> > > > >> >> >> and populate it with the real
> > controller
> > > > ID,
> > > > > > and
> > > > > > > > > leave
> > > > > > > > > > > the
> > > > > > > > > > > >> old
> > > > > > > > > > > >> > > > >> >> controllerId
> > > > > > > > > > > >> > > > >> >> >> field as -1. A ZK-based broker that
> > sees
> > > > this
> > > > > > can
> > > > > > > > > then
> > > > > > > > > > > >> consult
> > > > > > > > > > > >> > > its
> > > > > > > > > > > >> > > > >> >> >> controller.quorum.voters
> configuration
> > to
> > > > see
> > > > > > > where
> > > > > > > > > it
> > > > > > > > > > > >> should
> > > > > > > > > > > >> > > send
> > > > > > > > > > > >> > > > >> >> >> controller-bound RPCs. That (static)
> > > > > > > configuration
> > > > > > > > > lets
> > > > > > > > > > > us
> > > > > > > > > > > >> map
> > > > > > > > > > > >> > > > >> between
> > > > > > > > > > > >> > > > >> >> >> controller ID and host:port.
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> We should still keep our existing
> epoch
> > > > logic
> > > > > > for
> > > > > > > > > > > deciding
> > > > > > > > > > > >> > when
> > > > > > > > > > > >> > > > >> >> >> UpdateMetadataRequest /
> > > > LeaderAndIsrRequests
> > > > > > are
> > > > > > > > > stale,
> > > > > > > > > > > >> with
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > > >> caveat
> > > > > > > > > > > >> > > > >> >> >> that any kraft-based epoch should be
> > > > treated
> > > > > as
> > > > > > > > > greater
> > > > > > > > > > > >> than
> > > > > > > > > > > >> > any
> > > > > > > > > > > >> > > > >> >> ZK-based
> > > > > > > > > > > >> > > > >> >> >> epoch. After all, the kraft epoch is
> > > coming
> > > > > > from
> > > > > > > > the
> > > > > > > > > > > epoch
> > > > > > > > > > > >> of
> > > > > > > > > > > >> > > > >> >> >> __cluster_metadata, whereas the ZK
> > epoch
> > > > > comes
> > > > > > > from
> > > > > > > > > ZK.
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> > 16. "Additionally, the controller
> > must
> > > > > > specify
> > > > > > > > if a
> > > > > > > > > > > >> broker
> > > > > > > > > > > >> > in
> > > > > > > > > > > >> > > > >> >> >> “LiveBrokers”
> > > > > > > > > > > >> > > > >> >> >> > is KRaft or ZK." Does that require
> > any
> > > > > > protocol
> > > > > > > > > > changes
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > > > >> >> >> UpdateMetadata?
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> Yeah, I am also curious why the we
> need
> > > to
> > > > > care
> > > > > > > > > whether
> > > > > > > > > > > >> > brokers
> > > > > > > > > > > >> > > > are
> > > > > > > > > > > >> > > > >> ZK
> > > > > > > > > > > >> > > > >> >> or
> > > > > > > > > > > >> > > > >> >> >> KRaft in UpdateMetadataRequest. We
> > don't
> > > > > reveal
> > > > > > > > this
> > > > > > > > > to
> > > > > > > > > > > >> > clients,
> > > > > > > > > > > >> > > > so
> > > > > > > > > > > >> > > > >> can
> > > > > > > > > > > >> > > > >> >> we
> > > > > > > > > > > >> > > > >> >> >> just leave this out?
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> best,
> > > > > > > > > > > >> > > > >> >> >> Colin
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >> >> > Thanks,
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> > Jun
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> > On Wed, Oct 5, 2022 at 10:07 AM
> > Mickael
> > > > > > Maison
> > > > > > > <
> > > > > > > > > > > >> > > > >> >> mickael.mai...@gmail.com
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> > wrote:
> > > > > > > > > > > >> > > > >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> >> Hi David,
> > > > > > > > > > > >> > > > >> >> >> >>
> > > > > > > > > > > >> > > > >> >> >> >> Thanks for starting this important
> > > KIP.
> > > > > > > > > > > >> > > > >> >> >> >>
> > > > > > > > > > > >> > > > >> >> >> >> I've just taken a quick look so
> far
> > > but
> > > > > I've
> > > > > > > > got a
> > > > > > > > > > > >> couple
> > > > > > > > > > > >> > of
> > > > > > > > > > > >> > > > >> initial
> > > > > > > > > > > >> > > > >> >> >> >> questions:
> > > > > > > > > > > >> > > > >> >> >> >>
> > > > > > > > > > > >> > > > >> >> >> >> 1) What happens if a non KRaft
> > > > compatible
> > > > > > > broker
> > > > > > > > > (or
> > > > > > > > > > > >> with
> > > > > > > > > > > >> > > > >> >> >> >> kafka.metadata.migration.enable
> set
> > to
> > > > > > false)
> > > > > > > > > joins
> > > > > > > > > > > the
> > > > > > > > > > > >> > > cluster
> > > > > > > > > > > >> > > > >> after
> > > > > > > > > > > >> > > > >> >> >> >> the migration is triggered?
> > > > > > > > > > > >> > > > >> >> >> >>
> > > > > > > > > > > >> > > > >> >> >> >> 2) In the Failure Modes section
> you
> > > > > mention
> > > > > > a
> > > > > > > > > > scenario
> > > > > > > > > > > >> > where
> > > > > > > > > > > >> > > a
> > > > > > > > > > > >> > > > >> write
> > > > > > > > > > > >> > > > >> >> >> >> to ZK fails. What happens when the
> > > > > > divergence
> > > > > > > > > limit
> > > > > > > > > > is
> > > > > > > > > > > >> > > > reached? Is
> > > > > > > > > > > >> > > > >> >> >> >> this a fatal condition? How much
> > > > > divergence
> > > > > > > > should
> > > > > > > > > > we
> > > > > > > > > > > >> > allow?
> > > > > > > > > > > >> > > > >> >> >> >>
> > > > > > > > > > > >> > > > >> >> >> >> Thanks,
> > > > > > > > > > > >> > > > >> >> >> >> Mickael
> > > > > > > > > > > >> > > > >> >> >> >>
> > > > > > > > > > > >> > > > >> >> >> >> On Wed, Oct 5, 2022 at 12:20 AM
> > David
> > > > > > Arthur <
> > > > > > > > > > > >> > > mum...@gmail.com
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > >> >> wrote:
> > > > > > > > > > > >> > > > >> >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> >> > Hey folks, I wanted to get the
> > ball
> > > > > > rolling
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > >> > > discussion
> > > > > > > > > > > >> > > > >> for
> > > > > > > > > > > >> > > > >> >> the
> > > > > > > > > > > >> > > > >> >> >> >> > ZooKeeper migration KIP. This
> KIP
> > > > > details
> > > > > > > how
> > > > > > > > we
> > > > > > > > > > > plan
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > do
> > > > > > > > > > > >> > > > an
> > > > > > > > > > > >> > > > >> >> online
> > > > > > > > > > > >> > > > >> >> >> >> > migration of metadata from
> > ZooKeeper
> > > > to
> > > > > > > KRaft
> > > > > > > > as
> > > > > > > > > > > well
> > > > > > > > > > > >> as
> > > > > > > > > > > >> > a
> > > > > > > > > > > >> > > > >> rolling
> > > > > > > > > > > >> > > > >> >> >> >> > upgrade of brokers to KRaft
> mode.
> > > > > > > > > > > >> > > > >> >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> >> > The general idea is to keep
> KRaft
> > > and
> > > > > > > > ZooKeeper
> > > > > > > > > in
> > > > > > > > > > > >> sync
> > > > > > > > > > > >> > > > during
> > > > > > > > > > > >> > > > >> the
> > > > > > > > > > > >> > > > >> >> >> >> > migration, so both types of
> > brokers
> > > > can
> > > > > > > exist
> > > > > > > > > > > >> > > simultaneously.
> > > > > > > > > > > >> > > > >> Then,
> > > > > > > > > > > >> > > > >> >> >> >> > once everything is migrated and
> > > > updated,
> > > > > > we
> > > > > > > > can
> > > > > > > > > > turn
> > > > > > > > > > > >> off
> > > > > > > > > > > >> > > > >> ZooKeeper
> > > > > > > > > > > >> > > > >> >> >> >> > writes.
> > > > > > > > > > > >> > > > >> >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> >> > This is a pretty complex KIP, so
> > > > please
> > > > > > > take a
> > > > > > > > > > look
> > > > > > > > > > > :)
> > > > > > > > > > > >> > > > >> >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> >>
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >>
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration
> > > > > > > > > > > >> > > > >> >> >> >> >
> > > > > > > > > > > >> > > > >> >> >> >> > Thanks!
> > > > > > > > > > > >> > > > >> >> >> >> > David
> > > > > > > > > > > >> > > > >> >> >> >>
> > > > > > > > > > > >> > > > >> >> >>
> > > > > > > > > > > >> > > > >> >>
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > --
> > > > > > > > > > > >> > > > > -David
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > --
> > > > > > > > > > > >> > > -David
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > -David
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > -David
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -David
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -David
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -David
> > > > >
> > > >
> > >
> > >
> > > --
> > > -David
> > >
> >
>


-- 
-David

Reply via email to