Hi Justine, Luke, and others,

I believe a 3.8 version would make sense, and I would say KIP-853 should be
part of it as well.

Best,

On Wed, Dec 20, 2023 at 4:11 PM Justine Olshan <jols...@confluent.io.invalid>
wrote:

> Hey Luke,
>
> I think your point is valid. This is another good reason to have a 3.8
> release.
> Would you say that implementing KIP-966 in 3.8 would be an acceptable way
> to move forward?
>
> Thanks,
> Justine
>
>
> On Tue, Dec 19, 2023 at 4:35 AM Luke Chen <show...@gmail.com> wrote:
>
> > Hi Justine,
> >
> > Thanks for your reply.
> >
> > > I think that for folks that want to prioritize availability over
> > durability, the aggressive recovery strategy from KIP-966 should be
> > preferable to the old unclean leader election configuration.
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas#KIP966:EligibleLeaderReplicas-Uncleanrecovery
> >
> > Yes, I'm aware that we're going to implement the new way of leader
> election
> > in KIP-966.
> > But obviously, KIP-966 is not included in v3.7.0.
> > What I'm worried about is the users who prioritize availability over
> > durability and enable the unclean leader election in ZK mode.
> > Once they migrate to KRaft, there will be availability impact when
> unclean
> > leader election is needed.
> > And like you said, they can run unclean leader election via CLI, but
> again,
> > the availability is already impacted, which might be unacceptable in some
> > cases.
> >
> > IMO, we should prioritize this missing feature and include it in 3.x
> > release.
> > Including in 3.x release means users can migrate to KRaft in dual-write
> > mode, and run it for a while to make sure everything works fine, before
> > they decide to upgrade to 4.0.
> >
> > Does that make sense?
> >
> > Thanks.
> > Luke
> >
> > On Tue, Dec 19, 2023 at 12:15 AM Justine Olshan
> > <jols...@confluent.io.invalid> wrote:
> >
> > > Hey Luke --
> > >
> > > There were some previous discussions on the mailing list about this but
> > > looks like we didn't file the ticket
> > > https://lists.apache.org/thread/sqsssos1d9whgmo92vdn81n9r5woy1wk
> > >
> > > When I asked some of the folks who worked on Kraft about this, they
> > > communicated to me that it was intentional to make unclean leader
> > election
> > > a manual action.
> > >
> > > I think that for folks that want to prioritize availability over
> > > durability, the aggressive recovery strategy from KIP-966 should be
> > > preferable to the old unclean leader election configuration.
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas#KIP966:EligibleLeaderReplicas-Uncleanrecovery
> > >
> > > Let me know if we don't think this is sufficient.
> > >
> > > Justine
> > >
> > > On Mon, Dec 18, 2023 at 4:39 AM Luke Chen <show...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We found that currently (the latest trunk branch), the unclean leader
> > > > election is not supported in KRaft mode.
> > > > That is, when users enable `unclean.leader.election.enable` in KRaft
> > > mode,
> > > > the config won't take effect and just behave like
> > > > `unclean.leader.election.enable` is disabled.
> > > > KAFKA-12670 <https://issues.apache.org/jira/browse/KAFKA-12670> was
> > > opened
> > > > for this and is still not resolved.
> > > >
> > > > I think this is a regression issue in KRaft mode, and we should
> > complete
> > > > this missing feature in 3.x release, instead of adding it in 4.0.
> > > > Does anyone know what's status for this issue?
> > > >
> > > > Thanks.
> > > > Luke
> > > >
> > > >
> > > >
> > > > On Mon, Nov 27, 2023 at 4:38 PM Colin McCabe <cmcc...@apache.org>
> > wrote:
> > > >
> > > > > On Fri, Nov 24, 2023, at 03:47, Anton Agestam wrote:
> > > > > > In your last message you wrote:
> > > > > >
> > > > > > > But, on the KRaft side, I still maintain that nothing is
> missing
> > > > except
> > > > > > > JBOD, which we already have a plan for.
> > > > > >
> > > > > > But earlier in this thread you mentioned an issue with "torn
> > writes",
> > > > > > possibly missing tests, as well as the fact that the recommended
> > > method
> > > > > of
> > > > > > replacing controller nodes is undocumented. Would you mind
> > clarifying
> > > > > what
> > > > > > your stance is on these three issues? Do you think that they are
> > > > > important
> > > > > > enablers of upgrade paths or not?
> > > > >
> > > > > Hi Anton,
> > > > >
> > > > > There shouldn't be anything blocking controller disk replacement
> now.
> > > > From
> > > > > memory (not looking at the code now), we do log recovery on our
> > single
> > > > log
> > > > > directory every time we start the controller, so it should handle
> > > partial
> > > > > records there. I do agree that a test would be good, and some
> > > > > documentation. I'll probably take a look at that this week if I get
> > > some
> > > > > time.
> > > > >
> > > > > > > Well, the line was drawn in KIP-833. If we redraw it, what is
> to
> > > stop
> > > > > us
> > > > > > > from redrawing it again and again?
> > > > > >
> > > > > > I'm fairly new to the Kafka community so please forgive me if I'm
> > > > missing
> > > > > > things that have been said in earlier discussions, but reading up
> > on
> > > > that
> > > > > > KIP I see it has language like "Note: this timeline is very rough
> > and
> > > > > > subject to change." in the section of versions, but it also says
> > "As
> > > > > > outlined above, we expect to close these gaps soon" with relation
> > to
> > > > the
> > > > > > outstanding features. From my perspective this doesn't really
> look
> > > like
> > > > > an
> > > > > > agreement that dynamic quorum membership changes shall not be a
> > > blocker
> > > > > for
> > > > > > 4.0.
> > > > >
> > > > > The timeline was rough because we wrote that in 2022, trying to
> look
> > > > > forward multiple releases. The gaps that were discussed have all
> been
> > > > > closed -- except for JBOD, which we are working on this quarter.
> > > > >
> > > > > The set of features needed for 4.0 is very clearly described in
> > > KIP-833.
> > > > > There's no uncertainty on that point.
> > > > >
> > > > > >
> > > > > > To answer the specific question you pose here, "what is to stop
> us
> > > from
> > > > > > redrawing it again and again?", wouldn't the suggestion of
> parallel
> > > > work
> > > > > > lanes brought up by Josep address this concern?
> > > > > >
> > > > >
> > > > > It's very important not to fragment the community by supporting
> > > multiple
> > > > > long-running branch lines. At the end of the day, once branch 3's
> > time
> > > > has
> > > > > come, it needs to fade away, just like JDK 6 support or the old
> Scala
> > > > > producer.
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > >
> > > > > > BR,
> > > > > > Anton
> > > > > >
> > > > > > Den tors 23 nov. 2023 kl 05:48 skrev Colin McCabe <
> > > cmcc...@apache.org
> > > > >:
> > > > > >
> > > > > >> On Tue, Nov 21, 2023, at 19:30, Luke Chen wrote:
> > > > > >> > Yes, KIP-853 and disk failure support are both very important
> > > > missing
> > > > > >> > features. For the disk failure support, I don't think this is
> a
> > > > > >> > "good-to-have-feature", it should be a "must-have" IMO. We
> can't
> > > > > announce
> > > > > >> > the 4.0 release without a good solution for disk failure in
> > KRaft.
> > > > > >>
> > > > > >> Hi Luke,
> > > > > >>
> > > > > >> Thanks for the reply.
> > > > > >>
> > > > > >> Controller disk failure support is not missing from KRaft. I
> > > described
> > > > > how
> > > > > >> to handle controller disk failures earlier in this thread.
> > > > > >>
> > > > > >> I should note here that the broker in ZooKeeper mode also
> requires
> > > > > manual
> > > > > >> handling of disk failures. Restarting a broker with the same ID,
> > but
> > > > an
> > > > > >> empty disk, breaks the invariants of replication when in ZK
> mode.
> > > > > Consider:
> > > > > >>
> > > > > >> 1. Broker 1 goes down. A ZK state change notification for
> /brokers
> > > > fires
> > > > > >> and goes on the controller queue.
> > > > > >>
> > > > > >> 2. Broker 1 comes back up with an empty disk.
> > > > > >>
> > > > > >> 3. The controller processes the zk state change notification for
> > > > > /brokers.
> > > > > >> Since broker 1 is up no action is taken.
> > > > > >>
> > > > > >> 4. Now broker 1 is in the ISR for any partitions it was
> > previously,
> > > > but
> > > > > >> has no data. If it is or becomes leader for any partitions,
> > > > irreversable
> > > > > >> data loss will occur.
> > > > > >>
> > > > > >> This problem is more than theoretical. We at Confluent have
> > observed
> > > > it
> > > > > in
> > > > > >> production and put in place special workarounds for the ZK
> > clusters
> > > we
> > > > > >> still have.
> > > > > >>
> > > > > >> KRaft has never had this problem because brokers are removed
> from
> > > ISRs
> > > > > >> when a new incarnation of the broker registers.
> > > > > >>
> > > > > >> So perhaps ZK mode is not ready for production for Aiven? Since
> > disk
> > > > > >> failures do in fact require special handling there. (And/or
> > bringing
> > > > up
> > > > > new
> > > > > >> nodes with empty disks, which seems to be their main concern.)
> > > > > >>
> > > > > >> >
> > > > > >> > It’s also worth thinking about how Apache Kafka users who
> depend
> > > on
> > > > > JBOD
> > > > > >> > might look at the risks of not having a 3.8 release. JBOD
> > support
> > > on
> > > > > >> KRaft
> > > > > >> > is planned to be added in 3.7, and is still in progress so
> far.
> > So
> > > > > it’s
> > > > > >> > hard to say it’s a blocker or not. But in practice, even if
> the
> > > > > feature
> > > > > >> is
> > > > > >> > made into 3.7 in time, a lot of new code for this feature is
> > > > unlikely
> > > > > to
> > > > > >> be
> > > > > >> > entirely bug free. We need to maintain the confidence of those
> > > > users,
> > > > > and
> > > > > >> > forcing them to migrate through 3.7 where this new code is
> > hardly
> > > > > >> > battle-tested doesn’t appear to do that.
> > > > > >> >
> > > > > >>
> > > > > >> As Ismael said, if there are JBOD bugs in 3.7, we will do
> > follow-on
> > > > > point
> > > > > >> releases to address them.
> > > > > >>
> > > > > >> > Our goal for 4.0 should be that all the “main” features in
> KRaft
> > > are
> > > > > in
> > > > > >> > production ready state. To reach the goal, I think having one
> > more
> > > > > >> release
> > > > > >> > makes sense. We can have different opinions about what the
> “main
> > > > > >> features”
> > > > > >> > in KRaft are, but we should all agree, JBOD is one of them.
> > > > > >>
> > > > > >> The current plan is for JBOD to be production-ready in the 3.7
> > > branch.
> > > > > >>
> > > > > >> The other features of KRaft have been in production-ready state
> > > since
> > > > > the
> > > > > >> 3.3 release. (Well, except for delegation tokens and SCRAM,
> which
> > > were
> > > > > >> implemented in 3.5 and 3.6)
> > > > > >>
> > > > > >> > I totally agree with you. We can keep delaying the 4.0 release
> > > > > forever.
> > > > > >> I'd
> > > > > >> > also like to draw a line to it. So, in my opinion, the 3.8
> > release
> > > > is
> > > > > the
> > > > > >> > line. No 3.9, 3.10 releases after that. If this is the
> decision,
> > > > will
> > > > > >> your
> > > > > >> > concern about this infinite loop disappear?
> > > > > >>
> > > > > >> Well, the line was drawn in KIP-833. If we redraw it, what is to
> > > stop
> > > > us
> > > > > >> from redrawing it again and again?
> > > > > >>
> > > > > >> >
> > > > > >> > Final note: Speaking of the missing features, I can always
> > > cooperate
> > > > > with
> > > > > >> > you and all other community contributors to make them happen,
> > like
> > > > we
> > > > > >> have
> > > > > >> > discussed earlier. Just let me know.
> > > > > >> >
> > > > > >>
> > > > > >> Thanks, Luke. I appreciate the offer.
> > > > > >>
> > > > > >> But, on the KRaft side, I still maintain that nothing is missing
> > > > except
> > > > > >> JBOD, which we already have a plan for.
> > > > > >>
> > > > > >> best,
> > > > > >> Colin
> > > > > >>
> > > > > >>
> > > > > >> > Thank you.
> > > > > >> > Luke
> > > > > >> >
> > > > > >> > On Wed, Nov 22, 2023 at 2:54 AM Colin McCabe <
> > cmcc...@apache.org>
> > > > > wrote:
> > > > > >> >
> > > > > >> >> On Tue, Nov 21, 2023, at 03:47, Josep Prat wrote:
> > > > > >> >> > Hi Colin,
> > > > > >> >> >
> > > > > >> >> > I think it's great that Confluent runs KRaft clusters in
> > > > > production,
> > > > > >> >> > and it means that it is production ready for Confluent and
> > it's
> > > > > users.
> > > > > >> >> > But luckily for Kafka, the community is bigger than this
> > (self
> > > > > managed
> > > > > >> >> > in the cloud or in-prem, or customers of other SaaS
> > companies).
> > > > > >> >>
> > > > > >> >> Hi Josep,
> > > > > >> >>
> > > > > >> >> Confluent is not the only company using or developing KRaft.
> > Most
> > > > of
> > > > > the
> > > > > >> >> big organizations developing Kafka are involved. I mentioned
> > > > > Confluent's
> > > > > >> >> deployments because I wanted to be clear that KRaft mode is
> not
> > > > > >> >> experimental or new. Talking about software in production is
> a
> > > good
> > > > > way
> > > > > >> to
> > > > > >> >> clear up these misconceptions.
> > > > > >> >>
> > > > > >> >> Indeed, KRaft mode is many years old. It started around 2020,
> > and
> > > > > became
> > > > > >> >> production-ready in AK 3.5 in 2022. ZK mode was deprecated in
> > AK
> > > > 3.5,
> > > > > >> which
> > > > > >> >> was released June 2023. If we release AK 4.0 around April (or
> > > > maybe a
> > > > > >> month
> > > > > >> >> or two later) then that will be almost a full year between
> > > > > deprecation
> > > > > >> and
> > > > > >> >> removal of ZK mode. We've talked about this a lot, in KIPs,
> in
> > > > Apache
> > > > > >> blog
> > > > > >> >> posts, at conferences, and so forth.
> > > > > >> >>
> > > > > >> >> > We've heard at least from 1 SaaS company, Aiven
> (disclaimer,
> > it
> > > > is
> > > > > my
> > > > > >> >> > employer) where the current feature set makes it not
> trivial
> > to
> > > > > >> >> > migrate. This same issue might happen not only at Aiven but
> > > with
> > > > > any
> > > > > >> >> > user of Kafka who uses immutable infrastructure.
> > > > > >> >>
> > > > > >> >> Can you discuss why you feel it is "not trivial to migrate"?
> > From
> > > > the
> > > > > >> >> discussion above, the main gap is that we should improve the
> > > > > >> documentation
> > > > > >> >> for handling failed disks.
> > > > > >> >>
> > > > > >> >> > Another case is for
> > > > > >> >> > users that have hundreds (or more) of clusters and more
> than
> > > 100k
> > > > > >> nodes
> > > > > >> >> > experience node failures multiple times during a single
> day.
> > In
> > > > > this
> > > > > >> >> > situation, not having KIP 853 makes these power users
> unable
> > to
> > > > > join
> > > > > >> >> > the game as  introducing a new error-prone manual (or
> needed
> > to
> > > > > >> >> > automate) operation is usually a huge no-go.
> > > > > >> >>
> > > > > >> >> We have thousands of KRaft clusters in production and haven't
> > > seen
> > > > > these
> > > > > >> >> problems, as I described above.
> > > > > >> >>
> > > > > >> >> best,
> > > > > >> >> Colin
> > > > > >> >>
> > > > > >> >> >
> > > > > >> >> > But I hear the concerns of delaying 4.0 for another 3 to 4
> > > > months.
> > > > > >> >> > Would it help if we would aim at shortening the timeline
> for
> > > > 3.8.0
> > > > > and
> > > > > >> >> > start with the 4.0.0 a bit earlier help?
> > > > > >> >> > Maybe we could work on 3.8.0 almost in parallel with 4.0.0:
> > > > > >> >> > - Start with 3.8.0 release process
> > > > > >> >> > - After a small time (let's say a week) create the release
> > > branch
> > > > > >> >> > - Start with 4.0.0 release process as usual
> > > > > >> >> > - Cherry pick KRaft related issues to 3.8.0
> > > > > >> >> > - Release 3.8.0
> > > > > >> >> > I suspect 4.0.0 will need a bit more time than usual to
> > ensure
> > > > the
> > > > > >> code
> > > > > >> >> > is cleaned up of deprecated classes and methods on top of
> the
> > > > usual
> > > > > >> >> > work we have. For this reason I think there would be enough
> > > time
> > > > > >> >> > between releasing 3.8.0 and 4.0.0.
> > > > > >> >> >
> > > > > >> >> > What do you all think?
> > > > > >> >> >
> > > > > >> >> > Best,
> > > > > >> >> > Josep Prat
> > > > > >> >> >
> > > > > >> >> > On 2023/11/20 20:03:18 Colin McCabe wrote:
> > > > > >> >> >> Hi Josep,
> > > > > >> >> >>
> > > > > >> >> >> I think there is some confusion here. Quorum
> reconfiguration
> > > is
> > > > > not
> > > > > >> >> needed for KRaft to become production ready. Confluent runs
> > > > > thousands of
> > > > > >> >> KRaft clusters without quorum reconfiguration, and has for
> > years.
> > > > > While
> > > > > >> >> dynamic quorum reconfiguration is a nice feature, it doesn't
> > > block
> > > > > >> >> anything: not migration, not deployment. As best as I
> > understand
> > > > it,
> > > > > the
> > > > > >> >> use-case Aiven has isn't even reconfiguration per se, just
> > > wiping a
> > > > > >> disk.
> > > > > >> >> There are ways to handle this -- I discussed some earlier in
> > the
> > > > > >> thread. I
> > > > > >> >> think it would be productive to continue that discussion --
> > > > > especially
> > > > > >> the
> > > > > >> >> part around documentation and testing of these cases.
> > > > > >> >> >>
> > > > > >> >> >> A lot of people have done a lot of work to get Kafka 4.0
> > > ready.
> > > > I
> > > > > >> would
> > > > > >> >> not want to delay that because we want an additional feature.
> > And
> > > > we
> > > > > >> will
> > > > > >> >> always want additional features. So I am concerned we will
> end
> > up
> > > > in
> > > > > an
> > > > > >> >> infinite loop of people asking for "just one more feature"
> > before
> > > > > they
> > > > > >> >> migrate.
> > > > > >> >> >>
> > > > > >> >> >> best,
> > > > > >> >> >> Colin
> > > > > >> >> >>
> > > > > >> >> >>
> > > > > >> >> >> On Mon, Nov 20, 2023, at 04:15, Josep Prat wrote:
> > > > > >> >> >> > Hi all,
> > > > > >> >> >> >
> > > > > >> >> >> > I wanted to share my opinion regarding this topic. I
> know
> > > some
> > > > > >> >> >> > discussions happened some time ago (over a year) but I
> > > believe
> > > > > it's
> > > > > >> >> >> > wise to reflect and re-evaluate if those decisions are
> > still
> > > > > valid.
> > > > > >> >> >> > KRaft, as of Kafka 3.6.x and 3.7.x, has not yet feature
> > > parity
> > > > > with
> > > > > >> >> >> > Zookeeper. By dropping Zookeeper altogether before
> > achieving
> > > > > such
> > > > > >> >> >> > parity, we are opening the door to leaving a chunk of
> > Apache
> > > > > Kafka
> > > > > >> >> >> > users without an easy way to upgrade to 4.0.
> > > > > >> >> >> > In pro of making upgrades as smooth as possible, I
> propose
> > > to
> > > > > have
> > > > > >> a
> > > > > >> >> >> > Kafka version where KIP-853 is merged and Zookeeper
> still
> > is
> > > > > >> >> supported.
> > > > > >> >> >> > This will enable community members who can't migrate yet
> > to
> > > > > KRaft
> > > > > >> to
> > > > > >> >> do
> > > > > >> >> >> > so in a safe way (rolling back is something goes wrong).
> > > > > >> >> Additionally,
> > > > > >> >> >> > this will give us more confidence on having KRaft
> > replacing
> > > > > >> >> >> > successfully Zookeeper without any big problems by
> > > discovering
> > > > > and
> > > > > >> >> >> > fixing bugs or by confirming that KRaft works as
> expected.
> > > > > >> >> >> > For this I strongly believe we should have a 3.8.x
> version
> > > > > before
> > > > > >> >> 4.0.x.
> > > > > >> >> >> >
> > > > > >> >> >> > What do other think in this regard?
> > > > > >> >> >> >
> > > > > >> >> >> > Best,
> > > > > >> >> >> >
> > > > > >> >> >> > On 2023/11/14 20:47:10 Colin McCabe wrote:
> > > > > >> >> >> >> On Tue, Nov 14, 2023, at 04:37, Anton Agestam wrote:
> > > > > >> >> >> >> > Hi Colin,
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > Thank you for your thoughtful and comprehensive
> > response.
> > > > > >> >> >> >> >
> > > > > >> >> >> >> >> KIP-853 is not a blocker for either 3.7 or 4.0. We
> > > > discussed
> > > > > >> this
> > > > > >> >> in
> > > > > >> >> >> >> >> several KIPs that happened this year and last year.
> > The
> > > > most
> > > > > >> >> notable was
> > > > > >> >> >> >> >> probably KIP-866, which was approved in May 2022.
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > I understand this is the case, I'm raising my concern
> > > > > because I
> > > > > >> was
> > > > > >> >> >> >> > foreseeing some major pain points as a consequence of
> > > this
> > > > > >> >> decision. Just
> > > > > >> >> >> >> > to make it clear though: I am not asking for anyone
> to
> > do
> > > > > work
> > > > > >> for
> > > > > >> >> me, and
> > > > > >> >> >> >> > I understand the limitations of resources available
> to
> > > > > implement
> > > > > >> >> features.
> > > > > >> >> >> >> > What I was asking is rather to consider the
> > implications
> > > of
> > > > > >> >> _removing_
> > > > > >> >> >> >> > features before there exists a replacement for them.
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > I understand that the timeframe for 3.7 isn't
> feasible,
> > > and
> > > > > >> >> because of that
> > > > > >> >> >> >> > I think what I was asking is rather: can we make sure
> > > that
> > > > > there
> > > > > >> >> are more
> > > > > >> >> >> >> > 3.x releases until controller quorum online resizing
> is
> > > > > >> >> implemented?
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > From your response, I gather that your stance is that
> > > it's
> > > > > >> >> important to
> > > > > >> >> >> >> > drop ZK support sooner rather than later and that the
> > > > > necessary
> > > > > >> >> pieces for
> > > > > >> >> >> >> > doing so are already in place.
> > > > > >> >> >> >>
> > > > > >> >> >> >> Hi Anton,
> > > > > >> >> >> >>
> > > > > >> >> >> >> Yes. I'm basically just repeating what we agreed upon
> in
> > > 2022
> > > > > as
> > > > > >> >> part of KIP-833.
> > > > > >> >> >> >>
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > ---
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > I want to make sure I've understood your suggested
> > > sequence
> > > > > for
> > > > > >> >> controller
> > > > > >> >> >> >> > node replacement. I hope the mentions of Kubernetes
> are
> > > > > rather
> > > > > >> for
> > > > > >> >> examples
> > > > > >> >> >> >> > of how to carry things out, rather than saying "this
> is
> > > > only
> > > > > >> >> supported on
> > > > > >> >> >> >> > Kubernetes"?
> > > > > >> >> >> >>
> > > > > >> >> >> >> Apache Kafka is supported in lots of environments,
> > > including
> > > > > >> non-k8s
> > > > > >> >> ones. I was just pointing out that using k8s means that you
> > > control
> > > > > your
> > > > > >> >> own DNS resolution, which simplifies matters. If you don't
> > > control
> > > > > DNS
> > > > > >> >> there are some extra steps for changing the quorum voters.
> > > > > >> >> >> >>
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > Given we have three existing nodes as such:
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > - a.local -> 192.168.0.100
> > > > > >> >> >> >> > - b.local -> 192.168.0.101
> > > > > >> >> >> >> > - c.local -> 192.168.0.102
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > As well as a candidate node 192.168.0.103 that we
> want
> > to
> > > > > >> replace
> > > > > >> >> for the
> > > > > >> >> >> >> > role of c.local.
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > 1. Shut down controller process on node .102 (to make
> > > sure
> > > > we
> > > > > >> >> don't "go
> > > > > >> >> >> >> > back in time").
> > > > > >> >> >> >> > 2. rsync state from leader to .103.
> > > > > >> >> >> >> > 3. Start controller process on .103.
> > > > > >> >> >> >> > 4. Point the c.local entry at .103.
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > I have a few questions about this sequence:
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > 1. Would this sequence be safe against leadership
> > > changes?
> > > > > >> >> >> >> >
> > > > > >> >> >> >>
> > > > > >> >> >> >> If the leader changes, the new leader should have all
> of
> > > the
> > > > > >> >> committed entries that the old leader had.
> > > > > >> >> >> >>
> > > > > >> >> >> >> > 2. Does it work
> > > > > >> >> >> >>
> > > > > >> >> >> >> Probably the biggest issue is dealing with "torn
> writes"
> > > that
> > > > > >> happen
> > > > > >> >> because you're copying the current log segment while it's
> being
> > > > > written
> > > > > >> to.
> > > > > >> >> The system should be robust against this. However, we don't
> > > > > regularly do
> > > > > >> >> this, so there hasn't been a lot of testing.
> > > > > >> >> >> >>
> > > > > >> >> >> >> I think Jose had a PR for improving the handling of
> this
> > > > which
> > > > > we
> > > > > >> >> might want to dig up. We'd want the system to auto-truncate
> the
> > > > > partial
> > > > > >> >> record at the end of the log, if there is one.
> > > > > >> >> >> >>
> > > > > >> >> >> >> > 3. By "state", do we mean `metadata.log.dir`?
> Something
> > > > else?
> > > > > >> >> >> >>
> > > > > >> >> >> >> Yes, the state of the metadata.log.dir. Keep in mind
> you
> > > will
> > > > > need
> > > > > >> >> to change the node ID in meta.properties after copying, of
> > > course.
> > > > > >> >> >> >>
> > > > > >> >> >> >> > 4. What are the effects on cluster availability? (I
> > think
> > > > > this
> > > > > >> is
> > > > > >> >> the same
> > > > > >> >> >> >> > as asking what happens if a or b crashes during the
> > > > process,
> > > > > or
> > > > > >> if
> > > > > >> >> network
> > > > > >> >> >> >> > partitions occur).
> > > > > >> >> >> >>
> > > > > >> >> >> >> Cluster metadata state tends to be pretty small.
> > typically
> > > a
> > > > > >> hundred
> > > > > >> >> megabytes or so. Therefore, I do not think it will take more
> > > than a
> > > > > >> second
> > > > > >> >> or two to copy from one node to another. However, if you do
> > > > > experience a
> > > > > >> >> crash when one node out of three is down, then you will be
> > > > > unavailable
> > > > > >> >> until you can bring up a second node to regain a majority.
> > > > > >> >> >> >>
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > ---
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > If this is considered the official way of handling
> > > > controller
> > > > > >> node
> > > > > >> >> >> >> > replacements, does it make sense to improve
> > documentation
> > > > in
> > > > > >> this
> > > > > >> >> area? Is
> > > > > >> >> >> >> > there already a plan for this documentation layed out
> > in
> > > > some
> > > > > >> >> KIPs? This is
> > > > > >> >> >> >> > something I'd be happy to contribute to.
> > > > > >> >> >> >> >
> > > > > >> >> >> >>
> > > > > >> >> >> >> Yes, I think we should have official documentation
> about
> > > > this.
> > > > > >> We'd
> > > > > >> >> be happy to review anything in that area.
> > > > > >> >> >> >>
> > > > > >> >> >> >> >> To circle back to KIP-853, I think it stands a good
> > > chance
> > > > > of
> > > > > >> >> making it
> > > > > >> >> >> >> >> into AK 4.0.
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > This sounds good, but the point I was making was if
> we
> > > > could
> > > > > >> have
> > > > > >> >> a release
> > > > > >> >> >> >> > with both KRaft and ZK supporting this feature to
> ease
> > > the
> > > > > >> >> migration out of
> > > > > >> >> >> >> > ZK.
> > > > > >> >> >> >> >
> > > > > >> >> >> >>
> > > > > >> >> >> >> The problem is, supporting multiple controller
> > > > implementations
> > > > > is
> > > > > >> a
> > > > > >> >> huge burden. So we don't want to extend the 3.x release past
> > the
> > > > > point
> > > > > >> >> that's needed to complete all the must-dos (SCRAM, delegation
> > > > tokens,
> > > > > >> JBOD)
> > > > > >> >> >> >>
> > > > > >> >> >> >> best,
> > > > > >> >> >> >> Colin
> > > > > >> >> >> >>
> > > > > >> >> >> >>
> > > > > >> >> >> >> > BR,
> > > > > >> >> >> >> > Anton
> > > > > >> >> >> >> >
> > > > > >> >> >> >> > Den tors 9 nov. 2023 kl 23:04 skrev Colin McCabe <
> > > > > >> >> cmcc...@apache.org>:
> > > > > >> >> >> >> >
> > > > > >> >> >> >> >> Hi Anton,
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> It rarely makes sense to scale up and down the
> number
> > of
> > > > > >> >> controller nodes
> > > > > >> >> >> >> >> in the cluster. Only one controller node will be
> > active
> > > at
> > > > > any
> > > > > >> >> given time.
> > > > > >> >> >> >> >> The main reason to use 5 nodes would be to be able
> to
> > > > > tolerate
> > > > > >> 2
> > > > > >> >> failures
> > > > > >> >> >> >> >> instead of 1.
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> At Confluent, we generally run KRaft with 3
> > controllers.
> > > > We
> > > > > >> have
> > > > > >> >> not seen
> > > > > >> >> >> >> >> problems with this setup, even with thousands of
> > > clusters.
> > > > > We
> > > > > >> have
> > > > > >> >> >> >> >> discussed using 5 node controller clusters on
> certain
> > > very
> > > > > big
> > > > > >> >> clusters,
> > > > > >> >> >> >> >> but we haven't done that yet. This is all very
> similar
> > > to
> > > > > ZK,
> > > > > >> >> where most
> > > > > >> >> >> >> >> deployments were 3 nodes as well.
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> KIP-853 is not a blocker for either 3.7 or 4.0. We
> > > > discussed
> > > > > >> this
> > > > > >> >> in
> > > > > >> >> >> >> >> several KIPs that happened this year and last year.
> > The
> > > > most
> > > > > >> >> notable was
> > > > > >> >> >> >> >> probably KIP-866, which was approved in May 2022.
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> Many users these days run in a Kubernetes
> environment
> > > > where
> > > > > >> >> Kubernetes
> > > > > >> >> >> >> >> actually controls the DNS. This makes changing the
> set
> > > of
> > > > > >> voters
> > > > > >> >> less
> > > > > >> >> >> >> >> important than it was historically.
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> For example, in a world with static DNS, you might
> > have
> > > to
> > > > > >> change
> > > > > >> >> the
> > > > > >> >> >> >> >> controller.quorum.voters setting from:
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> 100@a.local:9073,101@b.local:9073,102@c.local:9073
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> to:
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> 100@a.local:9073,101@b.local:9073,102@d.local:9073
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> In a world with k8s controlling the DNS, you simply
> > > remap
> > > > > >> c.local
> > > > > >> >> to point
> > > > > >> >> >> >> >> ot the IP address of your new pod for controller
> 102,
> > > and
> > > > > >> you're
> > > > > >> >> done. No
> > > > > >> >> >> >> >> need to update controller.quorum.voters.
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> Another question is whether you re-create the pod
> data
> > > > from
> > > > > >> >> scratch every
> > > > > >> >> >> >> >> time you add a new node. If you store the controller
> > > data
> > > > > on an
> > > > > >> >> EBS volume
> > > > > >> >> >> >> >> (or cloud-specific equivalent), you really only have
> > to
> > > > > detach
> > > > > >> it
> > > > > >> >> from the
> > > > > >> >> >> >> >> previous pod and re-attach it to the new pod. k8s
> also
> > > > > handles
> > > > > >> >> this
> > > > > >> >> >> >> >> automatically, of course.
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> If you want to reconstruct the full controller pod
> > state
> > > > > each
> > > > > >> >> time you
> > > > > >> >> >> >> >> create a new pod (for example, so that you can use
> > only
> > > > > >> instance
> > > > > >> >> storage),
> > > > > >> >> >> >> >> you should be able to rsync that state from the
> > leader.
> > > In
> > > > > >> >> general, the
> > > > > >> >> >> >> >> invariant that we want to maintain is that the state
> > > > should
> > > > > not
> > > > > >> >> "go back in
> > > > > >> >> >> >> >> time" -- if controller 102 promised to hold all log
> > data
> > > > up
> > > > > to
> > > > > >> >> offset X, it
> > > > > >> >> >> >> >> should come back with committed data at at least
> that
> > > > > offset.
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> There are lots of new features we'd like to
> implement
> > > for
> > > > > >> KRaft,
> > > > > >> >> and Kafka
> > > > > >> >> >> >> >> in general. If you have some you really would like
> to
> > > > see, I
> > > > > >> >> think everyone
> > > > > >> >> >> >> >> in the community would be happy to work with you.
> The
> > > flip
> > > > > >> side,
> > > > > >> >> of course,
> > > > > >> >> >> >> >> is that since there are an unlimited number of
> > features
> > > we
> > > > > >> could
> > > > > >> >> do, we
> > > > > >> >> >> >> >> can't really block the release for any one feature.
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> To circle back to KIP-853, I think it stands a good
> > > chance
> > > > > of
> > > > > >> >> making it
> > > > > >> >> >> >> >> into AK 4.0. Jose, Alyssa, and some other people
> have
> > > > > worked on
> > > > > >> >> it. It
> > > > > >> >> >> >> >> definitely won't make it into 3.7, since we have
> only
> > a
> > > > few
> > > > > >> weeks
> > > > > >> >> left
> > > > > >> >> >> >> >> before that release happens.
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> best,
> > > > > >> >> >> >> >> Colin
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >>
> > > > > >> >> >> >> >> On Thu, Nov 9, 2023, at 00:20, Anton Agestam wrote:
> > > > > >> >> >> >> >> > Hi Luke,
> > > > > >> >> >> >> >> >
> > > > > >> >> >> >> >> > We have been looking into what switching from ZK
> to
> > > > KRaft
> > > > > >> will
> > > > > >> >> mean for
> > > > > >> >> >> >> >> > Aiven.
> > > > > >> >> >> >> >> >
> > > > > >> >> >> >> >> > We heavily depend on an “immutable infrastructure”
> > > model
> > > > > for
> > > > > >> >> deployments.
> > > > > >> >> >> >> >> > This means that, when we perform upgrades, we
> > > introduce
> > > > > new
> > > > > >> >> nodes to our
> > > > > >> >> >> >> >> > clusters, scale the cluster up to incorporate the
> > new
> > > > > nodes,
> > > > > >> >> and then
> > > > > >> >> >> >> >> phase
> > > > > >> >> >> >> >> > the old ones out once all partitions are moved to
> > the
> > > > new
> > > > > >> >> generation.
> > > > > >> >> >> >> >> This
> > > > > >> >> >> >> >> > allows us, and anyone else using a similar model,
> to
> > > do
> > > > > >> >> upgrades as well
> > > > > >> >> >> >> >> as
> > > > > >> >> >> >> >> > cluster resizing with zero downtime.
> > > > > >> >> >> >> >> >
> > > > > >> >> >> >> >> > Reading up on KRaft and the ZK-to-KRaft migration
> > > path,
> > > > > this
> > > > > >> is
> > > > > >> >> somewhat
> > > > > >> >> >> >> >> > worrying for us. It seems like, if KIP-853 is not
> > > > included
> > > > > >> >> prior to
> > > > > >> >> >> >> >> > dropping support for ZK, we will essentially have
> no
> > > > > >> satisfying
> > > > > >> >> upgrade
> > > > > >> >> >> >> >> > path. Even if KIP-853 is included in 4.0, I’m
> unsure
> > > if
> > > > > that
> > > > > >> >> would allow
> > > > > >> >> >> >> >> a
> > > > > >> >> >> >> >> > migration path for us, since a new cluster
> > generation
> > > > > would
> > > > > >> not
> > > > > >> >> be able
> > > > > >> >> >> >> >> to
> > > > > >> >> >> >> >> > use ZK during the migration step.
> > > > > >> >> >> >> >> > On the other hand, if KIP-853 was released in a
> > > version
> > > > > prior
> > > > > >> >> to dropping
> > > > > >> >> >> >> >> > ZK support, because it allows online resizing of
> > KRaft
> > > > > >> >> clusters, this
> > > > > >> >> >> >> >> would
> > > > > >> >> >> >> >> > allow us and others that use an immutable
> > > infrastructure
> > > > > >> >> deployment
> > > > > >> >> >> >> >> model,
> > > > > >> >> >> >> >> > to provide a zero downtime migration path.
> > > > > >> >> >> >> >> >
> > > > > >> >> >> >> >> > For that reason, we’d like to raise awareness
> around
> > > > this
> > > > > >> issue
> > > > > >> >> and
> > > > > >> >> >> >> >> > encourage considering the implementation of
> KIP-853
> > or
> > > > > >> >> equivalent a
> > > > > >> >> >> >> >> blocker
> > > > > >> >> >> >> >> > not only for 4.0, but for the last version prior
> to
> > > 4.0.
> > > > > >> >> >> >> >> >
> > > > > >> >> >> >> >> > BR,
> > > > > >> >> >> >> >> > Anton
> > > > > >> >> >> >> >> >
> > > > > >> >> >> >> >> > On 2023/10/11 12:17:23 Luke Chen wrote:
> > > > > >> >> >> >> >> >> Hi all,
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> While Kafka 3.6.0 is released, I’d like to start
> > the
> > > > > >> >> discussion for the
> > > > > >> >> >> >> >> >> “road to Kafka 4.0”. Based on the plan in KIP-833
> > > > > >> >> >> >> >> >> <
> > > > > >> >> >> >> >> >
> > > > > >> >> >> >> >>
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-Kafka3.7
> > > > > >> >> >> >> >> >>,
> > > > > >> >> >> >> >> >> the next release 3.7 will be the final release
> > before
> > > > > moving
> > > > > >> >> to Kafka
> > > > > >> >> >> >> >> 4.0
> > > > > >> >> >> >> >> >> to remove the Zookeeper from Kafka. Before making
> > > this
> > > > > major
> > > > > >> >> change, I'd
> > > > > >> >> >> >> >> >> like to get consensus on the "must-have
> > > features/fixes
> > > > > for
> > > > > >> >> Kafka 4.0",
> > > > > >> >> >> >> >> to
> > > > > >> >> >> >> >> >> avoid some users being surprised when upgrading
> to
> > > > Kafka
> > > > > >> 4.0.
> > > > > >> >> The intent
> > > > > >> >> >> >> >> > is
> > > > > >> >> >> >> >> >> to have a clear communication about what to
> expect
> > in
> > > > the
> > > > > >> >> following
> > > > > >> >> >> >> >> > months.
> > > > > >> >> >> >> >> >> In particular we should be signaling what
> features
> > > and
> > > > > >> >> configurations
> > > > > >> >> >> >> >> are
> > > > > >> >> >> >> >> >> not supported, or at risk (if no one is able to
> add
> > > > > support
> > > > > >> or
> > > > > >> >> fix known
> > > > > >> >> >> >> >> >> bugs).
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> Here is the JIRA tickets list
> > > > > >> >> >> >> >> >> <
> > > > > >> >>
> > > > >
> > https://issues.apache.org/jira/issues/?jql=labels%20%3D%204.0-blocker>
> > > > > >> >> >> >> >> I
> > > > > >> >> >> >> >> >> labeled for "4.0-blocker". The criteria I labeled
> > as
> > > > > >> >> “4.0-blocker” are:
> > > > > >> >> >> >> >> >> 1. The feature is supported in Zookeeper Mode,
> but
> > > not
> > > > > >> >> supported in
> > > > > >> >> >> >> >> KRaft
> > > > > >> >> >> >> >> >> mode, yet (ex: KIP-858: JBOD in KRaft)
> > > > > >> >> >> >> >> >> 2. Critical bugs in KRaft, (ex: KAFKA-15489 :
> split
> > > > > brain in
> > > > > >> >> KRaft
> > > > > >> >> >> >> >> >> controller quorum)
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> If you disagree with my current list, welcome to
> > have
> > > > > >> >> discussion in the
> > > > > >> >> >> >> >> >> specific JIRA ticket. Or, if you think there are
> > some
> > > > > >> tickets
> > > > > >> >> I missed,
> > > > > >> >> >> >> >> >> welcome to start a discussion in the JIRA ticket
> > and
> > > > > ping me
> > > > > >> >> or other
> > > > > >> >> >> >> >> >> people. After we get the consensus, we can
> > > > label/unlabel
> > > > > it
> > > > > >> >> afterwards.
> > > > > >> >> >> >> >> >> Again, the goal is to have an open communication
> > with
> > > > the
> > > > > >> >> community
> > > > > >> >> >> >> >> about
> > > > > >> >> >> >> >> >> what will be coming in 4.0.
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> Below is the high level category of the list
> > content:
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> 1. Recovery from disk failure
> > > > > >> >> >> >> >> >> KIP-856
> > > > > >> >> >> >> >> >> <
> > > > > >> >> >> >> >> >
> > > > > >> >> >> >> >>
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-856:+KRaft+Disk+Failure+Recovery
> > > > > >> >> >> >> >> >>:
> > > > > >> >> >> >> >> >> KRaft Disk Failure Recovery
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> 2. Prevote to support controllers more than 3
> > > > > >> >> >> >> >> >> KIP-650
> > > > > >> >> >> >> >> >> <
> > > > > >> >> >> >> >> >
> > > > > >> >> >> >> >>
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-650%3A+Enhance+Kafkaesque+Raft+semantics
> > > > > >> >> >> >> >> >>:
> > > > > >> >> >> >> >> >> Enhance Kafkaesque Raft semantics
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> 3. JBOD support
> > > > > >> >> >> >> >> >> KIP-858
> > > > > >> >> >> >> >> >> <
> > > > > >> >> >> >> >> >
> > > > > >> >> >> >> >>
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft
> > > > > >> >> >> >> >> >>:
> > > > > >> >> >> >> >> >> Handle
> > > > > >> >> >> >> >> >> JBOD broker disk failure in KRaft
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> 4. Scale up/down Controllers
> > > > > >> >> >> >> >> >> KIP-853
> > > > > >> >> >> >> >> >> <
> > > > > >> >> >> >> >> >
> > > > > >> >> >> >> >>
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes
> > > > > >> >> >> >> >> >>:
> > > > > >> >> >> >> >> >> KRaft Controller Membership Changes
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> 5. Modifying dynamic configurations on the KRaft
> > > > > controller
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> 6. Critical bugs in KRaft
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> Does this make sense?
> > > > > >> >> >> >> >> >> Any feedback is welcomed.
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >> >> Thank you.
> > > > > >> >> >> >> >> >> Luke
> > > > > >> >> >> >> >> >>
> > > > > >> >> >> >> >>
> > > > > >> >> >> >>
> > > > > >> >> >>
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
>


-- 
[image: Aiven] <https://www.aiven.io>

*Josep Prat*
Open Source Engineering Director, *Aiven*
josep.p...@aiven.io   |   +491715557497
aiven.io <https://www.aiven.io>   |   <https://www.facebook.com/aivencloud>
  <https://www.linkedin.com/company/aiven/>   <https://twitter.com/aiven_io>
*Aiven Deutschland GmbH*
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
Amtsgericht Charlottenburg, HRB 209739 B

Reply via email to