Hi Colin,
I was the one raising the issue about rollback and I also already tried
what you mentioned but with no success.
During the first rolling, I left
the zookeeper.metadata.migration.enable=true but
removed controller.quorum.voters and controller.listener.names.
This is what I get from the brokers on restarting:

2024-01-15 09:19:14,172] ERROR Exiting Kafka due to fatal exception
(kafka.Kafka$)
org.apache.kafka.common.config.ConfigException: If using
zookeeper.metadata.migration.enable, controller.quorum.voters must contain
a parseable set of voters.
        at
kafka.server.KafkaConfig.validateNonEmptyQuorumVotersForMigration$1(KafkaConfig.scala:2286)
        at kafka.server.KafkaConfig.validateValues(KafkaConfig.scala:2371)
        at kafka.server.KafkaConfig.<init>(KafkaConfig.scala:2233)
        at kafka.server.KafkaConfig.<init>(KafkaConfig.scala:1604)
        at kafka.server.KafkaConfig$.fromProps(KafkaConfig.scala:1527)
        at kafka.Kafka$.buildServer(Kafka.scala:72)
        at kafka.Kafka$.main(Kafka.scala:91)
        at kafka.Kafka.main(Kafka.scala)

Did you try it?
Am I missing anything in your procedure?

Thanks,
Paolo

On Mon, 15 Jan 2024 at 09:13, Colin McCabe <co...@cmccabe.xyz> wrote:

> Docs fix discussed in the thread is here:
> https://github.com/apache/kafka/pull/15193
>
> best,
> Colin
>
>
> On Sun, Jan 14, 2024, at 23:56, Colin McCabe wrote:
> > Hi Stanislav,
> >
> > Thanks for making the first RC. The fact that it's titled RC2 is
> > messing with my mind a bit. I hope this doesn't make people think that
> > we're farther along than we are, heh.
> >
> > On Sun, Jan 14, 2024, at 13:54, Jakub Scholz wrote:
> >> *> Nice catch! It does seem like we should have gated this behind the
> >> metadata> version as KIP-858 implies. Is the cluster configured with
> >> multiple log> dirs? What is the impact of the error messages?*
> >>
> >> I did not observe any obvious impact. I was able to send and receive
> >> messages as normally. But to be honest, I have no idea what else
> >> this might impact, so I did not try anything special.
> >>
> >> I think everyone upgrading an existing KRaft cluster will go through
> this
> >> stage (running Kafka 3.7 with an older metadata version for at least a
> >> while). So even if it is just a logged exception without any other
> impact I
> >> wonder if it might scare users from upgrading. But I leave it to others
> to
> >> decide if this is a blocker or not.
> >>
> >
> > Hi Jakub,
> >
> > Thanks for trying the RC. I think what you found is a blocker bug
> > because it will generate huge amount of logspam. I guess we didn't find
> > it in junit tests since logspam doesn't fail the automated tests. But
> > certainly it's not suitable for production. Did you file a JIRA yet?
> >
> >> On Sun, Jan 14, 2024 at 10:17 PM Stanislav Kozlovski
> >> <stanis...@confluent.io.invalid> wrote:
> >>
> >>> Hey Luke,
> >>>
> >>> This is an interesting problem. Given the fact that the KIP for having
> a
> >>> 3.8 release passed, I think it weights the scale towards not calling
> this a
> >>> blocker and expecting it to be solved in 3.7.1.
> >>>
> >>> It is unfortunate that it would not seem safe to migrate to KRaft in
> 3.7.0
> >>> (given the inability to rollback safely), but if that's true - the same
> >>> case would apply for 3.6.0. So in any case users w\ould be expected to
> use a
> >>> patch release for this.
> >
> > Hi Luke,
> >
> > Thanks for testing rollback. I think this is a case where the
> > documentation is wrong. The intention was to for the steps to basically
> > be:
> >
> > 1. roll all the brokers into zk mode, but with migration enabled
> > 2. take down the kraft quorum
> > 3. rmr /controller, allowing a hybrid broker to take over.
> > 4. roll all the brokers into zk mode without migration enabled (if
> desired)
> >
> > With these steps, there isn't really unavailability since a ZK
> > controller can be elected quickly after the kraft quorum is gone.
> >
> >>> Further, since we will have a 3.8 release - it is
> >>> likely we will ultimately recommend users upgrade from that version
> given
> >>> its aim is to have strategic KRaft feature parity with ZK.
> >>> That being said, I am not 100% on this. Let me know whether you think
> this
> >>> should block the release, Luke. I am also tagging Colin and David to
> weigh
> >>> in with their opinions, as they worked on the migration logic.
> >
> > The rollback docs are new in 3.7 so the fact that they're wrong is a
> > clear blocker, I think. But easy to fix, I believe. I will create a PR.
> >
> > best,
> > Colin
> >
> >>>
> >>> Hey Kirk and Chris,
> >>>
> >>> Unless I'm missing something - KAFKALESS-16029 is simply a bad log due
> to
> >>> improper closing. And the PR description implies this has been present
> >>> since 3.5. While annoying, I don't see a strong reason for this to
> block
> >>> the release.
> >>>
> >>> Hey Jakub,
> >>>
> >>> Nice catch! It does seem like we should have gated this behind the
> metadata
> >>> version as KIP-858 implies. Is the cluster configured with multiple log
> >>> dirs? What is the impact of the error messages?
> >>>
> >>> Tagging Igor (the author of the KIP) to weigh in.
> >>>
> >>> Best,
> >>> Stanislav
> >>>
> >>> On Sat, Jan 13, 2024 at 7:22 PM Jakub Scholz <ja...@scholz.cz> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > I was trying the RC2 and run into the following issue ... when I run
> >>> > 3.7.0-RC2 KRaft cluster with metadata version set to 3.6-IV2 metadata
> >>> > version, I seem to be getting repeated errors like this in the
> controller
> >>> > logs:
> >>> >
> >>> > 2024-01-13 16:58:01,197 INFO [QuorumController id=0]
> >>> assignReplicasToDirs:
> >>> > event failed with UnsupportedVersionException in 15 microseconds.
> >>> > (org.apache.kafka.controller.QuorumController)
> >>> > [quorum-controller-0-event-handler]
> >>> > 2024-01-13 16:58:01,197 ERROR [ControllerApis nodeId=0] Unexpected
> error
> >>> > handling request RequestHeader(apiKey=ASSIGN_REPLICAS_TO_DIRS,
> >>> > apiVersion=0, clientId=1000, correlationId=14, headerVersion=2) --
> >>> > AssignReplicasToDirsRequestData(brokerId=1000, brokerEpoch=5,
> >>> > directories=[DirectoryData(id=w_uxN7pwQ6eXSMrOKceYIQ,
> >>> > topics=[TopicData(topicId=bvAKLSwmR7iJoKv2yZgygQ,
> >>> > partitions=[PartitionData(partitionIndex=2),
> >>> > PartitionData(partitionIndex=1)]),
> >>> > TopicData(topicId=uNe7f5VrQgO0zST6yH1jDQ,
> >>> > partitions=[PartitionData(partitionIndex=0)])])]) with context
> >>> > RequestContext(header=RequestHeader(apiKey=ASSIGN_REPLICAS_TO_DIRS,
> >>> > apiVersion=0, clientId=1000, correlationId=14, headerVersion=2),
> >>> > connectionId='172.16.14.219:9090-172.16.14.217:53590-7',
> clientAddress=/
> >>> > 172.16.14.217, principal=User:CN=my-cluster-kafka,O=io.strimzi,
> >>> > listenerName=ListenerName(CONTROLPLANE-9090), securityProtocol=SSL,
> >>> > clientInformation=ClientInformation(softwareName=apache-kafka-java,
> >>> > softwareVersion=3.7.0), fromPrivilegedListener=false,
> >>> >
> >>> >
> >>>
> principalSerde=Optional[org.apache.kafka.common.security.authenticator.DefaultKafkaPrincipalBuilder@71004ad2
> >>> > ])
> >>> > (kafka.server.ControllerApis) [quorum-controller-0-event-handler]
> >>> > java.util.concurrent.CompletionException:
> >>> > org.apache.kafka.common.errors.UnsupportedVersionException: Directory
> >>> > assignment is not supported yet.
> >>> >
> >>> >  at
> >>> >
> >>> >
> >>>
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
> >>> >  at
> >>> >
> >>> >
> >>>
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
> >>> >  at
> >>> >
> >>> >
> >>>
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636)
> >>> >  at
> >>> >
> >>> >
> >>>
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> >>> >  at
> >>> >
> >>> >
> >>>
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
> >>> >  at
> >>> >
> >>> >
> >>>
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent.complete(QuorumController.java:880)
> >>> >  at
> >>> >
> >>> >
> >>>
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent.handleException(QuorumController.java:871)
> >>> >  at
> >>> >
> >>> >
> >>>
> org.apache.kafka.queue.KafkaEventQueue$EventContext.completeWithException(KafkaEventQueue.java:148)
> >>> >  at
> >>> >
> >>> >
> >>>
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:137)
> >>> >  at
> >>> >
> >>> >
> >>>
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
> >>> >  at
> >>> >
> >>> >
> >>>
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
> >>> >  at java.base/java.lang.Thread.run(Thread.java:840)
> >>> >
> >>> > Caused by:
> org.apache.kafka.common.errors.UnsupportedVersionException:
> >>> > Directory assignment is not supported yet.
> >>> >
> >>> > Is that expected? I guess with the metadata version set to 3.6-IV2,
> it
> >>> > makes sense that the request is not supported. But shouldn't then the
> >>> > request not be sent at all by the brokers? (I did not opened a JIRA
> for
> >>> it,
> >>> > but I can open one if you agree this is not expected)
> >>> >
> >>> > Thanks & Regards
> >>> > Jakub
> >>> >
> >>> > On Sat, Jan 13, 2024 at 8:03 AM Luke Chen <show...@gmail.com> wrote:
> >>> >
> >>> > > Hi Stanislav,
> >>> > >
> >>> > > I commented in the "Apache Kafka 3.7.0 Release" thread, but maybe
> you
> >>> > > missed it.
> >>> > > cross-posting here:
> >>> > >
> >>> > > There is a bug KAFKA-16101
> >>> > > <https://issues.apache.org/jira/browse/KAFKA-16101> reporting that
> >>> > "Kafka
> >>> > > cluster will be unavailable during KRaft migration rollback".
> >>> > > The impact for this issue is that if brokers try to rollback to ZK
> mode
> >>> > > during KRaft migration process, there will be a period of time the
> >>> > cluster
> >>> > > is unavailable.
> >>> > > Since ZK migrating to KRaft feature is a production ready feature,
> I
> >>> > think
> >>> > > this should be addressed soon.
> >>> > > Do you think this is a blocker for v3.7.0?
> >>> > >
> >>> > > Thanks.
> >>> > > Luke
> >>> > >
> >>> > > On Sat, Jan 13, 2024 at 8:36 AM Chris Egerton <
> fearthecel...@gmail.com
> >>> >
> >>> > > wrote:
> >>> > >
> >>> > > > Thanks, Kirk!
> >>> > > >
> >>> > > > @Stanislav--do you believe that this warrants a new RC?
> >>> > > >
> >>> > > > On Fri, Jan 12, 2024, 19:08 Kirk True <k...@kirktrue.pro> wrote:
> >>> > > >
> >>> > > > > Hi Chris/Stanislav,
> >>> > > > >
> >>> > > > > I'm working on the 'Unable to find FetchSessionHandler' log
> problem
> >>> > > > > (KAFKA-16029) and have put out a draft PR (
> >>> > > > > https://github.com/apache/kafka/pull/15186). I will use the
> >>> > quickstart
> >>> > > > > approach as a second means to reproduce/verify while I wait
> for the
> >>> > > PR's
> >>> > > > > Jenkins job to finish.
> >>> > > > >
> >>> > > > > Thanks,
> >>> > > > > Kirk
> >>> > > > >
> >>> > > > > On Fri, Jan 12, 2024, at 11:31 AM, Chris Egerton wrote:
> >>> > > > > > Hi Stanislav,
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > Thanks for running this release!
> >>> > > > > >
> >>> > > > > > To verify, I:
> >>> > > > > > - Built from source using Java 11 with both:
> >>> > > > > > - - the 3.7.0-rc2 tag on GitHub
> >>> > > > > > - - the kafka-3.7.0-src.tgz artifact from
> >>> > > > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
> >>> > > > > > - Checked signatures and checksums
> >>> > > > > > - Ran the quickstart using both:
> >>> > > > > > - - The kafka_2.13-3.7.0.tgz artifact from
> >>> > > > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
> >>> with
> >>> > > Java
> >>> > > > > 11
> >>> > > > > > and Scala 13 in KRaft mode
> >>> > > > > > - - Our shiny new broker Docker image, apache/kafka:3.7.0-rc2
> >>> > > > > > - Ran all unit tests
> >>> > > > > > - Ran all integration tests for Connect and MM2
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > I found two minor areas for concern:
> >>> > > > > >
> >>> > > > > > 1. (Possibly a blocker)
> >>> > > > > > When running the quickstart, I noticed this ERROR-level log
> >>> message
> >>> > > > being
> >>> > > > > > emitted frequently (not not every time) when I killed my
> console
> >>> > > > consumer
> >>> > > > > > via ctrl-C:
> >>> > > > > >
> >>> > > > > > > [2024-01-12 11:00:31,088] ERROR [Consumer
> >>> > > clientId=console-consumer,
> >>> > > > > > groupId=console-consumer-74388] Unable to find
> >>> FetchSessionHandler
> >>> > > for
> >>> > > > > node
> >>> > > > > > 1. Ignoring fetch response
> >>> > > > > > (org.apache.kafka.clients.consumer.internals.AbstractFetch)
> >>> > > > > >
> >>> > > > > > I see that this error message is already reported in
> >>> > > > > > https://issues.apache.org/jira/browse/KAFKA-16029. I think
> we
> >>> > should
> >>> > > > > > prioritize fixing it for this release. I know it's probably
> >>> benign
> >>> > > but
> >>> > > > > it's
> >>> > > > > > really not a good look for us when basic operations log error
> >>> > > messages,
> >>> > > > > and
> >>> > > > > > it may give new users some headaches.
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > 2. (Probably not a blocker)
> >>> > > > > > The following unit tests failed the first time around, and
> all of
> >>> > > them
> >>> > > > > > passed the second time I ran them:
> >>> > > > > >
> >>> > > > > > - (clients)
> >>> > > > >
> ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
> >>> > > > > > - (clients) SelectorTest.testConnectionsByClientMetric()
> >>> > > > > > - (clients) Tls13SelectorTest.testConnectionsByClientMetric()
> >>> > > > > > - (connect)
> >>> > > TopicAdminTest.retryEndOffsetsShouldRetryWhenTopicNotFound
> >>> > > > (I
> >>> > > > > > thought I fixed this one! 🤬🤬)
> >>> > > > > > - (core)
> ProducerIdManagerTest.testUnrecoverableErrors(Errors)[2]
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > Thanks again for your work on this release, and
> congratulations
> >>> to
> >>> > > > Kafka
> >>> > > > > > Streams for having zero flaky unit tests during my
> >>> > > highly-experimental
> >>> > > > > > single laptop run!
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > Cheers,
> >>> > > > > >
> >>> > > > > > Chris
> >>> > > > > >
> >>> > > > > > On Thu, Jan 11, 2024 at 1:33 PM Stanislav Kozlovski
> >>> > > > > > <stanis...@confluent.io.invalid> wrote:
> >>> > > > > >
> >>> > > > > > > Hello Kafka users, developers, and client-developers,
> >>> > > > > > >
> >>> > > > > > > This is the first candidate for release of Apache Kafka
> 3.7.0.
> >>> > > > > > >
> >>> > > > > > > Note it's named "RC2" because I had a few "failed" RCs
> that I
> >>> had
> >>> > > > > > > cut/uploaded but ultimately had to scrap prior to
> announcing
> >>> due
> >>> > to
> >>> > > > new
> >>> > > > > > > blockers arriving before I could even announce them.
> >>> > > > > > >
> >>> > > > > > > Further - I haven't yet been able to set up the system
> tests
> >>> > > > > successfully.
> >>> > > > > > > And the integration/unit tests do have a few failures that
> I
> >>> have
> >>> > > to
> >>> > > > > spend
> >>> > > > > > > time triaging. I would appreciate any help in case anyone
> >>> notices
> >>> > > any
> >>> > > > > tests
> >>> > > > > > > failing that they're subject matters experts in. Expect me
> to
> >>> > > follow
> >>> > > > > up in
> >>> > > > > > > a day or two with more detailed analysis.
> >>> > > > > > >
> >>> > > > > > > Major changes include:
> >>> > > > > > > - Early Access to KIP-848 - the next generation of the
> consumer
> >>> > > > > rebalance
> >>> > > > > > > protocol
> >>> > > > > > > - KIP-858: Adding JBOD support to KRaft
> >>> > > > > > > - KIP-714: Observability into Client metrics via a
> standardized
> >>> > > > > interface
> >>> > > > > > >
> >>> > > > > > > Check more information in the WIP blog post:
> >>> > > > > > > https://github.com/apache/kafka-site/pull/578
> >>> > > > > > >
> >>> > > > > > > Release notes for the 3.7.0 release:
> >>> > > > > > >
> >>> > > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/RELEASE_NOTES.html
> >>> > > > > > >
> >>> > > > > > > *** Please download, test and vote by Thursday, January
> 18, 9am
> >>> > PT
> >>> > > > ***
> >>> > > > > > >
> >>> > > > > > > Usually these deadlines tend to be 2-3 days, but due to
> this
> >>> > being
> >>> > > > the
> >>> > > > > > > first RC and the tests not having ran yet, I am giving it
> a bit
> >>> > > more
> >>> > > > > time.
> >>> > > > > > >
> >>> > > > > > > Kafka's KEYS file containing PGP keys we use to sign the
> >>> release:
> >>> > > > > > > https://kafka.apache.org/KEYS
> >>> > > > > > >
> >>> > > > > > > * Release artifacts to be voted upon (source and binary):
> >>> > > > > > >
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
> >>> > > > > > >
> >>> > > > > > > * Docker release artifact to be voted upon:
> >>> > > > > > > apache/kafka:3.7.0-rc2
> >>> > > > > > >
> >>> > > > > > > * Maven artifacts to be voted upon:
> >>> > > > > > >
> >>> > > >
> >>> https://repository.apache.org/content/groups/staging/org/apache/kafka/
> >>> > > > > > >
> >>> > > > > > > * Javadoc:
> >>> > > > > > >
> >>> > >
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/javadoc/
> >>> > > > > > >
> >>> > > > > > > * Tag to be voted upon (off 3.7 branch) is the 3.7.0 tag:
> >>> > > > > > > https://github.com/apache/kafka/releases/tag/3.7.0-rc2
> >>> > > > > > >
> >>> > > > > > > * Documentation:
> >>> > > > > > > https://kafka.apache.org/37/documentation.html
> >>> > > > > > >
> >>> > > > > > > * Protocol:
> >>> > > > > > > https://kafka.apache.org/37/protocol.html
> >>> > > > > > >
> >>> > > > > > > * Successful Jenkins builds for the 3.7 branch:
> >>> > > > > > > Unit/integration tests:
> >>> > > > > > >
> https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.7/58/
> >>> > > > > > > There are failing tests here. I have to follow up with
> triaging
> >>> > > some
> >>> > > > of
> >>> > > > > > > the failures and figuring out if they're actual problems or
> >>> > simply
> >>> > > > > flakes.
> >>> > > > > > >
> >>> > > > > > > System tests:
> >>> > > > > https://jenkins.confluent.io/job/system-test-kafka/job/3.7/
> >>> > > > > > >
> >>> > > > > > > No successful system test runs yet. I am working on
> getting the
> >>> > job
> >>> > > > to
> >>> > > > > run.
> >>> > > > > > >
> >>> > > > > > > * Successful Docker Image Github Actions Pipeline for 3.7
> >>> branch:
> >>> > > > > > > Attached are the scan_report and report_jvm output files
> from
> >>> the
> >>> > > > > Docker
> >>> > > > > > > Build run:
> >>> > > > > > >
> >>> > > > >
> >>> > >
> >>>
> https://github.com/apache/kafka/actions/runs/7486094960/job/20375761673
> >>> > > > > > >
> >>> > > > > > > And the final docker image build job - Docker Build Test
> >>> > Pipeline:
> >>> > > > > > > https://github.com/apache/kafka/actions/runs/7486178277
> >>> > > > > > >
> >>> > > > > > > The image is apache/kafka:3.7.0-rc2 -
> >>> > > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> https://hub.docker.com/layers/apache/kafka/3.7.0-rc2/images/sha256-5b4707c08170d39549fbb6e2a3dbb83936a50f987c0c097f23cb26b4c210c226?context=explore
> >>> > > > > > >
> >>> > > > > > > /**************************************
> >>> > > > > > >
> >>> > > > > > > Thanks,
> >>> > > > > > > Stanislav Kozlovski
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>>
> >>> --
> >>> Best,
> >>> Stanislav
> >>>
>


-- 
Paolo Patierno

*Senior Principal Software Engineer @ Red Hat**Microsoft MVP on **Azure*

Twitter : @ppatierno <http://twitter.com/ppatierno>
Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno>
GitHub : ppatierno <https://github.com/ppatierno>

Reply via email to