Hey Luke,

This is an interesting problem. Given the fact that the KIP for having a
3.8 release passed, I think it weights the scale towards not calling this a
blocker and expecting it to be solved in 3.7.1.

It is unfortunate that it would not seem safe to migrate to KRaft in 3.7.0
(given the inability to rollback safely), but if that's true - the same
case would apply for 3.6.0. So in any case users would be expected to use a
patch release for this. Further, since we will have a 3.8 release - it is
likely we will ultimately recommend users upgrade from that version given
its aim is to have strategic KRaft feature parity with ZK.
That being said, I am not 100% on this. Let me know whether you think this
should block the release, Luke. I am also tagging Colin and David to weigh
in with their opinions, as they worked on the migration logic.

Hey Kirk and Chris,

Unless I'm missing something - KAFKALESS-16029 is simply a bad log due to
improper closing. And the PR description implies this has been present
since 3.5. While annoying, I don't see a strong reason for this to block
the release.

Hey Jakub,

Nice catch! It does seem like we should have gated this behind the metadata
version as KIP-858 implies. Is the cluster configured with multiple log
dirs? What is the impact of the error messages?

Tagging Igor (the author of the KIP) to weigh in.

Best,
Stanislav

On Sat, Jan 13, 2024 at 7:22 PM Jakub Scholz <ja...@scholz.cz> wrote:

> Hi,
>
> I was trying the RC2 and run into the following issue ... when I run
> 3.7.0-RC2 KRaft cluster with metadata version set to 3.6-IV2 metadata
> version, I seem to be getting repeated errors like this in the controller
> logs:
>
> 2024-01-13 16:58:01,197 INFO [QuorumController id=0] assignReplicasToDirs:
> event failed with UnsupportedVersionException in 15 microseconds.
> (org.apache.kafka.controller.QuorumController)
> [quorum-controller-0-event-handler]
> 2024-01-13 16:58:01,197 ERROR [ControllerApis nodeId=0] Unexpected error
> handling request RequestHeader(apiKey=ASSIGN_REPLICAS_TO_DIRS,
> apiVersion=0, clientId=1000, correlationId=14, headerVersion=2) --
> AssignReplicasToDirsRequestData(brokerId=1000, brokerEpoch=5,
> directories=[DirectoryData(id=w_uxN7pwQ6eXSMrOKceYIQ,
> topics=[TopicData(topicId=bvAKLSwmR7iJoKv2yZgygQ,
> partitions=[PartitionData(partitionIndex=2),
> PartitionData(partitionIndex=1)]),
> TopicData(topicId=uNe7f5VrQgO0zST6yH1jDQ,
> partitions=[PartitionData(partitionIndex=0)])])]) with context
> RequestContext(header=RequestHeader(apiKey=ASSIGN_REPLICAS_TO_DIRS,
> apiVersion=0, clientId=1000, correlationId=14, headerVersion=2),
> connectionId='172.16.14.219:9090-172.16.14.217:53590-7', clientAddress=/
> 172.16.14.217, principal=User:CN=my-cluster-kafka,O=io.strimzi,
> listenerName=ListenerName(CONTROLPLANE-9090), securityProtocol=SSL,
> clientInformation=ClientInformation(softwareName=apache-kafka-java,
> softwareVersion=3.7.0), fromPrivilegedListener=false,
>
> principalSerde=Optional[org.apache.kafka.common.security.authenticator.DefaultKafkaPrincipalBuilder@71004ad2
> ])
> (kafka.server.ControllerApis) [quorum-controller-0-event-handler]
> java.util.concurrent.CompletionException:
> org.apache.kafka.common.errors.UnsupportedVersionException: Directory
> assignment is not supported yet.
>
>  at
>
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
>  at
>
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
>  at
>
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636)
>  at
>
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>  at
>
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
>  at
>
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent.complete(QuorumController.java:880)
>  at
>
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent.handleException(QuorumController.java:871)
>  at
>
> org.apache.kafka.queue.KafkaEventQueue$EventContext.completeWithException(KafkaEventQueue.java:148)
>  at
>
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:137)
>  at
>
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
>  at
>
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
>  at java.base/java.lang.Thread.run(Thread.java:840)
>
> Caused by: org.apache.kafka.common.errors.UnsupportedVersionException:
> Directory assignment is not supported yet.
>
> Is that expected? I guess with the metadata version set to 3.6-IV2, it
> makes sense that the request is not supported. But shouldn't then the
> request not be sent at all by the brokers? (I did not opened a JIRA for it,
> but I can open one if you agree this is not expected)
>
> Thanks & Regards
> Jakub
>
> On Sat, Jan 13, 2024 at 8:03 AM Luke Chen <show...@gmail.com> wrote:
>
> > Hi Stanislav,
> >
> > I commented in the "Apache Kafka 3.7.0 Release" thread, but maybe you
> > missed it.
> > cross-posting here:
> >
> > There is a bug KAFKA-16101
> > <https://issues.apache.org/jira/browse/KAFKA-16101> reporting that
> "Kafka
> > cluster will be unavailable during KRaft migration rollback".
> > The impact for this issue is that if brokers try to rollback to ZK mode
> > during KRaft migration process, there will be a period of time the
> cluster
> > is unavailable.
> > Since ZK migrating to KRaft feature is a production ready feature, I
> think
> > this should be addressed soon.
> > Do you think this is a blocker for v3.7.0?
> >
> > Thanks.
> > Luke
> >
> > On Sat, Jan 13, 2024 at 8:36 AM Chris Egerton <fearthecel...@gmail.com>
> > wrote:
> >
> > > Thanks, Kirk!
> > >
> > > @Stanislav--do you believe that this warrants a new RC?
> > >
> > > On Fri, Jan 12, 2024, 19:08 Kirk True <k...@kirktrue.pro> wrote:
> > >
> > > > Hi Chris/Stanislav,
> > > >
> > > > I'm working on the 'Unable to find FetchSessionHandler' log problem
> > > > (KAFKA-16029) and have put out a draft PR (
> > > > https://github.com/apache/kafka/pull/15186). I will use the
> quickstart
> > > > approach as a second means to reproduce/verify while I wait for the
> > PR's
> > > > Jenkins job to finish.
> > > >
> > > > Thanks,
> > > > Kirk
> > > >
> > > > On Fri, Jan 12, 2024, at 11:31 AM, Chris Egerton wrote:
> > > > > Hi Stanislav,
> > > > >
> > > > >
> > > > > Thanks for running this release!
> > > > >
> > > > > To verify, I:
> > > > > - Built from source using Java 11 with both:
> > > > > - - the 3.7.0-rc2 tag on GitHub
> > > > > - - the kafka-3.7.0-src.tgz artifact from
> > > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
> > > > > - Checked signatures and checksums
> > > > > - Ran the quickstart using both:
> > > > > - - The kafka_2.13-3.7.0.tgz artifact from
> > > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/ with
> > Java
> > > > 11
> > > > > and Scala 13 in KRaft mode
> > > > > - - Our shiny new broker Docker image, apache/kafka:3.7.0-rc2
> > > > > - Ran all unit tests
> > > > > - Ran all integration tests for Connect and MM2
> > > > >
> > > > >
> > > > > I found two minor areas for concern:
> > > > >
> > > > > 1. (Possibly a blocker)
> > > > > When running the quickstart, I noticed this ERROR-level log message
> > > being
> > > > > emitted frequently (not not every time) when I killed my console
> > > consumer
> > > > > via ctrl-C:
> > > > >
> > > > > > [2024-01-12 11:00:31,088] ERROR [Consumer
> > clientId=console-consumer,
> > > > > groupId=console-consumer-74388] Unable to find FetchSessionHandler
> > for
> > > > node
> > > > > 1. Ignoring fetch response
> > > > > (org.apache.kafka.clients.consumer.internals.AbstractFetch)
> > > > >
> > > > > I see that this error message is already reported in
> > > > > https://issues.apache.org/jira/browse/KAFKA-16029. I think we
> should
> > > > > prioritize fixing it for this release. I know it's probably benign
> > but
> > > > it's
> > > > > really not a good look for us when basic operations log error
> > messages,
> > > > and
> > > > > it may give new users some headaches.
> > > > >
> > > > >
> > > > > 2. (Probably not a blocker)
> > > > > The following unit tests failed the first time around, and all of
> > them
> > > > > passed the second time I ran them:
> > > > >
> > > > > - (clients)
> > > > ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
> > > > > - (clients) SelectorTest.testConnectionsByClientMetric()
> > > > > - (clients) Tls13SelectorTest.testConnectionsByClientMetric()
> > > > > - (connect)
> > TopicAdminTest.retryEndOffsetsShouldRetryWhenTopicNotFound
> > > (I
> > > > > thought I fixed this one! 🤬🤬)
> > > > > - (core) ProducerIdManagerTest.testUnrecoverableErrors(Errors)[2]
> > > > >
> > > > >
> > > > > Thanks again for your work on this release, and congratulations to
> > > Kafka
> > > > > Streams for having zero flaky unit tests during my
> > highly-experimental
> > > > > single laptop run!
> > > > >
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Thu, Jan 11, 2024 at 1:33 PM Stanislav Kozlovski
> > > > > <stanis...@confluent.io.invalid> wrote:
> > > > >
> > > > > > Hello Kafka users, developers, and client-developers,
> > > > > >
> > > > > > This is the first candidate for release of Apache Kafka 3.7.0.
> > > > > >
> > > > > > Note it's named "RC2" because I had a few "failed" RCs that I had
> > > > > > cut/uploaded but ultimately had to scrap prior to announcing due
> to
> > > new
> > > > > > blockers arriving before I could even announce them.
> > > > > >
> > > > > > Further - I haven't yet been able to set up the system tests
> > > > successfully.
> > > > > > And the integration/unit tests do have a few failures that I have
> > to
> > > > spend
> > > > > > time triaging. I would appreciate any help in case anyone notices
> > any
> > > > tests
> > > > > > failing that they're subject matters experts in. Expect me to
> > follow
> > > > up in
> > > > > > a day or two with more detailed analysis.
> > > > > >
> > > > > > Major changes include:
> > > > > > - Early Access to KIP-848 - the next generation of the consumer
> > > > rebalance
> > > > > > protocol
> > > > > > - KIP-858: Adding JBOD support to KRaft
> > > > > > - KIP-714: Observability into Client metrics via a standardized
> > > > interface
> > > > > >
> > > > > > Check more information in the WIP blog post:
> > > > > > https://github.com/apache/kafka-site/pull/578
> > > > > >
> > > > > > Release notes for the 3.7.0 release:
> > > > > >
> > > > > >
> > > >
> > >
> >
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/RELEASE_NOTES.html
> > > > > >
> > > > > > *** Please download, test and vote by Thursday, January 18, 9am
> PT
> > > ***
> > > > > >
> > > > > > Usually these deadlines tend to be 2-3 days, but due to this
> being
> > > the
> > > > > > first RC and the tests not having ran yet, I am giving it a bit
> > more
> > > > time.
> > > > > >
> > > > > > Kafka's KEYS file containing PGP keys we use to sign the release:
> > > > > > https://kafka.apache.org/KEYS
> > > > > >
> > > > > > * Release artifacts to be voted upon (source and binary):
> > > > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
> > > > > >
> > > > > > * Docker release artifact to be voted upon:
> > > > > > apache/kafka:3.7.0-rc2
> > > > > >
> > > > > > * Maven artifacts to be voted upon:
> > > > > >
> > > https://repository.apache.org/content/groups/staging/org/apache/kafka/
> > > > > >
> > > > > > * Javadoc:
> > > > > >
> > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/javadoc/
> > > > > >
> > > > > > * Tag to be voted upon (off 3.7 branch) is the 3.7.0 tag:
> > > > > > https://github.com/apache/kafka/releases/tag/3.7.0-rc2
> > > > > >
> > > > > > * Documentation:
> > > > > > https://kafka.apache.org/37/documentation.html
> > > > > >
> > > > > > * Protocol:
> > > > > > https://kafka.apache.org/37/protocol.html
> > > > > >
> > > > > > * Successful Jenkins builds for the 3.7 branch:
> > > > > > Unit/integration tests:
> > > > > > https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.7/58/
> > > > > > There are failing tests here. I have to follow up with triaging
> > some
> > > of
> > > > > > the failures and figuring out if they're actual problems or
> simply
> > > > flakes.
> > > > > >
> > > > > > System tests:
> > > > https://jenkins.confluent.io/job/system-test-kafka/job/3.7/
> > > > > >
> > > > > > No successful system test runs yet. I am working on getting the
> job
> > > to
> > > > run.
> > > > > >
> > > > > > * Successful Docker Image Github Actions Pipeline for 3.7 branch:
> > > > > > Attached are the scan_report and report_jvm output files from the
> > > > Docker
> > > > > > Build run:
> > > > > >
> > > >
> > https://github.com/apache/kafka/actions/runs/7486094960/job/20375761673
> > > > > >
> > > > > > And the final docker image build job - Docker Build Test
> Pipeline:
> > > > > > https://github.com/apache/kafka/actions/runs/7486178277
> > > > > >
> > > > > > The image is apache/kafka:3.7.0-rc2 -
> > > > > >
> > > >
> > >
> >
> https://hub.docker.com/layers/apache/kafka/3.7.0-rc2/images/sha256-5b4707c08170d39549fbb6e2a3dbb83936a50f987c0c097f23cb26b4c210c226?context=explore
> > > > > >
> > > > > > /**************************************
> > > > > >
> > > > > > Thanks,
> > > > > > Stanislav Kozlovski
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Best,
Stanislav

Reply via email to