Hi Paolo, Colin, Let's discuss detail about this issue in the PR: https://github.com/apache/kafka/pull/15193 .
On Mon, Jan 15, 2024 at 4:21 PM Paolo Patierno <paolo.patie...@gmail.com> wrote: > Hi Colin, > I was the one raising the issue about rollback and I also already tried > what you mentioned but with no success. > During the first rolling, I left > the zookeeper.metadata.migration.enable=true but > removed controller.quorum.voters and controller.listener.names. > This is what I get from the brokers on restarting: > > 2024-01-15 09:19:14,172] ERROR Exiting Kafka due to fatal exception > (kafka.Kafka$) > org.apache.kafka.common.config.ConfigException: If using > zookeeper.metadata.migration.enable, controller.quorum.voters must contain > a parseable set of voters. > at > > kafka.server.KafkaConfig.validateNonEmptyQuorumVotersForMigration$1(KafkaConfig.scala:2286) > at kafka.server.KafkaConfig.validateValues(KafkaConfig.scala:2371) > at kafka.server.KafkaConfig.<init>(KafkaConfig.scala:2233) > at kafka.server.KafkaConfig.<init>(KafkaConfig.scala:1604) > at kafka.server.KafkaConfig$.fromProps(KafkaConfig.scala:1527) > at kafka.Kafka$.buildServer(Kafka.scala:72) > at kafka.Kafka$.main(Kafka.scala:91) > at kafka.Kafka.main(Kafka.scala) > > Did you try it? > Am I missing anything in your procedure? > > Thanks, > Paolo > > On Mon, 15 Jan 2024 at 09:13, Colin McCabe <co...@cmccabe.xyz> wrote: > > > Docs fix discussed in the thread is here: > > https://github.com/apache/kafka/pull/15193 > > > > best, > > Colin > > > > > > On Sun, Jan 14, 2024, at 23:56, Colin McCabe wrote: > > > Hi Stanislav, > > > > > > Thanks for making the first RC. The fact that it's titled RC2 is > > > messing with my mind a bit. I hope this doesn't make people think that > > > we're farther along than we are, heh. > > > > > > On Sun, Jan 14, 2024, at 13:54, Jakub Scholz wrote: > > >> *> Nice catch! It does seem like we should have gated this behind the > > >> metadata> version as KIP-858 implies. Is the cluster configured with > > >> multiple log> dirs? What is the impact of the error messages?* > > >> > > >> I did not observe any obvious impact. I was able to send and receive > > >> messages as normally. But to be honest, I have no idea what else > > >> this might impact, so I did not try anything special. > > >> > > >> I think everyone upgrading an existing KRaft cluster will go through > > this > > >> stage (running Kafka 3.7 with an older metadata version for at least a > > >> while). So even if it is just a logged exception without any other > > impact I > > >> wonder if it might scare users from upgrading. But I leave it to > others > > to > > >> decide if this is a blocker or not. > > >> > > > > > > Hi Jakub, > > > > > > Thanks for trying the RC. I think what you found is a blocker bug > > > because it will generate huge amount of logspam. I guess we didn't find > > > it in junit tests since logspam doesn't fail the automated tests. But > > > certainly it's not suitable for production. Did you file a JIRA yet? > > > > > >> On Sun, Jan 14, 2024 at 10:17 PM Stanislav Kozlovski > > >> <stanis...@confluent.io.invalid> wrote: > > >> > > >>> Hey Luke, > > >>> > > >>> This is an interesting problem. Given the fact that the KIP for > having > > a > > >>> 3.8 release passed, I think it weights the scale towards not calling > > this a > > >>> blocker and expecting it to be solved in 3.7.1. > > >>> > > >>> It is unfortunate that it would not seem safe to migrate to KRaft in > > 3.7.0 > > >>> (given the inability to rollback safely), but if that's true - the > same > > >>> case would apply for 3.6.0. So in any case users w\ould be expected > to > > use a > > >>> patch release for this. > > > > > > Hi Luke, > > > > > > Thanks for testing rollback. I think this is a case where the > > > documentation is wrong. The intention was to for the steps to basically > > > be: > > > > > > 1. roll all the brokers into zk mode, but with migration enabled > > > 2. take down the kraft quorum > > > 3. rmr /controller, allowing a hybrid broker to take over. > > > 4. roll all the brokers into zk mode without migration enabled (if > > desired) > > > > > > With these steps, there isn't really unavailability since a ZK > > > controller can be elected quickly after the kraft quorum is gone. > > > > > >>> Further, since we will have a 3.8 release - it is > > >>> likely we will ultimately recommend users upgrade from that version > > given > > >>> its aim is to have strategic KRaft feature parity with ZK. > > >>> That being said, I am not 100% on this. Let me know whether you think > > this > > >>> should block the release, Luke. I am also tagging Colin and David to > > weigh > > >>> in with their opinions, as they worked on the migration logic. > > > > > > The rollback docs are new in 3.7 so the fact that they're wrong is a > > > clear blocker, I think. But easy to fix, I believe. I will create a PR. > > > > > > best, > > > Colin > > > > > >>> > > >>> Hey Kirk and Chris, > > >>> > > >>> Unless I'm missing something - KAFKALESS-16029 is simply a bad log > due > > to > > >>> improper closing. And the PR description implies this has been > present > > >>> since 3.5. While annoying, I don't see a strong reason for this to > > block > > >>> the release. > > >>> > > >>> Hey Jakub, > > >>> > > >>> Nice catch! It does seem like we should have gated this behind the > > metadata > > >>> version as KIP-858 implies. Is the cluster configured with multiple > log > > >>> dirs? What is the impact of the error messages? > > >>> > > >>> Tagging Igor (the author of the KIP) to weigh in. > > >>> > > >>> Best, > > >>> Stanislav > > >>> > > >>> On Sat, Jan 13, 2024 at 7:22 PM Jakub Scholz <ja...@scholz.cz> > wrote: > > >>> > > >>> > Hi, > > >>> > > > >>> > I was trying the RC2 and run into the following issue ... when I > run > > >>> > 3.7.0-RC2 KRaft cluster with metadata version set to 3.6-IV2 > metadata > > >>> > version, I seem to be getting repeated errors like this in the > > controller > > >>> > logs: > > >>> > > > >>> > 2024-01-13 16:58:01,197 INFO [QuorumController id=0] > > >>> assignReplicasToDirs: > > >>> > event failed with UnsupportedVersionException in 15 microseconds. > > >>> > (org.apache.kafka.controller.QuorumController) > > >>> > [quorum-controller-0-event-handler] > > >>> > 2024-01-13 16:58:01,197 ERROR [ControllerApis nodeId=0] Unexpected > > error > > >>> > handling request RequestHeader(apiKey=ASSIGN_REPLICAS_TO_DIRS, > > >>> > apiVersion=0, clientId=1000, correlationId=14, headerVersion=2) -- > > >>> > AssignReplicasToDirsRequestData(brokerId=1000, brokerEpoch=5, > > >>> > directories=[DirectoryData(id=w_uxN7pwQ6eXSMrOKceYIQ, > > >>> > topics=[TopicData(topicId=bvAKLSwmR7iJoKv2yZgygQ, > > >>> > partitions=[PartitionData(partitionIndex=2), > > >>> > PartitionData(partitionIndex=1)]), > > >>> > TopicData(topicId=uNe7f5VrQgO0zST6yH1jDQ, > > >>> > partitions=[PartitionData(partitionIndex=0)])])]) with context > > >>> > RequestContext(header=RequestHeader(apiKey=ASSIGN_REPLICAS_TO_DIRS, > > >>> > apiVersion=0, clientId=1000, correlationId=14, headerVersion=2), > > >>> > connectionId='172.16.14.219:9090-172.16.14.217:53590-7', > > clientAddress=/ > > >>> > 172.16.14.217, principal=User:CN=my-cluster-kafka,O=io.strimzi, > > >>> > listenerName=ListenerName(CONTROLPLANE-9090), securityProtocol=SSL, > > >>> > clientInformation=ClientInformation(softwareName=apache-kafka-java, > > >>> > softwareVersion=3.7.0), fromPrivilegedListener=false, > > >>> > > > >>> > > > >>> > > > principalSerde=Optional[org.apache.kafka.common.security.authenticator.DefaultKafkaPrincipalBuilder@71004ad2 > > >>> > ]) > > >>> > (kafka.server.ControllerApis) [quorum-controller-0-event-handler] > > >>> > java.util.concurrent.CompletionException: > > >>> > org.apache.kafka.common.errors.UnsupportedVersionException: > Directory > > >>> > assignment is not supported yet. > > >>> > > > >>> > at > > >>> > > > >>> > > > >>> > > > java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) > > >>> > at > > >>> > > > >>> > > > >>> > > > java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) > > >>> > at > > >>> > > > >>> > > > >>> > > > java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636) > > >>> > at > > >>> > > > >>> > > > >>> > > > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) > > >>> > at > > >>> > > > >>> > > > >>> > > > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) > > >>> > at > > >>> > > > >>> > > > >>> > > > org.apache.kafka.controller.QuorumController$ControllerWriteEvent.complete(QuorumController.java:880) > > >>> > at > > >>> > > > >>> > > > >>> > > > org.apache.kafka.controller.QuorumController$ControllerWriteEvent.handleException(QuorumController.java:871) > > >>> > at > > >>> > > > >>> > > > >>> > > > org.apache.kafka.queue.KafkaEventQueue$EventContext.completeWithException(KafkaEventQueue.java:148) > > >>> > at > > >>> > > > >>> > > > >>> > > > org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:137) > > >>> > at > > >>> > > > >>> > > > >>> > > > org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210) > > >>> > at > > >>> > > > >>> > > > >>> > > > org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181) > > >>> > at java.base/java.lang.Thread.run(Thread.java:840) > > >>> > > > >>> > Caused by: > > org.apache.kafka.common.errors.UnsupportedVersionException: > > >>> > Directory assignment is not supported yet. > > >>> > > > >>> > Is that expected? I guess with the metadata version set to 3.6-IV2, > > it > > >>> > makes sense that the request is not supported. But shouldn't then > the > > >>> > request not be sent at all by the brokers? (I did not opened a JIRA > > for > > >>> it, > > >>> > but I can open one if you agree this is not expected) > > >>> > > > >>> > Thanks & Regards > > >>> > Jakub > > >>> > > > >>> > On Sat, Jan 13, 2024 at 8:03 AM Luke Chen <show...@gmail.com> > wrote: > > >>> > > > >>> > > Hi Stanislav, > > >>> > > > > >>> > > I commented in the "Apache Kafka 3.7.0 Release" thread, but maybe > > you > > >>> > > missed it. > > >>> > > cross-posting here: > > >>> > > > > >>> > > There is a bug KAFKA-16101 > > >>> > > <https://issues.apache.org/jira/browse/KAFKA-16101> reporting > that > > >>> > "Kafka > > >>> > > cluster will be unavailable during KRaft migration rollback". > > >>> > > The impact for this issue is that if brokers try to rollback to > ZK > > mode > > >>> > > during KRaft migration process, there will be a period of time > the > > >>> > cluster > > >>> > > is unavailable. > > >>> > > Since ZK migrating to KRaft feature is a production ready > feature, > > I > > >>> > think > > >>> > > this should be addressed soon. > > >>> > > Do you think this is a blocker for v3.7.0? > > >>> > > > > >>> > > Thanks. > > >>> > > Luke > > >>> > > > > >>> > > On Sat, Jan 13, 2024 at 8:36 AM Chris Egerton < > > fearthecel...@gmail.com > > >>> > > > >>> > > wrote: > > >>> > > > > >>> > > > Thanks, Kirk! > > >>> > > > > > >>> > > > @Stanislav--do you believe that this warrants a new RC? > > >>> > > > > > >>> > > > On Fri, Jan 12, 2024, 19:08 Kirk True <k...@kirktrue.pro> > wrote: > > >>> > > > > > >>> > > > > Hi Chris/Stanislav, > > >>> > > > > > > >>> > > > > I'm working on the 'Unable to find FetchSessionHandler' log > > problem > > >>> > > > > (KAFKA-16029) and have put out a draft PR ( > > >>> > > > > https://github.com/apache/kafka/pull/15186). I will use the > > >>> > quickstart > > >>> > > > > approach as a second means to reproduce/verify while I wait > > for the > > >>> > > PR's > > >>> > > > > Jenkins job to finish. > > >>> > > > > > > >>> > > > > Thanks, > > >>> > > > > Kirk > > >>> > > > > > > >>> > > > > On Fri, Jan 12, 2024, at 11:31 AM, Chris Egerton wrote: > > >>> > > > > > Hi Stanislav, > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > Thanks for running this release! > > >>> > > > > > > > >>> > > > > > To verify, I: > > >>> > > > > > - Built from source using Java 11 with both: > > >>> > > > > > - - the 3.7.0-rc2 tag on GitHub > > >>> > > > > > - - the kafka-3.7.0-src.tgz artifact from > > >>> > > > > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/ > > >>> > > > > > - Checked signatures and checksums > > >>> > > > > > - Ran the quickstart using both: > > >>> > > > > > - - The kafka_2.13-3.7.0.tgz artifact from > > >>> > > > > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/ > > >>> with > > >>> > > Java > > >>> > > > > 11 > > >>> > > > > > and Scala 13 in KRaft mode > > >>> > > > > > - - Our shiny new broker Docker image, > apache/kafka:3.7.0-rc2 > > >>> > > > > > - Ran all unit tests > > >>> > > > > > - Ran all integration tests for Connect and MM2 > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > I found two minor areas for concern: > > >>> > > > > > > > >>> > > > > > 1. (Possibly a blocker) > > >>> > > > > > When running the quickstart, I noticed this ERROR-level log > > >>> message > > >>> > > > being > > >>> > > > > > emitted frequently (not not every time) when I killed my > > console > > >>> > > > consumer > > >>> > > > > > via ctrl-C: > > >>> > > > > > > > >>> > > > > > > [2024-01-12 11:00:31,088] ERROR [Consumer > > >>> > > clientId=console-consumer, > > >>> > > > > > groupId=console-consumer-74388] Unable to find > > >>> FetchSessionHandler > > >>> > > for > > >>> > > > > node > > >>> > > > > > 1. Ignoring fetch response > > >>> > > > > > (org.apache.kafka.clients.consumer.internals.AbstractFetch) > > >>> > > > > > > > >>> > > > > > I see that this error message is already reported in > > >>> > > > > > https://issues.apache.org/jira/browse/KAFKA-16029. I think > > we > > >>> > should > > >>> > > > > > prioritize fixing it for this release. I know it's probably > > >>> benign > > >>> > > but > > >>> > > > > it's > > >>> > > > > > really not a good look for us when basic operations log > error > > >>> > > messages, > > >>> > > > > and > > >>> > > > > > it may give new users some headaches. > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > 2. (Probably not a blocker) > > >>> > > > > > The following unit tests failed the first time around, and > > all of > > >>> > > them > > >>> > > > > > passed the second time I ran them: > > >>> > > > > > > > >>> > > > > > - (clients) > > >>> > > > > > > ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup() > > >>> > > > > > - (clients) SelectorTest.testConnectionsByClientMetric() > > >>> > > > > > - (clients) > Tls13SelectorTest.testConnectionsByClientMetric() > > >>> > > > > > - (connect) > > >>> > > TopicAdminTest.retryEndOffsetsShouldRetryWhenTopicNotFound > > >>> > > > (I > > >>> > > > > > thought I fixed this one! 🤬🤬) > > >>> > > > > > - (core) > > ProducerIdManagerTest.testUnrecoverableErrors(Errors)[2] > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > Thanks again for your work on this release, and > > congratulations > > >>> to > > >>> > > > Kafka > > >>> > > > > > Streams for having zero flaky unit tests during my > > >>> > > highly-experimental > > >>> > > > > > single laptop run! > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > Cheers, > > >>> > > > > > > > >>> > > > > > Chris > > >>> > > > > > > > >>> > > > > > On Thu, Jan 11, 2024 at 1:33 PM Stanislav Kozlovski > > >>> > > > > > <stanis...@confluent.io.invalid> wrote: > > >>> > > > > > > > >>> > > > > > > Hello Kafka users, developers, and client-developers, > > >>> > > > > > > > > >>> > > > > > > This is the first candidate for release of Apache Kafka > > 3.7.0. > > >>> > > > > > > > > >>> > > > > > > Note it's named "RC2" because I had a few "failed" RCs > > that I > > >>> had > > >>> > > > > > > cut/uploaded but ultimately had to scrap prior to > > announcing > > >>> due > > >>> > to > > >>> > > > new > > >>> > > > > > > blockers arriving before I could even announce them. > > >>> > > > > > > > > >>> > > > > > > Further - I haven't yet been able to set up the system > > tests > > >>> > > > > successfully. > > >>> > > > > > > And the integration/unit tests do have a few failures > that > > I > > >>> have > > >>> > > to > > >>> > > > > spend > > >>> > > > > > > time triaging. I would appreciate any help in case anyone > > >>> notices > > >>> > > any > > >>> > > > > tests > > >>> > > > > > > failing that they're subject matters experts in. Expect > me > > to > > >>> > > follow > > >>> > > > > up in > > >>> > > > > > > a day or two with more detailed analysis. > > >>> > > > > > > > > >>> > > > > > > Major changes include: > > >>> > > > > > > - Early Access to KIP-848 - the next generation of the > > consumer > > >>> > > > > rebalance > > >>> > > > > > > protocol > > >>> > > > > > > - KIP-858: Adding JBOD support to KRaft > > >>> > > > > > > - KIP-714: Observability into Client metrics via a > > standardized > > >>> > > > > interface > > >>> > > > > > > > > >>> > > > > > > Check more information in the WIP blog post: > > >>> > > > > > > https://github.com/apache/kafka-site/pull/578 > > >>> > > > > > > > > >>> > > > > > > Release notes for the 3.7.0 release: > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/RELEASE_NOTES.html > > >>> > > > > > > > > >>> > > > > > > *** Please download, test and vote by Thursday, January > > 18, 9am > > >>> > PT > > >>> > > > *** > > >>> > > > > > > > > >>> > > > > > > Usually these deadlines tend to be 2-3 days, but due to > > this > > >>> > being > > >>> > > > the > > >>> > > > > > > first RC and the tests not having ran yet, I am giving it > > a bit > > >>> > > more > > >>> > > > > time. > > >>> > > > > > > > > >>> > > > > > > Kafka's KEYS file containing PGP keys we use to sign the > > >>> release: > > >>> > > > > > > https://kafka.apache.org/KEYS > > >>> > > > > > > > > >>> > > > > > > * Release artifacts to be voted upon (source and binary): > > >>> > > > > > > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/ > > >>> > > > > > > > > >>> > > > > > > * Docker release artifact to be voted upon: > > >>> > > > > > > apache/kafka:3.7.0-rc2 > > >>> > > > > > > > > >>> > > > > > > * Maven artifacts to be voted upon: > > >>> > > > > > > > > >>> > > > > > >>> > https://repository.apache.org/content/groups/staging/org/apache/kafka/ > > >>> > > > > > > > > >>> > > > > > > * Javadoc: > > >>> > > > > > > > > >>> > > > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/javadoc/ > > >>> > > > > > > > > >>> > > > > > > * Tag to be voted upon (off 3.7 branch) is the 3.7.0 tag: > > >>> > > > > > > https://github.com/apache/kafka/releases/tag/3.7.0-rc2 > > >>> > > > > > > > > >>> > > > > > > * Documentation: > > >>> > > > > > > https://kafka.apache.org/37/documentation.html > > >>> > > > > > > > > >>> > > > > > > * Protocol: > > >>> > > > > > > https://kafka.apache.org/37/protocol.html > > >>> > > > > > > > > >>> > > > > > > * Successful Jenkins builds for the 3.7 branch: > > >>> > > > > > > Unit/integration tests: > > >>> > > > > > > > > https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.7/58/ > > >>> > > > > > > There are failing tests here. I have to follow up with > > triaging > > >>> > > some > > >>> > > > of > > >>> > > > > > > the failures and figuring out if they're actual problems > or > > >>> > simply > > >>> > > > > flakes. > > >>> > > > > > > > > >>> > > > > > > System tests: > > >>> > > > > https://jenkins.confluent.io/job/system-test-kafka/job/3.7/ > > >>> > > > > > > > > >>> > > > > > > No successful system test runs yet. I am working on > > getting the > > >>> > job > > >>> > > > to > > >>> > > > > run. > > >>> > > > > > > > > >>> > > > > > > * Successful Docker Image Github Actions Pipeline for 3.7 > > >>> branch: > > >>> > > > > > > Attached are the scan_report and report_jvm output files > > from > > >>> the > > >>> > > > > Docker > > >>> > > > > > > Build run: > > >>> > > > > > > > > >>> > > > > > > >>> > > > > >>> > > https://github.com/apache/kafka/actions/runs/7486094960/job/20375761673 > > >>> > > > > > > > > >>> > > > > > > And the final docker image build job - Docker Build Test > > >>> > Pipeline: > > >>> > > > > > > https://github.com/apache/kafka/actions/runs/7486178277 > > >>> > > > > > > > > >>> > > > > > > The image is apache/kafka:3.7.0-rc2 - > > >>> > > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > > https://hub.docker.com/layers/apache/kafka/3.7.0-rc2/images/sha256-5b4707c08170d39549fbb6e2a3dbb83936a50f987c0c097f23cb26b4c210c226?context=explore > > >>> > > > > > > > > >>> > > > > > > /************************************** > > >>> > > > > > > > > >>> > > > > > > Thanks, > > >>> > > > > > > Stanislav Kozlovski > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > > >>> -- > > >>> Best, > > >>> Stanislav > > >>> > > > > > -- > Paolo Patierno > > *Senior Principal Software Engineer @ Red Hat**Microsoft MVP on **Azure* > > Twitter : @ppatierno <http://twitter.com/ppatierno> > Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno> > GitHub : ppatierno <https://github.com/ppatierno> >