Roland Sommer created KAFKA-15330:
-------------------------------------

             Summary: Migration from ZK to KRaft works with 3.4 but fails from 
3.5 upwards
                 Key: KAFKA-15330
                 URL: https://issues.apache.org/jira/browse/KAFKA-15330
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 3.5.1, 3.5.0
         Environment: Debian Bookworm/12.1
kafka 3.4 and 3.5 / scala 2.13
OpenJDK Runtime Environment (build 17.0.8+7-Debian-1deb12u1)
            Reporter: Roland Sommer


We recently did some migration testing from our old ZK-based kafka clusters to 
KRaft while still being on kafka 3.4. The migration tests succeeded at first 
try. In the meantime we updated to kafka 3.5/3.5.1 and now we wanted to 
continue our migration work, which ran into unexpected problems.

On the controller we get messages like:
{code:java}
Aug 10 06:49:33 kafkactl01 kafka-server-start.sh[48572]: [2023-08-10 
06:49:33,072] INFO [KRaftMigrationDriver id=495] Still waiting for all 
controller nodes ready to begin the migration. due to: Missing apiVersion from 
nodes: [514, 760] 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver){code}
On the broker side, we see:
{code:java}
06:52:56,109] INFO [BrokerLifecycleManager id=6 isZkBroker=true] Unable to 
register the broker because the RPC got timed out before it could be sent. 
(kafka.server.BrokerLifecycleManager){code}
If we reinstall the same development cluster with kafka 3.4, using the exact 
same steps provided by your migration documentation (only difference is using 
{{inter.broker.protocol.version=3.4}} instead of 
{{{}inter.broker.protocol.version=3.5{}}}), everything works as expected. 
Updating to kafka 3.5/3.5.1 yields the same problems.

Testing is done on a three-node kafka cluster with a three-node zookeeper 
ensemble and a three-node controller setup.

Besides our default configuration containing the active zookeeper hosts etc., 
this is what was added on the brokers:
{code:java}
# Migration
advertised.listeners=PLAINTEXT://kafka03:9092
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
zookeeper.metadata.migration.enable=true
controller.quorum.voters=495@kafkactl01:9093,760@kafkactl02:9093,514@kafkactl03:9093
controller.listener.names=CONTROLLER
{code}
The main controller config looks like this:
{code:java}
process.roles=controller
node.id=495
controller.quorum.voters=495@kafkactl01:9093,760@kafkactl02:9093,514@kafkactl03:9093
listeners=CONTROLLER://:9093
inter.broker.listener.name=PLAINTEXT
controller.listener.names=CONTROLLER
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
zookeeper.metadata.migration.enable=true
{code}
Both configs contain the identical {{zookeeper.connect}} settings, everything 
is setup automatically so it should be identical on every run and we can 
reliably reproduce migration success on kafka 3.4 and migration failure using 
the same setup with kafka 3.5.

There are other issues mentioning problems with ApiVersions like KAFKA-15230 - 
not quite sure if this is a duplicate of the underlying problem there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to