[
https://issues.apache.org/jira/browse/KAFKA-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viktor Somogyi-Vass resolved KAFKA-17950.
-----------------------------------------
Resolution: Invalid
Ok, my mistake. It seems like an incorrect voter list configuration caused the
issue in controller1.properties
> The leader requested truncation to below the current high watermark
> -------------------------------------------------------------------
>
> Key: KAFKA-17950
> URL: https://issues.apache.org/jira/browse/KAFKA-17950
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 3.9.0, 3.9.1
> Reporter: Viktor Somogyi-Vass
> Priority: Blocker
> Attachments: broker1.log, broker2.log, broker3.log,
> controller-logs.zip, controller1-migration-enabled.properties,
> controller1.properties, controller2-migration-enabled.properties,
> controller2.properties, controller3-migration-enabled.properties,
> controller3.properties, kraft1.log, kraft2.log, kraft3.log,
> producer-perf.log, producer.properties, server1-migrated-to-kraft.properties,
> server1-migration-enabled.properties, server1.properties,
> server2-migrated-to-kraft.properties, server2-migration-enabled.properties,
> server2.properties, server3-migrated-to-kraft.properties,
> server3-migration-enabled.properties, server3.properties, zookeeper.log
>
>
> While testing the migration from 3.9 ZK Kafka to 3.9 KRaft, I find that in
> the last step (finalization) where I restart the controllers in non-migration
> mode, the last controller restart causes a fatal failure in the cluster:
> every node (broker and controller) stops beside the controller I restarted.
> The failing nodes throw the same exception at the time:
> {noformat}
> [2024-11-06 14:02:13,498] ERROR Encountered fatal fault: Unexpected error in
> raft IO thread (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
> org.apache.kafka.common.KafkaException: The leader requested truncation to
> offset 484, which is below the current high watermark
> LogOffsetMetadata(offset=508, metadata=Optional.empty)
> at
> org.apache.kafka.raft.KafkaRaftClient.lambda$handleFetchResponse$11(KafkaRaftClient.java:1619)
> at java.base/java.util.Optional.ifPresent(Optional.java:183)
> at
> org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1616)
> at
> org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:2457)
> at
> org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2613)
> at
> org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3312)
> at
> org.apache.kafka.raft.KafkaRaftClientDriver.doWork(KafkaRaftClientDriver.java:64)
> at
> org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136)
> {noformat}
> Setup:
> * single Zookeeper node
> * 3 brokers
> * 1 running producer-performance client
> * 3 controllers
> Repro:
> # Start Zookeeper with zookeeper.properties
> {noformat}
> bin/zookeeper-server-start.sh repro-conf/zookeeper.properties
> {noformat}
> # Start brokers with serverX.properties
> {noformat}
> bin/kafka-server-start.sh repro-conf/server1.properties
> bin/kafka-server-start.sh repro-conf/server2.properties
> bin/kafka-server-start.sh repro-conf/server3.properties
> {noformat}
> # Start the producer-performance tool
> {noformat}
> bin/kafka-producer-perf-test.sh --topic test1 --num-records 1000000
> --throughput 100 --record-size 10000 --producer.config
> repro-conf/producer.properties
> {noformat}
> # Get the cluster ID and format all controller log dirs
> # Start the controllers in migration mode
> {noformat}
> bin/kafka-server-start.sh repro-conf/controller1-migration-enabled.properties
> bin/kafka-server-start.sh repro-conf/controller2-migration-enabled.properties
> bin/kafka-server-start.sh repro-conf/controller3-migration-enabled.properties
> {noformat}
> # Restart the brokers (rolling) in migration mode with the following configs.
> (My restart order was 1,2,3.)
> {noformat}
> bin/kafka-server-start.sh repro-conf/server1-migration-enabled.properties
> bin/kafka-server-start.sh repro-conf/server2-migration-enabled.properties
> bin/kafka-server-start.sh repro-conf/server3-migration-enabled.properties
> {noformat}
> # Restart the brokers (rolling) in migrated mode with the following configs
> (at this point they are connected to the controllers and not ZK). My restart
> order was 1,2,3.
> {noformat}
> bin/kafka-server-start.sh repro-conf/server1-migrated-to-kraft.properties
> bin/kafka-server-start.sh repro-conf/server2-migrated-to-kraft.properties
> bin/kafka-server-start.sh repro-conf/server3-migrated-to-kraft.properties
> {noformat}
> # At this point all brokers run with KRaft, let's rolling restart the
> controllers to finalize. (The order was 3,2,1.)
> {noformat}
> bin/kafka-server-start.sh repro-conf/controller3.properties
> bin/kafka-server-start.sh repro-conf/controller2.properties
> bin/kafka-server-start.sh repro-conf/controller1.properties
> {noformat}
> At the last restart, when controller1 starts up, all other nodes crash at
> once. Attached all logs and configuration.
> I've been working from the 3.9 branch, the hash is 4a562cd.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)