Viktor Somogyi-Vass created KAFKA-17950:
-------------------------------------------
Summary: The leader requested truncation to below the current high
watermark
Key: KAFKA-17950
URL: https://issues.apache.org/jira/browse/KAFKA-17950
Project: Kafka
Issue Type: Bug
Affects Versions: 3.9.0, 3.9.1
Reporter: Viktor Somogyi-Vass
Attachments: broker1.log, broker2.log, broker3.log,
controller-logs.zip, controller1-migration-enabled.properties,
controller1.properties, controller2-migration-enabled.properties,
controller2.properties, controller3-migration-enabled.properties,
controller3.properties, kraft1.log, kraft2.log, kraft3.log, producer-perf.log,
producer.properties, server1-migrated-to-kraft.properties,
server1-migration-enabled.properties, server1.properties,
server2-migrated-to-kraft.properties, server2-migration-enabled.properties,
server2.properties, server3-migrated-to-kraft.properties,
server3-migration-enabled.properties, server3.properties, zookeeper.log
While testing the migration from 3.9 ZK Kafka to 3.9 KRaft, I find that in the
last step (finalization) where I restart the controllers in non-migration mode,
the last controller restart causes a fatal failure in the cluster: every node
(broker and controller) stops beside the controller I restarted.
The failing nodes throw the same exception at the time:
{noformat}
[2024-11-06 14:02:13,498] ERROR Encountered fatal fault: Unexpected error in
raft IO thread (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
org.apache.kafka.common.KafkaException: The leader requested truncation to
offset 484, which is below the current high watermark
LogOffsetMetadata(offset=508, metadata=Optional.empty)
at
org.apache.kafka.raft.KafkaRaftClient.lambda$handleFetchResponse$11(KafkaRaftClient.java:1619)
at java.base/java.util.Optional.ifPresent(Optional.java:183)
at
org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1616)
at
org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:2457)
at
org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2613)
at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3312)
at
org.apache.kafka.raft.KafkaRaftClientDriver.doWork(KafkaRaftClientDriver.java:64)
at
org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136)
{noformat}
Setup:
* single Zookeeper node
* 3 brokers
* 1 running producer-performance client
* 3 controllers
Repro:
# Start Zookeeper with zookeeper.properties
{noformat}
bin/zookeeper-server-start.sh repro-conf/zookeeper.properties
{noformat}
# Start brokers with serverX.properties
{noformat}
bin/kafka-server-start.sh repro-conf/server1.properties
bin/kafka-server-start.sh repro-conf/server2.properties
bin/kafka-server-start.sh repro-conf/server3.properties
{noformat}
# Start the producer-performance tool
{noformat}
bin/kafka-producer-perf-test.sh --topic test1 --num-records 1000000
--throughput 100 --record-size 10000 --producer.config
repro-conf/producer.properties
{noformat}
# Start the controllers in migration mode
{noformat}
bin/kafka-server-start.sh repro-conf/controller1-migration-enabled.properties
bin/kafka-server-start.sh repro-conf/controller2-migration-enabled.properties
bin/kafka-server-start.sh repro-conf/controller3-migration-enabled.properties
{noformat}
# Restart the brokers in migration mode with the following configs. (My restart
order was 1,2,3.)
{noformat}
bin/kafka-server-start.sh repro-conf/server1-migration-enabled.properties
bin/kafka-server-start.sh repro-conf/server2-migration-enabled.properties
bin/kafka-server-start.sh repro-conf/server3-migration-enabled.properties
{noformat}
# Restart the brokers in migrated mode with the following configs (at this
point they are connected to the controllers and not ZK). My restart order was
1,2,3.
{noformat}
bin/kafka-server-start.sh repro-conf/server1-migrated-to-kraft.properties
bin/kafka-server-start.sh repro-conf/server2-migrated-to-kraft.properties
bin/kafka-server-start.sh repro-conf/server3-migrated-to-kraft.properties
{noformat}
# At this point all brokers run with KRaft, let's restart the controllers to
finalize. (The order was 3,2,1.)
{noformat}
bin/kafka-server-start.sh repro-conf/controller3.properties
bin/kafka-server-start.sh repro-conf/controller2.properties
bin/kafka-server-start.sh repro-conf/controller1.properties
{noformat}
At the last restart, when controller1 starts up, all other nodes crash at once.
Attached all logs and configuration.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)