[
https://issues.apache.org/jira/browse/KAFKA-14197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Colin McCabe resolved KAFKA-14197.
----------------------------------
Resolution: Duplicate
> Kraft broker fails to startup after topic creation failure
> ----------------------------------------------------------
>
> Key: KAFKA-14197
> URL: https://issues.apache.org/jira/browse/KAFKA-14197
> Project: Kafka
> Issue Type: Bug
> Components: kraft
> Reporter: Luke Chen
> Priority: Blocker
> Fix For: 3.3.0
>
>
> In kraft ControllerWriteEvent, we start by trying to apply the record to
> controller in-memory state, then sent out the record via raft client. But if
> there is error during sending the records, there's no way to revert the
> change to controller in-memory state[1].
> The issue happened when creating topics, controller state is updated with
> topic and partition metadata (ex: broker to ISR map), but the record doesn't
> send out successfully (ex: RecordBatchTooLargeException). Then, when shutting
> down the node, the controlled shutdown will try to remove the broker from ISR
> by[2]:
> {code:java}
> generateLeaderAndIsrUpdates("enterControlledShutdown[" + brokerId + "]",
> brokerId, NO_LEADER, records,
> brokersToIsrs.partitionsWithBrokerInIsr(brokerId));{code}
>
> After we appending the partitionChangeRecords, and send to metadata topic
> successfully, it'll cause the brokers failed to "replay" these partition
> change since these topic/partitions didn't get created successfully
> previously.
> Even worse, after restarting the node, all the metadata records will replay
> again, and the same error happened again, cause the broker cannot start up
> successfully.
>
> The error and call stack is like this, basically, it complains the topic
> image can't be found
> {code:java}
> [2022-09-02 16:29:16,334] ERROR Encountered metadata loading fault: Error
> replaying metadata log record at offset 81
> (org.apache.kafka.server.fault.LoggingFaultHandler)
> java.lang.NullPointerException
> at org.apache.kafka.image.TopicDelta.replay(TopicDelta.java:69)
> at org.apache.kafka.image.TopicsDelta.replay(TopicsDelta.java:91)
> at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:248)
> at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:186)
> at
> kafka.server.metadata.BrokerMetadataListener.$anonfun$loadBatches$3(BrokerMetadataListener.scala:239)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
> at
> kafka.server.metadata.BrokerMetadataListener.kafka$server$metadata$BrokerMetadataListener$$loadBatches(BrokerMetadataListener.scala:232)
> at
> kafka.server.metadata.BrokerMetadataListener$HandleCommitsEvent.run(BrokerMetadataListener.scala:113)
> at
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
> at
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
> at
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {code}
>
> [1]
> [https://github.com/apache/kafka/blob/ef65b6e566ef69b2f9b58038c98a5993563d7a68/metadata/src/main/java/org/apache/kafka/controller/QuorumController.java#L779-L804]
>
> [2]
> [https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L1270]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)