[ 
https://issues.apache.org/jira/browse/KAFKA-14197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin McCabe resolved KAFKA-14197.
----------------------------------
    Resolution: Duplicate

> Kraft broker fails to startup after topic creation failure
> ----------------------------------------------------------
>
>                 Key: KAFKA-14197
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14197
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft
>            Reporter: Luke Chen
>            Priority: Blocker
>             Fix For: 3.3.0
>
>
> In kraft ControllerWriteEvent, we start by trying to apply the record to 
> controller in-memory state, then sent out the record via raft client. But if 
> there is error during sending the records, there's no way to revert the 
> change to controller in-memory state[1].
> The issue happened when creating topics, controller state is updated with 
> topic and partition metadata (ex: broker to ISR map), but the record doesn't 
> send out successfully (ex: RecordBatchTooLargeException). Then, when shutting 
> down the node, the controlled shutdown will try to remove the broker from ISR 
> by[2]:
> {code:java}
> generateLeaderAndIsrUpdates("enterControlledShutdown[" + brokerId + "]", 
> brokerId, NO_LEADER, records, 
> brokersToIsrs.partitionsWithBrokerInIsr(brokerId));{code}
>  
> After we appending the partitionChangeRecords, and send to metadata topic 
> successfully, it'll cause the brokers failed to "replay" these partition 
> change since these topic/partitions didn't get created successfully 
> previously.
> Even worse, after restarting the node, all the metadata records will replay 
> again, and the same error happened again, cause the broker cannot start up 
> successfully.
>  
> The error and call stack is like this, basically, it complains the topic 
> image can't be found
> {code:java}
> [2022-09-02 16:29:16,334] ERROR Encountered metadata loading fault: Error 
> replaying metadata log record at offset 81 
> (org.apache.kafka.server.fault.LoggingFaultHandler)
> java.lang.NullPointerException
>     at org.apache.kafka.image.TopicDelta.replay(TopicDelta.java:69)
>     at org.apache.kafka.image.TopicsDelta.replay(TopicsDelta.java:91)
>     at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:248)
>     at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:186)
>     at 
> kafka.server.metadata.BrokerMetadataListener.$anonfun$loadBatches$3(BrokerMetadataListener.scala:239)
>     at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
>     at 
> kafka.server.metadata.BrokerMetadataListener.kafka$server$metadata$BrokerMetadataListener$$loadBatches(BrokerMetadataListener.scala:232)
>     at 
> kafka.server.metadata.BrokerMetadataListener$HandleCommitsEvent.run(BrokerMetadataListener.scala:113)
>     at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
>     at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
>     at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
>     at java.base/java.lang.Thread.run(Thread.java:829)
> {code}
>  
> [1] 
> [https://github.com/apache/kafka/blob/ef65b6e566ef69b2f9b58038c98a5993563d7a68/metadata/src/main/java/org/apache/kafka/controller/QuorumController.java#L779-L804]
>  
> [2] 
> [https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L1270]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to