Luke Chen created KAFKA-14197:
---------------------------------

             Summary: Kraft broker fails to startup after topic creation failure
                 Key: KAFKA-14197
                 URL: https://issues.apache.org/jira/browse/KAFKA-14197
             Project: Kafka
          Issue Type: Bug
          Components: kraft
            Reporter: Luke Chen


In kraft ControllerWriteEvent, we start by trying to apply the record to 
controller in-memory state, then sent out the record via raft client. But if 
there is error during sending the records, there's no way to revert the change 
to controller in-memory state[1].

The issue happened when creating topics, controller state is updated with topic 
and partition metadata (ex: broker to ISR map), but the record doesn't send out 
successfully (i.e. buffer allocation error). Then, when shutting down the node, 
the controlled shutdown will try to remove the broker from ISR by[2]:
{code:java}
generateLeaderAndIsrUpdates("enterControlledShutdown[" + brokerId + "]", 
brokerId, NO_LEADER, records, 
brokersToIsrs.partitionsWithBrokerInIsr(brokerId));{code}
 

After we appending the partitionChangeRecords, and send to metadata topic 
successfully, it'll cause the brokers failed to "replay" these partition change 
since these topic/partitions didn't get created successfully previously.

Even worse, after restarting the node, all the metadata records will replay 
again, and the same error happened again, cause the broker cannot start up 
successfully.

 

The error and call stack is like this, basically, it complains the topic image 
can't be found
{code:java}
[2022-09-02 16:29:16,334] ERROR Encountered metadata loading fault: Error 
replaying metadata log record at offset 81 
(org.apache.kafka.server.fault.LoggingFaultHandler)
java.lang.NullPointerException
    at org.apache.kafka.image.TopicDelta.replay(TopicDelta.java:69)
    at org.apache.kafka.image.TopicsDelta.replay(TopicsDelta.java:91)
    at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:248)
    at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:186)
    at 
kafka.server.metadata.BrokerMetadataListener.$anonfun$loadBatches$3(BrokerMetadataListener.scala:239)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
    at 
kafka.server.metadata.BrokerMetadataListener.kafka$server$metadata$BrokerMetadataListener$$loadBatches(BrokerMetadataListener.scala:232)
    at 
kafka.server.metadata.BrokerMetadataListener$HandleCommitsEvent.run(BrokerMetadataListener.scala:113)
    at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
    at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
    at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
    at java.base/java.lang.Thread.run(Thread.java:829)
{code}
 

[1] 
https://github.com/apache/kafka/blob/ef65b6e566ef69b2f9b58038c98a5993563d7a68/metadata/src/main/java/org/apache/kafka/controller/QuorumController.java#L779-L804
 

[2] 
https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L1270



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to