Haoze Wu created KAFKA-13468: -------------------------------- Summary: Consumers may hang because IOException in Log#<init> does not trigger KafkaStorageException Key: KAFKA-13468 URL: https://issues.apache.org/jira/browse/KAFKA-13468 Project: Kafka Issue Type: Bug Components: log Affects Versions: 2.8.0 Reporter: Haoze Wu
When the Kafka Log class (`core/src/main/scala/kafka/log/Log.scala`) is initialized, it may encounter an IO exception in the locally block, e.g., when the log directory cannot be created due to permission issue or IOException in `initializeLeaderEpochCache`, `initializePartitionMetadata`, etc. {code:java} class Log(...) { // ... locally { // create the log directory if it doesn't exist Files.createDirectories(dir.toPath) initializeLeaderEpochCache() initializePartitionMetadata() val nextOffset = loadSegments() // ... } // ... }{code} We found that the broker encountering the IO exception prints an KafkaApi error log like the following and proceeds. {code:java} [2021-11-17 22:41:30,057] ERROR [KafkaApi-1] Error when handling request: clientId=1, correlationId=1, api=LEADER_AND_ISR, version=5, body=LeaderAndIsrRequestData(controllerId=1, controllerEpoch=1, brokerEpoch=4294967362, type=0, ungroupedPartitionStates=[], topicStates=[LeaderAndIsrTopicState(topicName='gray-2-0', topicId=573bAVHfRQeXApzAKevNIg, partitionStates=[LeaderAndIsrPartitionState(topicName='gray-2-0', partitionIndex=1, controllerEpoch=1, leader=1, leaderEpoch=0, isr=[1, 3], zkVersion=0, replicas=[1, 3], addingReplicas=[], removingReplicas=[], isNew=true)]), LeaderAndIsrTopicState(topicName='gray-1-0', topicId=12dW2FxLTiyKmGi41HhdZQ, partitionStates=[LeaderAndIsrPartitionState(topicName='gray-1-0', partitionIndex=1, controllerEpoch=1, leader=3, leaderEpoch=0, isr=[3, 1], zkVersion=0, replicas=[3, 1], addingReplicas=[], removingReplicas=[], isNew=true)]), LeaderAndIsrTopicState(topicName='gray-3-0', topicId=_yvmANyZSoK_PTV0e-nqCA, partitionStates=[LeaderAndIsrPartitionState(topicName='gray-3-0', partitionIndex=1, controllerEpoch=1, leader=1, leaderEpoch=0, isr=[1, 3], zkVersion=0, replicas=[1, 3], addingReplicas=[], removingReplicas=[], isNew=true)])], liveLeaders=[LeaderAndIsrLiveLeader(brokerId=1, hostName='localhost', port=9791), LeaderAndIsrLiveLeader(brokerId=3, hostName='localhost', port=9793)]) (kafka.server.RequestHandlerHelper) {code} But all the consumers that are consuming data from the affected topics (“gray-2-0”, “gray-1-0”, “gray-3-0”) are not able to proceed. These consumers don’t have any error log related to this issue. They hang for more than 3 minutes. The IOException sometimes affects multiple offset topics: {code:java} [2021-11-18 10:57:41,289] ERROR [KafkaApi-1] Error when handling request: clientId=1, correlationId=11, api=LEADER_AND_ISR, version=5, body=LeaderAndIsrRequestData(controllerId=1, controllerEpoch=1, brokerEpoch=4294967355, type=0, ungroupedPartitionStates=[], topicStates=[LeaderAndIsrTopicState(topicName='__consumer_offsets', topicId=_MiMTCViS76osIyDdxekIg, partitionStates=[LeaderAndIsrPartitionState(topicName='__consumer_offsets', partitionIndex=15, controllerEpoch=1, leader=1, leaderEpoch=0, isr=[1], zkVersion=0, replicas=[1], addingReplicas=[], removingReplicas=[], isNew=true), LeaderAndIsrPartitionState(topicName='__consumer_offsets', partitionIndex=48, controllerEpoch=1, leader=1, leaderEpoch=0, isr=[1], zkVersion=0, replicas=[1], addingReplicas=[], removingReplicas=[], isNew=true), LeaderAndIsrPartitionState(topicName='__consumer_offsets', partitionIndex=45, controllerEpoch=1, leader=1, leaderEpoch=0, isr=[1], zkVersion=0, replicas=[1], addingReplicas=[], removingReplicas=[], isNew=true), ... addingReplicas=[], removingReplicas=[], isNew=true), LeaderAndIsrPartitionState(topicName='__consumer_offsets', partitionIndex=33, controllerEpoch=1, leader=1, leaderEpoch=0, isr=[1], zkVersion=0, replicas=[1], addingReplicas=[], removingReplicas=[], isNew=true)])], liveLeaders=[LeaderAndIsrLiveLeader(brokerId=1, hostName='localhost', port=9791)]) (kafka.server.RequestHandlerHelper) {code} *Analysis* The key stacktrace is as follows: {code:java} "java.lang.Thread,run,748", "kafka.server.KafkaRequestHandler,run,74", "kafka.server.KafkaApis,handle,236", "kafka.server.KafkaApis,handleLeaderAndIsrRequest,258", "kafka.server.ReplicaManager,becomeLeaderOrFollower,1411", "kafka.server.ReplicaManager,makeLeaders,1566", "scala.collection.mutable.HashMap,foreachEntry,499", "scala.collection.mutable.HashMap$Node,foreachEntry,633", "kafka.utils.Implicits$MapExtensionMethods$,$anonfun$forKeyValue$1,62", "kafka.server.ReplicaManager,$anonfun$makeLeaders$5,1568", "kafka.cluster.Partition,makeLeader,548", "kafka.cluster.Partition,$anonfun$makeLeader$1,564", "kafka.cluster.Partition,createLogIfNotExists,324", "kafka.cluster.Partition,createLog,344", "kafka.log.LogManager,getOrCreateLog,783", "scala.Option,getOrElse,201", "kafka.log.LogManager,$anonfun$getOrCreateLog$1,830", "kafka.log.Log$,apply,2601", "kafka.log.Log,<init>,323" {code} Basically, the IOException is not be handled by Log but instead gets propagated all the way back to `core/src/main/scala/kafka/server/KafkaApis.scala` {code:java} override def handle(request: RequestChannel.Request): Unit = { try { request.header.apiKey match { // ... case ApiKeys.LEADER_AND_ISR => handleLeaderAndIsrRequest(request) // ... } } catch { case e: FatalExitError => throw e case e: Throwable => requestHelper.handleError(request, e) } finally { // ... } } {code} I also notice the ReplicaManager in `core/src/main/scala/kafka/server/ReplicaManager.scala` has a relevant comment about “unexpected error” with a TODO. {code:java} /* * Make the current broker to become leader for a given set of partitions by: * * 1. Stop fetchers for these partitions * 2. Update the partition metadata in cache * 3. Add these partitions to the leader partitions set * * If an unexpected error is thrown in this function, it will be propagated to KafkaApis where * the error message will be set on each partition since we do not know which partition caused it. Otherwise, * return the set of partitions that are made leader due to this method * * TODO: the above may need to be fixed later */ private def makeLeaders(...): Set[Partition] = { // ... try { // ... partitionStates.forKeyValue { (partition, partitionState) => try { if (partition.makeLeader(partitionState, highWatermarkCheckpoints)) // line 1568 partitionsToMakeLeaders += partition else stateChangeLogger.info(...) } catch { case e: KafkaStorageException => stateChangeLogger.error(...) val dirOpt = getLogDir(partition.topicPartition) error(...) responseMap.put(partition.topicPartition, Errors.KAFKA_STORAGE_ERROR) } } } catch { case e: Throwable => partitionStates.keys.foreach { partition => stateChangeLogger.error(...) } // Re-throw the exception for it to be caught in KafkaApis throw e } // ... } {code} *Fix* To fix this issue, I think we should catch the potential IOException when Log is initialized, and then throw a KafkaStorageException, just like many other IOException handlers in Kafka, e.g., [https://github.com/apache/kafka/blob/ebb1d6e21cc9213071ee1c6a15ec3411fc215b81/core/src/main/scala/kafka/server/checkpoints/CheckpointFile.scala#L92-L120] After applying this fix, the aforementioned symptoms will disappear, i.e., the consumers will not hang and proceed to finish the remaining workload. One question is whether we should also use `logDirFailureChannel.maybeAddOfflineLogDir` to handle the IOException, like [https://github.com/apache/kafka/blob/ebb1d6e21cc9213071ee1c6a15ec3411fc215b81/core/src/main/scala/kafka/server/checkpoints/CheckpointFile.scala#L92-L120] and [https://github.com/apache/kafka/blob/ebb1d6e21cc9213071ee1c6a15ec3411fc215b81/core/src/main/scala/kafka/server/checkpoints/CheckpointFile.scala#L126-L139] . If so, `logDirFailureChannel.maybeAddOfflineLogDir` would crash the node according to the protocol in [https://github.com/apache/kafka/blob/ebb1d6e21cc9213071ee1c6a15ec3411fc215b81/core/src/main/scala/kafka/server/ReplicaManager.scala#L268-L277] and [https://github.com/apache/kafka/blob/ebb1d6e21cc9213071ee1c6a15ec3411fc215b81/core/src/main/scala/kafka/server/ReplicaManager.scala#L327-L332] -- This message was sent by Atlassian Jira (v8.20.1#820001)