Haoze Wu created KAFKA-15339:
--------------------------------
Summary: Transient I/O error happening in appending records could
lead to the half of whole cluster
Key: KAFKA-15339
URL: https://issues.apache.org/jira/browse/KAFKA-15339
Project: Kafka
Issue Type: Improvement
Components: connect, producer
Affects Versions: 3.5.0
Reporter: Haoze Wu
We are running an integration test in which we start an Embedded Connect
Cluster in the active 3.5 branch. However, because of transient disk error, we
may encounter an IOException during appending records to one topic. As shown in
the stack trace:
{code:java}
[2023-08-13 16:53:51,016] ERROR Error while appending records to
connect-config-topic-connect-cluster-0 in dir
/tmp/EmbeddedKafkaCluster8003464883598783225
(org.apache.kafka.storage.internals.log.LogDirFailureChannel:61)
java.io.IOException:
at
org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:92)
at
org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
at kafka.log.LogSegment.append(LogSegment.scala:161)
at kafka.log.LocalLog.append(LocalLog.scala:436)
at kafka.log.UnifiedLog.append(UnifiedLog.scala:853)
at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:664)
at
kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1281)
at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1269)
at
kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:977)
at
scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
at
scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
at scala.collection.mutable.HashMap.map(HashMap.scala:35)
at
kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:965)
at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:623)
at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:680)
at kafka.server.KafkaApis.handle(KafkaApis.scala:180)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:76)
at java.lang.Thread.run(Thread.java:748) {code}
However, just because of failing to append the records to one partition. The
fetcher for all the other partitions are removed, broker shutdown, and finally
embedded connect cluster killed as whole.
{code:java}
[2023-08-13 17:35:37,966] WARN Stopping serving logs in dir
/tmp/EmbeddedKafkaCluster6777164631574762227 (kafka.log.LogManager:70)
[2023-08-13 17:35:37,968] ERROR Shutdown broker because all log dirs in
/tmp/EmbeddedKafkaCluster6777164631574762227 have failed
(kafka.log.LogManager:143)
[2023-08-13 17:35:37,968] WARN Abrupt service halt with code 1 and message null
(org.apache.kafka.connect.util.clusters.EmbeddedConnectCluster:130)
[2023-08-13 17:35:37,968] ERROR [LogDirFailureHandler]: Error due to
(kafka.server.ReplicaManager$LogDirFailureHandler:135)
org.apache.kafka.connect.util.clusters.UngracefulShutdownException: Abrupt
service halt with code 1 and message null {code}
I am wondering if we could add configurable retry around the root cause to
tolerate the possible I/O faults so that if the retry is successful, the
embedded connect cluster could still operate.
Any comments and suggestions would be appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)