Haozhong Ma created KAFKA-19646:
-----------------------------------
Summary: CLONE - Broker Startup: Handle Checkpoint Creation
Failure via logDirFailureChannel
Key: KAFKA-19646
URL: https://issues.apache.org/jira/browse/KAFKA-19646
Project: Kafka
Issue Type: Improvement
Components: core
Reporter: Haozhong Ma
Assignee: Haozhong Ma
In our production environment, we encountered a scenario where a broker failed
to start due to checkpoint creation failure on a single disk (caused by disk
corruption or filesystem errors). According to Kafka's design, such disk-level
failures should be isolated via {{{}logDirFailureChannel{}}}, allowing other
healthy disks to continue serving traffic. However, upon reviewing the
{{CheckpointFileWithFailureHandler}} implementation, we observed that while
methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}} handle
{{IOException}} by routing the affected {{log.dir}} to
{{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this
fault-tolerant behavior. Should checkpoint creation adopt the same
failure-handling logic? If this is not an intentional design, I will submit a
PR to fix this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)