[
https://issues.apache.org/jira/browse/KAFKA-19548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Haozhong Ma resolved KAFKA-19548.
---------------------------------
Resolution: Not A Problem
> Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel
> ---------------------------------------------------------------------------
>
> Key: KAFKA-19548
> URL: https://issues.apache.org/jira/browse/KAFKA-19548
> Project: Kafka
> Issue Type: Improvement
> Components: core
> Reporter: Haozhong Ma
> Assignee: Haozhong Ma
> Priority: Major
>
> In our production environment, we encountered a scenario where a broker
> failed to start due to checkpoint creation failure on a single disk (caused
> by disk corruption or filesystem errors). According to Kafka's design, such
> disk-level failures should be isolated via {{{}logDirFailureChannel{}}},
> allowing other healthy disks to continue serving traffic. However, upon
> reviewing the {{CheckpointFileWithFailureHandler}} implementation, we
> observed that while methods like {{{}write{}}}, {{{}read{}}}, and
> {{writeIfDirExists}} handle {{IOException}} by routing the affected
> {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization
> process lacks this fault-tolerant behavior. Should checkpoint creation adopt
> the same failure-handling logic? If this is not an intentional design, I will
> submit a PR to fix this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)