[ https://issues.apache.org/jira/browse/KAFKA-19548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Haozhong Ma updated KAFKA-19548: -------------------------------- Description: In our production environment, we encountered a scenario where a broker failed to start due to checkpoint creation failure on a single disk (caused by disk corruption or filesystem errors). According to Kafka's design, such disk-level failures should be isolated via {{{}logDirFailureChannel{}}}, allowing other healthy disks to continue serving traffic. However, upon reviewing the {{CheckpointFileWithFailureHandler}} implementation, we observed that while methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}} handle {{IOException}} by routing the affected {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this fault-tolerant behavior. Should checkpoint creation adopt the same failure-handling logic? If this is not an intentional design, I will submit a PR to fix this issue. (was: In our production environment, we encountered a scenario where a broker failed to start due to checkpoint creation failure on a single disk (caused by disk corruption or filesystem errors). According to Kafka's design, such disk-level failures should be isolated via {{{}logDirFailureChannel{}}}, allowing other healthy disks to continue serving traffic. However, upon reviewing the {{CheckpointFileWithFailureHandler}} implementation, we observed that while methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}} handle {{IOException}} by routing the affected {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this fault-tolerant behavior. Is this an oversight? Should checkpoint creation adopt the same failure-handling logic?) > Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel > --------------------------------------------------------------------------- > > Key: KAFKA-19548 > URL: https://issues.apache.org/jira/browse/KAFKA-19548 > Project: Kafka > Issue Type: Improvement > Components: core > Reporter: Haozhong Ma > Assignee: Haozhong Ma > Priority: Major > > In our production environment, we encountered a scenario where a broker > failed to start due to checkpoint creation failure on a single disk (caused > by disk corruption or filesystem errors). According to Kafka's design, such > disk-level failures should be isolated via {{{}logDirFailureChannel{}}}, > allowing other healthy disks to continue serving traffic. However, upon > reviewing the {{CheckpointFileWithFailureHandler}} implementation, we > observed that while methods like {{{}write{}}}, {{{}read{}}}, and > {{writeIfDirExists}} handle {{IOException}} by routing the affected > {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization > process lacks this fault-tolerant behavior. Should checkpoint creation adopt > the same failure-handling logic? If this is not an intentional design, I will > submit a PR to fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)