[ 
https://issues.apache.org/jira/browse/KAFKA-19548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haozhong Ma updated KAFKA-19548:
--------------------------------
    Description: In our production environment, we encountered a scenario where 
a broker failed to start due to checkpoint creation failure on a single disk 
(caused by disk corruption or filesystem errors). According to Kafka's design, 
such disk-level failures should be isolated via {{{}logDirFailureChannel{}}}, 
allowing other healthy disks to continue serving traffic. However, upon 
reviewing the {{CheckpointFileWithFailureHandler}} implementation, we observed 
that while methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}} 
handle {{IOException}} by routing the affected {{log.dir}} to 
{{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this 
fault-tolerant behavior. Is this an oversight? Should checkpoint creation adopt 
the same failure-handling logic?  (was: In our production environment, we 
encountered a scenario where a broker failed to start due to checkpoint 
creation failure on a single disk (caused by disk corruption or filesystem 
errors). According to Kafka's design, such disk-level failures should be 
isolated via {{{}logDirFailureChannel{}}}, allowing other healthy disks to 
continue serving traffic. However, upon reviewing the 
{{CheckpointFileWithFailureHandler}} implementation, we observed that while 
methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}} handle 
{{IOException}} by routing the affected {{log.dir}} to 
{{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this 
fault-tolerant behavior. Is this an oversight? Should checkpoint creation adopt 
the same failure-handling logic?

!image-2025-07-25-15-07-18-919.png!)

> Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-19548
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19548
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>            Reporter: Haozhong Ma
>            Assignee: Haozhong Ma
>            Priority: Major
>
> In our production environment, we encountered a scenario where a broker 
> failed to start due to checkpoint creation failure on a single disk (caused 
> by disk corruption or filesystem errors). According to Kafka's design, such 
> disk-level failures should be isolated via {{{}logDirFailureChannel{}}}, 
> allowing other healthy disks to continue serving traffic. However, upon 
> reviewing the {{CheckpointFileWithFailureHandler}} implementation, we 
> observed that while methods like {{{}write{}}}, {{{}read{}}}, and 
> {{writeIfDirExists}} handle {{IOException}} by routing the affected 
> {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization 
> process lacks this fault-tolerant behavior. Is this an oversight? Should 
> checkpoint creation adopt the same failure-handling logic?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to