[ 
https://issues.apache.org/jira/browse/KAFKA-19548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haozhong Ma updated KAFKA-19548:
--------------------------------
    Description: In our production environment, we encountered a scenario where 
a broker failed to start due to checkpoint creation failure on a single disk 
(caused by disk corruption or filesystem errors). According to Kafka's design, 
such disk-level failures should be isolated via {{{}logDirFailureChannel{}}}, 
allowing other healthy disks to continue serving traffic. However, upon 
reviewing the {{CheckpointFileWithFailureHandler}} implementation, we observed 
that while methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}} 
handle {{IOException}} by routing the affected {{log.dir}} to 
{{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this 
fault-tolerant behavior. Should checkpoint creation adopt the same 
failure-handling logic? If this is not an intentional design, I will submit a 
PR to fix this issue.  (was: In our production environment, we encountered a 
scenario where a broker failed to start due to checkpoint creation failure on a 
single disk (caused by disk corruption or filesystem errors). According to 
Kafka's design, such disk-level failures should be isolated via 
{{{}logDirFailureChannel{}}}, allowing other healthy disks to continue serving 
traffic. However, upon reviewing the {{CheckpointFileWithFailureHandler}} 
implementation, we observed that while methods like {{{}write{}}}, 
{{{}read{}}}, and {{writeIfDirExists}} handle {{IOException}} by routing the 
affected {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint 
initialization process lacks this fault-tolerant behavior. Is this an 
oversight? Should checkpoint creation adopt the same failure-handling logic?)

> Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-19548
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19548
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>            Reporter: Haozhong Ma
>            Assignee: Haozhong Ma
>            Priority: Major
>
> In our production environment, we encountered a scenario where a broker 
> failed to start due to checkpoint creation failure on a single disk (caused 
> by disk corruption or filesystem errors). According to Kafka's design, such 
> disk-level failures should be isolated via {{{}logDirFailureChannel{}}}, 
> allowing other healthy disks to continue serving traffic. However, upon 
> reviewing the {{CheckpointFileWithFailureHandler}} implementation, we 
> observed that while methods like {{{}write{}}}, {{{}read{}}}, and 
> {{writeIfDirExists}} handle {{IOException}} by routing the affected 
> {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization 
> process lacks this fault-tolerant behavior. Should checkpoint creation adopt 
> the same failure-handling logic? If this is not an intentional design, I will 
> submit a PR to fix this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to