[jira] [Created] (KAFKA-19646) CLONE - Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel

Haozhong Ma (Jira) Tue, 26 Aug 2025 01:52:09 -0700

Haozhong Ma created KAFKA-19646:
-----------------------------------

             Summary: CLONE - Broker Startup: Handle Checkpoint Creation 
Failure via logDirFailureChannel
                 Key: KAFKA-19646
                 URL: https://issues.apache.org/jira/browse/KAFKA-19646
             Project: Kafka
          Issue Type: Improvement
          Components: core
            Reporter: Haozhong Ma
            Assignee: Haozhong Ma



In our production environment, we encountered a scenario where a broker failed 
to start due to checkpoint creation failure on a single disk (caused by disk 
corruption or filesystem errors). According to Kafka's design, such disk-level 
failures should be isolated via {{{}logDirFailureChannel{}}}, allowing other 
healthy disks to continue serving traffic. However, upon reviewing the 
{{CheckpointFileWithFailureHandler}} implementation, we observed that while 
methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}} handle 
{{IOException}} by routing the affected {{log.dir}} to 
{{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this 
fault-tolerant behavior. Should checkpoint creation adopt the same 
failure-handling logic? If this is not an intentional design, I will submit a 
PR to fix this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-19646) CLONE - Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel

Reply via email to