[jira] [Created] (HDFS-17821) Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs

caozhiqiang (Jira) Sat, 16 Aug 2025 09:28:14 -0700

caozhiqiang created HDFS-17821:
----------------------------------

             Summary: Fix the SNN repeatedly checkpoint after fsimage transfer 
failure on one of the multiple NNs
                 Key: HDFS-17821
                 URL: https://issues.apache.org/jira/browse/HDFS-17821
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 3.5.0
            Reporter: caozhiqiang
            Assignee: caozhiqiang



In our cluster with observer NNs, when the standby NN is doing a checkpoint and 
sending the fsimage to other NNs, if the sending fails of one NN due to network 
anomalies, NN restarts, or other exceptions, the standby will consider this 
Checkpoint as failed and does not update the lastCheckpointTime, and retry 
checkpoints. 
However, the active or observer NNs which successfully received the fsimage has 
update their lastCheckpointTime, and the NN which receive fsimage failed don't 
update its lastCheckpointTime, resulting in inconsistent lastCheckpointTime 
across the NNs. This causes subsequent checkpoints to repeatedly fail to send 
fsimage to part or all active or observer NNs, because they do not satisfy the 
DFS_NAMENODE_CHECKPOINT_PERIOD_KEY condition. 
Then the SNN will always failed to do checkpoint and repeat retry. I think that 
the SNN should consider the checkpoint successful and update its 
lastCheckpointTime if the fsimage transmission succeeds on at least half of the 
NNs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDFS-17821) Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs

Reply via email to