[
https://issues.apache.org/jira/browse/HDFS-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18012816#comment-18012816
]
ASF GitHub Bot commented on HDFS-17815:
---------------------------------------
lfxy commented on code in PR #7845:
URL: https://github.com/apache/hadoop/pull/7845#discussion_r2262481006
##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java:
##########
@@ -459,9 +459,10 @@ private void doWork() {
uncheckpointed, checkpointConf.getTxnCount());
needCheckpoint = true;
} else if (secsSinceLast >= checkpointConf.getPeriod()) {
- LOG.info("Triggering checkpoint because it has been {} seconds " +
- "since the last checkpoint, which exceeds the configured " +
- "interval {}", secsSinceLast, checkpointConf.getPeriod());
+ LOG.info("Triggering checkpoint because it has been {} seconds "
Review Comment:
@Hexiaoqiao Thanks for review. I have resolve it.
> Fix upload fsimage failure when checkpoint takes a long time
> ------------------------------------------------------------
>
> Key: HDFS-17815
> URL: https://issues.apache.org/jira/browse/HDFS-17815
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 3.5.0
> Reporter: caozhiqiang
> Assignee: caozhiqiang
> Priority: Major
> Labels: pull-request-available
>
> The capacity of Our hdfs federation cluster are more then 500 PB, with one NS
> containing over 600 million files. Once checkpoint takes nearly two hours.
> We discover checkpoint frequently failures due to fail to put the fsimage to
> the active Namenode, leading to repeat checkpoints. We configured
> dfs.recent.image.check.enabled=true. After debug, the reason is the standby
> NN updates the lastCheckpointTime use the start time of checkpoint, rather
> than the end time. In our cluster, the lastCheckpointTime of the standby node
> is approximately 80 minutes ahead of the lastCheckpointTime of the active NN.
> When the checkpoint interval in standby NN exceeds
> dfs.namenode.checkpoint.period, the next checkpoint is performed. Because the
> active NN's lastCheckpointTime is later than standby NN's, the interval is
> less than dfs.namenode.checkpoint.period, and the putting fsimage is been
> rejected, causing the checkpoint to fail and retried.
> ANN's log:
> {code:java}
> 2025-07-31 07:14:29,845 INFO [qtp231311211-8404]
> org.apache.hadoop.hdfs.server.namenode.ImageServlet: New txnid cnt is
> 126487459, expecting at least 300000000. now is 1753917269845,
> lastCheckpointTime is 1753875142580, timeDelta is 42127, expecting period at
> least 43200 unless too long since last upload.. {code}
> SNN's log:
> {code:java}
> last checkpoint start time:
> 2025-07-30 18:13:08,729 INFO [Standby State Checkpointer]
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering
> checkpoint because it has been 48047 seconds since the last checkpoint, which
> exceeds the configured interval 43200
> last checkpoint end timeļ¼
> 2025-07-30 20:11:51,330 INFO [Standby State Checkpointer]
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Checkpoint
> finished successfully.
> this time checkpoint start time:
> 2025-07-31 06:13:51,681 INFO [Standby State Checkpointer]
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering
> checkpoint because it has been 43242 seconds since the last checkpoint, which
> exceeds the configured interval 43200{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]