[ https://issues.apache.org/jira/browse/HDFS-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ke Han updated HDFS-17163: -------------------------- Description: When I performed the full-stop upgrade from 2.10.2 to 3.3.6. I noticed the following error message: *2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: Error reported on storage directory Storage Directory /tmp/hadoop-root/dfs/namesecondary* {code:java} 2023-08-17 10:43:11,407 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 of size 340 bytes saved in 0 seconds . 2023-08-17 10:43:11,427 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15: SIGTERM 2023-08-17 10:43:11,434 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: FSImageSaver clean checkpoint: txid = 20 when meet shutdown. 2023-08-17 10:43:11,434 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down SecondaryNameNode at 5371b4aeefe1/192.168.78.3 ************************************************************/ 2023-08-17 10:43:11,663 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Unable to rename checkpoint in Storage Directory /tmp/hadoop-root/dfs/namesecondary java.io.IOException: renaming /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 to /tmp/hadoop-root/dfs/namesecondary/current/fsimage_0000000000000000020 FAILED at org.apache.hadoop.hdfs.server.namenode.FSImage.renameImageFileInDir(FSImage.java:1329) at org.apache.hadoop.hdfs.server.namenode.FSImage.renameCheckpoint(FSImage.java:1263) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1224) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1172) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1105) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:563) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321) at java.lang.Thread.run(Thread.java:750) 2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: Error reported on storage directory Storage Directory /tmp/hadoop-root/dfs/namesecondary 2023-08-17 10:43:11,665 WARN org.apache.hadoop.hdfs.server.common.Storage: About to remove corresponding storage: /tmp/hadoop-root/dfs/namesecondary {code} The cluster I am using is four nodes: 1 NN, 1 SNN, 2 DN. The upgrade order is: (1) Stop SNN (2) Stop NN (3) Stop DN1 and DN2. The error message occurs at SNN when it's stopping. The command sequence I was executing and the configurations are appended. I tried to reproduce it with the same command sequence, but it cannot be reproduced (I repeatedly execute the command sequence + upgrade) two thousand times. It might require some special timing constraints. I am not sure whether this could impact the data integrity. == Command Sequence == {code:java} // Start up cluster bin/hdfs dfsadmin -safemode enter bin/hdfs dfsadmin -rollingUpgrade prepare bin/hdfs dfsadmin -safemode leave // Execute commands dfs -mkdir /ymlAOGQU dfs -mkdir /ymlAOGQU/xXVm dfs -touchz /ymlAOGQU/xXVm/xXVm.xml dfs -mv /ymlAOGQU/xXVm /ymlAOGQU/ dfs -setacl -k -m acl / --set acl2 / dfsadmin -saveNamespace dfs -touchz /ymlAOGQU/xXVm.txt dfs -put /tmp/upfuzz/hdfs/orDmixfM/D /ymlAOGQU/ dfs -rm -f -R -safely -skipTrash /ymlAOGQU/ dfsadmin -report -live -enteringmaintenance -inmaintenance dfsadmin -saveNamespace dfsadmin -report -dead -enteringmaintenance dfsadmin -rollEdits dfsadmin -refreshNodes // stop SNN // stop NN // stop DN1&DN2{code} was: When I performed the full-stop upgrade from 2.10.2 to 3.3.6. I noticed the following error message: *2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: Error reported on storage directory Storage Directory /tmp/hadoop-root/dfs/namesecondary* {code:java} 2023-08-17 10:43:11,407 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 of size 340 bytes saved in 0 seconds . 2023-08-17 10:43:11,427 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15: SIGTERM 2023-08-17 10:43:11,434 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: FSImageSaver clean checkpoint: txid = 20 when meet shutdown. 2023-08-17 10:43:11,434 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down SecondaryNameNode at 5371b4aeefe1/192.168.78.3 ************************************************************/ 2023-08-17 10:43:11,663 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Unable to rename checkpoint in Storage Directory /tmp/hadoop-root/dfs/namesecondary java.io.IOException: renaming /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 to /tmp/hadoop-root/dfs/namesecondary/current/fsimage_0000000000000000020 FAILED at org.apache.hadoop.hdfs.server.namenode.FSImage.renameImageFileInDir(FSImage.java:1329) at org.apache.hadoop.hdfs.server.namenode.FSImage.renameCheckpoint(FSImage.java:1263) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1224) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1172) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1105) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:563) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321) at java.lang.Thread.run(Thread.java:750) 2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: Error reported on storage directory Storage Directory /tmp/hadoop-root/dfs/namesecondary 2023-08-17 10:43:11,665 WARN org.apache.hadoop.hdfs.server.common.Storage: About to remove corresponding storage: /tmp/hadoop-root/dfs/namesecondary {code} The cluster I am using is four nodes: 1 NN, 1 SNN, 2 DN. The upgrade order is: (1) Stop SNN (2) Stop NN (3) Stop DN1 and DN2. The error message occurs at SNN when it's stoping. The command sequence I was executing and configurations are appended. I tried to reproduce it with the same command sequence, but it cannot be reproduced (I repeatedly execute the command sequence + upgrade) for two thousand times. It might require some special timing constraints. I am not sure whether this could impact the data integrity. == Command Sequence == {code:java} // Start up cluster bin/hdfs dfsadmin -safemode enter bin/hdfs dfsadmin -rollingUpgrade prepare bin/hdfs dfsadmin -safemode leave // Execute commands dfs -mkdir /ymlAOGQU dfs -mkdir /ymlAOGQU/xXVm dfs -touchz /ymlAOGQU/xXVm/xXVm.xml dfs -mv /ymlAOGQU/xXVm /ymlAOGQU/ dfs -setacl -k -m acl / --set acl2 / dfsadmin -saveNamespace dfs -touchz /ymlAOGQU/xXVm.txt dfs -put /tmp/upfuzz/hdfs/orDmixfM/D /ymlAOGQU/ dfs -rm -f -R -safely -skipTrash /ymlAOGQU/ dfsadmin -report -live -enteringmaintenance -inmaintenance dfsadmin -saveNamespace dfsadmin -report -dead -enteringmaintenance dfsadmin -rollEdits dfsadmin -refreshNodes // stop SNN // stop NN // stop DN1&DN2{code} > ERROR Log Message when upgrading from 2.10.2 to 3.3.6 > ----------------------------------------------------- > > Key: HDFS-17163 > URL: https://issues.apache.org/jira/browse/HDFS-17163 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.10.2 > Reporter: Ke Han > Priority: Major > Attachments: core-site.xml, hdfs-site.xml, log.tar.gz, orDmixfM.tar.gz > > > When I performed the full-stop upgrade from 2.10.2 to 3.3.6. I noticed the > following error message: > *2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: > Error reported on storage directory Storage Directory > /tmp/hadoop-root/dfs/namesecondary* > {code:java} > 2023-08-17 10:43:11,407 INFO > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file > /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 > of size 340 bytes saved in 0 seconds . > 2023-08-17 10:43:11,427 ERROR > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15: > SIGTERM > 2023-08-17 10:43:11,434 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > FSImageSaver clean checkpoint: txid = 20 when meet shutdown. > 2023-08-17 10:43:11,434 INFO > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG: > /************************************************************ > SHUTDOWN_MSG: Shutting down SecondaryNameNode at 5371b4aeefe1/192.168.78.3 > ************************************************************/ > 2023-08-17 10:43:11,663 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: > Unable to rename checkpoint in Storage Directory > /tmp/hadoop-root/dfs/namesecondary > java.io.IOException: renaming > /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000020 > to /tmp/hadoop-root/dfs/namesecondary/current/fsimage_0000000000000000020 > FAILED > at > org.apache.hadoop.hdfs.server.namenode.FSImage.renameImageFileInDir(FSImage.java:1329) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.renameCheckpoint(FSImage.java:1263) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1224) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1172) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1105) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:563) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321) > at java.lang.Thread.run(Thread.java:750) > 2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: > Error reported on storage directory Storage Directory > /tmp/hadoop-root/dfs/namesecondary > 2023-08-17 10:43:11,665 WARN org.apache.hadoop.hdfs.server.common.Storage: > About to remove corresponding storage: /tmp/hadoop-root/dfs/namesecondary > {code} > > The cluster I am using is four nodes: 1 NN, 1 SNN, 2 DN. The upgrade order > is: (1) Stop SNN (2) Stop NN (3) Stop DN1 and DN2. The error message occurs > at SNN when it's stopping. > The command sequence I was executing and the configurations are appended. I > tried to reproduce it with the same command sequence, but it cannot be > reproduced (I repeatedly execute the command sequence + upgrade) two thousand > times. It might require some special timing constraints. I am not sure > whether this could impact the data integrity. > == Command Sequence == > {code:java} > // Start up cluster > bin/hdfs dfsadmin -safemode enter > bin/hdfs dfsadmin -rollingUpgrade prepare > bin/hdfs dfsadmin -safemode leave > // Execute commands > dfs -mkdir /ymlAOGQU > dfs -mkdir /ymlAOGQU/xXVm > dfs -touchz /ymlAOGQU/xXVm/xXVm.xml > dfs -mv /ymlAOGQU/xXVm /ymlAOGQU/ > dfs -setacl -k -m acl / --set acl2 / > dfsadmin -saveNamespace > dfs -touchz /ymlAOGQU/xXVm.txt > dfs -put /tmp/upfuzz/hdfs/orDmixfM/D /ymlAOGQU/ > dfs -rm -f -R -safely -skipTrash /ymlAOGQU/ > dfsadmin -report -live -enteringmaintenance -inmaintenance > dfsadmin -saveNamespace > dfsadmin -report -dead -enteringmaintenance > dfsadmin -rollEdits > dfsadmin -refreshNodes > // stop SNN > // stop NN > // stop DN1&DN2{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org