[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104187#comment-17104187 ] Yicong Cai commented on HDFS-15175: --- Hi, [~wanchang] We solved this problem by completely resetting the OP object. At present, I have not passed this problem through UT use cases, so I have not provided a patch for repair. Do you have a UT use case that reproduces this problem? > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Critical > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-15175: -- Description: {panel:title=Crash exception} 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=32625024993] java.io.IOException: File is not under construction: .. at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) {panel} {panel:title=Editlog} OP_REASSIGN_LEASE 32625021150 DFSClient_NONMAPREDUCE_-969060727_197760 .. DFSClient_NONMAPREDUCE_1000868229_201260 .. OP_CLOSE 32625023743 0 0 .. 3 1581816135883 1581814760398 536870912 false 5568434562 185818644 4495417845 da_music hdfs 416 .. OP_TRUNCATE 32625024049 .. DFSClient_NONMAPREDUCE_1000868229_201260 .. 185818644 1581816136336 5568434562 185818648 4495417845 .. OP_CLOSE 32625024993 0 0 .. 3 1581816138774 1581814760398 536870912 false 5568434562 185818644 4495417845 da_music hdfs 416 {panel} The block size should be 185818648 in the first CloseOp. When truncate is used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is synchronized to the JournalNode in the same batch. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When SNN rolling Editlog, TruncateOp does not make the file to the UnderConstruction state. Then, when the second CloseOp is executed, the file is not in the UnderConstruction state, and SNN crashes. was: {panel:title=Crash exception} 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=32625024993] java.io.IOException: File is not under construction: .. at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) at
[jira] [Created] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
Yicong Cai created HDFS-15175: - Summary: Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog Key: HDFS-15175 URL: https://issues.apache.org/jira/browse/HDFS-15175 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.9.2 Reporter: Yicong Cai Assignee: Yicong Cai {panel:title=Crash exception} 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=32625024993] java.io.IOException: File is not under construction: .. at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) {panel} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911074#comment-16911074 ] Yicong Cai commented on HDFS-14311: --- Thanks [~sodonnell] [~surendrasingh] [~jojochuang] for your attention and review on this issue. It is very difficult to use UT to reproduce, I have failed. I first modified the check style related issues, I will continue to try to reproduce the problem with UT. > multi-threading conflict at layoutVersion when loading block pool storage > - > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14311.1.patch, HDFS-14311.2.patch, > HDFS-14311.branch-2.1.patch > > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old
[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14311: -- Attachment: HDFS-14311.branch-2.1.patch > multi-threading conflict at layoutVersion when loading block pool storage > - > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14311.1.patch, HDFS-14311.2.patch, > HDFS-14311.branch-2.1.patch > > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set the BlockPoolSliceStorage's layoutVersion to current > DN version. The next
[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14311: -- Attachment: HDFS-14311.2.patch > multi-threading conflict at layoutVersion when loading block pool storage > - > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14311.1.patch, HDFS-14311.2.patch > > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set the BlockPoolSliceStorage's layoutVersion to current > DN version. The next storage dir transition check will concurrent
[jira] [Commented] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906309#comment-16906309 ] Yicong Cai commented on HDFS-14311: --- [~sodonnell] Thanks for your detailed reply. I will add the corresponding question replication use cases and adjust the code format. > multi-threading conflict at layoutVersion when loading block pool storage > - > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14311.1.patch > > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set
[jira] [Commented] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875357#comment-16875357 ] Yicong Cai commented on HDFS-14429: --- [~jojochuang] Before fixing this issue, the decommissing block will not complete the block, so the Redundancy check will not be performed. After fixing the problem, the Redundancy check will be performed and updateNeededReconstructions will be performed. The replication of the maintenance is Effective, but the decommission is not, so a neededReconstruction.update will cause curReplicas to be negative. {code:java} // handle low redundancy/extra redundancy short fileRedundancy = getExpectedRedundancyNum(storedBlock); if (!isNeededReconstruction(storedBlock, num, pendingNum)) { neededReconstruction.remove(storedBlock, numCurrentReplica, num.readOnlyReplicas(), num.outOfServiceReplicas(), fileRedundancy); } else { // Perform update updateNeededReconstructions(storedBlock, curReplicaDelta, 0); } {code} {code:java} if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) { neededReconstruction.update(block, repl.liveReplicas() + pendingNum, repl.readOnlyReplicas(), repl.outOfServiceReplicas(), curExpectedReplicas, curReplicasDelta, expectedReplicasDelta); } {code} {code:java} synchronized void update(BlockInfo block, int curReplicas, int readOnlyReplicas, int outOfServiceReplicas, int curExpectedReplicas, int curReplicasDelta, int expectedReplicasDelta) { // Cause Negative here int oldReplicas = curReplicas-curReplicasDelta; int oldExpectedReplicas = curExpectedReplicas-expectedReplicasDelta; int curPri = getPriority(block, curReplicas, readOnlyReplicas, outOfServiceReplicas, curExpectedReplicas); int oldPri = getPriority(block, oldReplicas, readOnlyReplicas, outOfServiceReplicas, oldExpectedReplicas); if(NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("LowRedundancyBlocks.update " + block + " curReplicas " + curReplicas + " curExpectedReplicas " + curExpectedReplicas + " oldReplicas " + oldReplicas + " oldExpectedReplicas " + oldExpectedReplicas + " curPri " + curPri + " oldPri " + oldPri); } // oldPri is mostly correct, but not always. If not found with oldPri, // other levels will be searched until the block is found & removed. remove(block, oldPri, oldExpectedReplicas); if(add(block, curPri, curExpectedReplicas)) { NameNode.blockStateChangeLog.debug( "BLOCK* NameSystem.LowRedundancyBlock.update: {} has only {} " + "replicas and needs {} replicas so is added to " + "neededReconstructions at priority level {}", block, curReplicas, curExpectedReplicas, curPri); } } {code} > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, > HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, > HDFS-14429.branch-2.02.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at >
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Attachment: HDFS-14429.03.patch > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, > HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, > HDFS-14429.branch-2.02.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Attachment: HDFS-14429.branch-2.02.patch > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, > HDFS-14429.branch-2.01.patch, HDFS-14429.branch-2.02.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Attachment: (was: HDFS-14429.branch-2.02.patch) > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, > HDFS-14429.branch-2.01.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Attachment: (was: HDFS-14429.03.patch) > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, > HDFS-14429.branch-2.01.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871032#comment-16871032 ] Yicong Cai commented on HDFS-14429: --- Thanks [~hexiaoqiao] for reviewing my patch. I have modified the three issues you mentioned a/b/c. trunk: [^HDFS-14429.03.patch] branch-2: [^HDFS-14429.branch-2.02.patch] d. Do we need add {{pendingNum}} when calc numUsableReplicas? {color:#FF}No need to add pendingNum. Because only the FINALIZED block and reach the minimum replication, COMPLETE can be entered. The Pending Block is not a FINALIZED block.{color} > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, > HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, > HDFS-14429.branch-2.02.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Attachment: HDFS-14429.branch-2.02.patch > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, > HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch, > HDFS-14429.branch-2.02.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Attachment: HDFS-14429.03.patch > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, > HDFS-14429.03.patch, HDFS-14429.branch-2.01.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Attachment: HDFS-14429.branch-2.01.patch > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, > HDFS-14429.branch-2.01.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870579#comment-16870579 ] Yicong Cai commented on HDFS-14429: --- Provided branch-2 [^HDFS-14429.branch-2.01.patch] and trunck [^HDFS-14429.02.patch] [~jojochuang] > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, > HDFS-14429.branch-2.01.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Target Version/s: 2.10.0, 3.3.0, 2.9.3 (was: 3.3.0, 2.9.3) > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch, > HDFS-14429.branch-2.01.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Attachment: HDFS-14429.02.patch > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Attachment: (was: HDFS-14429.02.patch) > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Attachment: HDFS-14429.02.patch > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch, HDFS-14429.02.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867237#comment-16867237 ] Yicong Cai commented on HDFS-14465: --- [~jojochuang] [^HDFS-14465.branch-2.9.01.patch] is ready. > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch, > HDFS-14465.branch-2.9.01.patch > > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14465: -- Attachment: HDFS-14465.branch-2.9.01.patch > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch, > HDFS-14465.branch-2.9.01.patch > > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866890#comment-16866890 ] Yicong Cai commented on HDFS-14465: --- Okay, I'll provide branch-2 patch as soon as possible.[~jojochuang] > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch > > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866891#comment-16866891 ] Yicong Cai commented on HDFS-14429: --- Okay, I'll provide relevant test cases. > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14465: -- Attachment: HDFS-14465.02.patch Status: Patch Available (was: Open) > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch > > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14465: -- Status: Open (was: Patch Available) > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch > > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14465: -- Attachment: (was: HDFS-14465.02.patch) > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Attachments: HDFS-14465.01.patch > > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834534#comment-16834534 ] Yicong Cai commented on HDFS-14465: --- [^HDFS-14465.02.patch] Fix checkstyle. Had tested the hadoop.hdfs.web.TestWebHdfsTimeouts use case separately and it works fine. > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch > > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14465: -- Attachment: HDFS-14465.02.patch > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Attachments: HDFS-14465.01.patch, HDFS-14465.02.patch > > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
[ https://issues.apache.org/jira/browse/HDFS-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14429: -- Attachment: HDFS-14429.01.patch Target Version/s: 3.3.0, 2.9.3 (was: 2.9.3) Status: Patch Available (was: Open) > Block remain in COMMITTED but not COMPLETE cause by Decommission > > > Key: HDFS-14429 > URL: https://issues.apache.org/jira/browse/HDFS-14429 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Attachments: HDFS-14429.01.patch > > > In the following scenario, the Block will remain in the COMMITTED but not > COMPLETE state and cannot be closed properly: > # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). > # bk1 has been completely written to three data nodes, and the data node > succeeds FinalizeBlock, joins IBR and waits to report to NameNode. > # The client commits bk1 after receiving the ACK. > # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 > enter Decommissioning. > # The DN reports the IBR, but the block cannot be completed normally. > > Then it will lead to the following related exceptions: > {panel:title=Exception} > 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem > (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* > blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= > minimum = 1) in file xxx > 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - > IPC Server handler 499 on 8020, call Call#122552 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not > replicated yet: xxx > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) > {panel} > And will cause the scenario described in HDFS-12747 > The root cause is that addStoredBlock does not consider the case where the > replications are in Decommission. > This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14311: -- Attachment: (was: HDFS-14311.1.patch) > multi-threading conflict at layoutVersion when loading block pool storage > - > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Fix For: 3.3.0, 2.9.3 > > Attachments: HDFS-14311.1.patch > > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set the BlockPoolSliceStorage's layoutVersion to current > DN version. The next storage dir transition check will concurrent with
[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14311: -- Fix Version/s: (was: 2.9.3) (was: 3.3.0) Attachment: HDFS-14311.1.patch Target Version/s: 3.3.0, 2.9.3 Status: Patch Available (was: Open) > multi-threading conflict at layoutVersion when loading block pool storage > - > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Attachments: HDFS-14311.1.patch > > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set the
[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14465: -- Status: Open (was: Patch Available) > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Attachments: HDFS-14465.01.patch > > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14465: -- Attachment: HDFS-14465.01.patch Target Version/s: 3.3.0, 2.9.3 Status: Patch Available (was: Open) > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Attachments: HDFS-14465.01.patch > > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14465: -- Fix Version/s: (was: 2.9.3) (was: 3.3.0) > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
[ https://issues.apache.org/jira/browse/HDFS-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14465: -- Fix Version/s: 2.9.3 3.3.0 Status: Patch Available (was: Open) > When the Block expected replications is larger than the number of DataNodes, > entering maintenance will never exit. > -- > > Key: HDFS-14465 > URL: https://issues.apache.org/jira/browse/HDFS-14465 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Fix For: 3.3.0, 2.9.3 > > > Scenes: > There is a small HDFS cluster with 5 DataNodes; one of them is maintained, > added to the maintenance list, and set > dfs.namenode.maintenance.replication.min to 1. > When refresh Nodes, the NameNode starts checking whether the blocks on the > node require a new replication. > The replications of the MapReduce task job file is 10 by default, > isNeededReplicationForMaintenance will determine to false, and > isSufficientlyReplicated will determine to false, so the block of the job > file needs to increase the replication. > When adding a replication, since the cluster has only 5 DataNodes, all the > nodes have the replications of the block, chooseTargetInOrder will throw a > NotEnoughReplicasException, so that the replication cannot be increase, and > the Entering Maintenance cannot be ended. > This issue will cause the independent small cluster to be unable to use the > maintenance mode. > > {panel:title=chooseTarget exception log} > 2019-05-03 23:42:31,008 [31545331] - WARN > [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough > replicas, still in need of 1 to reach 5 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > {panel} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14465) When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit.
Yicong Cai created HDFS-14465: - Summary: When the Block expected replications is larger than the number of DataNodes, entering maintenance will never exit. Key: HDFS-14465 URL: https://issues.apache.org/jira/browse/HDFS-14465 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.9.2 Reporter: Yicong Cai Scenes: There is a small HDFS cluster with 5 DataNodes; one of them is maintained, added to the maintenance list, and set dfs.namenode.maintenance.replication.min to 1. When refresh Nodes, the NameNode starts checking whether the blocks on the node require a new replication. The replications of the MapReduce task job file is 10 by default, isNeededReplicationForMaintenance will determine to false, and isSufficientlyReplicated will determine to false, so the block of the job file needs to increase the replication. When adding a replication, since the cluster has only 5 DataNodes, all the nodes have the replications of the block, chooseTargetInOrder will throw a NotEnoughReplicasException, so that the replication cannot be increase, and the Entering Maintenance cannot be ended. This issue will cause the independent small cluster to be unable to use the maintenance mode. {panel:title=chooseTarget exception log} 2019-05-03 23:42:31,008 [31545331] - WARN [ReplicationMonitor:BlockPlacementPolicyDefault@431] - Failed to place enough replicas, still in need of 1 to reach 5 (unavailableStorages=[], storagePolicy=BlockStoragePolicy\{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology {panel} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14429) Block remain in COMMITTED but not COMPLETE cause by Decommission
Yicong Cai created HDFS-14429: - Summary: Block remain in COMMITTED but not COMPLETE cause by Decommission Key: HDFS-14429 URL: https://issues.apache.org/jira/browse/HDFS-14429 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.9.2 Reporter: Yicong Cai In the following scenario, the Block will remain in the COMMITTED but not COMPLETE state and cannot be closed properly: # Client writes Block(bk1) to three data nodes (dn1/dn2/dn3). # bk1 has been completely written to three data nodes, and the data node succeeds FinalizeBlock, joins IBR and waits to report to NameNode. # The client commits bk1 after receiving the ACK. # When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 enter Decommissioning. # The DN reports the IBR, but the block cannot be completed normally. Then it will lead to the following related exceptions: {panel:title=Exception} 2019-04-02 13:40:31,882 INFO namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= minimum = 1) in file xxx 2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - IPC Server handler 499 on 8020, call Call#122552 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615 org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet: xxx at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606) {panel} And will cause the scenario described in HDFS-12747 The root cause is that addStoredBlock does not consider the case where the replications are in Decommission. This problem needs to be fixed like HDFS-11499. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14311: -- Attachment: HDFS-14311.1.patch > multi-threading conflict at layoutVersion when loading block pool storage > - > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Fix For: 3.3.0, 2.9.3 > > Attachments: HDFS-14311.1.patch > > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set the BlockPoolSliceStorage's layoutVersion to current > DN version. The next storage dir transition check will concurrent with pre >
[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14311: -- Status: Open (was: Patch Available) > multi-threading conflict at layoutVersion when loading block pool storage > - > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Fix For: 3.3.0, 2.9.3 > > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set the BlockPoolSliceStorage's layoutVersion to current > DN version. The next storage dir transition check will concurrent with pre > storage dir real transition work, then the
[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14311: -- Fix Version/s: 2.9.3 3.3.0 Status: Patch Available (was: Open) > multi-threading conflict at layoutVersion when loading block pool storage > - > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > Fix For: 3.3.0, 2.9.3 > > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set the BlockPoolSliceStorage's layoutVersion to current > DN version. The next storage dir transition check will
[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14311: -- Attachment: (was: HDFS-14311.1.patch) > multi-threading conflict at layoutVersion when loading block pool storage > - > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set the BlockPoolSliceStorage's layoutVersion to current > DN version. The next storage dir transition check will concurrent with pre > storage dir real transition work, then the BlockPoolSliceStorage instance
[jira] [Updated] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
[ https://issues.apache.org/jira/browse/HDFS-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicong Cai updated HDFS-14311: -- Attachment: HDFS-14311.1.patch > multi-threading conflict at layoutVersion when loading block pool storage > - > > Key: HDFS-14311 > URL: https://issues.apache.org/jira/browse/HDFS-14311 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Priority: Major > > When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at > StorageInfo.layoutVersion in loading block pool storage process. > It will cause this exception: > > {panel:title=exceptions} > 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] > - Restored 36974 block files from trash before the layout upgrade. These > blocks will be moved to the previous directory during the upgrade > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] > - Failed to analyze storage directories for block pool > BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed > to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block > pool BP-1216718839-10.120.232.23-1548736842023 > java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the > namespace state: LV = -63 CTime = 0 > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) > at java.lang.Thread.run(Thread.java:748) > {panel} > > root cause: > BlockPoolSliceStorage instance is shared for all storage locations recover > transition. In BlockPoolSliceStorage.doTransition, it will read the old > layoutVersion from local storage, compare with current DataNode version, then > do upgrade. In doUpgrade, add the transition work as a sub-thread, the > transition work will set the BlockPoolSliceStorage's layoutVersion to current > DN version. The next storage dir transition check will concurrent with pre > storage dir real transition work, then the BlockPoolSliceStorage instance >
[jira] [Created] (HDFS-14311) multi-threading conflict at layoutVersion when loading block pool storage
Yicong Cai created HDFS-14311: - Summary: multi-threading conflict at layoutVersion when loading block pool storage Key: HDFS-14311 URL: https://issues.apache.org/jira/browse/HDFS-14311 Project: Hadoop HDFS Issue Type: Bug Components: rolling upgrades Affects Versions: 2.9.2 Reporter: Yicong Cai When DataNode upgrade from 2.7.3 to 2.9.2, there is a conflict at StorageInfo.layoutVersion in loading block pool storage process. It will cause this exception: {panel:title=exceptions} 2019-02-15 10:18:01,357 [13783] - INFO [Thread-33:BlockPoolSliceStorage@395] - Restored 36974 block files from trash before the layout upgrade. These blocks will be moved to the previous directory during the upgrade 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:BlockPoolSliceStorage@226] - Failed to analyze storage directories for block pool BP-1216718839-10.120.232.23-1548736842023 java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the namespace state: LV = -63 CTime = 0 at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) at java.lang.Thread.run(Thread.java:748) 2019-02-15 10:18:01,358 [13784] - WARN [Thread-33:DataStorage@472] - Failed to add storage directory [DISK]file:/mnt/dfs/2/hadoop/hdfs/data/ for block pool BP-1216718839-10.120.232.23-1548736842023 java.io.IOException: Datanode state: LV = -57 CTime = 0 is newer than the namespace state: LV = -63 CTime = 0 at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:406) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:177) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:221) at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:250) at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadBlockPoolSliceStorage(DataStorage.java:460) at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:390) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1649) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1610) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:388) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:280) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:816) at java.lang.Thread.run(Thread.java:748) {panel} root cause: BlockPoolSliceStorage instance is shared for all storage locations recover transition. In BlockPoolSliceStorage.doTransition, it will read the old layoutVersion from local storage, compare with current DataNode version, then do upgrade. In doUpgrade, add the transition work as a sub-thread, the transition work will set the BlockPoolSliceStorage's layoutVersion to current DN version. The next storage dir transition check will concurrent with pre storage dir real transition work, then the BlockPoolSliceStorage instance layoutVersion will confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org